response variable:

  • quantitative
  • qualitative / categorical

methods for classification

  • first predict the probability that the observation belongs to each of the categories of a qualitative variable, then the response can be seen as a member of the max probability class.

Reading tips:

  • The discussion of logistic regression is used as a jumping-off point for a discussion of generalized linear models, and in particular Possion regression.

4.1 Overview

In classicification setting, we have a set of training observations \((x_1,y_1),...,(x_n,y_n)\) that we can use to build a classifier.

Data Set: Default, columns income, balance,

4.2 Why not Linear Regression?

For example, $$

\begin{align}

\begin{split}

Y{}= \left {

\begin{array}{lr}

    1     & \text{if stroke}\

    2     & \text{if drug overdose}\

    3     & \text{if epileptic seizure}

\end{array}

\right.

\end{split}

\end{align
}$$

Facts:

  • This coding implies an ordering on the outcomes, putting drug overdose in between stroke and epileptic seizure, and insisting that the difference between stroke and drug overdose is the same as the difference between drug overdose and epileptic seizure.
  • \(0.5*\text{epileptic seizure} + 0.5*\text{stroke} == 1 * \text{drug overdose}\)
  • If the response variable’s values did take on a natural ordering, such as mild, moderate, and severe, and we felt the gap between mild and moderate was similar to the gap between moderate and severe, then a 1, 2, 3 coding would be reasonable. Unfortunately, in general there is no natural way to convert a qualitative response variable with more than two levels into a quantitative response that is ready for linear regression.
  • Curiously, it turns out that the classifications that we get if we use linear regression to predict a binary response will be the same as for the linear discriminant analysis (LDA) procedure we discuss in Section 4.4.
  • For a binary response with a \(0/1\) coding as above, regression by least squares is not completely unreasonable: it can be shown that the \(X\hat \beta\) obtained using linear regression is in fact an estimate of \(Pr(\text{drug overdose}|X)\) in this special case. However, if we use linear regression, some of our estimates might be outside the [0, 1] interval, making them hard to interpret as probabilities.

Summary: There are at least two reasons not to perform classification using a regression method:

(a) a regression method cannot accommodate a qualitative response with more than two classes;

(b) a regression method will not provide meaningful estimates of \(Pr(Y |X)\), even with just two classes.

4.3 Logistic Regression

Logistic regression models the probability that Y belongs to a particular category.

4.3.1 The Logistic Model

First, consider a linear regression model: \(p(X)=\beta_0+\beta_1 X\). Any time a straight line is fit to a binary response that is coded as \(0\) or \(1\), in principle we can always predict \(p(X) < 0\) for some values of \(X\) and \(p(X) > 1\) for others (unless the range of \(X\) is limited). To avoid this problem, we must model \(p(X)\) using a function that gives outputs between \(0\) and \(1\) for all values of \(X\). In logistic regression, we use the logistic function,$$p(X)=\frac{e{\beta_0+\beta_1X}}{1+e{\beta_0+\beta_1X}}$$

To fit the model above, we use a method called maximum likelihood Estimation. (MLE)

After a bit of manipulation of above equation, we find that $$\frac{p(X)}{1 - p(X)} = e^{\beta_0+\beta_1X}$$

The quantity \(p(X)/[1-P(X)]\) is called the odds, and can take on any value between \(0\) and \(\infty\).

By taking the logarithm of both sides, we arrive at $$log\Big(\frac{p(X)}{1-p(X)}\Big)=\beta_0+\beta_1X$$

The left-hand side is called the log odds or logit. We see that the logistic regression model has a logit that is linear in \(X\).

The amount that \(p(X)\) changes due to a one-unit change in \(X\) depends on the current value of \(X\). But regardless of the value of \(X\), if \(\beta_1\) is positive then increasing \(X\) will be associated with increasing \(p(X)\), and if \(\beta_1\) is negative then increasing \(X\) will be associated with decreasing \(p(X)\).

4.3.2 Estimating the Regression Coefficients

The basic intuitition behiind using maximum likelihood to fit a logistic regression model is as follows: we seek estimates for \(\beta_0\) and \(\beta_1\) such that the predicted probability \(\hat p(x_i)\) of default for each individual, corresponds as closely as possible to the individual's observerd default status.

Here, $$\mathcal{l}(\beta_0,\beta_1)=\Pi_{i:y_i = 1}p(x_i)\Pi_{i \prime:y_{i \prime = 0}} (1 - p(x_{i \prime}))$$

4.3.3 Making Predictions

Maximum likelihood is a very general approach that is used to fit many of the non-linear models that we examine throughout this book.

For qualitative predictors with 2 categoties, we may simply create a dummy variable that takes 0/1.

4.3.4 Multiple Logistic Regression

\[log ( \frac{p(X)}{1 - p(X)} ) = \beta_0 + \beta_1X_1+...+\beta_pX_p
\]

where \(X = (X_1,...,X_p)\) are predictors.

The above equation is equivalent to $$p(X) = \frac{e^{\beta_0 + \beta_1X_1+...+\beta_pX_p}}{1+e^{\beta_0 + \beta_1X_1+...+\beta_pX_p}}$$

confounding

For Default dataset, the negative coefficient for student in the multiple logistic regression indicates that for a fixed value of balance and income, a student is less likely to default then a non-student.

But for the bar plot, which shows the default rates for students and non-students averaged over all values of balance and income, suggest the opposite effect: the overall student default rate is higher than the non-student default rate.

Thus, even though an individual student with a given credit card balance will tend to have a lower probability of default than a non-student with the same credit card balance, the fact that students on the whole tend to have higher credit card balances means that overall, students tend to default at a higher rate than non-students.

![[Pasted image 20221017160629.png]]

![[Pasted image 20221017160707.png]]

(Confounding) This simple example illustrates the dangers and subtleties associated with performing regressions involving only a single predictor when other predictors may also be relevant. As in the linear regression setting, the results obtained using one predictor may be quite different from those obtained using multiple predictors, especially when there is correlation among the predictors.

4.3.5 Multinomial Logistic Regression

Want to do: classify a response variable that has more than two classes

However, the logistic regression approach that we have seen in this section only allows for K = 2 classes for the response variable.

It turns out that it is possible to extend the two-class logistic regression approach to the setting of K > 2 classes. This extension is sometimes known as multinomial logistic regression.

  1. we first select a single multinomial logistic regression class to serve as the baseline; without loss of generality, we select the Kth class for this role. Then we replace the model by $$Pr(Y=k|X=x)=\frac{e{\beta_{k0}+\beta_{k1}x_1+...+\beta_{kp}x_p}}{1+\sum_{l=1}e^{\beta_{l0}+\beta_{l1}x_1+...+\beta_{lp}x_p}}$$ for \(k=1,...,K-1\) and $$Pr(Y=K|X=x)=\frac{1}{1+\sum_{l=1}{K-1}e+\beta_{l1}x_1+...+\beta_{lp}x_p}}$$
  2. so that $$log(\frac{Pr(Y=k|X=x)}{Pr(Y=K|X=x)})=\beta_{k0}+\beta_{k1}x_1+...+\beta_{kp}x_p$$ for \(k=1,...K-1\). This indicate the log odds between any pair of classes is linear in the features.
  3. .The coefficient estimates will differ between the two fitted models due to the differing choice of baseline, but the fitted values (predictions), the log odds between any pair of classes, and the other key model outputs will remain the same.
Softmax coding

The softmax coding is equivalent softmax to the coding just described in the sense that the fitted values, log odds between any pair of classes, and other key model outputs will remain the same, regardless of coding.

In the softmax coding, rather than selecting a baseline class, we treat all K classes symmetrically, and assume that for \(k = 1,...,K\), $$Pr(Y=k|X=x)=\frac{e{\beta_{k0}+\beta_{k1}x_1+...+\beta_{kp}x_p}}{1+\sum_{l=1}e^{\beta_{l0}+\beta_{l1}x_1+...+\beta_{lp}x_p}}$$Thus, rather than estimating coefficients for K − 1 classes, we actually estimate coefficients for all K classes. The log odds ratio between the \(k\)th and \(k′\) th classes equals $$log(\frac{Pr(Y=k|X=x)}{Pr(Y=k\prime|X=x)})=(\beta_{k0}-\beta_{k\prime 0}) + (\beta_{k1}-\beta_{k\prime 1}) x_1+...+(\beta_{kp}-\beta_{k\prime p})x_p $$

4.4 Generative Models for Classification

In statistical jargon, we model the conditional distribution of the response Y , given the predictor(s) X.

In this new approach, we model the distribution of the predictors X separately in each of the response classes (i.e. for each value of Y ). We then use Bayes’ theorem to flip these around into estimates for \(Pr(Y = k|X = x)\). When the distribution of X within each class is assumed to be normal, it turns out that the model is very similar in form to logistic regression.

advantages:

  • When there is substantial separation between the two classes, the parameter estimates for the logistic regression model are surprisingly unstable. The methods that we consider in this section do not suffer from this problem.
  • If the distribution of the predictors X is approximately normal in each of the classes and the sample size is small, then the approaches in this section may be more accurate than logistic regression.
  • The methods in this section can be naturally extended to the case of more than two response classes. (In the case of more than two response classes, we can also use multinomial logistic regression from Section 4.3.5.)

Let \(f_k(X) ≡ Pr(X|Y = k)\) denote the density function of X for an observation that comes from the \(k\)th class. Then Bayes’ theorem states that $$Pr(Y= k|X = x)=\frac{\pi_kf_k(x)}{\sum_{l=1}^K\pi_l f_l(x)}$$

  • \(p_k(x) = Pr(Y = k|X = x)\) is the posterior probability that an observation posterior \(X = x\) belongs to the \(k\)th class.
  • In general, estimating πk is easy if we have a random sample from the population: we simply compute the fraction of the training observations that belong to the \(k\)th class.
  • As we will see, to estimate \(f_k(x)\), we will typically have to make some simplifying assumptions.

4.4.1 Linear Discriminant Analysis for p=1

Assumptions:

  1. \(f_k(x)\) is normal or Gaussian
  2. \(\sigma_1^2=...=\sigma_K^2=\sigma^2\) (a shared variance)

so we have $$p_k(x)=\frac{\pi_k\frac{1}{\sqrt{2\pi}\sigma}exp(-\frac{1}{2\sigma2}(x-\mu_k)2)}{\sum_{l=1}^K \pi_l\frac{1}{\sqrt{2\pi}\sigma}exp(-\frac{1}{2\sigma2}(x-\mu_l)2)}$$

Taking log, this is equivalent to assigning the observation to the class for which \(\delta_k(x) = x \frac{\mu_k}{\sigma^2} - \frac{\mu_k^2}{2\sigma^2}+log(\pi_k)\) is largest.

The Bayes decision boundary is the point for which \(\delta_1(x) = \delta_2(x)\); one can show that this amounts to$$x=\frac{\mu_1^2 - \mu_2^2}{2(\mu_1 - \mu_2)} = \frac{\mu_1 + \mu_2}{2}$$ (Assume \(\pi_i = \pi_j\) for all i,j)

In practice, the following estimates are used: $$\hat \mu_k = \frac{1}{n_k}\sum_{i:y_i=k}x_i$$$$\hat \sigma^2 = \frac{1}{n-K} \sum_{k=1}^{K}\sum_{i:y_i=k}(x_i-\hat \mu_k)^2$$

discriminant functions \(\hat \delta_k(x)\) are linear functions of \(x\).

summary: the LDA classifier results from assuming that the observations within each class come from a normal distribution with a classspecific mean and a common variance \(\sigma^2\), and plugging estimates for these parameters into the Bayes classifier.

4.4.2 Linear Discriminant Analysis for p > 1

Assumptions:

  • X = (X1, X2,...,Xp) is drawn from a multivariate Gaussian (or multivariate normal) distribution, with a class-specific multivariate mean vector and a common covariance matrix.

Note: the multivariate Gaussian distribution assumes that each individual predictor follows a one-dimensional normal distribution, with some correlation between each pair of predictors.

An example in which \(Var(X_1) = Var(X_2)\) and \(Cor(X_1, X_2) = 0\); this surface has a characteristic bell shape. (就像一个中心对称的窝窝头),如果\(Cor(X_1, X_2) \neq 0\), 就是一个被压扁了的窝窝头。

![[Pasted image 20221017165257.png]]

we write \(X \sim N(\mu, \Sigma)\). Here \(E(X) = \mu\) is the mean of X (a vector with p components), and \(Cov(X) = \Sigma\) is the p × p covariance matrix of \(X\). Formally, the multivariate Gaussian density is defined as $$f(x)=\frac{1}{(2\pi){p/2}|\Sigma|{1/2}}exp(-\frac{1}{2}(x-\mu)^T \Sigma^{-1}(x-\mu))$$

The discriminant functions \(\delta_k(x)\) are$$\delta_k(x)=xT\Sigma\mu_k-\frac{1}{2}\mu_kT\Sigma\mu_k + log \pi_k$$are linear functions of \(x\). The Bayes decision boundaries are $$xT\Sigma\mu_k - \frac{1}{2}\mu_k^T \Sigma^{-1}\mu_k = x^T \Sigma^{-1}\mu_l - \frac{1}{2}\mu_lT\Sigma\mu_l$$ (Assume \(\pi_k = \pi_j\) for all k,j)

Summary:

  • we need to estimate the unknown parameters \(\mu_1,...,\mu_K, \pi_1,..., \pi_K\), and \(\Sigma\);
  • To assign a new observation \(X = x\), LDA plugs these estimates to obtain quantities \(\hat \delta_k(x)\), and classifies to the class for which \(\hat \delta_k(x)\) is largest.
  • \(\hat \delta_k(x)\) is a linear function of \(x\); that is, the LDA decision rule depends on \(x\) only through a linear combination of its elements.

Class-specific performance is also important in medicine and biology, where the terms sensitivity and specificity characterize the performance of sensitivity specificity a classifier or screening test.

LDA is trying to approximate the Bayes classifier, which has the lowest total error rate out of all classifiers.

The Bayes classifier works by assigning an observation to the class for which the posterior probability \(p_k(X)\) is greatest. Thus, the Bayes classifier, and by extension LDA, uses a threshold of 50 % for the posterior probability of default in order to assign an observation to the default class. However, this threshold can be changed to 80% or other percentage.

Figure 4.7 illustrates the trade-off that results from modifying the threshold value for the posterior probability of default.

.As the threshold is reduced, the error rate among individuals who default decreases steadily, but the error rate among the individuals who do not default increases.

![[Pasted image 20221017190149.png]]

ROC curve

The ROC (receiver operating characteristics) curve is a popular graphic for simultaneously displaying the ROC curve two types of errors for all possible thresholds.

![[Pasted image 20221017190418.png]]

The overall performance of a classifier, summarized over all possible thresholds, is given by the area under the (ROC) curve (AUC). An ideal ROC curve will hug the top left corner, so the larger area under the (ROC) curve the AUC the better the classifier.

![[Pasted image 20221017190710.png]]

4.4.3 Quadratic Discriminant Analysis

Assumptions:

  1. (like LDA) each class are drawn from a Gaussian distribution
  2. (like LDA) each class has its own mean vector
  3. (unlike LDA) each class has its own covariance matrix

the Bayes classifier assigns an observation \(X = x\) to the class for which $$\delta_k(x) = -\frac{1}{2}(x-\mu_k)T\Sigma_k(x-\mu_k)-\frac{1}{2}log|\Sigma_k|+log \pi_k \

= -\frac{1}{2}xT\Sigma_kx+xT\Sigma_k\mu_k-\frac{1}{2}\mu_kT\Sigma_k\mu_k-\frac{1}{2}log|\Sigma_k|+log\pi_k$$ is largest. The quantity \(x\) appears as a quadratic function.

why would one prefer LDA to QDA, or vice-versa?

The answer lies in the bias-variance trade-off. When there are p predictors, then estimating a covariance matrix requires estimating p(p+1)/2 parameters. QDA estimates a separate covariance matrix for each class, for a total of Kp(p+1)/2 parameters.

LDA is a much less flexible classifier than QDA, and so has substantially lower variance. But there is a trade-off: if LDA’s assumption that the K classes share a common covariance matrix is badly off, then LDA can suffer from high bias.

Roughly speaking, LDA tends to be a better bet than QDA if there are relatively few training observations and so reducing variance is crucial. In contrast, QDA is recommended if the training set is very large, so that the variance of the classifier is not a major concern, or if the assumption of a common covariance matrix for the K classes is clearly untenable.

4.4.4 Naive Bayes

Assumption:

  1. Within the \(k\)th class, the \(p\) predictors are independent. Instead of assuming that these functions belong to a particular family of distributions (e.g. multivariate normal),i.e. for \(k = 1,...,K\), \(f_k(x) = f_{k1}(x_1) × f_{k2}(x_2) × ··· × f_{kp}(x_p)\), where \(f_{kj}\) is the density function of the \(j\)th predictor among observations in the \(k\)th class.
Why is this assumption so powerful?

Essentially, estimating a p-dimensional density function is challenging because we must consider not only the marginal distribution of each predictor — that is, the distribution of marginal distribution each predictor on its own — but also the joint distribution of the predictors joint distribution — that is, the association between the different predictors.

it often leads to pretty decent results, especially in settings where n is not large enough relative to p for us to effectively estimate the joint distribution of the predictors within each class.

Essentially, the naive Bayes assumption introduces some bias, but reduces variance, leading to a classifier that works quite well in practice as a result of the bias-variance trade-off.

\[Pr(Y=k|X=x) = \frac{\pi_kf_{k1}(x1)f_{k2}x_2...f_{kp}(x_p)}{\sum_{l=1}^K\pi_lf_{l1}(x1)f_{l2}x_2...f_{lp}(x_p)}
\]

To estimate the one-dimensional density function \(f_{kj}\) using training data \(x_{1j} ,...,x_{nj}\) , we have a few options:

  • If \(X_j\) is quantitative, then we can assume that \(X_j |Y = k \sim N(\mu_{jk}, \sigma^2_{jk})\). While this may sound a bit like QDA, there is one key difference, in that here we are assuming that the predictors are independent; this amounts to QDA with an additional assumption that the class-specific covariance matrix is diagonal.
  • If \(X_j\) is quantitative, then another option is to use a non-parametric estimate for \(f_{kj}\) . A very simple way to do this is by making a histogram for the observations of the \(j\)th predictor within each class. Then we can estimate \(f_{kj}(x_j)\) as the fraction of the training observations in the kth class that belong to the same histogram bin as \(x_j\) . Alternatively, we can use a kernel density estimator, which is essentially a smoothed version of a histogram.
  • If \(X_j\) is qualitative, then we can simply count the proportion of training observations for the jth predictor corresponding to each class.

We expect to see a greater pay-off to using naive Bayes relative to LDA or QDA in instances where p is larger or n is smaller, so that reducing the variance is very important.

4.5 A Comparison of Classification Methods

4.5.1 An Analytical Comparison

Equivalently, we can set K as the baseline class and assign an observation to the class that maximizes $$log(\frac{Pr(Y=k|X=x)}{Pr(Y=K|X=x)})$$1. For LDA, $$log(\frac{Pr(Y=k|X=x)}{Pr(Y=K|X=x)})=a_k+\sum_{j=1}^pb_{kj}x_j$$ where \(a_k = log(\frac{\pi_k}{\pi_K}) - \frac{1}{2}(\mu_k+\mu_K)^T\Sigma^{-1}(\mu_k - \mu_K)\) and \(b_{kj}\) is the \(j\)th component of \(\Sigma^{-1}(\mu_k-\mu_K)\).

  1. For QDA, $$log(\frac{Pr(Y=k|X=x)}{Pr(Y=K|X=x)})=a_k+\sum_{j=1}pb_{kj}x_j+\sum_{j=1}p\sum_{l=1}^p c_{kjl}x_jx_l$$ where \(a_k\), \(b_{kj}\) , and \(c_{kjl}\) are functions of \(\pi_k\), \(\pi_K\), \(\mu_k\), \(\mu_K\), \(\Sigma_k\) and \(\Sigma_K\).
  2. for naive Bayes $$log(\frac{Pr(Y=k|X=x)}{Pr(Y=K|X=x)}) = a_k + \sum_{j=1}^pg_{kj}(x_j)$$ where \(a_k = log(\frac{\pi_k}{\pi_K})\) and \(g_{kj}(x_j)= log(\frac{f_{kj(x_j)}}{f_{Kj}(x_j)})\). Hence, the right-hand side of takes the form of a generalized additive model (CH7) .

Summary1:

  • LDA is a special case of QDA with \(c_{kjl} = 0\) for all \(j = 1,...,p\), \(l = 1,...,p\), and \(k = 1,...,K\). (Of course, this is not surprising, since LDA is simply a restricted version of QDA with \(\Sigma_1 = ··· = \Sigma_K = \Sigma.\))
  • Any classifier with a linear decision boundary is a special case of naive Bayes with \(g_{kj}(x_j) = b_{kj}x_j\) . In particular, this means that LDA is a special case of naive Bayes!
  • If we model \(f_{kj} (x_j )\) in the naive Bayes classifier using a one-dimensional Gaussian distribution \(N(\mu_{kj} , \sigma^2_j )\), then we end up with \(g_{kj} (x_j ) = b_{kj}x_j\) where \(b_{kj} = (\mu_{kj} - \mu_{Kj} )/\sigma^2_j\) . In this case, naive Bayes is actually a special case of LDA with \(\Sigma\) restricted to be a diagonal matrix with jth diagonal element equal to \(\sigma^2_j\) .
  • Neither QDA nor naive Bayes is a special case of the other. Naive Bayes can produce a more flexible fit, since any choice can be made for \(g_{kj} (x_j )\). However, it is restricted to a purely additive fit, however, these terms are never multiplied. By contrast, QDA includes multiplicative terms of the form \(c_{kjl}x_jx_l\). Therefore, QDA has the potential to be more accurate in settings where interactions among the predictors are important in discriminating between classes.
  1. for multinomial logistic regression, $$log(\frac{Pr(Y=k|X=x)}{Pr(Y=K|X=x)})=\beta_{k0}+\sum_{l=1}^p\beta_{kl}x_l$$ This is identical to the linear form of LDA, In LDA, those coefficients are functions of estimates by assuming that \(X_1,...,X_p\) follow a normal distribution. However, in logistic regression, the coefficients are chosen to maximize the likelihood function,.Thus, we expect LDA to outperform logistic regression when the normality assumption (approximately) holds, and we expect logistic regression to perform better when it does not.
  2. for K-nearest neighbors (KNN), in order to make a prediction for an observation \(X = x\), the training observations that are closest to \(x\) are identified. Then \(X\) is assigned to the class to which the plurality of these observations belong. Hence KNN is a completely non-parametric approach: **no assumptions are made about the shape of the decision boundary

Summary2:

  • Because KNN is completely non-parametric, we can expect this approach to dominate LDA and logistic regression when the decision boundary is highly non-linear, provided that \(n\) is very large and \(p\) is small.
  • KNN requires a lot of observations relative to the number of predictors—that is, \(n\) much larger than \(p\).
  • KNN is non-parametric, and thus tends to reduce the bias while incurring a lot of variance.
  • In settings where the decision boundary is non-linear but \(n\) is only modest, or \(p\) is not very small, then QDA may be preferred to KNN.
  • Unlike logistic regression, KNN does not tell us which predictors are important.

4.5.2 An Empirical Comparison

  • When the true decision boundaries are linear, then the LDA and logistic regression approaches will tend to perform well.
  • When the boundaries are moderately non-linear, QDA or naive Bayes may give better results.
  • for much more complicated decision boundaries, a non-parametric approach such as KNN can be superior. But the level of smoothness for a non-parametric approach must be chosen carefully.
  • Finally, recall from Chapter 3 that in the regression setting we can accommodate a non-linear relationship between the predictors and the response by performing regression using transformations of the predictors. A similar approach could be taken in the classification setting. For instance, we could create a more flexible version of logistic regression by including \(X_2, X_3\), and even \(X_4\) as predictors. If we added all possible quadratic terms and cross-products to LDA, the form of the model would be the same as the QDA model, although the parameter estimates would be different.

4.6 Generalized Linear Models

dataset: Bikeshare

4.6.2 Poisson Regression on the Bikeshare Data

Poisson Distribution:

\[P(Y=k)=\frac{e^{-\lambda}\lambda^k}{k!}$$ for $k=0,1,2...$
properties:
- $\lambda = E(Y)=Var(Y)$

Poisson distribution is typically used to model **counts**.
Assumption: by Poisson regression, we implicitly assume that **mean** bike usage in a given hour equals the **variance** of bike usage during that hour. $$log(\lambda(X_1,...X_p)) = \beta_0+\beta_1X_1+...+\beta_pX_p$$ To estimate the coefficients, use MLE approach.

>In fact, the variance in the Bikeshare data appears to be much higher than the mean, a situation referred to as overdispersion. This causes the Z-values to be inflated in Table 4.11. A more careful analysis should account for this overdispersion to obtain more accurate Z-values, and there are a variety of methods for doing this. But they are beyond the scope of this book.

### 4.6.3 Generalized Linear Models in Greater Generality
three types of regression models: linear, logistic and Poisson.
common characteristics:
- Each approach uses predictors $X_1,...,X_p$ to predict a response $Y$ . We assume that, conditional on $X_1,...,X_p$, $Y$ belongs to a certain family of distributions. For linear regression, we typically assume that $Y$ follows a **Gaussian** or normal distribution. For logistic regression, we assume that $Y$ follows a **Bernoulli** distribution. Finally, for Poisson regression, we assume that $Y$ follows a **Poisson** distribution.
- for $E(Y|X_1,...X_p)$, can be expressed using a **link function**, $\eta$. The link functions for linear, logistic and Poisson regression are $η(µ) = µ$, $\eta(\mu) = log(\mu/(1 - \mu))$, and $\eta(\mu) = log(\mu)$, respectively.
- The Gaussian, Bernoulli and Poisson distributions are all members of a wider class of distributions, known as the **exponential family**. In general, we can perform a regression by modeling the response $Y$ as coming from a particular member of the exponential family, and then transforming the mean of the response so that the transformed mean is a linear function of the predictors. Any regression approach that follows this very general recipe is known as a **generalized linear model (GLM)**. Thus, linear regression, logistic regression, and Poisson regression are three examples of GLMs. Other examples not covered here include Gamma regression and negative binomial regression.\]

统计机器学习-Introduction to Statistical Learning-阅读笔记-CH4-Classification的更多相关文章

  1. ≪统计学习精要(The Elements of Statistical Learning)≫课堂笔记(三)

    照例文章第一段跑题,先附上个段子(转载的哦~): I hate CS people. They don't know linear algebra but want to teach projecti ...

  2. 【DeepLearning学习笔记】Coursera课程《Neural Networks and Deep Learning》——Week1 Introduction to deep learning课堂笔记

    Coursera课程<Neural Networks and Deep Learning> deeplearning.ai Week1 Introduction to deep learn ...

  3. “Deep models under the GAN: information leakage from collaborative deep learning”阅读笔记

    一.摘要 指出深度学习在机器学习场景下的优势,以及深度学习快速崛起的原因.随后点出研究者对于深度学习隐私问题的考虑.作者提出了一种强力的攻击方法,在其攻击下任何分布式.联邦式.或者中心化的深度学习方法 ...

  4. Deep Learning 阅读笔记:Convolutional Auto-Encoders 卷积神经网络的自编码表达

    需要搭建一个比较复杂的CNN网络,希望通过预训练来提高CNN的表现. 上网找了一下,关于CAE(Convolutional Auto-Encoders)的文章还真是少,勉强只能找到一篇瑞士的文章. S ...

  5. Introduction to statistical learning:with Applications in R (书,数据,R代码,链接)

    http://faculty.marshall.usc.edu/gareth-james/ http://faculty.marshall.usc.edu/gareth-james/ISL/

  6. 李宏毅机器学习笔记4:Brief Introduction of Deep Learning、Backpropagation(后向传播算法)

    李宏毅老师的机器学习课程和吴恩达老师的机器学习课程都是都是ML和DL非常好的入门资料,在YouTube.网易云课堂.B站都能观看到相应的课程视频,接下来这一系列的博客我都将记录老师上课的笔记以及自己对 ...

  7. 机器学习实战(Machine Learning in Action)学习笔记————08.使用FPgrowth算法来高效发现频繁项集

    机器学习实战(Machine Learning in Action)学习笔记————08.使用FPgrowth算法来高效发现频繁项集 关键字:FPgrowth.频繁项集.条件FP树.非监督学习作者:米 ...

  8. 个性探测综述阅读笔记——Recent trends in deep learning based personality detection

    目录 abstract 1. introduction 1.1 个性衡量方法 1.2 应用前景 1.3 伦理道德 2. Related works 3. Baseline methods 3.1 文本 ...

  9. 论文阅读笔记 Improved Word Representation Learning with Sememes

    论文阅读笔记 Improved Word Representation Learning with Sememes 一句话概括本文工作 使用词汇资源--知网--来提升词嵌入的表征能力,并提出了三种基于 ...

  10. 机器学习实战(Machine Learning in Action)学习笔记————07.使用Apriori算法进行关联分析

    机器学习实战(Machine Learning in Action)学习笔记————07.使用Apriori算法进行关联分析 关键字:Apriori.关联规则挖掘.频繁项集作者:米仓山下时间:2018 ...

随机推荐

  1. JS刷题自制参考知识

    (建议复制到本地,需要看的时候打开Typora,大纲点击要查的内容即可,我一般记不清某个方法的时候就查一下.) 基础 Typescript TypeScript是一个开源的.渐进式包含类型的JavaS ...

  2. 微信小程序页面间通的5种方式

    PageModel(页面模型)对小程序而言是很重要的一个概念,从app.json中也可以看到,小程序就是由一个个页面组成的. 如上图,这是一个常见结构的小程序:首页是一个双Tab框架PageA和Pag ...

  3. springboot + mybatisplus出现was not registered for synchronization because synchronization is not active

    原因一:缺少事务注解,底层mybatisplus的接口方法有事务 原因二:该服务器被限制访问要连接的数据库 原因三:乐观锁失效 乐观锁由@version注解标注,有以下使用要求 支持的数据类型只有:i ...

  4. 《Zookeeper分布式过程协同技术详解》之简介-分布式与Zookeeper简介

    [常见的分布式架构场景面临的问题]一般在主从架构中,主节点进程负责跟踪从节点的状态和任务的有效性,并分配任务到从节点.而这种架构中必须要解决的几个问题是,主节点崩溃.从节点崩溃.通信故障.主节点崩溃: ...

  5. JAVA加载PMML算法模型

    注:加载失败时尝试修改pmml文件版本为4.3 依赖 <dependency> <groupId>org.jpmml</groupId> <artifactI ...

  6. RPA的概念和优势

    大多数人每天都会使用到一些机器人流程自动化的工具,例如读取邮件和系统,计算.生成文件和报告.而在未来,那些你不想做的枯燥的工作,也许真的可以不做了,重复化.标准化的工作都可以让机器人帮你完成.想必此刻 ...

  7. 解析极限编程-拥抱变化_V2

    作者:Kent Beck 第一章 极限编程定义 XP(极限编程):extreme programming,适用于中小型团队在需求不明确或迅速变化的情况下进行软件开发的轻量级方法学. 第二章 学习开车 ...

  8. python调用方法或者变量时出现未定义异常的原因,可能会是没有正确实例化

    当引用某个某块时 例如 Testpython import test class test(object): def __init__(): -- self.mimi = test def test1 ...

  9. 声网 X 远程超声:实时音视频解决基层“看病难” 推动医疗资源均衡化

    实时互联网像触角一样,通过情景的共享延伸开来,链接着我们彼此的线下.线上生活,形成一张不可分割的网络.随着社交直播.在线教育.视频会议成为大众生活不可或缺的一部分的同时,智能手表.智能作业灯.视频双录 ...

  10. ExcelDataReader插件的使用

    NPOI插件的弊端 刚来公司的时候公司软件导入导出操作都使用微软的office组件来实现,大家应该都知道这个组件有很大的弊端,就是运行主机上面必须安装office组件才可进行使用,不然无法进行导入导出 ...