统计机器学习-Introduction to Statistical Learning-阅读笔记-CH4-Classification
response variable:
- quantitative
- qualitative / categorical
methods for classification
- first predict the probability that the observation belongs to each of the categories of a qualitative variable, then the response can be seen as a member of the max probability class.
Reading tips:
- The discussion of logistic regression is used as a jumping-off point for a discussion of generalized linear models, and in particular Possion regression.
4.1 Overview
In classicification setting, we have a set of training observations \((x_1,y_1),...,(x_n,y_n)\) that we can use to build a classifier.
Data Set: Default, columns income, balance,
4.2 Why not Linear Regression?
For example, $$
\begin{align}
\begin{split}
Y{}= \left {
\begin{array}{lr}
    1     & \text{if stroke}\
    2     & \text{if drug overdose}\
    3     & \text{if epileptic seizure}
\end{array}
\right.
\end{split}
\end{align}$$
Facts:
- This coding implies an ordering on the outcomes, putting drug overdosein betweenstrokeandepileptic seizure, and insisting that the difference betweenstrokeanddrug overdoseis the same as the difference betweendrug overdoseandepileptic seizure.
- \(0.5*\text{epileptic seizure} + 0.5*\text{stroke} == 1 * \text{drug overdose}\)
- If the response variable’s values did take on a natural ordering, such as mild, moderate, and severe, and we felt the gap between mild and moderate was similar to the gap between moderate and severe, then a 1, 2, 3 coding would be reasonable. Unfortunately, in general there is no natural way to convert a qualitative response variable with more than two levels into a quantitative response that is ready for linear regression.
- Curiously, it turns out that the classifications that we get if we use linear regression to predict a binary response will be the same as for the linear discriminant analysis (LDA) procedure we discuss in Section 4.4.
- For a binary response with a \(0/1\) coding as above, regression by least squares is not completely unreasonable: it can be shown that the \(X\hat \beta\) obtained using linear regression is in fact an estimate of \(Pr(\text{drug overdose}|X)\) in this special case. However, if we use linear regression, some of our estimates might be outside the [0, 1] interval, making them hard to interpret as probabilities.
Summary: There are at least two reasons not to perform classification using a regression method:
(a) a regression method cannot accommodate a qualitative response with more than two classes;
(b) a regression method will not provide meaningful estimates of \(Pr(Y |X)\), even with just two classes.
4.3 Logistic Regression
Logistic regression models the probability that Y belongs to a particular category.
4.3.1 The Logistic Model
First, consider a linear regression model: \(p(X)=\beta_0+\beta_1 X\). Any time a straight line is fit to a binary response that is coded as \(0\) or \(1\), in principle we can always predict \(p(X) < 0\) for some values of \(X\) and \(p(X) > 1\) for others (unless the range of \(X\) is limited). To avoid this problem, we must model \(p(X)\) using a function that gives outputs between \(0\) and \(1\) for all values of \(X\). In logistic regression, we use the logistic function,$$p(X)=\frac{e{\beta_0+\beta_1X}}{1+e{\beta_0+\beta_1X}}$$
To fit the model above, we use a method called maximum likelihood Estimation. (MLE)
After a bit of manipulation of above equation, we find that $$\frac{p(X)}{1 - p(X)} = e^{\beta_0+\beta_1X}$$
The quantity \(p(X)/[1-P(X)]\) is called the odds, and can take on any value between \(0\) and \(\infty\).
By taking the logarithm of both sides, we arrive at $$log\Big(\frac{p(X)}{1-p(X)}\Big)=\beta_0+\beta_1X$$
The left-hand side is called the log odds or logit. We see that the logistic regression model has a logit that is linear in \(X\).
The amount that \(p(X)\) changes due to a one-unit change in \(X\) depends on the current value of \(X\). But regardless of the value of \(X\), if \(\beta_1\) is positive then increasing \(X\) will be associated with increasing \(p(X)\), and if \(\beta_1\) is negative then increasing \(X\) will be associated with decreasing \(p(X)\).
4.3.2 Estimating the Regression Coefficients
The basic intuitition behiind using maximum likelihood to fit a logistic regression model is as follows: we seek estimates for \(\beta_0\) and \(\beta_1\) such that the predicted probability \(\hat p(x_i)\) of default for each individual, corresponds as closely as possible to the individual's observerd default status.
Here, $$\mathcal{l}(\beta_0,\beta_1)=\Pi_{i:y_i = 1}p(x_i)\Pi_{i \prime:y_{i \prime = 0}} (1 - p(x_{i \prime}))$$
4.3.3 Making Predictions
Maximum likelihood is a very general approach that is used to fit many of the non-linear models that we examine throughout this book.
For qualitative predictors with 2 categoties, we may simply create a dummy variable that takes 0/1.
4.3.4 Multiple Logistic Regression
\]
where \(X = (X_1,...,X_p)\) are predictors.
The above equation is equivalent to $$p(X) = \frac{e^{\beta_0 + \beta_1X_1+...+\beta_pX_p}}{1+e^{\beta_0 + \beta_1X_1+...+\beta_pX_p}}$$
confounding
For Default dataset, the negative coefficient for student in the multiple logistic regression indicates that for a fixed value of balance and income, a student is less likely to default then a non-student.
But for the bar plot, which shows the default rates for students and non-students averaged over all values of balance and income, suggest the opposite effect: the overall student default rate is higher than the non-student default rate.
Thus, even though an individual student with a given credit card balance will tend to have a lower probability of default than a non-student with the same credit card balance, the fact that students on the whole tend to have higher credit card balances means that overall, students tend to default at a higher rate than non-students.
![[Pasted image 20221017160629.png]]
![[Pasted image 20221017160707.png]]
(Confounding) This simple example illustrates the dangers and subtleties associated with performing regressions involving only a single predictor when other predictors may also be relevant. As in the linear regression setting, the results obtained using one predictor may be quite different from those obtained using multiple predictors, especially when there is correlation among the predictors.
4.3.5 Multinomial Logistic Regression
Want to do: classify a response variable that has more than two classes
However, the logistic regression approach that we have seen in this section only allows for K = 2 classes for the response variable.
It turns out that it is possible to extend the two-class logistic regression approach to the setting of K > 2 classes. This extension is sometimes known as multinomial logistic regression.
- we first select a single multinomial logistic regression class to serve as the baseline; without loss of generality, we select the Kth class for this role. Then we replace the model by $$Pr(Y=k|X=x)=\frac{e{\beta_{k0}+\beta_{k1}x_1+...+\beta_{kp}x_p}}{1+\sum_{l=1}e^{\beta_{l0}+\beta_{l1}x_1+...+\beta_{lp}x_p}}$$ for \(k=1,...,K-1\) and $$Pr(Y=K|X=x)=\frac{1}{1+\sum_{l=1}{K-1}e+\beta_{l1}x_1+...+\beta_{lp}x_p}}$$
- so that $$log(\frac{Pr(Y=k|X=x)}{Pr(Y=K|X=x)})=\beta_{k0}+\beta_{k1}x_1+...+\beta_{kp}x_p$$ for \(k=1,...K-1\). This indicate the log odds between any pair of classes is linear in the features.
- .The coefficient estimates will differ between the two fitted models due to the differing choice of baseline, but the fitted values (predictions), the log odds between any pair of classes, and the other key model outputs will remain the same.
Softmax coding
The softmax coding is equivalent softmax to the coding just described in the sense that the fitted values, log odds between any pair of classes, and other key model outputs will remain the same, regardless of coding.
In the softmax coding, rather than selecting a baseline class, we treat all K classes symmetrically, and assume that for \(k = 1,...,K\), $$Pr(Y=k|X=x)=\frac{e{\beta_{k0}+\beta_{k1}x_1+...+\beta_{kp}x_p}}{1+\sum_{l=1}e^{\beta_{l0}+\beta_{l1}x_1+...+\beta_{lp}x_p}}$$Thus, rather than estimating coefficients for K − 1 classes, we actually estimate coefficients for all K classes. The log odds ratio between the \(k\)th and \(k′\) th classes equals $$log(\frac{Pr(Y=k|X=x)}{Pr(Y=k\prime|X=x)})=(\beta_{k0}-\beta_{k\prime 0}) + (\beta_{k1}-\beta_{k\prime 1}) x_1+...+(\beta_{kp}-\beta_{k\prime p})x_p $$
4.4 Generative Models for Classification
In statistical jargon, we model the conditional distribution of the response Y , given the predictor(s) X.
In this new approach, we model the distribution of the predictors X separately in each of the response classes (i.e. for each value of Y ). We then use Bayes’ theorem to flip these around into estimates for \(Pr(Y = k|X = x)\). When the distribution of X within each class is assumed to be normal, it turns out that the model is very similar in form to logistic regression.
advantages:
- When there is substantial separation between the two classes, the parameter estimates for the logistic regression model are surprisingly unstable. The methods that we consider in this section do not suffer from this problem.
- If the distribution of the predictors X is approximately normal in each of the classes and the sample size is small, then the approaches in this section may be more accurate than logistic regression.
- The methods in this section can be naturally extended to the case of more than two response classes. (In the case of more than two response classes, we can also use multinomial logistic regression from Section 4.3.5.)
Let \(f_k(X) ≡ Pr(X|Y = k)\) denote the density function of X for an observation that comes from the \(k\)th class. Then Bayes’ theorem states that $$Pr(Y= k|X = x)=\frac{\pi_kf_k(x)}{\sum_{l=1}^K\pi_l f_l(x)}$$
- \(p_k(x) = Pr(Y = k|X = x)\) is the posterior probability that an observation posterior \(X = x\) belongs to the \(k\)th class.
- In general, estimating πk is easy if we have a random sample from the population: we simply compute the fraction of the training observations that belong to the \(k\)th class.
- As we will see, to estimate \(f_k(x)\), we will typically have to make some simplifying assumptions.
4.4.1 Linear Discriminant Analysis for p=1
Assumptions:
- \(f_k(x)\) is normal or Gaussian
- \(\sigma_1^2=...=\sigma_K^2=\sigma^2\) (a shared variance)
so we have $$p_k(x)=\frac{\pi_k\frac{1}{\sqrt{2\pi}\sigma}exp(-\frac{1}{2\sigma2}(x-\mu_k)2)}{\sum_{l=1}^K \pi_l\frac{1}{\sqrt{2\pi}\sigma}exp(-\frac{1}{2\sigma2}(x-\mu_l)2)}$$
Taking log,  this is equivalent to assigning the observation to the class for which \(\delta_k(x) = x \frac{\mu_k}{\sigma^2} - \frac{\mu_k^2}{2\sigma^2}+log(\pi_k)\) is largest.
The Bayes decision boundary is the point for which \(\delta_1(x) = \delta_2(x)\); one can show that this amounts to$$x=\frac{\mu_1^2 - \mu_2^2}{2(\mu_1 - \mu_2)} = \frac{\mu_1 + \mu_2}{2}$$ (Assume \(\pi_i = \pi_j\) for all i,j)
In practice, the following estimates are used: $$\hat \mu_k = \frac{1}{n_k}\sum_{i:y_i=k}x_i$$$$\hat \sigma^2 = \frac{1}{n-K} \sum_{k=1}^{K}\sum_{i:y_i=k}(x_i-\hat \mu_k)^2$$
discriminant functions \(\hat \delta_k(x)\) are linear functions of \(x\).
summary: the LDA classifier results from assuming that the observations within each class come from a normal distribution with a classspecific mean and a common variance \(\sigma^2\), and plugging estimates for these parameters into the Bayes classifier.
4.4.2 Linear Discriminant Analysis for p > 1
Assumptions:
- X = (X1, X2,...,Xp) is drawn from a multivariate Gaussian (or multivariate normal) distribution, with a class-specific multivariate mean vector and a common covariance matrix.
Note: the multivariate Gaussian distribution assumes that each individual predictor follows a one-dimensional normal distribution, with some correlation between each pair of predictors.
An example in which \(Var(X_1) = Var(X_2)\) and \(Cor(X_1, X_2) = 0\); this surface has a characteristic bell shape. (就像一个中心对称的窝窝头),如果\(Cor(X_1, X_2) \neq 0\), 就是一个被压扁了的窝窝头。
![[Pasted image 20221017165257.png]]
we write \(X \sim N(\mu, \Sigma)\). Here \(E(X) = \mu\) is the mean of X (a vector with p components), and \(Cov(X) = \Sigma\) is the p × p covariance matrix of \(X\). Formally, the multivariate Gaussian density is defined as $$f(x)=\frac{1}{(2\pi){p/2}|\Sigma|{1/2}}exp(-\frac{1}{2}(x-\mu)^T \Sigma^{-1}(x-\mu))$$
The discriminant functions \(\delta_k(x)\) are$$\delta_k(x)=xT\Sigma\mu_k-\frac{1}{2}\mu_kT\Sigma\mu_k + log \pi_k$$are linear functions of \(x\). The Bayes decision boundaries are $$xT\Sigma\mu_k - \frac{1}{2}\mu_k^T \Sigma^{-1}\mu_k = x^T \Sigma^{-1}\mu_l - \frac{1}{2}\mu_lT\Sigma\mu_l$$ (Assume \(\pi_k = \pi_j\) for all k,j)
Summary:
- we need to estimate the unknown parameters \(\mu_1,...,\mu_K, \pi_1,..., \pi_K\), and \(\Sigma\);
- To assign a new observation \(X = x\), LDA plugs these estimates to obtain quantities \(\hat \delta_k(x)\), and classifies to the class for which \(\hat \delta_k(x)\) is largest.
- \(\hat \delta_k(x)\) is a linear function of \(x\); that is, the LDA decision rule depends on \(x\) only through a linear combination of its elements.
Class-specific performance is also important in medicine and biology, where the terms sensitivity and specificity characterize the performance of sensitivity specificity a classifier or screening test.
LDA is trying to approximate the Bayes classifier, which has the lowest total error rate out of all classifiers.
The Bayes classifier works by assigning an observation to the class for which the posterior probability \(p_k(X)\) is greatest. Thus, the Bayes classifier, and by extension LDA, uses a threshold of 50 % for the posterior probability of default in order to assign an observation to the default class. However, this threshold can be changed to 80% or other percentage.
Figure 4.7 illustrates the trade-off that results from modifying the threshold value for the posterior probability of default.
.As the threshold is reduced, the error rate among individuals who default decreases steadily, but the error rate among the individuals who do not default increases.
![[Pasted image 20221017190149.png]]
ROC curve
The ROC (receiver operating characteristics) curve is a popular graphic for simultaneously displaying the ROC curve two types of errors for all possible thresholds.
![[Pasted image 20221017190418.png]]
The overall performance of a classifier, summarized over all possible thresholds, is given by the area under the (ROC) curve (AUC). An ideal ROC curve will hug the top left corner, so the larger area under the (ROC) curve the AUC the better the classifier.
![[Pasted image 20221017190710.png]]
4.4.3 Quadratic Discriminant Analysis
Assumptions:
- (like LDA) each class are drawn from a Gaussian distribution
- (like LDA) each class has its own mean vector
- (unlike LDA) each class has its own covariance matrix
the Bayes classifier assigns an observation \(X = x\) to the class for which $$\delta_k(x) = -\frac{1}{2}(x-\mu_k)T\Sigma_k(x-\mu_k)-\frac{1}{2}log|\Sigma_k|+log \pi_k \
= -\frac{1}{2}xT\Sigma_kx+xT\Sigma_k\mu_k-\frac{1}{2}\mu_kT\Sigma_k\mu_k-\frac{1}{2}log|\Sigma_k|+log\pi_k$$ is largest. The quantity \(x\) appears as a quadratic function.
why would one prefer LDA to QDA, or vice-versa?
The answer lies in the bias-variance trade-off. When there are p predictors, then estimating a covariance matrix requires estimating p(p+1)/2 parameters. QDA estimates a separate covariance matrix for each class, for a total of Kp(p+1)/2 parameters.
LDA is a much less flexible classifier than QDA, and so has substantially lower variance. But there is a trade-off: if LDA’s assumption that the K classes share a common covariance matrix is badly off, then LDA can suffer from high bias.
Roughly speaking, LDA tends to be a better bet than QDA if there are relatively few training observations and so reducing variance is crucial. In contrast, QDA is recommended if the training set is very large, so that the variance of the classifier is not a major concern, or if the assumption of a common covariance matrix for the K classes is clearly untenable.
4.4.4 Naive Bayes
Assumption:
- Within the \(k\)th class, the \(p\) predictors are independent. Instead of assuming that these functions belong to a particular family of distributions (e.g. multivariate normal),i.e. for \(k = 1,...,K\), \(f_k(x) = f_{k1}(x_1) × f_{k2}(x_2) × ··· × f_{kp}(x_p)\), where \(f_{kj}\) is the density function of the \(j\)th predictor among observations in the \(k\)th class.
Why is this assumption so powerful?
Essentially, estimating a p-dimensional density function is challenging because we must consider not only the marginal distribution of each predictor — that is, the distribution of marginal distribution each predictor on its own — but also the joint distribution of the predictors joint distribution — that is, the association between the different predictors.
it often leads to pretty decent results, especially in settings where n is not large enough relative to p for us to effectively estimate the joint distribution of the predictors within each class.
Essentially, the naive Bayes assumption introduces some bias, but reduces variance, leading to a classifier that works quite well in practice as a result of the bias-variance trade-off.
\]
To estimate the one-dimensional density function \(f_{kj}\) using training data \(x_{1j} ,...,x_{nj}\) , we have a few options:
- If \(X_j\) is quantitative, then we can assume that \(X_j |Y = k \sim N(\mu_{jk}, \sigma^2_{jk})\). While this may sound a bit like QDA, there is one key difference, in that here we are assuming that the predictors are independent; this amounts to QDA with an additional assumption that the class-specific covariance matrix is diagonal.
- If \(X_j\) is quantitative, then another option is to use a non-parametric estimate for \(f_{kj}\) . A very simple way to do this is by making a histogram for the observations of the \(j\)th predictor within each class. Then we can estimate \(f_{kj}(x_j)\) as the fraction of the training observations in the kth class that belong to the same histogram bin as \(x_j\) . Alternatively, we can use a kernel density estimator, which is essentially a smoothed version of a histogram.
- If \(X_j\) is qualitative, then we can simply count the proportion of training observations for the jth predictor corresponding to each class.
We expect to see a greater pay-off to using naive Bayes relative to LDA or QDA in instances where p is larger or n is smaller, so that reducing the variance is very important.
4.5 A Comparison of Classification Methods
4.5.1 An Analytical Comparison
Equivalently, we can set K as the baseline class and assign an observation to the class that maximizes $$log(\frac{Pr(Y=k|X=x)}{Pr(Y=K|X=x)})$$1. For LDA, $$log(\frac{Pr(Y=k|X=x)}{Pr(Y=K|X=x)})=a_k+\sum_{j=1}^pb_{kj}x_j$$ where \(a_k = log(\frac{\pi_k}{\pi_K}) - \frac{1}{2}(\mu_k+\mu_K)^T\Sigma^{-1}(\mu_k - \mu_K)\) and \(b_{kj}\) is the \(j\)th component of \(\Sigma^{-1}(\mu_k-\mu_K)\).
- For QDA, $$log(\frac{Pr(Y=k|X=x)}{Pr(Y=K|X=x)})=a_k+\sum_{j=1}pb_{kj}x_j+\sum_{j=1}p\sum_{l=1}^p c_{kjl}x_jx_l$$ where \(a_k\), \(b_{kj}\) , and \(c_{kjl}\) are functions of \(\pi_k\), \(\pi_K\), \(\mu_k\), \(\mu_K\), \(\Sigma_k\) and \(\Sigma_K\).
- for naive Bayes $$log(\frac{Pr(Y=k|X=x)}{Pr(Y=K|X=x)}) = a_k + \sum_{j=1}^pg_{kj}(x_j)$$ where \(a_k = log(\frac{\pi_k}{\pi_K})\) and \(g_{kj}(x_j)= log(\frac{f_{kj(x_j)}}{f_{Kj}(x_j)})\). Hence, the right-hand side of takes the form of a generalized additive model (CH7) .
Summary1:
- LDA is a special case of QDA with \(c_{kjl} = 0\) for all \(j = 1,...,p\), \(l = 1,...,p\), and \(k = 1,...,K\). (Of course, this is not surprising, since LDA is simply a restricted version of QDA with \(\Sigma_1 = ··· = \Sigma_K = \Sigma.\))
- Any classifier with a linear decision boundary is a special case of naive Bayes with \(g_{kj}(x_j) = b_{kj}x_j\) . In particular, this means that LDA is a special case of naive Bayes!
- If we model \(f_{kj} (x_j )\) in the naive Bayes classifier using a one-dimensional Gaussian distribution \(N(\mu_{kj} , \sigma^2_j )\), then we end up with \(g_{kj} (x_j ) = b_{kj}x_j\) where \(b_{kj} = (\mu_{kj} - \mu_{Kj} )/\sigma^2_j\) . In this case, naive Bayes is actually a special case of LDA with \(\Sigma\) restricted to be a diagonal matrix with jth diagonal element equal to \(\sigma^2_j\) .
- Neither QDA nor naive Bayes is a special case of the other. Naive Bayes can produce a more flexible fit, since any choice can be made for \(g_{kj} (x_j )\). However, it is restricted to a purely additive fit, however, these terms are never multiplied. By contrast, QDA includes multiplicative terms of the form \(c_{kjl}x_jx_l\). Therefore, QDA has the potential to be more accurate in settings where interactions among the predictors are important in discriminating between classes.
- for multinomial logistic regression, $$log(\frac{Pr(Y=k|X=x)}{Pr(Y=K|X=x)})=\beta_{k0}+\sum_{l=1}^p\beta_{kl}x_l$$ This is identical to the linear form of LDA, In LDA, those coefficients are functions of estimates by assuming that \(X_1,...,X_p\) follow a normal distribution. However, in logistic regression, the coefficients are chosen to maximize the likelihood function,.Thus, we expect LDA to outperform logistic regression when the normality assumption (approximately) holds, and we expect logistic regression to perform better when it does not.
- for K-nearest neighbors (KNN), in order to make a prediction for an observation \(X = x\), the training observations that are closest to \(x\) are identified. Then \(X\) is assigned to the class to which the plurality of these observations belong. Hence KNN is a completely non-parametric approach: **no assumptions are made about the shape of the decision boundary
Summary2:
- Because KNN is completely non-parametric, we can expect this approach to dominate LDA and logistic regression when the decision boundary is highly non-linear, provided that \(n\) is very large and \(p\) is small.
- KNN requires a lot of observations relative to the number of predictors—that is, \(n\) much larger than \(p\).
- KNN is non-parametric, and thus tends to reduce the bias while incurring a lot of variance.
- In settings where the decision boundary is non-linear but \(n\) is only modest, or \(p\) is not very small, then QDA may be preferred to KNN.
- Unlike logistic regression, KNN does not tell us which predictors are important.
4.5.2 An Empirical Comparison
- When the true decision boundaries are linear, then the LDA and logistic regression approaches will tend to perform well.
- When the boundaries are moderately non-linear, QDA or naive Bayes may give better results.
- for much more complicated decision boundaries, a non-parametric approach such as KNN can be superior. But the level of smoothness for a non-parametric approach must be chosen carefully.
- Finally, recall from Chapter 3 that in the regression setting we can accommodate a non-linear relationship between the predictors and the response by performing regression using transformations of the predictors. A similar approach could be taken in the classification setting. For instance, we could create a more flexible version of logistic regression by including \(X_2, X_3\), and even \(X_4\) as predictors. If we added all possible quadratic terms and cross-products to LDA, the form of the model would be the same as the QDA model, although the parameter estimates would be different.
4.6 Generalized Linear Models
dataset: Bikeshare
4.6.2 Poisson Regression on the Bikeshare Data
Poisson Distribution:
properties:
- $\lambda = E(Y)=Var(Y)$
Poisson distribution is typically used to model **counts**.
Assumption: by Poisson regression, we implicitly assume that **mean** bike usage in a given hour equals the **variance** of bike usage during that hour. $$log(\lambda(X_1,...X_p)) = \beta_0+\beta_1X_1+...+\beta_pX_p$$ To estimate the coefficients, use MLE approach.
>In fact, the variance in the Bikeshare data appears to be much higher than the mean, a situation referred to as overdispersion. This causes the Z-values to be inflated in Table 4.11. A more careful analysis should account for this overdispersion to obtain more accurate Z-values, and there are a variety of methods for doing this. But they are beyond the scope of this book.
### 4.6.3 Generalized Linear Models in Greater Generality
three types of regression models: linear, logistic and Poisson.
common characteristics:
- Each approach uses predictors $X_1,...,X_p$ to predict a response $Y$ . We assume that, conditional on $X_1,...,X_p$, $Y$ belongs to a certain family of distributions. For linear regression, we typically assume that $Y$ follows a **Gaussian** or normal distribution. For logistic regression, we assume that $Y$ follows a **Bernoulli** distribution. Finally, for Poisson regression, we assume that $Y$ follows a **Poisson** distribution.
- for $E(Y|X_1,...X_p)$, can be expressed using a **link function**, $\eta$. The link functions for linear, logistic and Poisson regression are $η(µ) = µ$, $\eta(\mu) = log(\mu/(1 - \mu))$, and $\eta(\mu) = log(\mu)$, respectively.
- The Gaussian, Bernoulli and Poisson distributions are all members of a wider class of distributions, known as the **exponential family**. In general, we can perform a regression by modeling the response $Y$ as coming from a particular member of the exponential family, and then transforming the mean of the response so that the transformed mean is a linear function of the predictors. Any regression approach that follows this very general recipe is known as a **generalized linear model (GLM)**. Thus, linear regression, logistic  regression, and Poisson regression are three examples of GLMs. Other examples not covered here include Gamma regression and negative binomial regression.\]
统计机器学习-Introduction to Statistical Learning-阅读笔记-CH4-Classification的更多相关文章
- ≪统计学习精要(The Elements of Statistical Learning)≫课堂笔记(三)
		照例文章第一段跑题,先附上个段子(转载的哦~): I hate CS people. They don't know linear algebra but want to teach projecti ... 
- 【DeepLearning学习笔记】Coursera课程《Neural Networks and Deep Learning》——Week1 Introduction to deep learning课堂笔记
		Coursera课程<Neural Networks and Deep Learning> deeplearning.ai Week1 Introduction to deep learn ... 
- “Deep models under the GAN: information leakage from collaborative deep learning”阅读笔记
		一.摘要 指出深度学习在机器学习场景下的优势,以及深度学习快速崛起的原因.随后点出研究者对于深度学习隐私问题的考虑.作者提出了一种强力的攻击方法,在其攻击下任何分布式.联邦式.或者中心化的深度学习方法 ... 
- Deep Learning 阅读笔记:Convolutional Auto-Encoders 卷积神经网络的自编码表达
		需要搭建一个比较复杂的CNN网络,希望通过预训练来提高CNN的表现. 上网找了一下,关于CAE(Convolutional Auto-Encoders)的文章还真是少,勉强只能找到一篇瑞士的文章. S ... 
- Introduction to statistical learning:with Applications in R  (书,数据,R代码,链接)
		http://faculty.marshall.usc.edu/gareth-james/ http://faculty.marshall.usc.edu/gareth-james/ISL/ 
- 李宏毅机器学习笔记4:Brief Introduction of Deep Learning、Backpropagation(后向传播算法)
		李宏毅老师的机器学习课程和吴恩达老师的机器学习课程都是都是ML和DL非常好的入门资料,在YouTube.网易云课堂.B站都能观看到相应的课程视频,接下来这一系列的博客我都将记录老师上课的笔记以及自己对 ... 
- 机器学习实战(Machine Learning in Action)学习笔记————08.使用FPgrowth算法来高效发现频繁项集
		机器学习实战(Machine Learning in Action)学习笔记————08.使用FPgrowth算法来高效发现频繁项集 关键字:FPgrowth.频繁项集.条件FP树.非监督学习作者:米 ... 
- 个性探测综述阅读笔记——Recent trends in deep learning based personality detection
		目录 abstract 1. introduction 1.1 个性衡量方法 1.2 应用前景 1.3 伦理道德 2. Related works 3. Baseline methods 3.1 文本 ... 
- 论文阅读笔记 Improved Word Representation Learning with Sememes
		论文阅读笔记 Improved Word Representation Learning with Sememes 一句话概括本文工作 使用词汇资源--知网--来提升词嵌入的表征能力,并提出了三种基于 ... 
- 机器学习实战(Machine Learning in Action)学习笔记————07.使用Apriori算法进行关联分析
		机器学习实战(Machine Learning in Action)学习笔记————07.使用Apriori算法进行关联分析 关键字:Apriori.关联规则挖掘.频繁项集作者:米仓山下时间:2018 ... 
随机推荐
- EXT GridPanel button  按钮 事件 方法 DirectMethod
			C# 代码 //首页 Ext.Net.Button btnFirst = new Ext.Net.Button(); btnFirst.Icon = Icon.ControlStartBlue; bt ... 
- 双调排序--GPU/AIPU适合的排序【转载】
			欢迎转载,转载请注明:本文出自Bin的专栏blog.csdn.net/xbinworld 技术交流QQ群:433250724,欢迎对算法.技术.应用感兴趣的同学加入 双调排序是data-indepen ... 
- bbswitch与bumblebee配合使用
			!建议查阅并使用archwiki的bumblebee方案 ! 安装NONFREE驱动 1.在终端中输入以下命令来检查已安装的驱动版本(我这次装manjaro是hybird440) inxi -G 2. ... 
- redis基础-redis事务
			学习总结 原文:https://juejin.im/post/5d29ac845188252cc75e2d5c redis事务: redis是否有事务? redis是有事务的.命令如下: Redis事 ... 
- BlenderGIS记录
			blender GIS 的插件名:"3Dview:blenderGIS" 具体使用方法看文档. 选择地图时选择bing地图会快一点.如果能挂梯子可以选择google地图 shift ... 
- Linux(CentOS8) 安装 Docker
			查询当前系统的相关信息 cat /etc/os-release 输入内容如下 校验当前CentOS内核版本 说明:Docker 要求 CentOS 的内核版本,至少高于 3.10 .低于 3.10 的 ... 
- IO题目
			8-1 写入日志文件 (0 分) 编写程序,要求:用户在键盘每输入一行文本,程序将这段文本显示在控制台中.当用户输入的一行文本是"exit"(不区分大小写)时,程序将用户所有输 ... 
- Vue中 ref、$refs区别与使用
			定义2个组件: 子组件ChildrenSubRef.vue: 1 <template> 2 <div> 3 4 </div> 5 </template> ... 
- mongodb展开数组数据
			核心是$unwind操作 db.getCollection("orders").aggregate([{$unwind:"$OrderTrackingDetails&qu ... 
- File 未释放文件权柄问题处理
			Unreleased Resource: Files Abstract 程序可能无法释放某个文件句柄. Explanation 程序可能无法成功释放某一个文件句柄. 资源泄露至少有两种常见的原因: - ... 
