Jul 10, 2009; 10:46pm

predict.glm -> which class does it predict?

2 posts
Hi,

I have a question about logistic regression in R.

Suppose I have a small list of proteins P1, P2, P3 that predict a 
two-class target T, say cancer/noncancer. Lets further say I know that I 
can build a simple logistic regression model in R

model <- glm(T ~ ., data=d.f(Y), family=binomial)   (Y is the dataset of 
the Proteins).

This works fine. T is a factored vector with levels cancer, noncancer. 
Proteins are numeric.

Now, I want to use predict.glm to predict a new data.

predict(model, newdata=testsamples, type="response")    (testsamples is 
a small set of new samples).

The result is a vector of the probabilites for each sample in 
testsamples. But probabilty WHAT for? To belong to the first level in T? 
To belong to second level in T?

Is this fallowing expression 
factor(predict(model, newdata=testsamples, type="response") >= 0.5) 
TRUE, when the new sample is classified to Cancer or when it's 
classified to Noncancer? And why not the other way around?

Thank you,

Peter

______________________________________________ 
[hidden email] mailing list 
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code. 

Reply | Threaded | More 
Jul 10, 2009; 11:37pm

Re: predict.glm -> which class does it predict?

1330 posts
On Jul 10, 2009, at 9:46 AM, Peter Schüffler wrote:

> Hi, 

> I have a question about logistic regression in R. 

> Suppose I have a small list of proteins P1, P2, P3 that predict a   
> two-class target T, say cancer/noncancer. Lets further say I know   
> that I can build a simple logistic regression model in R 

> model <- glm(T ~ ., data=d.f(Y), family=binomial)   (Y is the   
> dataset of the Proteins). 

> This works fine. T is a factored vector with levels cancer,   
> noncancer. Proteins are numeric. 

> Now, I want to use predict.glm to predict a new data. 

> predict(model, newdata=testsamples, type="response")    (testsamples   
> is a small set of new samples). 

> The result is a vector of the probabilites for each sample in   
> testsamples. But probabilty WHAT for? To belong to the first level   
> in T? To belong to second level in T? 

> Is this fallowing expression 
> factor(predict(model, newdata=testsamples, type="response") >= 0.5) 
> TRUE, when the new sample is classified to Cancer or when it's   
> classified to Noncancer? And why not the other way around? 

> Thank you, 

> Peter

As per the Details section of ?glm:

A typical predictor has the form response ~ terms where response is   
the (numeric) response vector and terms is a series of terms which   
specifies a linear predictor forresponse. ***For binomial and   
quasibinomial families the response can also be specified as a factor   
(when the first level denotes failure and all others success)*** or as   
a two-column matrix with the columns giving the numbers of successes   
and failures. A terms specification of the form first + second   
indicates all the terms in first together with all the terms in second   
with any duplicates removed.

So, given your description above, you are predicting   
"noncancer"...that is, you are predicting the probability of the   
second level of the factor ("success"), given the covariates.

If you want to predict "cancer", alter the factor levels thusly:

T <- factor(T, levels = c("noncancer", "cancer"))

By default, R will alpha sort the factor levels, so "cancer" would be   
first.

Think of it in terms of using a 0,1 integer code for absence,presence,   
where you are predicting the probability of a '1', or the presence of   
the event or characteristic of interest.

BTW, using 'T' as the name of the response vector is not a good habit:

> T 
[1] TRUE

'T' is shorthand for the built in R constant TRUE. R is generally   
smart enough to know the difference, but it is better to avoid getting   
into trouble by not using it.

HTH,

Marc Schwartz

______________________________________________ 
[hidden email] mailing list 
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code. 

Reply | Threaded | More 
Jul 10, 2009; 11:48pm

Re: predict.glm -> which class does it predict?

2360 posts
In reply to this post by Peter Schüffler-2
Peter Schüffler wrote:

> Hi, 

> I have a question about logistic regression in R. 

> Suppose I have a small list of proteins P1, P2, P3 that predict a 
> two-class target T, say cancer/noncancer. Lets further say I know that I 
> can build a simple logistic regression model in R 

> model <- glm(T ~ ., data=d.f(Y), family=binomial)   (Y is the dataset of 
> the Proteins). 

> This works fine. T is a factored vector with levels cancer, noncancer. 
> Proteins are numeric. 

> Now, I want to use predict.glm to predict a new data. 

> predict(model, newdata=testsamples, type="response")    (testsamples is 
> a small set of new samples). 

> The result is a vector of the probabilites for each sample in 
> testsamples. But probabilty WHAT for? To belong to the first level in T? 
> To belong to second level in T? 

> Is this fallowing expression 
> factor(predict(model, newdata=testsamples, type="response") >= 0.5) 
> TRUE, when the new sample is classified to Cancer or when it's 
> classified to Noncancer? And why not the other way around?

It's the probability of the 2nd level of a factor response (termed 
"success" in the documentation, even when your modeling the probability 
of disease or death...), just like when interpreting the logistic 
regression itself.

I find it easiest to sort ut this kind of issue by experimentation in 
simplified situations. E.g.

> x <- sample(c("A","B"),10,replace=TRUE) 
 > x 
  [1] "B" "A" "B" "B" "A" "B" "B" "A" "B" "A" 
 > table(x) 

A B 
4 6

(notice that the relative frequency of B is 0.6)

> glm(x~1,binomial) 
Error in eval(expr, envir, enclos) : y values must be 0 <= y <= 1 
In addition: Warning message: 
In model.matrix.default(mt, mf, contrasts) : 
   variable 'x' converted to a factor

(OK, so it won't go without conversion to factor. This is a good thing.)

> glm(factor(x)~1,binomial)

Call:  glm(formula = factor(x) ~ 1, family = binomial)

Coefficients: 
(Intercept) 
      0.4055

Degrees of Freedom: 9 Total (i.e. Null);  9 Residual 
Null Deviance:    13.46 
Residual Deviance: 13.46 AIC: 15.46

(The intercept is positive, corresponding to log odds for a probability 
 > 0.5 ; i.e.,  must be that "B": 0.4055==log(6/4))

> predict(glm(factor(x)~1,binomial)) 
         1         2         3         4         5         6         7 
        8 
0.4054651 0.4054651 0.4054651 0.4054651 0.4054651 0.4054651 0.4054651 
0.4054651 
         9        10 
0.4054651 0.4054651 
 > predict(glm(factor(x)~1,binomial),type="response") 
   1   2   3   4   5   6   7   8   9  10 
0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6

As for why it's not the other way around, well, if it had been, then you 
could have asked the same question....

-- 
    O__  ---- Peter Dalgaard             Øster Farimagsgade 5, Entr.B 
   c/ /'_ --- Dept. of Biostatistics     PO Box 2099, 1014 Cph. K 
  (*) \(*) -- University of Copenhagen   Denmark      Ph:  (+45) 35327918 
~~~~~~~~~~ - ([hidden email])              FAX: (+45) 35327907

______________________________________________ 
[hidden email] mailing list 
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code. 

Reply | Threaded | More 
Jul 11, 2009; 1:27am

Re: predict.glm -> which class does it predict?

7686 posts
2009/7/10 Peter Dalgaard <[hidden email]>:

> Peter Schüffler wrote: 
>> 
>> Hi, 
>> 
>> I have a question about logistic regression in R. 
>> 
>> Suppose I have a small list of proteins P1, P2, P3 that predict a 
>> two-class target T, say cancer/noncancer. Lets further say I know that I can 
>> build a simple logistic regression model in R 
>> 
>> model <- glm(T ~ ., data=d.f(Y), family=binomial)   (Y is the dataset of 
>> the Proteins). 
>> 
>> This works fine. T is a factored vector with levels cancer, noncancer. 
>> Proteins are numeric. 
>> 
>> Now, I want to use predict.glm to predict a new data. 
>> 
>> predict(model, newdata=testsamples, type="response")    (testsamples is a 
>> small set of new samples). 
>> 
>> The result is a vector of the probabilites for each sample in testsamples. 
>> But probabilty WHAT for? To belong to the first level in T? To belong to 
>> second level in T? 
>> 
>> Is this fallowing expression 
>> factor(predict(model, newdata=testsamples, type="response") >= 0.5) 
>> TRUE, when the new sample is classified to Cancer or when it's classified 
>> to Noncancer? And why not the other way around? 

> It's the probability of the 2nd level of a factor response (termed "success" 
> in the documentation, even when your modeling the probability of disease or 
> death...), just like when interpreting the logistic regression itself. 

> I find it easiest to sort ut this kind of issue by experimentation in 
> simplified situations. E.g. 

>> x <- sample(c("A","B"),10,replace=TRUE) 
>> x 
>  [1] "B" "A" "B" "B" "A" "B" "B" "A" "B" "A" 
>> table(x) 
> x 
> A B 
> 4 6 

> (notice that the relative frequency of B is 0.6) 

>> glm(x~1,binomial) 
> Error in eval(expr, envir, enclos) : y values must be 0 <= y <= 1 
> In addition: Warning message: 
> In model.matrix.default(mt, mf, contrasts) : 
>  variable 'x' converted to a factor 

> (OK, so it won't go without conversion to factor. This is a good thing.) 

>> glm(factor(x)~1,binomial) 

> Call:  glm(formula = factor(x) ~ 1, family = binomial) 

> Coefficients: 
> (Intercept) 
>     0.4055 

> Degrees of Freedom: 9 Total (i.e. Null);  9 Residual 
> Null Deviance:      13.46 
> Residual Deviance: 13.46        AIC: 15.46 

> (The intercept is positive, corresponding to log odds for a probability > 
> 0.5 ; i.e.,  must be that "B": 0.4055==log(6/4)) 

>> predict(glm(factor(x)~1,binomial)) 
>        1         2         3         4         5         6         7       8 
> 0.4054651 0.4054651 0.4054651 0.4054651 0.4054651 0.4054651 0.4054651 
> 0.4054651 
>        9        10 
> 0.4054651 0.4054651 
>> predict(glm(factor(x)~1,binomial),type="response") 
>  1   2   3   4   5   6   7   8   9  10 
> 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 

> As for why it's not the other way around, well, if it had been, then you 
> could have asked the same question.... 
>

Or more specifically:

> resp <- factor(c("cancer", "noncancer", "noncancer", "noncancer")) 
> mod <- glm(resp ~ 1, family = binomial) 
> predict(mod, type = "response") 
   1    2    3    4 
0.75 0.75 0.75 0.75

and since noncancer occurs 75% of the time in the sample clearly 
its predicting the probability of noncancer.

______________________________________________ 
[hidden email] mailing list 
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code. 

Reply | Threaded | More 
Jul 11, 2009; 2:10am

Re: predict.glm -> which class does it predict?

2360 posts
In reply to this post by Peter Dalgaard
> As for why it's not the other way around, well, if it had been, then you 
> could have asked the same question....

...and come to think about it, it is rather convenient that it meshes 
with the default ordering of levels in factor(x) is x is 0/1 or FALSE/TRUE.

-- 
    O__  ---- Peter Dalgaard             Øster Farimagsgade 5, Entr.B 
   c/ /'_ --- Dept. of Biostatistics     PO Box 2099, 1014 Cph. K 
  (*) \(*) -- University of Copenhagen   Denmark      Ph:  (+45) 35327918 
~~~~~~~~~~ - ([hidden email])              FAX: (+45) 35327907

______________________________________________ 
[hidden email] mailing list 
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code. 

predict.glm -> which class does it predict?的更多相关文章

  1. CF451C Predict Outcome of the Game 水题

    Codeforces Round #258 (Div. 2) Predict Outcome of the Game C. Predict Outcome of the Game time limit ...

  2. tflearn tensorflow LSTM predict sin function

    from __future__ import division, print_function, absolute_import import tflearn import numpy as np i ...

  3. 如何在R语言中使用Logistic回归模型

    在日常学习或工作中经常会使用线性回归模型对某一事物进行预测,例如预测房价.身高.GDP.学生成绩等,发现这些被预测的变量都属于连续型变量.然而有些情况下,被预测变量可能是二元变量,即成功或失败.流失或 ...

  4. 简单介绍一下R中的几种统计分布及常用模型

    统计学上分布有很多,在R中基本都有描述.因能力有限,我们就挑选几个常用的.比较重要的简单介绍一下每种分布的定义,公式,以及在R中的展示. 统计分布每一种分布有四个函数:d――density(密度函数) ...

  5. Machine Learning for hackers读书笔记(六)正则化:文本回归

    data<-'F:\\learning\\ML_for_Hackers\\ML_for_Hackers-master\\06-Regularization\\data\\' ranks < ...

  6. 统计学习导论:基于R应用——第五章习题

    第五章习题 1. 我们主要用到下面三个公式: 根据上述公式,我们将式子化简为 对求导即可得到得到公式5-6. 2. (a) 1 - 1/n (b) 自助法是有有放回的,所以第二个的概率还是1 - 1/ ...

  7. 统计学习导论:基于R应用——第四章习题

    第四章习题,部分题目未给出答案 1. 这个题比较简单,有高中生推导水平的应该不难. 2~3证明题,略 4. (a) 这个问题问我略困惑,答案怎么直接写出来了,难道不是10%么 (b) 这个答案是(0. ...

  8. R与数据分析旧笔记(⑨)广义线性回归模型

    广义线性回归模型 广义线性回归模型 例题1 R.Norell实验 为研究高压电线对牲畜的影响,R.Norell研究小的电流对农场动物的影响.他在实验中,选择了7头,6种电击强度, 0,1,2,3,4, ...

  9. logistic回归和probit回归预测公司被ST的概率(应用)

    1.适合阅读人群: 知道以下知识点:盒状图.假设检验.逻辑回归的理论.probit的理论.看过回归分析,了解AIC和BIC判别准则.能自己跑R语言程序 2.本文目的:用R语言演示一个相对完整的逻辑回归 ...

随机推荐

  1. [原][译][osgearth]API加载地球(OE官方文档翻译)

    原文参考:http://docs.osgearth.org/en/latest/developer/maps.html#programmatic-map-creation 本人翻译水平有限... 加载 ...

  2. Miller_Rabin(米勒拉宾)素数测试

    2018-03-12 17:22:48 米勒-拉宾素性检验是一种素数判定法则,利用随机化算法判断一个数是合数还是可能是素数.卡内基梅隆大学的计算机系教授Gary Lee Miller首先提出了基于广义 ...

  3. hdu3031

    题解: 左偏树模板题目 每一次合并,删除最大,修改最大 都是基本操作 代码: #include<cstdio> #include<cmath> #include<algo ...

  4. 记录下返回list给前端 遇到 $ref":"$.data.*** 问题

    1.通过对象返回给前端,对象里面有三个list 2.一个父list 2个子list  子list中的对象 是通过for循环父list按照某个条件放进去的 3.直接放进去会出现 $ref":& ...

  5. python的单元测试代码编写流程

    单元测试: 单元测试是对单独的代码块分别进行测试, 以确保它们的正确性, 单元测试主要还是由开发人员来做, 其余的集成测试和系统测试由专业的测试人员来做. python的单元测试代码编写主要记住以下几 ...

  6. L183 Chinese company unveils first satellite for free WiFi

    A Chinese internet technology company unveiled the first satellite in a constellation plan to provid ...

  7. hasura graphql server (haskell)构建

    安装 &&运行pg(docker) version: '3.6' services: postgres: image: postgres environment: - "PO ...

  8. return 0;和exit(0);的区别

    首先说一下fork和vfork的差别: fork 是 创建一个子进程,并把父进程的内存数据copy到子进程中. vfork是 创建一个子进程,并和父进程的内存数据share一起用. 这两个的差别是,一 ...

  9. ETA6093 或 ETA9741 ETA9742 的 TYPE-C 的资料收集

    ETA6093 或 ETA9741 ETA9742 的 TYPE-C 的资料收集 因为项目使用. 这个 IC 好玩,但是还是有一些需要注意的. 对我有用的信息. http://www.great-et ...

  10. WCF 快速入门

    定义服务契约 构建HelloWCF应用的第一步是创建服务契约.契约式是表示消息应用外形的主要方式.对于外形,是指服务暴露的操作,使用的消息 schema和每个操作实现的消息交换模式(MEP).总之,契 ...