[Machine Learning][The Analytics Edge][Predicting Earnings from Census Data]

census = read.csv("census.csv")
library(caTools)
set.seed(2000)
spl = sample.split(census$over50k,SplitRatio = 0.6)
train = subset(census,spl == TRUE)
test = subset(census, spl == FALSE)
# use the logistic regression
glm = glm(over50k ~. , data = train, family = "binomial")
summary(glm) #pr(>|z|) if it is smaller than 0.1, the variables are significant

#accuracy
glm.pred = predict(glm, newdata = test, type = "response")
table(test$over50k,glm.pred >= 0.5)

(9051+1888)/nrow(test)

#baseline accuracy of test - more frequent outcome
table(test$over50k)
9713/nrow(test)

#ROC & ACU
library(ROCR)
#Then we can generate the confusion matrix
ROCpred = prediction(glm.pred, test$over50k)
plot(performance(ROCpred,measure="tpr",x.measure="fpr"),colorize = TRUE)
as.numeric(performance(ROCpred, "auc")@y.values)

#Problem 2.1 - A CART Model
library(rpart)
library(rpart.plot)
CTree = rpart(over50k ~. , data = train, method = "class")
prp(CTree)

# accuracy of the CART model
CTree.pred = predict(CTree, newdata = test, type = "class")

table(test$over50k,CTree.pred)
(9243+1596)/nrow(test)

#use another way- generate probabilities and use a threshold of 0.5 like in logistic regression
CTree.pred1 = predict(CTree, newdata = test)
p = CTree.pred1[,2] # the column of over 50k
table(test$over50k, p) # p<=0.5 it is same with the <=50k, p>0.5 means >50k

# ROC curve for the CART model - WOW
#removing the type="class" argument when making predictions
library(ROCR)
library(arulesViz)
CTree.ROCpred = prediction(CTree.pred1[,2],test$over50k)
# plot(CTree.ROCpred) can not run
plot(performance(CTree.ROCpred,measure="tpr",x.measure="fpr"),colorize = TRUE)

# to caculate the auc
as.numeric(performance(CTree.ROCpred,"auc")@y.values)

# another way to seek for auc
CTree.ROCpred2 = prediction(p,test$over50k)
as.numeric(performance(CTree.ROCpred2,"auc")@y.values)

#Problem 3.1 - A Random Forest Model
set.seed(1)
trainSmall = train[sample(nrow(train),2000),]

set.seed(1)
library(randomForest)
RFC = randomForest(over50k ~., data = trainSmall)
RFC.pred = predict(RFC,newdata = test) #using a threshold of 0.5, no need to set the type = "class"
table(test$over50k,RFC.pred)
(9586+1093)/nrow(test) # a little difference is allowed

#compute metrics that give us insight into which variables are important.
vu = varUsed(RFC, count = TRUE)
vusorted = sort(vu, decreasing = FALSE, index.return = TRUE)
dotchart(vnsorted$x, names(RFC$forest$xlevel[vusorted$ix]))

#another way to find the important variables - impurity
varImpPlot(RFC)

# select cp by Cross-validation for the CART Trees
library(caret)
library(e1071)
set.seed(2)
#Specify that we are going to use k-fold cross validation with 10 folds:
numFolds = trainControl(method = "cv", number = 10)
#Specify the grid of cp values that we wish to evaluate:
cartGrid = expand.grid(.cp = seq(0.002,0.1,0.002))
#run the train function and view the result:
tr = train(over50k ~.,data = train, method = "rpart", trControl = numFolds, tuneGrid = cartGrid)
tr # The final value used for the model was cp = 0.002.

CTree2 = rpart(over50k ~., data = train, method = "class", cp = 0.002)
CTree2.pred = predict(CTree2, newdata = test, type = "class")
table(test$over50k, CTree2.pred)
(9178+1838)/nrow(test)
prp(CTree2) # shoould be 18 splits

[Machine Learning][The Analytics Edge][Predicting Earnings from Census Data]的更多相关文章

Machine Learning for Developers
Machine Learning for Developers Most developers these days have heard of machine learning, but when ...
How do I learn machine learning?
https://www.quora.com/How-do-I-learn-machine-learning-1?redirected_qid=6578644 How Can I Learn X? ...
Course Machine Learning Note
Machine Learning Note Introduction Introduction What is Machine Learning? Two definitions of Machine ...
[C2P3] Andrew Ng - Machine Learning
##Advice for Applying Machine Learning Applying machine learning in practice is not always straightf ...
Why The Golden Age Of Machine Learning is Just Beginning
Why The Golden Age Of Machine Learning is Just Beginning Even though the buzz around neural networks ...
Introducing: Machine Learning in R（转）
Machine learning is a branch in computer science that studies the design of algorithms that can lear ...
Azure Machine Learning
About me In my spare time, I love learning new technologies and going to hackathons. Our hackathon p ...
Getting started with machine learning in Python
Getting started with machine learning in Python Machine learning is a field that uses algorithms to ...
Google's Machine Learning Crash Course #01# Introducing ML & Framing & Fundamental terminology
INDEX Introducing ML Framing Fundamental machine learning terminology Introducing ML What you learn ...

随机推荐

globals和locals的区别
Python的两个内置函数,locals 和globals,它们提供了基于字典的访问局部和全局变量的方式. 1.locals()是只读的.globals()不是.这里说的只读,是值对于原有变量的只读. ...
数论卷积公式and莫比乌斯反演
数论卷积: 对于两个数论函数f(x),g(x) f(n)g(n)=∑ f(d)g(n/d) d|n 莫比乌斯函数: 设一个数n=(p1^k1)*(p2^k2)*(p3^k3)*..........*( ...
mysqli用户权限操作
此操作指令在 mysql 的数据库中所以要 use mysql 查询mysqli中所有用户的权限 select host,user form user; 添加用户 grant all privil ...
IDEA—— 找不到或无法加载主类Main
最近使用idea,编写了一个项目,发现老是找不到main,网上找了一大圈的解决方案,都不行.灵机一动升级了jdk就可以了,之前用的是1.7的,换成了1.8的就好了.
一台电脑上配置多个tomcat同时运行
好使 1 1.配置运行tomcat 首先要配置java的jdk环境,这个就不在写了不懂去网上查查,这里主要介绍再jdk环境没配置好的情况下如何配置运行多个tomcat 2.第一个tomcat: ...
HDFS知识点总结
学习完Hadoop权威指南有一段时间了,现在再回顾和总结一下HDFS的知识点. 1.HDFS的设计 HDFS是什么:HDFS即Hadoop分布式文件系统(Hadoop Distributed File ...
vue+窗格切换+田字+dicom显示_02
环境:vue+webpack+cornerstone ide:vs code 需求:窗格设置+拼图设置分析: 由于时间的原因,我也没有Baidu更好的显示窗格的方法,所以使用比较笨的方法,通过组件显 ...
调用免费的web service(天气预报)
”免费WebService”, 找到提供天气预报Webservice的网络地址 http://ws.webxml.com.cn/WebServices/WeatherWS.asmx?wsdl 在url ...
关于spring的一些注解
vue.js插值，插入图片，属性
<html><head><title>Insert title here</title><script type="text/javas ...

[Machine Learning][The Analytics Edge][Predicting Earnings from Census Data]

[Machine Learning][The Analytics Edge][Predicting Earnings from Census Data]的更多相关文章

随机推荐

热门专题