There is a plethora of classification algorithms available to people who have a bit of coding experience and a set of data. A common machine learning method is the random forest, which is a good place to start.

This is a use case in R of the randomForest package used on a data set fromUCI's Machine Learning Data Repository.

Are These Mushrooms Edible?

If someone gave you thousands of rows of data with dozens of columns about mushrooms, could you identify which characteristics make a mushroom edible or poisonous? How much would you trust your model? Would it be enough for you to make a decision on whether or not to eat a mushroom you find? (That's a bad decision roughly 100% of the time).

The randomForest package does all of the heavy lifting behind the scenes. While this "magic" is incredibly nice for the end user, it's important to understand what it is you're doing. Keep this in mind for absolutely any package you use in R or any other language.

"To know how to run these programs is impressive, but to truly understand how and why they work is what makes you an expert!" -Haley Stoltzman (my wife is a genius)

Here is an article which explains things in layman's terms - A Gentle Introduction to Random Forests, Ensembles, and Performance Metrics in a Commercial System.

I created a function to grab and clean up the data. This happened to be a very manual process so I borrowed a lot of the code from others. Later on, I found that the data set had already been cleaned up by someone else and presented as a .csv file, but I decided to use my function anyway.

source('helper_functions.R')
library(randomForest)
library(e1071)
library(caret)
library(ggplot2)
set.seed(123)

I brought the data in as a dataframe, the first column is "Edible" which could be labeled "Class" as this is what we're looking for in the classification. We'll find only two values here, "Edible" and "Poisonous" (keep in mind that more than two values are easily handled by random forest).

I printed the first few rows and the output shows us there are 23 columns (including "Edible"). I am not a mushroom expert but most of this data makes sense to try and utilize.

#Import Data via Custom Function
data = fetchAndCleanData()
head(data)
##      Edible CapShape CapSurface CapColor Bruises    Odor GillAttachment
## 1 Poisonous Convex Smooth Brown True Pungent Free
## 2 Edible Convex Smooth Yellow True Almond Free
## 3 Edible Bell Smooth White True Anise Free
## 4 Poisonous Convex Scaly White True Pungent Free
## 5 Edible Convex Smooth Gray False None Free
## 6 Edible Convex Scaly Yellow True Almond Free
## GillSpacing GillSize GillColor StalkShape StalkRoot
## 1 Close Narrow Black Enlarging Equal
## 2 Close Broad Black Enlarging Club
## 3 Close Broad Brown Enlarging Club
## 4 Close Narrow Brown Enlarging Equal
## 5 Crowded Broad Black Tapering Equal
## 6 Close Broad Brown Enlarging Club
## StalkSurfaceAboveRing StalkSurfaceBelowRing StalkColorAboveRing
## 1 Smooth Smooth White
## 2 Smooth Smooth White
## 3 Smooth Smooth White
## 4 Smooth Smooth White
## 5 Smooth Smooth White
## 6 Smooth Smooth White
## StalkColorBelowRing VeilType VeilColor RingNumber RingType
## 1 White Partial White One Pendant
## 2 White Partial White One Pendant
## 3 White Partial White One Pendant
## 4 White Partial White One Pendant
## 5 White Partial White One Evanescent
## 6 White Partial White One Pendant
## SporePrintColor Population Habitat
## 1 Black Scattered Urban
## 2 Brown Numerous Grasses
## 3 Brown Numerous Meadows
## 4 Black Scattered Urban
## 5 Brown Abundnant Grasses
## 6 Black Numerous Grasses

It's important to know that R's random forest package cannot use rows with missing data. Using the summary() function can help to identify issues. This data doesn't have missing information.

summary(data) #no missing data appears
##        Edible        CapShape      CapSurface      CapColor
## Edible :4208 Convex :3656 Scaly :3244 Brown :2284
## Poisonous:3916 Flat :3152 Smooth :2556 Gray :1840
## Knobbed: 828 Fibrous:2320 Red :1500
## Bell : 452 Grooves: 4 Yellow :1072
## Sunken : 32 f : 0 White :1040
## Conical: 4 g : 0 Buff : 168
## (Other): 0 (Other): 0 (Other): 220
## Bruises Odor GillAttachment GillSpacing
## f : 0 None :3528 a : 0 c : 0
## t : 0 Foul :2160 f : 0 w : 0
## True :3376 Fishy : 576 Attached : 210 Close :6812
## False:4748 Spicy : 576 Descending: 0 Crowded:1312
## Almond : 400 Free :7914 Distant: 0
## Anise : 400 Notched : 0
## (Other): 484
## GillSize GillColor StalkShape StalkRoot
## b : 0 Buff :1728 e : 0 Bulbous:3776
## n : 0 Pink :1492 t : 0 Missing:2480
## Broad :5612 White :1202 Enlarging:3516 Equal :1120
## Narrow:2512 Brown :1048 Tapering :4608 Club : 556
## Gray : 752 Rooted : 192
## Chocolate: 732 ? : 0
## (Other) :1170 (Other): 0
## StalkSurfaceAboveRing StalkSurfaceBelowRing StalkColorAboveRing
## Smooth :5176 Smooth :4936 White :4464
## Silky :2372 Silky :2304 Pink :1872
## Fibrous: 552 Fibrous: 600 Gray : 576
## Scaly : 24 Scaly : 284 Brown : 448
## f : 0 f : 0 Buff : 432
## k : 0 k : 0 Orange : 192
## (Other): 0 (Other): 0 (Other): 140
## StalkColorBelowRing VeilType VeilColor RingNumber
## White :4384 p : 0 White :7924 n : 0
## Pink :1872 Partial :8124 Brown : 96 o : 0
## Gray : 576 Universal: 0 Orange : 96 t : 0
## Brown : 512 Yellow : 8 None: 36
## Buff : 432 n : 0 One :7488
## Orange : 192 o : 0 Two : 600
## (Other): 156 (Other): 0
## RingType SporePrintColor Population Habitat
## Pendant :3968 White :2388 Several :4040 Woods :3148
## Evanescent:2776 Brown :1968 Solitary :1712 Grasses:2148
## Large :1296 Black :1872 Scattered:1248 Paths :1144
## Flaring : 48 Chocolate:1632 Numerous : 400 Leaves : 832
## None : 36 Green : 72 Abundnant: 384 Urban : 368
## e : 0 Buff : 48 Clustered: 340 Meadows: 292
## (Other) : 0 (Other) : 144 (Other) : 0 (Other): 192

I want to explore the data before fitting a model to get an idea of what to expect. I am plotting a variable on two axes and using colors to see the relationship as to whether or not the mushroom is edible or poisonous.

In these plots, edible is shown as green and poisonous is shown as red. I'm looking for spots where there exists an overwhelming majority of one color.

A comparison of "CapSurface" to "CapShape" shows us:

  • CapShape Bell is more likely to be edible
  • CapShape Convex or Flat have a mix of edible and poisonous and make up the majority of the data
  • CapSurface alone does not tell us a lot of information
  • CapSurface Fibrous + CapShape Bell, Knobbed, or Sunken are likely to be edible
  • These variables will likely increase information gain but may not be incredibly strong

p = ggplot(data,aes(x=CapShape,
y=CapSurface,
color=Edible)) p + geom_jitter(alpha=0.3) +
scale_color_manual(breaks = c('Edible','Poisonous'),
values=c('darkgreen','red'))

A comparison of "StalkColorBelowRing" to "StalkColorAboveRing" shows us:

  • StalkColorAboveRing Gray is almost always going to be edible
  • StalkColorBelowRing Gray is almost always going to be edible
  • StalkColorBelowRing Buff is almost always going to be poisonous
  • This list could go on...
  • These variables are likely to increase information gain by a fair amount

p = ggplot(data,aes(x=StalkColorBelowRing,
y=StalkColorAboveRing,
color=Edible)) p + geom_jitter(alpha=0.3) +
scale_color_manual(breaks = c('Edible','Poisonous'),
values=c('darkgreen','red'))

A comparison of "Odor" to "SporePrintColor" shows us:

  • Odor Foul, Fishy, Pungent, Creosote, and Spicy are highly likely to be poisonous
  • Odor Almond and Anise are highly likely to be edible.
  • Odor None appears to be primarily edible
    • However, if it has SporePrintColor Green it is highly likely to be poisonous!
  • These variables are likely going to lead to a lot of information gain

p = ggplot(data,aes(x=Odor,
y=SporePrintColor,
color=Edible)) p + geom_jitter(alpha=0.3) +
scale_color_manual(breaks = c('Edible','Poisonous'),
values=c('darkgreen','red'))

Due to how strong those variables looked, I decided to plot them strictly as edible or poisonous and found:

  • Odor is an excellent indicator of edible or poisonous
  • Odor None is the only tricky one - there is data where it would be classified as edible or poisonous
  • SporePrintColor is not as strong as odor when it stands alone - there is a lot of overlap between the columns

p = ggplot(data,aes(x=Edible,
y=Odor,
color = Edible)) p + geom_jitter(alpha=0.2) +
scale_color_manual(breaks = c('Edible','Poisonous'),
values=c('darkgreen','red'))
p = ggplot(data,aes(x=Edible,
y=SporePrintColor,
color = Edible)) p + geom_jitter(alpha=0.2) +
scale_color_manual(breaks = c('Edible','Poisonous'),
values=c('darkgreen','red'))

Before fitting a model it's important to split data into different parts - train and test data. There's no perfect way to know exactly how much data you should use to train your model. In this example I split 5% as training and 95% as testing. However, this is not typical, most of what I see is usually around 60%/40% or 70%/30% for test/train split.

If you choose too large of a training set you run the risk of overfitting your model. Overfitting is a classic mistake people make when first entering the field of machine learning. I won't go into the details but there are classes dedicated to this subject. Wikipedia Article

Initially, I ran this at higher levels of training data and it had perfect prediction with zero false positives or negatives. That's not as fun to look at as an example so I scaled down the training data which created more bad predictions.

#Create data for training
sample.ind = sample(2,
nrow(data),
replace = T,
prob = c(0.05,0.95))
data.dev = data[sample.ind==1,]
data.val = data[sample.ind==2,]

I wanted to know the split of edible to poisonous mushrooms in the data set and compare it to the training and test data. The random sample appears to have created roughly the same ratio of edible to poisonous upon creating train and test data.

Edible % / Poisonous % :

  • Data: 52 / 48
  • Train: 50 / 50
  • Test: 52 / 48
# Original Data
table(data$Edible)/nrow(data)
## Edible Poisonous
## 0.5179714 0.4820286
# Training Data
table(data.dev$Edible)/nrow(data.dev)
## Edible Poisonous
## 0.4962779 0.5037221
# Testing Data
table(data.val$Edible)/nrow(data.val)
## Edible Poisonous
## 0.5191037 0.4808963

I finally fit the random forest model to the training data. Plotting the model shows us that after about 20 trees, not much changes in terms of error. It fluctuates a bit but not to a large degree.

#Fit Random Forest Model
rf = randomForest(Edible ~ .,
ntree = 100,
data = data.dev)
plot(rf)

Printing the model shows the number of variables tried at each split to be 4 and an OOB estimate of error rate 0.25%. The training model fit the training data almost perfectly. There was only one mushroom which was classified incorrectly. The model would have predicted 1 to be poisonous and it would have turned out to be edible. If we consider edible to be "positive" this means we would have had 1 false negative.

print(rf)
## Call:
## randomForest(formula = Edible ~ ., data = data.dev, ntree = 100)
## Type of random forest: classification
## Number of trees: 100
## No. of variables tried at each split: 4
##
## OOB estimate of error rate: 0.25%
## Confusion matrix:
## Edible Poisonous class.error
## Edible 200 0 0.000000000
## Poisonous 1 202 0.004926108

It's always important to look at what is shown in terms of variable importance. This plot indicates what variables had the greatest impact in the classification model.

I limited it to 10 for the plot.

# Variable Importance
varImpPlot(rf,
sort = T,
n.var=10,
main="Top 10 - Variable Importance")

Odor is by far the most important variable in terms of "Mean Decreasing Gini" - a similar term for information gain in this example. The rest of the results are listed below. It's interesting to notice "Veil Type" created no information gain - so I looked into it in the initial data. The reason is clear - there is only one VeilType, so it doesn't offer any differentiation and couldn't possibly impact the results.

#Variable Importance
var.imp = data.frame(importance(rf,
type=2))
# make row names as columns
var.imp$Variables = row.names(var.imp)
print(var.imp[order(var.imp$MeanDecreaseGini,decreasing = T),]) ## MeanDecreaseGini Variables
## Odor 69.3536782 Odor
## SporePrintColor 27.3837625 SporePrintColor
## GillColor 18.1981987 GillColor
## StalkSurfaceAboveRing 12.3172400 StalkSurfaceAboveRing
## RingType 11.3114967 RingType
## GillSize 11.1085947 GillSize
## Population 7.2591707 Population
## Bruises 7.2212660 Bruises
## CapColor 5.6746095 CapColor
## Habitat 5.4768013 Habitat
## StalkRoot 5.3053036 StalkRoot
## StalkSurfaceBelowRing 4.6080070 StalkSurfaceBelowRing
## GillSpacing 4.1186021 GillSpacing
## StalkShape 2.6858568 StalkShape
## StalkColorBelowRing 2.5570551 StalkColorBelowRing
## RingNumber 2.0463027 RingNumber
## StalkColorAboveRing 1.9823127 StalkColorAboveRing
## CapSurface 1.0200298 CapSurface
## CapShape 0.5779989 CapShape
## VeilColor 0.1522645 VeilColor
## GillAttachment 0.0275000 GillAttachment
## VeilType 0.0000000 VeilType

I decided to use the model to attempt to predict whether or not a mushroom is edible or poisonous based off of the training data set. It predicted the response variable perfectly - having zero false positives or false negatives.

# Predicting response variable
data.dev$predicted.response = predict(rf , data.dev) # Create Confusion Matrix
print(
confusionMatrix(data = data.dev$predicted.response,
reference = data.dev$Edible,
positive = 'Edible')) ## Confusion Matrix and Statistics
##
## Reference
## Prediction Edible Poisonous
## Edible 200 0
## Poisonous 0 203
##
## Accuracy : 1
## 95% CI : (0.9909, 1)
## No Information Rate : 0.5037
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 1
## Mcnemar's Test P-Value : NA
##
## Sensitivity : 1.0000
## Specificity : 1.0000
## Pos Pred Value : 1.0000
## Neg Pred Value : 1.0000
## Prevalence : 0.4963
## Detection Rate : 0.4963
## Detection Prevalence : 0.4963
## Balanced Accuracy : 1.0000
##
## 'Positive' Class : Edible

Now it was time to see how the model did with data it had not seen before - making predictions on the test data.

It did a decent job. It had a 99% accuracy with a very narrow confidence interval. It did have 48 false negatives and 8 false positives (which could be deadly if you were actually choosing to eat mushrooms based off of this model).

# Predicting response variable
data.val$predicted.response <- predict(rf ,data.val) # Create Confusion Matrix
print(
confusionMatrix(data=data.val$predicted.response,
reference=data.val$Edible,
positive='Edible')) ## Confusion Matrix and Statistics
##
## Reference
## Prediction Edible Poisonous
## Edible 3960 8
## Poisonous 48 3705
##
## Accuracy : 0.9927
## 95% CI : (0.9906, 0.9945)
## No Information Rate : 0.5191
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9855
## Mcnemar's Test P-Value : 1.872e-07
##
## Sensitivity : 0.9880
## Specificity : 0.9978
## Pos Pred Value : 0.9980
## Neg Pred Value : 0.9872
## Prevalence : 0.5191
## Detection Rate : 0.5129
## Detection Prevalence : 0.5139
## Balanced Accuracy : 0.9929
##
## 'Positive' Class : Edible

Unfortunately, I have no idea how reliable this data is or how it was captured. There is likely some background information and I would never choose whether or not to eat an unknown mushroom based off of this model (and neither should you).

Code used in this post is on my GitHub

转自:https://stoltzmaniac.com/random-forest-classification-of-mushrooms/

Random Forest Classification of Mushrooms的更多相关文章

  1. 随机森林分类(Random Forest Classification)

    其实,之前就接触过随机森林,但仅仅是用来做分类和回归.最近,因为要实现一个idea,想到用随机森林做ensemble learning才具体的来看其理论知识.随机森林主要是用到决策树的理论,也就是用决 ...

  2. ML(4.3): R Random Forest

    随机森林模型是一种数据挖掘模型,常用于进行分类预测.随机森林模型包含多个树形分类器,预测结果由多个分类器投票得出. 决策树相当于一个大师,通过自己在数据集中学到的知识对于新的数据进行分类.俗话说得好, ...

  3. [Machine Learning & Algorithm] 随机森林(Random Forest)

    1 什么是随机森林? 作为新兴起的.高度灵活的一种机器学习算法,随机森林(Random Forest,简称RF)拥有广泛的应用前景,从市场营销到医疗保健保险,既可以用来做市场营销模拟的建模,统计客户来 ...

  4. paper 85:机器统计学习方法——CART, Bagging, Random Forest, Boosting

    本文从统计学角度讲解了CART(Classification And Regression Tree), Bagging(bootstrap aggregation), Random Forest B ...

  5. 统计学习方法——CART, Bagging, Random Forest, Boosting

    本文从统计学角度讲解了CART(Classification And Regression Tree), Bagging(bootstrap aggregation), Random Forest B ...

  6. sklearn_随机森林random forest原理_乳腺癌分类器建模(推荐AAA)

     sklearn实战-乳腺癌细胞数据挖掘(博主亲自录制视频) https://study.163.com/course/introduction.htm?courseId=1005269003& ...

  7. 随机森林(Random Forest)

    阅读目录 1 什么是随机森林? 2 随机森林的特点 3 随机森林的相关基础知识 4 随机森林的生成 5 袋外错误率(oob error) 6 随机森林工作原理解释的一个简单例子 7 随机森林的Pyth ...

  8. 随机森林(Random Forest),决策树,bagging, boosting(Adaptive Boosting,GBDT)

    http://www.cnblogs.com/maybe2030/p/4585705.html 阅读目录 1 什么是随机森林? 2 随机森林的特点 3 随机森林的相关基础知识 4 随机森林的生成 5 ...

  9. [Machine Learning & Algorithm] 随机森林(Random Forest)-转载

    作者:Poll的笔记 博客出处:http://www.cnblogs.com/maybe2030/  阅读目录 1 什么是随机森林? 2 随机森林的特点 3 随机森林的相关基础知识 4 随机森林的生成 ...

随机推荐

  1. $_GET

    POST GET ,是提交表单的两种方式,GET传值就用$_GET获取,POST提交表单就用$_POSTpost与get的区别是一个在地址栏显示参数,另一个不显示 如果地址是这样:http://zhi ...

  2. 结合ThreadLocal来看spring事务源码,感受下清泉般的洗涤!

    在我的博客spring事务源码解析中,提到了一个很关键的点:将connection绑定到当前线程来保证这个线程中的数据库操作用的是同一个connection.但是没有细致的讲到如何绑定,以及为什么这么 ...

  3. ZOJ 3195 Design the city 题解

    这个题目大意是: 有N个城市,编号为0~N-1,给定N-1条无向带权边,Q个询问,每个询问求三个城市连起来的最小权值. 多组数据 每组数据  1 < N < 50000  1 < Q ...

  4. C 语言实现字符串替换

    void replaceFirst(char *str1,char *str2,char *str3) { ]; char *p; strcpy(str4,str1); if((p=strstr(st ...

  5. MySQL数据库主从同步配置

    主服务器必须打开开二进制日志. 主要是修改配置文件 , 一般在 linux 下安装的 mysql 配置文件是 my.cnf, 在 windwos 下是 my.ini, 修改主服务器配置文件 serve ...

  6. java面试题—精选30道Java笔试题解答(二)

    摘要: java面试题-精选30道Java笔试题解答(二) 19. 下面程序能正常运行吗() public class NULL { public static void haha(){ System ...

  7. python基础教程第二版 第一章

    1.模块导入python以增强其功能的扩展:三种方式实现 (1). >>> Import math >>> math.floor(32.9) 32.0 #按照 模块 ...

  8. Linux 练习(1)

    1) 新建用户natasha,uid为1000,gid为555,备注信息为"master" useradd -u 1000 -g 555 -c 'master' natasha2) ...

  9. 蓝桥杯-手机尾号-java

    /* (程序头部注释开始) * 程序的版权和版本声明部分 * Copyright (c) 2016, 广州科技贸易职业学院信息工程系学生 * All rights reserved. * 文件名称: ...

  10. 前端工作日常爬坑之——单页面微信开发Jssdk相关,以及jssdk图片直传自己服务器的实现。

    日常爬坑 遇到的情况大致说明: 项目基于Vue2全家桶实现,vue-router控制前端路由,路由模式是History(主要是领导追求太高,觉得hash带#号太丑,然后遇到了小坑...),主要是服务于 ...