本节主要用于机器学习入门,介绍两个简单的分类模型:

决策树和随机森林

不涉及内部原理,仅仅介绍基础的调用方法

1. How Models Work

以简单的决策树为例

This step of capturing patterns from data is called fitting or training the model

The data used to train the data is called the trainning data

After the model has been fit, you can apply it to new data to predict prices of additional homes

2.Basic Data Exploration

使用pandas中的describle()来探究数据:

    melbourne_file_path = '../input/melbourne-housing-snapshot/melb_data.csv'

    melbourne_data      =  pd.read_csv(melbourne_file_path)

    melbourne.describe()

    output:

注:数值含义

count:                     非缺失值的数量

mean:                                           平均值

std:                                              标准偏差,它度量值在数值上的分布情况

min、25%、50%、75%、max:        将每一列按照从lowest到highest排序,最小值是min, 1/4位置上,大于25%而小于50%是25%

3.Your First Machine Learning Model

  • Selecting Data for Modeling

 

  import  pandas as pd 

  melbourne_file_path     =         ' ../input/melbourne-housing-snapshot/melb_data.csv'

  melbourne_data          =          pd.read_csv(melbourne_file_path

  melbourne_data.columns

  • Selecting The Prediction Target

方法:使用dot-notation来挑选prediction target

  y = melbourne_data.Price

  • Choosing "Features"

  

  melbourne_features = ['Rooms', 'Bathroom', 'Landsize', 'Lattitude', 'Longtitude']

  X = melbourne_data[melbourne_features]

查看数据是否加载正确:

  

  X.head()

探究数据基本特性:

  X.describe()

  • Building Your Model

我们使用scikit-learn来创造模型,scikit-learn教程如下:

具体的原理可以根据需要自己探究

https://scikit-learn.org/stable/supervised_learning.html#supervised-learning

构建模型步骤:

    •   Define:

         What type of model will it be? A decision tree? Some other type of model? Some other parameters of the model type are specified too.

    •   Fit:

Capture patterns from provided data. This is the heart of modeling

    •   Predict:

        Just what it sounds like

    •   Evaluate:

         Determine how accurate the model's predictions are


实现:


    from sklearn.tree import DecisionTreeRegressor

    melbourne_mode = DecisionTreeRegressor(random_state=1)

    melbourne_mode.fit(X , y)

打印出开始几行:

  

    print (X.head())

预测后的价格如下:

    print (melbourne_mode.predict(X.head())

4.Model Validation

由于预测的价格和真实的价格会有差距,而差距多少,我们需要衡量

使用Mean Absolute Error

    error= actual-predicted

在实际过程中,我们要将数据分成两份,一份用于训练,叫做training data, 一份用于验证叫validataion data

    from sklearn.model_selection import train_test_split

    train_X, val_X,  train_y, val_y =     train_test_split(X, y, random_state=0)

    melbourne_model               =      DecisionTreeRegressor()

    melbourne_model.fit(train_X, train_y)

    val_predictions                  =      melbourne_model.predict(val_X)

    print(mean_absolute_error(val_y, val_predictions))

5.Underfitting and Overfitting

  • overfitting:     A model matches the data almost perfectly, but does poorly in validation and other new data.
  • underfitting:   When a model fails to capture important distinctions and patterns in the data, so it performs poorly even in training data.

The more leaves we allow the model to make, the more we move from the underfitting area in the above graph to overfitting area.

  from sklearn.metrics import mean_absolute_error

  from sklearn.tree import DecsionTreeRegressor

 

  def get_ame(max_leaf_nodes, train_X, val_X, train_y, val_y):

    model = DecisionTreeRegressor(max_leaf_nodes = max_leaf_nodes, random_state = 0)

    model.fit(train_X, train_y)

    preds_val = model.predict(val_X)

    mae = mean_absolute_error(val_y, preds_val)

    return(mae)

我可以使用循环比较选择最合适的max_leaf_nodes

    for max_leaf_nodes in [5,50,500,5000]:

      my_ame = get_ame(max_leaf_nodes, train_X, val_X, train_y, val_y)

      print(max_leaf_nodes, my_ame)

最后可以发现,当max leaf nodes 为 500时,MAE最小, 接下来我们换另外一种模型

6.Random Forests

The random forest uses many trees, and it makes a prediction by averaging the predictions of each component tree. It generally has much better predictive accuracy than a single decision tree and it works well with default parameters.

    from sklearn.ensemble import RandomForestRegressor

    from sklearn.metrics import mean_absolute_error

 

    forest_model = RandomForestRegressor(random_state=1)

    forest_model.fit(train_X,train_y)

    melb_preds = forest_model.predict(val_X)

    print(mean_absolute_error(val_y, melb_preds))

可以发现最后的误差,相对于决策树小。

one of the best features of Random Forest models is that they generally work reasonably even without this tuning.

7.Machine Learning Competitions

  • Build a Random Forest model with all of your data
  • Read in the "test" data, which doesn't include values for the target. Predict home values in the test data with your Random Forest model.
  • Submit those predictions to the competition and see your score.
  • Optionally, come back to see if you can improve your model by adding features or changing your model. Then you can resubmit to see how that stacks up on the competition leaderboard.

Intro to Machine Learning的更多相关文章

  1. 【机器学习Machine Learning】资料大全

    昨天总结了深度学习的资料,今天把机器学习的资料也总结一下(友情提示:有些网站需要"科学上网"^_^) 推荐几本好书: 1.Pattern Recognition and Machi ...

  2. 机器学习(Machine Learning)&深度学习(Deep Learning)资料【转】

    转自:机器学习(Machine Learning)&深度学习(Deep Learning)资料 <Brief History of Machine Learning> 介绍:这是一 ...

  3. How do I learn machine learning?

    https://www.quora.com/How-do-I-learn-machine-learning-1?redirected_qid=6578644   How Can I Learn X? ...

  4. 机器学习(Machine Learning)与深度学习(Deep Learning)资料汇总

    <Brief History of Machine Learning> 介绍:这是一篇介绍机器学习历史的文章,介绍很全面,从感知机.神经网络.决策树.SVM.Adaboost到随机森林.D ...

  5. Easy machine learning pipelines with pipelearner: intro and call for contributors

    @drsimonj here to introduce pipelearner – a package I'm developing to make it easy to create machine ...

  6. How do I learn mathematics for machine learning?

    https://www.quora.com/How-do-I-learn-mathematics-for-machine-learning   How do I learn mathematics f ...

  7. 机器学习案例学习【每周一例】之 Titanic: Machine Learning from Disaster

     下面一文章就总结几点关键: 1.要学会观察,尤其是输入数据的特征提取时,看各输入数据和输出的关系,用绘图看! 2.训练后,看测试数据和训练数据误差,确定是否过拟合还是欠拟合: 3.欠拟合的话,说明模 ...

  8. 【Machine Learning】KNN算法虹膜图片识别

    K-近邻算法虹膜图片识别实战 作者:白宁超 2017年1月3日18:26:33 摘要:随着机器学习和深度学习的热潮,各种图书层出不穷.然而多数是基础理论知识介绍,缺乏实现的深入理解.本系列文章是作者结 ...

  9. 【Machine Learning】Python开发工具:Anaconda+Sublime

    Python开发工具:Anaconda+Sublime 作者:白宁超 2016年12月23日21:24:51 摘要:随着机器学习和深度学习的热潮,各种图书层出不穷.然而多数是基础理论知识介绍,缺乏实现 ...

随机推荐

  1. spark学习(10)-RDD的介绍和常用算子

    RDD(弹性分布式数据集,里面并不存储真正要计算的数据,你对RDD的操作,他会在Driver端转换成Task,下发到Executor计算分散在多台集群上的数据) RDD是一个代理,你对代理进行操作,他 ...

  2. java并发编程(七)----(JUC)ReadWriteLock

    前面我们已经分析过JUC包里面的Lock锁,ReentrantLock锁和semaphore信号量机制.Lock锁实现了比synchronized更灵活的锁机制,Reentrantlock是Lock的 ...

  3. Netty学习(五)-DelimiterBasedFrameDecoder

    上一节我们说了LineBasedframeDecoder来解决粘包拆包的问题,TCP以流的方式进行数据传输,上层应用协议为了对消息进行区分,一般采用如下4种方式: 消息长度固定,累计读取到消息长度总和 ...

  4. 【0809 | Day 12】可变长参数/函数的对象/函数的嵌套/名称空间与作用域

    可变长参数 一.形参 位置形参 默认形参 二.实参 位置实参 关键字实参 三.可变长参数之* def func(name,pwd,*args): print('name:',name,'pwd:',p ...

  5. 神奇的 SQL 之温柔的陷阱 → 三值逻辑 与 NULL !

    前言 开心一刻   一个中国小孩参加国外的脱口秀节目,因为语言不通,于是找了一个翻译. 主持人问:“Who is your favorite singer ?” 翻译:”你最喜欢哪个歌手啊 ?” 小孩 ...

  6. c语言的输出

    Cpp是c plus plus Cpp c++的源文件 c++支持c语言的语法 %x是十六进制x后面输出小写%X输出的结果是大写. %o是八进制. %lf是双精度double,默认小数点后六位,输出最 ...

  7. 几图理解BeautifulSoup

  8. 基于 Vue+Mint-ui 的 Mobile-h5 的项目说明

    Vue作为前端三大框架之一,其已经悄然成为主流,学会用vue相关技术来开发项目会相当轻松. 对于还没学习或者还没用过vue的初学者,基础知识这里不作详解,推荐先去相关官网,学习一下vue相关的基础知识 ...

  9. python 32 操作系统与进程

    目录 1. 操作系统 1.1 作用 1.2 操作系统的发展 2. 进程的理论 2.1 相关名词 2.2 进程的创建 2.3 进程的状态: 1. 操作系统 ​ 管理.控制.协调计算机硬件与软件资源的计算 ...

  10. Mysql如何添加外键,如何实现连表查询

    创建表student和表score,表student设置主键,表score设置表student中属性相同的为外键: 创建student表 create table student ( id int p ...