Sklearn使用良心完整入门教程

The complete .ipynb file can be download through my share in onedrive:https://1drv.ms/u/s!Al86h1dThXMNxDtq_wkOF1PNARrl?e=WvRNaI

All the materials come from the Machine Learning class in Polyu,HK.

I promise that I just use and share for learning and non-profit

from sklearn.datasets import load_iris

iris=load_iris()

X=iris.data

y=iris.target

#use logisticRegression in sklearn

from sklearn.linear_model import LogisticRegression

logreg=LogisticRegression()

logreg.fit(X,y)

#the prediction for the training data

y_pred=logreg.predict(X)

/home/jiading/.conda/envs/nn/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.

  FutureWarning)

/home/jiading/.conda/envs/nn/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:469: FutureWarning: Default multi_class will be changed to 'auto' in 0.22. Specify the multi_class option to silence this warning.

  "this warning.", FutureWarning)

#analysis the result

from sklearn import metrics

#The sklearn.metrics module includes score functions, performance metrics and pairwise metrics and distance computations.

print(metrics.accuracy_score(y,y_pred))

0.96

#use knn in sklearn

from sklearn.neighbors import KNeighborsClassifier

knn=KNeighborsClassifier(n_neighbors=5)#set the K

knn.fit(X,y)#as we know,knn needn't the process of training

y_pred=knn.predict(X)

print(metrics.accuracy_score(y,y_pred))

0.9666666666666667

test_size means the percentage of data is used for test

the parameter “random_state” is used here to keep track of a consistent random output number each time to simplify and ease our evaluation.

random_statedecide the root for the random algorithm.

We will get the same way for spliting if we use the same root every time

#split the data into training data and test data

#the train_test_split method helps us to do this work

from sklearn.model_selection import train_test_split

X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.4,random_state=4)

#we use LogisticRegression again

logreg=LogisticRegression()

logreg.fit(X_train,y_train)

y_pred=logreg.predict(X_test)

print(metrics.accuracy_score(y_test,y_pred))

0.9333333333333333

/home/jiading/.conda/envs/nn/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.

  FutureWarning)

/home/jiading/.conda/envs/nn/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:469: FutureWarning: Default multi_class will be changed to 'auto' in 0.22. Specify the multi_class option to silence this warning.

  "this warning.", FutureWarning)

#we test the accuracy of knn and find the k which makes the biggest accuracy

k_range=list(range(1,26))#[1,25]

scores=[]

for k in k_range:

    knn=KNeighborsClassifier(n_neighbors=k)

    knn.fit(X_train,y_train)

    y_pred=knn.predict(X_test)

    scores.append(metrics.accuracy_score(y_test,y_pred))

#we draw a graph to show the result

import matplotlib.pyplot as plt

#a magic function,which allows polts to appear whitin the notebook

%matplotlib inline

plt.plot(k_range,scores)

plt.xlabel('Value of K for KNN')

plt.ylabel('Testing Accuracy')

Text(0, 0.5, 'Testing Accuracy')

the following experiment requires a file named"l3_data.csv"

You can download from my onedrive:

https://1drv.ms/u/s!Al86h1dThXMNugsNGgtFBFYmZpYt?e=QDU4c4

#use the data in a csv file named"l3_data.csv"

#use pandas now

import pandas as pd

/home/jiading/.conda/envs/nn/lib/python3.7/importlib/_bootstrap.py:219: RuntimeWarning: numpy.ufunc size changed, may indicate binary incompatibility. Expected 192 from C header, got 216 from PyObject

  return f(*args, **kwds)

index_col : int, str, sequence of int / str, or False, default None

Column(s) to use as the row labels of the DataFrame, either given as string name or column index. If a sequence of int / str is given, a MultiIndex is used.

Note: index_col=False can be used to force pandas to not use the first column as the index, e.g. when you have a malformed file with delimiters at the end of each line.

1.index_col 默认值（index_col = None）——重新设置一列成为index值

2.index_col=False——重新设置一列成为index值

3.index_col=0——第一列为index值

index_col=0，将第一列变为index。

reference:

data=pd.read_csv('./l3_data.csv',index_col=0)# indicate the location where the file is being stored

#show first 5 rows in the file

data.head()

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {

    vertical-align: top;

}

.dataframe thead th {

    text-align: right;

}

	TV	Radio	Newspaper	Sales
1	230.1	37.8	69.2	22.1
2	44.5	39.3	45.1	10.4
3	17.2	45.9	69.3	9.3
4	151.5	41.3	58.5	18.5
5	180.8	10.8	58.4	12.9

#show last 5 rows in the file

data.tail()

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {

    vertical-align: top;

}

.dataframe thead th {

    text-align: right;

}

	TV	Radio	Newspaper	Sales
196	38.2	3.7	13.8	7.6
197	94.2	4.9	8.1	9.7
198	177.0	9.3	6.4	12.8
199	283.6	42.0	66.2	25.5
200	232.1	8.6	8.7	13.4

From “shape”, we know that there are 200 rows (observations) and 4 columns (3 features and 1

response). The three features are “TV”, “Radio” and “Newspaper”. The response is “Sales”.

The dataset is showing the advertising dollars spent on different media (TV, Radio and

Newspaper), and the corresponding Sales amount of a product in a given market. All figures are

in thousands unit.

It is hard to tell the relationships between the response and the three features. Plot some graphs to

visualize the relationship could be helpful.

Seaborn其实是在matplotlib的基础上进行了更高级的API封装，从而使得作图更加容易，在大多数情况下使用seaborn就能做出很具有吸引力的图，而使用matplotlib就能制作具有更多特色的图。应该把Seaborn视为matplotlib的补充，而不是替代物。

Seaborn 要求原始数据的输入类型为 pandas 的 Dataframe 或 Numpy 数组，画图函数一般为如下形式

sns.图名(x='X轴列名', y='Y轴列名', data=原始数据df对象)

或

sns.图名(x='X轴列名', y='Y轴列名', hue='分组绘图参数', data=原始数据df对象)

或

sns.图名(x=np.array, y=np.array[, ...])

hue 的意思是 variable in data to map plot aspects to different colors。

reference:

import seaborn as sns

%matplotlib inline

/home/jiading/.conda/envs/nn/lib/python3.7/importlib/_bootstrap.py:219: RuntimeWarning: numpy.ufunc size changed, may indicate binary incompatibility. Expected 192 from C header, got 216 from PyObject

  return f(*args, **kwds)

Plot pairwise relationships in a dataset.

By default, this function will create a grid of Axes such that each variable in data will by shared in the y-axis across a single row and in the x-axis across a single column. The diagonal Axes are treated differently, drawing a plot to show the univariate distribution of the data for the variable in that column.

size changes the size of the chart

aspect : gives the width (in inches) of each facet.

kind='reg' means use the method of regression

the pairplot know which one is x and which is y by the name of attribute we specified in "x_var" and "y_var"

source:https://seaborn.pydata.org/generated/seaborn.pairplot.html#seaborn.pairplot

#visualize the relationship between the features and the response using scatterplots(散点图)

sns.pairplot(data,x_vars=['TV','Radio','Newspaper'],y_vars='Sales',size=7,aspect=0.7,kind='reg')

<seaborn.axisgrid.PairGrid at 0x7eff2ed36710>

From the three graphs, it seems that there is a strong relationship between the TV ads and Sales.

For Newspaper, it seems it does not affect the Sales too much. Later we will try to prove that

observation

Remember that the Scikit-Learn needs the dataset to have two parts, one feature dataset that is in

a matrix form and the other is response in vector format. That means we have to preprocess the

dataset in the correct format before we can apply it to perform the prediction task.

feature_cols=['TV','Radio','Newspaper']

# use the list to select a subset of the original DataFrame

X=data[feature_cols]#we can take out data by this!

#show the data in X

X.head()

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {

    vertical-align: top;

}

.dataframe thead th {

    text-align: right;

}

	TV	Radio	Newspaper
1	230.1	37.8	69.2
2	44.5	39.3	45.1
3	17.2	45.9	69.3
4	151.5	41.3	58.5
5	180.8	10.8	58.4

#pay attention to the type of X

print(type(X))

<class 'pandas.core.frame.DataFrame'>

#create the y in another way:visit the DataFrame through Member properties in the object of the class

y=data.Sales

y.head()

1    22.1

2    10.4

3     9.3

4    18.5

5    12.9

Name: Sales, dtype: float64

#pay attention to the type of y

print(type(y))

<class 'pandas.core.series.Series'>

#does two ways to build y get the same type?

y=data['Sales']

print(type(y))

#the answer is yes

<class 'pandas.core.series.Series'>

#now we spilt the training data and testing data

from sklearn.model_selection import train_test_split

X_train,X_test,y_train,y_test=train_test_split(X,y,random_state=1)

Let's check the size and the type of training data and testing data.

Since we don't use the 'test_size' property,the spilting is done by the default way

print("Size:")

print(X_train.shape)

print(y_train.shape)

print(X_test.shape)

print(y_test.shape)

print("Type:")

print(type(X_train))

Size:

(150, 3)

(150,)

(50, 3)

(50,)

Type:

<class 'pandas.core.frame.DataFrame'>

#we use LinearRegression again:

from sklearn.linear_model import LinearRegression

linreg=LinearRegression()

#the sklearn model can handle the data with type like pandas.core.frame.DataFrame

linreg.fit(X_train,y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

we can see the intercept(the b in the formula) and the coefficients(the w in the formula):

print(linreg.intercept_)

print(linreg.coef_)

2.8769666223179318

[0.04656457 0.17915812 0.00345046]

we can use the zip function in python to pair the title and coefficient for this title :

list(zip(feature_cols,linreg.coef_))

[('TV', 0.04656456787415029),

 ('Radio', 0.17915812245088839),

 ('Newspaper', 0.003450464711180378)]

y_pred=linreg.predict(X_test)

We can notice that thanks to the characteristics of python,we don't need to define the vairables first.So that we can use the same "y-pred" from the very beginning.23333

Evaluation metrics for classification problems, such as accuracy, are not useful for regression

problems. Instead, we need evaluation metrics designed for comparing continuous values.

There are three ways to do this:

#first,use the MAE(Mean Absolute Error)

#the rule for MAE is just add up the error in every dimonsion then divide the number of dimonsions

print(metrics.mean_absolute_error(y_test,y_pred))

1.0668917082595206

#second,MSE(Mean Squared Error)

#different from MAE,MSE square the error in every dimonsion first,then add up and divide the number of dimonsions

print(metrics.mean_squared_error(y_test,y_pred))

1.9730456202283368

#thrid,RMSE(Root Mean Squared Error)

#we need the sqrt function in numpy

#as you can see,RMSE just sqrt the result of MSE

import numpy as np

print(np.sqrt(metrics.mean_squared_error(y_test,y_pred)))

1.404651423032895

There methods have their advantages and disadvantages:

MAE is the easiest to understand, because it is just the average error.
MSE is more popular than MAE, because MSE "punishes" larger errors.
RMSE is even more popular than MSE, because RMSE is interpretable in the "y" units.

In this experiment,we use the result of RMSE.

Then, we can remove the “Newspaper” to re-run the “Logistic Regression” model again.

feature_cols=['TV','Radio']

X=data[feature_cols]

y=data.Sales

X_train,X_test,y_train,y_test=train_test_split(X,y,random_state=1)

linreg.fit(X_train,y_train)

y_pred=linreg.predict(X_test)

print(np.sqrt(metrics.mean_squared_error(y_test,y_pred)))

1.3879034699382886

The RMSE decreased when we removed Newspaper from the model. (Error is something we

want to minimize, so a lower number for RMSE is better.) Thus, it is unlikely that this feature is

useful for predicting Sales, and should be removed from the model.

It means that this feature doesn't suit the same distribution as the three other attributes and the y_label,which means that maybe this attribute is not related to the result or that the relationship between this attribute and y_label is different from others and we should use another method to predict,that is to say,use combined method to predict-of course the second possibilty is more complexity to implement.

Sklearn使用良心完整入门教程的更多相关文章

PyQt完整入门教程
1.GUI开发框架简介 19年来,一直在做Android ROM相关测试,也有了一定的积累:20年,计划把之前完整的测试方案.脚本.工具进行整合复用. 第一期计划是开发一个GUI的测试工具,近期也进行 ...
XFire完整入门教程
网上关于XFire入门的教程不少,要么是讲得很简单,就像Hello World一样的程序,要么就是通过IDE集成的工具来开发的,这对于不同的人群有诸多不便,关于XFire的一些详细的信息就不再多讲,可 ...
MVC5 + EF6 完整入门教程三
期待已久的EF终于来了. 学完本篇文章,你将会掌握基于EF数据模型的完整开发流程. 本次将会完成EF数据模型的搭建和使用. 基于这个模型,将之前的示例添加数据库查询验证功能. 文章提纲概述 & ...
MVC5 + EF6 完整入门教程三：EF来了
期待已久的EF终于来了学完本篇文章,你将会掌握基于EF数据模型的完整开发流程. 本次将会完成EF数据模型的搭建和使用. 基于这个模型,将之前的示例添加数据库查询验证功能. 文章提纲概述 & ...
AjaxPro2完整入门教程
一.目录简单类型数据传送(介绍缓存,访问Session等) 表类型数据传送数组类型数据传送(包含自定义类型数据) 二.环境搭建 1.这里本人用的是VS2012. 2.新建一个空的Web项目(.NE ...
D3.js的v5版本入门教程（第九章）——完整的柱状图
D3.js的v5版本入门教程(第九章) 一个完整的柱状图应该包括的元素有——矩形.文字.坐标轴,现在,我们就来一一绘制它们,这章是前面几章的综合,这一章只有少量新的知识点,它们是 d3.scaleBa ...
Fakeapp 入门教程（1）：安装篇！
在众多AI换脸软件中Fakeapp是流传最广,操作最简单的一款,当然他同样也是源于Deepfakes. 这款软件在设计上确实是花了一些心事,只要稍加点拨,哪怕是再小白的人也能学会.下面我就做一个入门教 ...
gulp详细入门教程
本文链接:http://www.ydcss.com/archives/18 gulp详细入门教程简介: gulp是前端开发过程中对代码进行构建的工具,是自动化项目的构建利器:她不仅能对网站资源进行优 ...
ABP(现代ASP.NET样板开发框架)系列之2、ABP入门教程
点这里进入ABP系列文章总目录基于DDD的现代ASP.NET开发框架--ABP系列之2.ABP入门教程 ABP是“ASP.NET Boilerplate Project (ASP.NET样板项目)” ...

随机推荐

docker Swarm mode集群
基本概念 Swarm 是使用 SwarmKit 构建的 Docker 引擎内置(原生)的集群管理和编排工具. 使用 Swarm 集群之前需要了解以下几个概念. 节点运行 Docker 的主机可以主动 ...
小程序API：wx.showActionSheet 将 itemList动态赋值
1.发现问题: 小程序调用API:wx.showActionSheet 时发现无论如何都不能将其属性itemList动态赋值. 2.分析问题: 首先我认为可能是格式的问题,itemList必须要求格式 ...
struts数据回显
数据回显,必须要用struts标签! Action中: // 进入修改页面 public String viewUpdate() { // 模拟一个对象(先获取一个id,再根据id调用service查 ...
Java编写能完成复数运算的程序
Java编写能完成复数运算的程序题目简介: 整体分析: 界面分析: 实验代码: package complex; import java.awt.EventQueue; import javax.s ...
阶段3 3.SpringMVC·_06.异常处理及拦截器_5 SpringMVC拦截器之编写controller
先新建包,com.itcast.controller,然后把异常拦截的项目的UserController复制过来. 复制过来稍作修改创建pages文件件,然后新建success.jsp页面部署当前 ...
"首页添加至购物车,TabBar显示购物车的数量"实现
今天学习别人的项目源码的时候,看到这样的一种实现功能:首页添加至购物车,TabBar显示购物车的数量....想到以前没有做过,这里学习了,记录一下: 实现的效果图如下: 当点击首页添加至购物的操作的时 ...
etcd三节点安全集群搭建-pki安全认证
etcd安全集群搭建就是 pki安装认证 1.环境: 三台centos7. 主机 192.168.0.91 192.168.0.92 192.168.0.93 都关闭防火墙都关闭selinux 配置 ...
爬取网贷之家平台数据保存到mysql数据库
# coding utf-8 import requests import json import datetime import pymysql user_agent = 'User-Agent: ...
laravel 5.8 实现消息推送
以下教程是基于5.6 的,在使用5.8实现时遇到一些问题,做一下记录在我看来,实时通信才是 APP 应用的将来. Socket 服务通常不是那么容易实现,但是 Laravel Echo 服务改变了这 ...
R数据分析（一）
R语言特点: 主要用于统计分析.图表显示. 属于解释型语言.支持模块化编程. 应用:数据科学.统计计算.机器学习学习方法: 做笔记,记重点或者心得手动实践,加深理解坚持练习,利用身边数据 ...

Sklearn使用良心完整入门教程

Sklearn使用良心完整入门教程的更多相关文章

随机推荐

热门专题