机器学习(四) 分类算法--K近邻算法 KNN (上)

一、K近邻算法基础

KNN------- K近邻算法--------K-Nearest Neighbors

思想极度简单

应用数学知识少（近乎为零）

效果好（缺点？）

可以解释机器学习算法使用过程中很多细节问题

更完整的刻画机器学习应用的流程

import numpy as np

import matplotlib.pyplot as plt

实现我们自己的 kNN

创建简单测试用例

raw_data_X = [[3.393533211, 2.331273381],

              [3.110073483, 1.781539638],

              [1.343808831, 3.368360954],

              [3.582294042, 4.679179110],

              [2.280362439, 2.866990263],

              [7.423436942, 4.696522875],

              [5.745051997, 3.533989803],

              [9.172168622, 2.511101045],

              [7.792783481, 3.424088941],

              [7.939820817, 0.791637231]

             ]

raw_data_y = [0, 0, 0, 0, 0, 1, 1, 1, 1, 1]

X_train = np.array(raw_data_X)

y_train = np.array(raw_data_y)

X_train

array([[ 3.39353321,  2.33127338],

       [ 3.11007348,  1.78153964],

       [ 1.34380883,  3.36836095],

       [ 3.58229404,  4.67917911],

       [ 2.28036244,  2.86699026],

       [ 7.42343694,  4.69652288],

       [ 5.745052  ,  3.5339898 ],

       [ 9.17216862,  2.51110105],

       [ 7.79278348,  3.42408894],

       [ 7.93982082,  0.79163723]])

y_train

array([0, 0, 0, 0, 0, 1, 1, 1, 1, 1])

kNN的过程

from math import sqrt

distances = []

for x_train in X_train:

    d = sqrt(np.sum((x_train - x)**2))

    distances.append(d)

distances

[4.812566907609877,

 5.229270827235305,

 6.749798999160064,

 4.6986266144110695,

 5.83460014556857,

 1.4900114024329525,

 2.354574897431513,

 1.3761132675144652,

 0.3064319992975,

 2.5786840957478887]

distances = [sqrt(np.sum((x_train - x)**2))

             for x_train in X_train]

distances

[4.812566907609877,

 5.229270827235305,

 6.749798999160064,

 4.6986266144110695,

 5.83460014556857,

 1.4900114024329525,

 2.354574897431513,

 1.3761132675144652,

 0.3064319992975,

 2.5786840957478887]

np.argsort(distances)

array([8, 7, 5, 6, 9, 3, 0, 1, 4, 2])

nearest = np.argsort(distances)

k = 6

topK_y = [y_train[neighbor] for neighbor in nearest[:k]]

topK_y

[1, 1, 1, 1, 1, 0]

from collections import Counter

votes = Counter(topK_y)

votes

Counter({0: 1, 1: 5})

votes.most_common(1)

[(1, 5)]

predict_y = votes.most_common(1)[0][0]

predict_y

1

二、scikit-learn 中的机器学习算法封装
KNN/KNNN.py

import numpy as np

from math import sqrt

from collections import Counter

class KNNClassifier:

    def __init__(self, k):

        """初始化kNN分类器"""

        assert k >= 1, "k must be valid"

        self.k = k

        self._X_train = None

        self._y_train = None

    def fit(self, X_train, y_train):

        """根据训练数据集X_train和y_train训练kNN分类器"""

        assert X_train.shape[0] == y_train.shape[0], \

            "the size of X_train must be equal to the size of y_train"

        assert self.k <= X_train.shape[0], \

            "the size of X_train must be at least k."

        self._X_train = X_train

        self._y_train = y_train

        return self

    def predict(self, X_predict):

        """给定待预测数据集X_predict，返回表示X_predict的结果向量"""

        assert self._X_train is not None and self._y_train is not None, \

                "must fit before predict!"

        assert X_predict.shape[1] == self._X_train.shape[1], \

                "the feature number of X_predict must be equal to X_train"

        y_predict = [self._predict(x) for x in X_predict]

        return np.array(y_predict)

    def _predict(self, x):

        """给定单个待预测数据x，返回x的预测结果值"""

        assert x.shape[0] == self._X_train.shape[1], \

            "the feature number of x must be equal to X_train"

        distances = [sqrt(np.sum((x_train - x) ** 2))

                     for x_train in self._X_train]

        nearest = np.argsort(distances)

        topK_y = [self._y_train[i] for i in nearest[:self.k]]

        votes = Counter(topK_y)

        return votes.most_common(1)[0][0]

    def __repr__(self):

        return "KNN(k=%d)" % self.k

kNN_function/KNN.py

import numpy as np

from math import sqrt

from collections import Counter

def kNN_classify(k, X_train, y_train, x):

    assert 1 <= k <= X_train.shape[0], "k must be valid"

    assert X_train.shape[0] == y_train.shape[0], \

        "the size of X_train must equal to the size of y_train"

    assert X_train.shape[1] == x.shape[0], \

        "the feature number of x must be equal to X_train"

    distances = [sqrt(np.sum((x_train - x)**2)) for x_train in X_train]

    nearest = np.argsort(distances)

    topK_y = [y_train[i] for i in nearest[:k]]

    votes = Counter(topK_y)

    return votes.most_common(1)[0][0]

三、训练数据集、测试数据集

判断机器学习算法的性能

playML/KNN.py

import numpy as np

from math import sqrt

from collections import Counter

class KNNClassifier:

    def __init__(self, k):

        """初始化kNN分类器"""

        assert k >= 1, "k must be valid"

        self.k = k

        self._X_train = None

        self._y_train = None

    def fit(self, X_train, y_train):

        """根据训练数据集X_train和y_train训练kNN分类器"""

        assert X_train.shape[0] == y_train.shape[0], \

            "the size of X_train must be equal to the size of y_train"

        assert self.k <= X_train.shape[0], \

            "the size of X_train must be at least k."

        self._X_train = X_train

        self._y_train = y_train

        return self

    def predict(self, X_predict):

        """给定待预测数据集X_predict，返回表示X_predict的结果向量"""

        assert self._X_train is not None and self._y_train is not None, \

                "must fit before predict!"

        assert X_predict.shape[1] == self._X_train.shape[1], \

                "the feature number of X_predict must be equal to X_train"

        y_predict = [self._predict(x) for x in X_predict]

        return np.array(y_predict)

    def _predict(self, x):

        """给定单个待预测数据x，返回x的预测结果值"""

        assert x.shape[0] == self._X_train.shape[1], \

            "the feature number of x must be equal to X_train"

        distances = [sqrt(np.sum((x_train - x) ** 2))

                     for x_train in self._X_train]

        nearest = np.argsort(distances)

        topK_y = [self._y_train[i] for i in nearest[:self.k]]

        votes = Counter(topK_y)

        return votes.most_common(1)[0][0]

    def __repr__(self):

        return "KNN(k=%d)" % self.k

playML/model_selection.py

import numpy as np

def train_test_split(X, y, test_ratio=0.2, seed=None):

    """将数据 X 和 y 按照test_ratio分割成X_train, X_test, y_train, y_test"""

    assert X.shape[0] == y.shape[0], \

        "the size of X must be equal to the size of y"

    assert 0.0 <= test_ratio <= 1.0, \

        "test_ration must be valid"

    if seed:

        np.random.seed(seed)

    shuffled_indexes = np.random.permutation(len(X))

    test_size = int(len(X) * test_ratio)

    test_indexes = shuffled_indexes[:test_size]

    train_indexes = shuffled_indexes[test_size:]

    X_train = X[train_indexes]

    y_train = y[train_indexes]

    X_test = X[test_indexes]

    y_test = y[test_indexes]

    return X_train, X_test, y_train, y_test

playML/__init__.py

四、分类的准确度
playML/metrics.py

import numpy as np

def accuracy_score(y_true, y_predict):

    '''计算y_true和y_predict之间的准确率'''

    assert y_true.shape[0] == y_predict.shape[0], \

        "the size of y_true must be equal to the size of y_predict"

    return sum(y_true == y_predict) / len(y_true)

model_selection.py-->KNNClassifier 类里面添加这样一个方法

from .metrics import accuracy_score

    def score(self, X_test, y_test):

        """根据测试数据集 X_test 和 y_test 确定当前模型的准确度"""

        y_predict = self.predict(X_test)

        return accuracy_score(y_test, y_predict)

五、超参数
超参数：在算法运行前需要决定的参数

模型参数：算法过程中学习的参数

KNN算法没有模型参数

KNN算法中的 K 是典型的超参数

寻找好的超参数：
领域知识、经验数值、实验搜索

我写的文章只是我自己对bobo老师讲课内容的理解和整理，也只是我自己的弊见。bobo老师的课是慕课网出品的。欢迎大家一起学习。

机器学习(四) 分类算法--K近邻算法 KNN (上)的更多相关文章

第4章最基础的分类算法-k近邻算法
思想极度简单应用数学知识少效果好(缺点?) 可以解释机器学习算法使用过程中的很多细节问题更完整的刻画机器学习应用的流程 distances = [] for x_train in X_train ...
机器学习(四) 机器学习(四) 分类算法--K近邻算法 KNN (下)
六.网格搜索与 K 邻近算法中更多的超参数七.数据归一化 Feature Scaling 解决方案:将所有的数据映射到同一尺度八.scikit-learn 中的 Scaler preprocess ...
python 机器学习（二）分类算法-k近邻算法
一.什么是K近邻算法? 定义: 如果一个样本在特征空间中的k个最相似(即特征空间中最邻近)的样本中的大多数属于某一个类别,则该样本也属于这个类别. 来源: KNN算法最早是由Cover和Hart提 ...
分类算法----k近邻算法
K最近邻(k-Nearest Neighbor,KNN)分类算法,是一个理论上比较成熟的方法,也是最简单的机器学习算法之一.该方法的思路是:如果一个样本在特征空间中的k个最相似(即特征空间中最邻近)的 ...
机器学习（1）——K近邻算法
KNN的函数写法 import numpy as np from math import sqrt from collections import Counter def KNN_classify(k ...
SIGAI机器学习第七集 k近邻算法
讲授K近邻思想,kNN的预测算法,距离函数,距离度量学习,kNN算法的实际应用. KNN是有监督机器学习算法,K-means是一个聚类算法,都依赖于距离函数.没有训练过程,只有预测过程. 大纲: k近 ...
最基础的分类算法-k近邻算法 kNN简介及Jupyter基础实现及Python实现
k-Nearest Neighbors简介对于该图来说,x轴对应的是肿瘤的大小,y轴对应的是时间,蓝色样本表示恶性肿瘤,红色样本表示良性肿瘤,我们先假设k=3,这个k先不考虑怎么得到,先假设这个k是 ...
【学习笔记】分类算法-k近邻算法
k-近邻算法采用测量不同特征值之间的距离来进行分类. 优点:精度高.对异常值不敏感.无数据输入假定缺点:计算复杂度高.空间复杂度高使用数据范围:数值型和标称型用例子来理解k-近邻算法电影可以按 ...
机器学习03：K近邻算法
本文来自同步博客. P.S. 不知道怎么显示数学公式以及排版文章.所以如果觉得文章下面格式乱的话请自行跳转到上述链接.后续我将不再对数学公式进行截图,毕竟行内公式截图的话排版会很乱.看原博客地址会有更 ...

随机推荐

PHP JWT初识
一直没有好好看过jwt,直到前两天要做web验证,朋友给我推荐了jwt.才发现jwt已经被大家广泛的应用了.看来我有点out了.哈哈,趁着这个世界来好好看看这个. JWT(JSON Web Token ...
Effective C++ 11-17
11.为须要动态分配内存的类声明一个拷贝构造函数和一个赋值操作符. 显然,由于动态内存分配,绝对会有深浅拷贝的问题,要重写拷贝构造函数.使其为深拷贝,才干实现真正意义上的拷贝.这是我理解的关于要声明拷 ...
hdoj 1429 胜利大逃亡(续) 【BFS+状态压缩】
题目:pid=1429">hdoj 1429 胜利大逃亡(续) 同样题目: 题意:中文的,自己看分析:题目是求最少的逃亡时间.确定用BFS 这个题目的难点在于有几个锁对于几把钥匙.唯 ...
CSS中的相关概念
CSS的几个概念: 包括块:一个元素的"布局上下文".对于正常的HTML文本流中的一个元素,包括块由近期的块级祖先框.表单元格或行内块祖先框的内容边界(content edge)构 ...
Opencv(3.0.0beta)+Python(2.7.8 64bit) 简单具体，一遍成功
看到非常多配置的文章,都没法正常走完流程使用到的资源,都是今天为止最新的: python-2.7.8.amd64.msi opencv-3.0.0-beta.exe numpy-MKL-1.9.1. ...
具体解释C++引用——带你走进引用的世界
一.介绍引用首先说引用是什么,大家能够记住,引用就是一个别名,比方小王有个绰号叫小狗.他的妈妈喊小狗回家吃饭.那就是在喊小王回家吃饭. 接下来我们用两行代码来声明一个引用(就拿小王和小狗来说吧 ...
反弹木马——本质上就是一个开80端口的CS程序，伪造自己在浏览网页
反弹端口型木马分析了防火墙的特性后发现:防火墙对于连入的链接往往会进行非常严格的过滤,但是对于连出的链接却疏于防范.于是,与一般的木马相反,反弹端口型木马的服务端(被控制端)使用主动端口,客户端(控制 ...
Windows下使用python绘制caffe中.prototxt网络结构数据可视化
准备工具: 1. 已编译好的pycaffe 2. Anaconda(python2.7) 3. graphviz 4. pydot 1. graphviz安装 graphviz是贝尔实验室开发的一个 ...
5.QT制作编译器,可以简单支持中文编程
学习了文件操作,那么先做一个自制的IDE吧,就是简单的读取,修改,保存文件,使用QT语言,附上github的代码:QT基本文件操作实现中文编程截图: 运行效果
数据仓库 SSIS
SSDT 下载 :https://msdn.microsoft.com/en-us/library/mt204009.aspx Codeplex 上的 AdventureWorks 示例数据库此链接将 ...

机器学习(四) 分类算法--K近邻算法 KNN (上)

一、K近邻算法基础

kNN的过程

机器学习(四) 分类算法--K近邻算法 KNN (上)的更多相关文章

随机推荐

热门专题