机器学习(四) 分类算法--K近邻算法 KNN (上)

一、K近邻算法基础

KNN------- K近邻算法--------K-Nearest Neighbors

思想极度简单

应用数学知识少（近乎为零）

效果好（缺点？）

可以解释机器学习算法使用过程中很多细节问题

更完整的刻画机器学习应用的流程

import numpy as np

import matplotlib.pyplot as plt

实现我们自己的 kNN

创建简单测试用例

raw_data_X = [[3.393533211, 2.331273381],

              [3.110073483, 1.781539638],

              [1.343808831, 3.368360954],

              [3.582294042, 4.679179110],

              [2.280362439, 2.866990263],

              [7.423436942, 4.696522875],

              [5.745051997, 3.533989803],

              [9.172168622, 2.511101045],

              [7.792783481, 3.424088941],

              [7.939820817, 0.791637231]

             ]

raw_data_y = [0, 0, 0, 0, 0, 1, 1, 1, 1, 1]

X_train = np.array(raw_data_X)

y_train = np.array(raw_data_y)

X_train

array([[ 3.39353321,  2.33127338],

       [ 3.11007348,  1.78153964],

       [ 1.34380883,  3.36836095],

       [ 3.58229404,  4.67917911],

       [ 2.28036244,  2.86699026],

       [ 7.42343694,  4.69652288],

       [ 5.745052  ,  3.5339898 ],

       [ 9.17216862,  2.51110105],

       [ 7.79278348,  3.42408894],

       [ 7.93982082,  0.79163723]])

y_train

array([0, 0, 0, 0, 0, 1, 1, 1, 1, 1])

kNN的过程

from math import sqrt

distances = []

for x_train in X_train:

    d = sqrt(np.sum((x_train - x)**2))

    distances.append(d)

distances

[4.812566907609877,

 5.229270827235305,

 6.749798999160064,

 4.6986266144110695,

 5.83460014556857,

 1.4900114024329525,

 2.354574897431513,

 1.3761132675144652,

 0.3064319992975,

 2.5786840957478887]

distances = [sqrt(np.sum((x_train - x)**2))

             for x_train in X_train]

distances

[4.812566907609877,

 5.229270827235305,

 6.749798999160064,

 4.6986266144110695,

 5.83460014556857,

 1.4900114024329525,

 2.354574897431513,

 1.3761132675144652,

 0.3064319992975,

 2.5786840957478887]

np.argsort(distances)

array([8, 7, 5, 6, 9, 3, 0, 1, 4, 2])

nearest = np.argsort(distances)

k = 6

topK_y = [y_train[neighbor] for neighbor in nearest[:k]]

topK_y

[1, 1, 1, 1, 1, 0]

from collections import Counter

votes = Counter(topK_y)

votes

Counter({0: 1, 1: 5})

votes.most_common(1)

[(1, 5)]

predict_y = votes.most_common(1)[0][0]

predict_y

1

二、scikit-learn 中的机器学习算法封装
KNN/KNNN.py

import numpy as np

from math import sqrt

from collections import Counter

class KNNClassifier:

    def __init__(self, k):

        """初始化kNN分类器"""

        assert k >= 1, "k must be valid"

        self.k = k

        self._X_train = None

        self._y_train = None

    def fit(self, X_train, y_train):

        """根据训练数据集X_train和y_train训练kNN分类器"""

        assert X_train.shape[0] == y_train.shape[0], \

            "the size of X_train must be equal to the size of y_train"

        assert self.k <= X_train.shape[0], \

            "the size of X_train must be at least k."

        self._X_train = X_train

        self._y_train = y_train

        return self

    def predict(self, X_predict):

        """给定待预测数据集X_predict，返回表示X_predict的结果向量"""

        assert self._X_train is not None and self._y_train is not None, \

                "must fit before predict!"

        assert X_predict.shape[1] == self._X_train.shape[1], \

                "the feature number of X_predict must be equal to X_train"

        y_predict = [self._predict(x) for x in X_predict]

        return np.array(y_predict)

    def _predict(self, x):

        """给定单个待预测数据x，返回x的预测结果值"""

        assert x.shape[0] == self._X_train.shape[1], \

            "the feature number of x must be equal to X_train"

        distances = [sqrt(np.sum((x_train - x) ** 2))

                     for x_train in self._X_train]

        nearest = np.argsort(distances)

        topK_y = [self._y_train[i] for i in nearest[:self.k]]

        votes = Counter(topK_y)

        return votes.most_common(1)[0][0]

    def __repr__(self):

        return "KNN(k=%d)" % self.k

kNN_function/KNN.py

import numpy as np

from math import sqrt

from collections import Counter

def kNN_classify(k, X_train, y_train, x):

    assert 1 <= k <= X_train.shape[0], "k must be valid"

    assert X_train.shape[0] == y_train.shape[0], \

        "the size of X_train must equal to the size of y_train"

    assert X_train.shape[1] == x.shape[0], \

        "the feature number of x must be equal to X_train"

    distances = [sqrt(np.sum((x_train - x)**2)) for x_train in X_train]

    nearest = np.argsort(distances)

    topK_y = [y_train[i] for i in nearest[:k]]

    votes = Counter(topK_y)

    return votes.most_common(1)[0][0]

三、训练数据集、测试数据集

判断机器学习算法的性能

playML/KNN.py

import numpy as np

from math import sqrt

from collections import Counter

class KNNClassifier:

    def __init__(self, k):

        """初始化kNN分类器"""

        assert k >= 1, "k must be valid"

        self.k = k

        self._X_train = None

        self._y_train = None

    def fit(self, X_train, y_train):

        """根据训练数据集X_train和y_train训练kNN分类器"""

        assert X_train.shape[0] == y_train.shape[0], \

            "the size of X_train must be equal to the size of y_train"

        assert self.k <= X_train.shape[0], \

            "the size of X_train must be at least k."

        self._X_train = X_train

        self._y_train = y_train

        return self

    def predict(self, X_predict):

        """给定待预测数据集X_predict，返回表示X_predict的结果向量"""

        assert self._X_train is not None and self._y_train is not None, \

                "must fit before predict!"

        assert X_predict.shape[1] == self._X_train.shape[1], \

                "the feature number of X_predict must be equal to X_train"

        y_predict = [self._predict(x) for x in X_predict]

        return np.array(y_predict)

    def _predict(self, x):

        """给定单个待预测数据x，返回x的预测结果值"""

        assert x.shape[0] == self._X_train.shape[1], \

            "the feature number of x must be equal to X_train"

        distances = [sqrt(np.sum((x_train - x) ** 2))

                     for x_train in self._X_train]

        nearest = np.argsort(distances)

        topK_y = [self._y_train[i] for i in nearest[:self.k]]

        votes = Counter(topK_y)

        return votes.most_common(1)[0][0]

    def __repr__(self):

        return "KNN(k=%d)" % self.k

playML/model_selection.py

import numpy as np

def train_test_split(X, y, test_ratio=0.2, seed=None):

    """将数据 X 和 y 按照test_ratio分割成X_train, X_test, y_train, y_test"""

    assert X.shape[0] == y.shape[0], \

        "the size of X must be equal to the size of y"

    assert 0.0 <= test_ratio <= 1.0, \

        "test_ration must be valid"

    if seed:

        np.random.seed(seed)

    shuffled_indexes = np.random.permutation(len(X))

    test_size = int(len(X) * test_ratio)

    test_indexes = shuffled_indexes[:test_size]

    train_indexes = shuffled_indexes[test_size:]

    X_train = X[train_indexes]

    y_train = y[train_indexes]

    X_test = X[test_indexes]

    y_test = y[test_indexes]

    return X_train, X_test, y_train, y_test

playML/__init__.py

四、分类的准确度
playML/metrics.py

import numpy as np

def accuracy_score(y_true, y_predict):

    '''计算y_true和y_predict之间的准确率'''

    assert y_true.shape[0] == y_predict.shape[0], \

        "the size of y_true must be equal to the size of y_predict"

    return sum(y_true == y_predict) / len(y_true)

model_selection.py-->KNNClassifier 类里面添加这样一个方法

from .metrics import accuracy_score

    def score(self, X_test, y_test):

        """根据测试数据集 X_test 和 y_test 确定当前模型的准确度"""

        y_predict = self.predict(X_test)

        return accuracy_score(y_test, y_predict)

五、超参数
超参数：在算法运行前需要决定的参数

模型参数：算法过程中学习的参数

KNN算法没有模型参数

KNN算法中的 K 是典型的超参数

寻找好的超参数：
领域知识、经验数值、实验搜索

我写的文章只是我自己对bobo老师讲课内容的理解和整理，也只是我自己的弊见。bobo老师的课是慕课网出品的。欢迎大家一起学习。

机器学习(四) 分类算法--K近邻算法 KNN (上)的更多相关文章

第4章最基础的分类算法-k近邻算法
思想极度简单应用数学知识少效果好(缺点?) 可以解释机器学习算法使用过程中的很多细节问题更完整的刻画机器学习应用的流程 distances = [] for x_train in X_train ...
机器学习(四) 机器学习(四) 分类算法--K近邻算法 KNN (下)
六.网格搜索与 K 邻近算法中更多的超参数七.数据归一化 Feature Scaling 解决方案:将所有的数据映射到同一尺度八.scikit-learn 中的 Scaler preprocess ...
python 机器学习（二）分类算法-k近邻算法
一.什么是K近邻算法? 定义: 如果一个样本在特征空间中的k个最相似(即特征空间中最邻近)的样本中的大多数属于某一个类别,则该样本也属于这个类别. 来源: KNN算法最早是由Cover和Hart提 ...
分类算法----k近邻算法
K最近邻(k-Nearest Neighbor,KNN)分类算法,是一个理论上比较成熟的方法,也是最简单的机器学习算法之一.该方法的思路是:如果一个样本在特征空间中的k个最相似(即特征空间中最邻近)的 ...
机器学习（1）——K近邻算法
KNN的函数写法 import numpy as np from math import sqrt from collections import Counter def KNN_classify(k ...
SIGAI机器学习第七集 k近邻算法
讲授K近邻思想,kNN的预测算法,距离函数,距离度量学习,kNN算法的实际应用. KNN是有监督机器学习算法,K-means是一个聚类算法,都依赖于距离函数.没有训练过程,只有预测过程. 大纲: k近 ...
最基础的分类算法-k近邻算法 kNN简介及Jupyter基础实现及Python实现
k-Nearest Neighbors简介对于该图来说,x轴对应的是肿瘤的大小,y轴对应的是时间,蓝色样本表示恶性肿瘤,红色样本表示良性肿瘤,我们先假设k=3,这个k先不考虑怎么得到,先假设这个k是 ...
【学习笔记】分类算法-k近邻算法
k-近邻算法采用测量不同特征值之间的距离来进行分类. 优点:精度高.对异常值不敏感.无数据输入假定缺点:计算复杂度高.空间复杂度高使用数据范围:数值型和标称型用例子来理解k-近邻算法电影可以按 ...
机器学习03：K近邻算法
本文来自同步博客. P.S. 不知道怎么显示数学公式以及排版文章.所以如果觉得文章下面格式乱的话请自行跳转到上述链接.后续我将不再对数学公式进行截图,毕竟行内公式截图的话排版会很乱.看原博客地址会有更 ...

随机推荐

HTTP 文件共享服务器工具 - chfs
CuteHttpFileServer/chfs是一个免费的.HTTP协议的文件共享服务器,使用浏览器可以快速访问.它具有以下特点: 单个文件,整个软件只有一个可执行程序,无配置文件等其他文件跨平台运 ...
react添加右键点击事件
1.在HTML里面支持contextmenu事件(右键事件).所以需要在组建加载完时添加此事件,销毁组建时移除此事件. 2. 需要增加一个state,名称为visible,用来控制菜单是否显示.在_h ...
用Maven+IDEA+Eclipse组合获得最好的OpenJML体验
OpenJML+SMTSolver的形式化验证想必大家都已经尝试过了.大家或许体验的更多的是IDEA上命令行输出版本的OpenJML插件,但真正得到官方支持的完全版OpenJML是它的Eclipse版 ...
Qt之QImageReader
简述 QImageReader类为从文件或设备读取图像提供了一个独立的接口. 读取图像最常用的方法是通过构造QImage和QPixmap,或通过调用QImage::load()和QPixmap::lo ...
50个经典Sql语句
50个经典Sql语句 --1.学生表 Student(S,Sname,Sage,Ssex) --S 学生编号,Sname 学生姓名,Sage 出生年月,Ssex 学生性别 --2.课程表 Cours ...
UILite-MFC/WTL/DirectUI界面库
之前写了UILite库介绍: http://blog.csdn.net/zhangzq86/article/details/9093945 如今UILite库能够使用git訪问了: https://g ...
严重: 文档无效: 找不到语法。 at (null:2:19)
1.错误描写叙述严重: 文档无效: 找不到语法. at (null:2:19) org.xml.sax.SAXParseException; systemId: file:/D:/MyEclipse ...
一：Java之面向对象基本概念
1.面向对象面向对象(Object Oriented)是一种新兴的程序设计方法,或者是一种新的程序设计规范(paradigm).其基本思想是使用对象.类.继承.封装.多态等基本概念来进行程序设计.从 ...
聊聊高并发（四十四）解析java.util.concurrent各个组件（二十） Executors工厂类
Executor框架为了更方便使用,提供了Executors这个工厂类.通过一系列的静态工厂方法.能够高速地创建对应的Executor实例. 仅仅有一个nThreads參数的newFixedThrea ...
35.自己实现vector模板库myvector
myvector.h #pragma once //自己写的vector模板库 template <class T> class myvector { public: //构造 myvec ...

机器学习(四) 分类算法--K近邻算法 KNN (上)

一、K近邻算法基础

kNN的过程

机器学习(四) 分类算法--K近邻算法 KNN (上)的更多相关文章

随机推荐

热门专题