[comment]: # Unsupervised Classification - Sprawl Classification Algorithm

Idea

Points (data) in same cluster are near each others, or are connected by each others.

So:

  • For a distance d,every points in a cluster always can find some points in the same cluster.
  • Distances between points in difference clusters are bigger than the distance d.

    The above condition maybe not correct totally, e.g. in the case of clusters which have common points, the condition will be incorrect.

    So need some improvement.

Sprawl Classification Algorithm

  • Input:

    • data: Training Data
    • d: The minimum distance between clusters
    • minConnectedPoints: The minimum connected points:
  • Output:
    • Result: an array of classified data
  • Logical:
Load data into TotalCache.
i = 0
while (TotalCache.size > 0)
{
Find a any point A from TotalCache, put A into Cache2.
Remove A from TotalCache
In TotalCache, find points 'nearPoints' less than d from any point in the Cache2.
Put Cache2 points into Cache1.
Clear Cache2.
Put nearPoints into Cache2.
Remove nearPoints from TotalCache.
if Cache2.size = 0, add Cache1 points into Result[i].
Clear Cache1.
i++
}
Return Result

Note: As the algorithm need to calculating the distances between points, maybe need to normalize data first to each feature has same weight.

Improvement

A big problem is the method need too much calculation for the distances between points. The max times is \(/frac{n * (n - 1)}{2}\).

Improvement ideas:

  • Check distance for one feature first maybe quicker.

    We need not to calculate the real distance for each pair, because we only need to make sure whether the distance is less than \(d\),

    if points x1, x2, the distance will be bigger or equals to \(d\) when there is a $ \vert x1[i] - x2[i] \vert \geqslant d$.
  • Split data in multiple area

    For n dimensions (features) dataset, we can split the dataset into multiple smaller datasets, each dataset is in a n dimension space whose size \(d^{n}\).

    We can image that each small space is a n dimensions cube and adjoin each other.

    so we only need to calculate points in the current space and neighbour spaces.

Cons

  • Need a amount of calculating.
  • Need to improve to handle clusters which have common points.

Unsupervised Classification - Sprawl Classification Algorithm的更多相关文章

  1. 微软亚洲实验室一篇超过人类识别率的论文:Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification ImageNet Classification

    在该文章的两大创新点:一个是PReLU,一个是权值初始化的方法.下面我们分别一一来看. PReLU(paramter ReLU) 所谓的PRelu,即在 ReLU激活函数的基础上加入了一个参数,看一个 ...

  2. What are the advantages of different classification algorithms?

    What are the advantages of different classification algorithms? For instance, if we have large train ...

  3. Classification / Recognition

    转载 https://handong1587.github.io/deep_learning/2015/10/09/recognition.html#facenet Classification / ...

  4. sklearn中的metrics模块中的Classification metrics

    metrics是sklearn用来做模型评估的重要模块,提供了各种评估度量,现在自己整理如下: 一.通用的用法:Common cases: predefined values 1.1 sklearn官 ...

  5. 机器学习-TensorFlow应用之classification和ROC curve

    概述 前面几节讲的是linear regression的内容,这里咱们再讲一个非常常用的一种模型那就是classification,classification顾名思义就是分类的意思,在实际的情况是非 ...

  6. 学习笔记之k-nearest neighbors algorithm (k-NN)

    k-nearest neighbors algorithm - Wikipedia https://en.wikipedia.org/wiki/K-nearest_neighbors_algorith ...

  7. Exploratory Undersampling for Class-Imbalance Learning

    Abstract - Undersampling is a popular method in dealing with class-imbalance problems, which uses on ...

  8. arcmap Command

    The information in this document is useful if you are trying to programmatically find a built-in com ...

  9. A Gentle Guide to Machine Learning

    A Gentle Guide to Machine Learning Machine Learning is a subfield within Artificial Intelligence tha ...

随机推荐

  1. (笔记)Linux内核学习(四)之系统调用

    一 用户空间和内核空间 Linux内核将这4G字节虚拟地址空间的空间分为两部分: l  将最高的1G字节(从虚拟地址0xC0000000到0xFFFFFFFF),供内核使用,称为“内核空间”. l  ...

  2. win7中安装redis

    1.下载redis安装版本 https://github.com/rgl/redis/downloads 2.设置环境变量 将redies的安装目录设置为环境变量 参考: http://www.cnb ...

  3. JS/React 判断对象是否为空对象

    JS一般判断对象是否为空,我们可以采用: if(!x)的方式直接判断,但是如果是一个空对象,比如空的JSON对象,是这样的:{},简单的判断是不成功的,因为它已经占用着内存了,如果是JQuery的话, ...

  4. Swift 字符与字符串

    Swift 的 String 和 Character 类型

  5. Mobilize.Net Silverlight bridge to Windows 10 UWP

    Windows UWP 既 Windows 10 Universal Windows platform,这个微软基于Windows NT内核的个运行时(Runtime)平台,此平台横跨所有的 Wind ...

  6. oracle数据库的字符集更改

    A.oracle server 端 字符集查询  select userenv('language') from dual 其中NLS_CHARACTERSET 为server端字符集 NLS_LAN ...

  7. Java 停止一个 Thread

    boolean   flag=true;         public   void   run(){             while(flag){                     ... ...

  8. HTML注释引起的问题

    因为规范要求需要对页面进行说明,添加作者等信息,所以在cshtml的代码中添加了html注释,包括之前使用jsp也是这样做的: 在页面布局的时候,需要对高度进行动态计算,IE8以上没有问题,主要是在I ...

  9. AutoLayout那些坑

    最近在做一个聊天界面,要适配iOS所有屏幕. 以前的思路是键盘弹出的时候去改table 和输入框的frame. 现在发现和autolayout的约束有冲突. 搞了半天发现需要动态改Constraint ...

  10. Git 分支合并

    理解核心 Git最初只有一个分支,所有后续分支都是直接或间接的从这个分支切出来的. 在任意两个分支上,向前追溯提交记录,都能找到一个最近的提交同时属于这两个分支,这个提交就是两个分支的分叉节点 分支合 ...