Unsupervised Classification - Sprawl Classification Algorithm
[comment]: # Unsupervised Classification - Sprawl Classification Algorithm
Idea
Points (data) in same cluster are near each others, or are connected by each others.
So:
- For a distance d,every points in a cluster always can find some points in the same cluster.
- Distances between points in difference clusters are bigger than the distance d.
The above condition maybe not correct totally, e.g. in the case of clusters which have common points, the condition will be incorrect.
So need some improvement.
Sprawl Classification Algorithm
- Input:
- data: Training Data
- d: The minimum distance between clusters
- minConnectedPoints: The minimum connected points:
- Output:
- Result: an array of classified data
- Logical:
Load data into TotalCache.
i = 0
while (TotalCache.size > 0)
{
Find a any point A from TotalCache, put A into Cache2.
Remove A from TotalCache
In TotalCache, find points 'nearPoints' less than d from any point in the Cache2.
Put Cache2 points into Cache1.
Clear Cache2.
Put nearPoints into Cache2.
Remove nearPoints from TotalCache.
if Cache2.size = 0, add Cache1 points into Result[i].
Clear Cache1.
i++
}
Return Result
Note: As the algorithm need to calculating the distances between points, maybe need to normalize data first to each feature has same weight.
Improvement
A big problem is the method need too much calculation for the distances between points. The max times is \(/frac{n * (n - 1)}{2}\).
Improvement ideas:
- Check distance for one feature first maybe quicker.
We need not to calculate the real distance for each pair, because we only need to make sure whether the distance is less than \(d\),
if points x1, x2, the distance will be bigger or equals to \(d\) when there is a $ \vert x1[i] - x2[i] \vert \geqslant d$. - Split data in multiple area
For n dimensions (features) dataset, we can split the dataset into multiple smaller datasets, each dataset is in a n dimension space whose size \(d^{n}\).
We can image that each small space is a n dimensions cube and adjoin each other.
so we only need to calculate points in the current space and neighbour spaces.
Cons
- Need a amount of calculating.
- Need to improve to handle clusters which have common points.
Unsupervised Classification - Sprawl Classification Algorithm的更多相关文章
- 微软亚洲实验室一篇超过人类识别率的论文:Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification ImageNet Classification
在该文章的两大创新点:一个是PReLU,一个是权值初始化的方法.下面我们分别一一来看. PReLU(paramter ReLU) 所谓的PRelu,即在 ReLU激活函数的基础上加入了一个参数,看一个 ...
- What are the advantages of different classification algorithms?
What are the advantages of different classification algorithms? For instance, if we have large train ...
- Classification / Recognition
转载 https://handong1587.github.io/deep_learning/2015/10/09/recognition.html#facenet Classification / ...
- sklearn中的metrics模块中的Classification metrics
metrics是sklearn用来做模型评估的重要模块,提供了各种评估度量,现在自己整理如下: 一.通用的用法:Common cases: predefined values 1.1 sklearn官 ...
- 机器学习-TensorFlow应用之classification和ROC curve
概述 前面几节讲的是linear regression的内容,这里咱们再讲一个非常常用的一种模型那就是classification,classification顾名思义就是分类的意思,在实际的情况是非 ...
- 学习笔记之k-nearest neighbors algorithm (k-NN)
k-nearest neighbors algorithm - Wikipedia https://en.wikipedia.org/wiki/K-nearest_neighbors_algorith ...
- Exploratory Undersampling for Class-Imbalance Learning
Abstract - Undersampling is a popular method in dealing with class-imbalance problems, which uses on ...
- arcmap Command
The information in this document is useful if you are trying to programmatically find a built-in com ...
- A Gentle Guide to Machine Learning
A Gentle Guide to Machine Learning Machine Learning is a subfield within Artificial Intelligence tha ...
随机推荐
- IIS 日志文件分析
先安装下文参考资料中的log parser studio 然后就可以针对日志文件进行sql语句的查询了. 各页面访问量排行 ) FROM '[LOGFILEPATH]' where cs-uri-st ...
- swift 附属脚本
附属脚本是访问对象,集合或序列的快捷方式 struct STest{ let constValue:Int subscript(count:Int)->Int{ return count*con ...
- 客户端配置文件tnsname.ora
ARP2 = (DESCRIPTION = (ADDRESS_LIST = (ADDRESS = (PROTOCOL = TCP)(HOST = 182.168.1.173)(PORT = 1521) ...
- vs.net 2005 C# WinForm GroupBOX 的BUG?尝试读取或写入受保护的内存。这通常指示其他内存已损坏
其实很久没有写程序了,国庆难得有空闲,写了个游戏辅助机器人,程序写好能用后本想把UI控件放到GroupBox里归下分类,美化下界面,结果一运行报“尝试读取或写入受保护的内存.这通常指示其他内存已损坏” ...
- 公网IP、私网IP
公网.内网是两种Internet的接入方式.公网接入方式:上网的计算机得到的IP地址是Internet上的非保留地址,公网的计算机和Internet上的其他计算机可随意互相访问. NAT(Networ ...
- Codeforces Round #195 A B C 三题合集 (Div. 2)
A 题 Vasily the Bear and Triangle 题目大意 一个等腰直角三角形 ABC,角 ACB 是直角,AC=BC,点 C 在原点,让确定 A 和 B 的坐标,使得三角形包含一个矩 ...
- 六款值得推荐的android(安卓)开源框架
1.volley 项目地址 https://github.com/smanikandan14/Volley-demo (1) JSON,图像等的异步下载: (2) 网络请求的排序(scheduli ...
- Maven更新子模块的版本号
mark! 已写成了另一篇,不要打我.
- 下载安装与配置 Java JDK 7
1. 去 Oracle 的官网下载 JDK,我下载的是:jdk-7u25-windows-x64.exe 大小为:90.6M 2. 双击它安装. 3. 安装完后,JDK 配置如下: 01 02 - ...
- DDD:Repository和UnitOfWork的生命周期问题
UnitOfWork UnitOfWork是一种有状态的.用例级别的对象.如果不采用ORM是不会使用UnitOfWork模式的, Repository Repository是一种特殊的领域服务,因此是 ...