Unsupervised Classification - Sprawl Classification Algorithm
[comment]: # Unsupervised Classification - Sprawl Classification Algorithm
Idea
Points (data) in same cluster are near each others, or are connected by each others.
So:
- For a distance d,every points in a cluster always can find some points in the same cluster.
- Distances between points in difference clusters are bigger than the distance d.
The above condition maybe not correct totally, e.g. in the case of clusters which have common points, the condition will be incorrect.
So need some improvement.
Sprawl Classification Algorithm
- Input:
- data: Training Data
- d: The minimum distance between clusters
- minConnectedPoints: The minimum connected points:
- Output:
- Result: an array of classified data
- Logical:
Load data into TotalCache.
i = 0
while (TotalCache.size > 0)
{
Find a any point A from TotalCache, put A into Cache2.
Remove A from TotalCache
In TotalCache, find points 'nearPoints' less than d from any point in the Cache2.
Put Cache2 points into Cache1.
Clear Cache2.
Put nearPoints into Cache2.
Remove nearPoints from TotalCache.
if Cache2.size = 0, add Cache1 points into Result[i].
Clear Cache1.
i++
}
Return Result
Note: As the algorithm need to calculating the distances between points, maybe need to normalize data first to each feature has same weight.
Improvement
A big problem is the method need too much calculation for the distances between points. The max times is \(/frac{n * (n - 1)}{2}\).
Improvement ideas:
- Check distance for one feature first maybe quicker.
We need not to calculate the real distance for each pair, because we only need to make sure whether the distance is less than \(d\),
if points x1, x2, the distance will be bigger or equals to \(d\) when there is a $ \vert x1[i] - x2[i] \vert \geqslant d$. - Split data in multiple area
For n dimensions (features) dataset, we can split the dataset into multiple smaller datasets, each dataset is in a n dimension space whose size \(d^{n}\).
We can image that each small space is a n dimensions cube and adjoin each other.
so we only need to calculate points in the current space and neighbour spaces.
Cons
- Need a amount of calculating.
- Need to improve to handle clusters which have common points.
Unsupervised Classification - Sprawl Classification Algorithm的更多相关文章
- 微软亚洲实验室一篇超过人类识别率的论文:Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification ImageNet Classification
在该文章的两大创新点:一个是PReLU,一个是权值初始化的方法.下面我们分别一一来看. PReLU(paramter ReLU) 所谓的PRelu,即在 ReLU激活函数的基础上加入了一个参数,看一个 ...
- What are the advantages of different classification algorithms?
What are the advantages of different classification algorithms? For instance, if we have large train ...
- Classification / Recognition
转载 https://handong1587.github.io/deep_learning/2015/10/09/recognition.html#facenet Classification / ...
- sklearn中的metrics模块中的Classification metrics
metrics是sklearn用来做模型评估的重要模块,提供了各种评估度量,现在自己整理如下: 一.通用的用法:Common cases: predefined values 1.1 sklearn官 ...
- 机器学习-TensorFlow应用之classification和ROC curve
概述 前面几节讲的是linear regression的内容,这里咱们再讲一个非常常用的一种模型那就是classification,classification顾名思义就是分类的意思,在实际的情况是非 ...
- 学习笔记之k-nearest neighbors algorithm (k-NN)
k-nearest neighbors algorithm - Wikipedia https://en.wikipedia.org/wiki/K-nearest_neighbors_algorith ...
- Exploratory Undersampling for Class-Imbalance Learning
Abstract - Undersampling is a popular method in dealing with class-imbalance problems, which uses on ...
- arcmap Command
The information in this document is useful if you are trying to programmatically find a built-in com ...
- A Gentle Guide to Machine Learning
A Gentle Guide to Machine Learning Machine Learning is a subfield within Artificial Intelligence tha ...
随机推荐
- WorkbookDesigner mvc里面返回file
using System; using System.Collections.Generic; using System.Linq; using System.Web; using System.We ...
- 配置Pylint for Python3.5
事件的缘由是因为在Ubuntu16.04 下面安装了Visual Studio Code, 再编辑的时候说需要Pylint来检查语法,我系统的默认的Python 版本是python2,而我现在正在学习 ...
- UNIX环境高级编程笔记之线程
本章涉及到线程的一些基本知识点,讨论了现有的创建线程和销毁线程的POSIX.1原语,此外,重点介绍了线程同步问题,讨论了三种基本的同步机制:互斥量.读写锁.条件变量.
- 天猫浏览型应用的CDN静态化架构演变
原文链接:http://www.csdn.net/article/2014-01-22/2818227-CDN-Architecture 在天猫双11活动中,商品详情.店铺等浏览型系统,通常会承受超出 ...
- Android应用如何支持屏幕多尺寸多分辨率问题
作为Android应用程序开发者都知道android是一个“碎片化”的世界.多种系统版本.多种尺寸.多种分辨率.多种机型,还有不同的厂商定制的不同ROM,你开发的应用会在不可预期的手机上报错.这给开发 ...
- 用MapX与C#开发地理信息系统
转:http://www.cnblogs.com/dachie/archive/2010/08/17/1801598.html 第四章 MapX与C#实例... 5 4.1 MapX图层建立... 5 ...
- 重识JavaScript 之 JavaScript的组成
JavaScript由ECMAScript.DOM.BOM组成. 简单认识: ECMAScript:首先它不是一门编程语言,而是一个标准,规定这些浏览器的脚步语言必须按照它的规定去做. DOM ...
- google翻译,翻译当前的网页
网页翻译为德语(Translate Page To German) <a href="javascript: void(window.open('http://translate.go ...
- 加快MySQL逻辑恢复速度的方法和参数总结
日常工作中经常会有需要从mysqldump导出的备份文件恢复数据库的情况,相比物理备份恢复这种方式在恢复时间上往往显得力不从心. 本文就总结了几个对于逻辑备份恢复有加速作用的参数和操作 注意:我们的大 ...
- 【转载】ubuntu和debian环境下无法挂载vmware虚拟机共享目录的解决办法
转载自:http://www.fengfly.com/plus/view-210022-1.html 第一步,安装VMware Tools 打开虚拟机ubuntu(debian原理一样)后,首先,点击 ...