分类看起来比聚类和推荐麻烦多了

分类算法与聚类和推荐算法的不同:必须是有明确结果的,必须是有监督的,主要用于预测和检测

Mahout的优势 mahout的分类算法对资源的要求不会快于训练数据和测试数据的增长速度,而且可以转换为分布式应用(数据规模如果不够大 Mahout表现可能不及其他类型的系统)

关键词表:

Key idea

Description

Model

A computer program that makes decisions; in classification, the output of the training algorithm is a model

Training Data

Subset of training examples labeled with the value of the target variable and used as input to the learning algorithm to produce the model

Test Data

Withheld portion of training examples given to the model without the value for the target variable (although the value is known) and used to evaluate the model

Training

Learning process that uses training data to produce a model. That model can then compute estimates of the target variable given the predictor variables as inputs.

Training example

Entity with features that will be used as input for learning algorithm

Feature

A known characteristic of a training or new example; a “feature” is equivalent to saying a “characteristic”.

Variable

In this context, a variable is equivalent to a the value of a feature or a function of several features. This usage is somewhat different from a normal variable in a computer program.

Record

A container where an example is stored; such a record is composed of fields.

Field

Part of a record that contains the value of a feature (variable)

Predictor variable

Feature selected for use as input for a classification model. Not all features need be used. Some features may be algorithmic combinations of other features.

Target variable

Feature that the classification model is attempting to estimate: the target variable is categorical and its determination is the aim of the classification system.

一般来说:80-90%的数据作为training Data 其他数据作为Test Data数据

Mahout分类中的四种数据类型

Type of Value

Description

Continuous

This type of value is a floating point value. This might be a price, a weight, a time, a value or anything else that has a numerical magnitude and where this magnitude is the key property of the value.

Categorical

A categorical value can have one of a pre-specified set of values. Typically the set of categorical values is relatively small and may be as small as two, although the set can be quite large. Boolean values are generally treated as categorical values. Another example might be a vendor id.

Word-like

A word-like value is like a categorical value, but it has an open-ended set of possible values.

Text-like

A text-like value is a sequence of word-like values, all of the same kind. Text is the classic example of a text-like value, but a list of email addresses or URL’s is also text-like.

数据类型

Name

Type

Value

from-address

word-like

George <george@fumble-tech.com>

in-address-book?

categorical(TRUE, FALSE)

TRUE

non-spam-words

text-like

“Ted”, “Mahout”, “User”, “lunch”

spam-words

text-like

“available”

unknown-words

continuous

0

message-length

continuous

31

分类的应用步骤

Stage

Step

1. Training the model

Define target variable

Collect historical data

Define predictor variables

Select a learning algorithm

Use learning algorithm to train model

2. Evaluating the model

Run test data

Adjust input (different predictor variables and/or algorithm)

3. Using model in production

Input the new examples to estimate unknown target values

Retrain model as needed

Mahout的命令行工具

$ $MAHOUT_HOME/bin/mahout

An example program must be given as the first argument.

Valid program names are:

canopy: : Canopy clustering

cat : Print a file or resource as the logistic regression models would see it

...

runlogistic : Run a logistic regression model against CSV data

...

trainlogistic : Train a logistic regression using stochastic gradient descent

demo:

cat 查看一个文件

$ bin/mahout cat donut.csv

"x","y","shape","color","k","k0","xx","xy","yy","a","b","c","bias"

0.923307513352484,0.0135197141207755,21,2,4,8,0.852496764213146,...,1

0.711011884035543,0.909141522599384,22,2,3,9,0.505537899239772,...,1

...

0.67132937326096,0.571220482233912,23,1,5,2,0.450683127402953,...,1

0.548616112209857,0.405350996181369,24,1,5,3,0.300979638576258,...,1

0.677980388281867,0.993355110753328,25,2,3,9,0.459657406894831,...,1

$

Trainlogistic:根据数据训练生成model

$ $MAHOUT_HOME/bin/mahout trainlogistic --input donut.csv \

--output ./model \

--target color --categories 2 \

--predictors x y --types numeric \

--features 20 --passes 100 --rate 50

...

color ~ -0.157*Intercept Term + -0.678*x + -0.416*y

Intercept Term -0.15655

x -0.67841

y -0.41587

...

Option

Whatit Does

--quiet

Produce less status and progress output

--input <file-or-resource>

Use the specified file or resource as input

--output <file-for-model>

Put the model into the specified file

--target <variable>

Use the specified variable as the target

--categories <n>

How many categories does the target variable have?

--predictors <v1> ... <vn>

A list of the names of the predictor variables

--types <t1> ... <tm>

A list of the types of the predictor variables. Each type should be one of numeric, word or text. Types can be abbreviated to their first letter. If too few types are given, the last one is used again as necessary. Use word for categorical variables.

--passes

The number of times the input data should be re-examined during training. Small input files may need to be examined dozens of times. Very large input files probably don’t even need to be completely examined

--lambda

Controls how much the algorithm tries to eliminate variables from the final model. A value of 0 indicates no effort is made. Typical values are on the order of 0.00001 or less.

--rate

The initial learning rate. This can be large if you have lots of data or use lots of passes because it is decreased progressively as data is examined.

--noBias

Do not use the built-in constant in the model (this eliminates the Intercept Term from the model. Occasionally this is a good idea, but generally it is not since the SGD learning algorithm can usually eliminate the intercept term if warranted.

--features

The size of the internal feature vector to use in building the model. A larger value here can be helpful, especially with text-like input data.

Runlogistic model评价

$ bin/mahout runlogistic --input donut.csv --model ./model \

--auc --confusion

AUC = 0.57

confusion: [[27.0, 13.0], [0.0, 0.0]]

AUC 和confusion表示分类准确率 AUC(readingData的正确率越接近1越好) confusion(识别率和误识率)

参数说明

Option

What it Does

--quiet

Produce less status and progress output

--auc

Print out AUC score for model versus input data after reading data

--scores

Print target variable value and scores for each input example

--threshold <t>

Set the threshold for confusion matrix computation to t (default 0.5)

--confusion

Print out confusion matrix for a particular threshold (See --threshold)

--input <input>

Read data records from specified file or resource

--model <model>

Read model from specified file

 

分类看起来比聚类和推荐麻烦多了

分类算法与聚类和推荐算法的不同:必须是有明确结果的,必须是有监督的,主要用于预测和检测

Mahout的优势 mahout的分类算法对资源的要求不会快于训练数据和测试数据的增长速度,而且可以转换为分布式应用(数据规模如果不够大 Mahout表现可能不及其他类型的系统)

关键词表:

Key idea

Description

Model

A computer program that makes decisions; in classification, the output of the training algorithm is a model

Training Data

Subset of training examples labeled with the value of the target variable and used as input to the learning algorithm to produce the model

Test Data

Withheld portion of training examples given to the model without the value for the target variable (although the value is known) and used to evaluate the model

Training

Learning process that uses training data to produce a model. That model can then compute estimates of the target variable given the predictor variables as inputs.

Training example

Entity with features that will be used as input for learning algorithm

Feature

A known characteristic of a training or new example; a “feature” is equivalent to saying a “characteristic”.

Variable

In this context, a variable is equivalent to a the value of a feature or a function of several features. This usage is somewhat different from a normal variable in a computer program.

Record

A container where an example is stored; such a record is composed of fields.

Field

Part of a record that contains the value of a feature (variable)

Predictor variable

Feature selected for use as input for a classification model. Not all features need be used. Some features may be algorithmic combinations of other features.

Target variable

Feature that the classification model is attempting to estimate: the target variable is categorical and its determination is the aim of the classification system.

一般来说:80-90%的数据作为training Data 其他数据作为Test Data数据

Mahout分类中的四种数据类型

Type of Value

Description

Continuous

This type of value is a floating point value. This might be a price, a weight, a time, a value or anything else that has a numerical magnitude and where this magnitude is the key property of the value.

Categorical

A categorical value can have one of a pre-specified set of values. Typically the set of categorical values is relatively small and may be as small as two, although the set can be quite large. Boolean values are generally treated as categorical values. Another example might be a vendor id.

Word-like

A word-like value is like a categorical value, but it has an open-ended set of possible values.

Text-like

A text-like value is a sequence of word-like values, all of the same kind. Text is the classic example of a text-like value, but a list of email addresses or URL’s is also text-like.

数据类型

Name

Type

Value

from-address

word-like

George <george@fumble-tech.com>

in-address-book?

categorical(TRUE, FALSE)

TRUE

non-spam-words

text-like

“Ted”, “Mahout”, “User”, “lunch”

spam-words

text-like

“available”

unknown-words

continuous

0

message-length

continuous

31

分类的应用步骤

Stage

Step

1. Training the model

Define target variable

Collect historical data

Define predictor variables

Select a learning algorithm

Use learning algorithm to train model

2. Evaluating the model

Run test data

Adjust input (different predictor variables and/or algorithm)

3. Using model in production

Input the new examples to estimate unknown target values

Retrain model as needed

Mahout的命令行工具

$ $MAHOUT_HOME/bin/mahout

An example program must be given as the first argument.

Valid program names are:

canopy: : Canopy clustering

cat : Print a file or resource as the logistic regression models would see it

...

runlogistic : Run a logistic regression model against CSV data

...

trainlogistic : Train a logistic regression using stochastic gradient descent

demo:

cat 查看一个文件

$ bin/mahout cat donut.csv

"x","y","shape","color","k","k0","xx","xy","yy","a","b","c","bias"

0.923307513352484,0.0135197141207755,21,2,4,8,0.852496764213146,...,1

0.711011884035543,0.909141522599384,22,2,3,9,0.505537899239772,...,1

...

0.67132937326096,0.571220482233912,23,1,5,2,0.450683127402953,...,1

0.548616112209857,0.405350996181369,24,1,5,3,0.300979638576258,...,1

0.677980388281867,0.993355110753328,25,2,3,9,0.459657406894831,...,1

$

Trainlogistic:根据数据训练生成model

$ $MAHOUT_HOME/bin/mahout trainlogistic --input donut.csv \

--output ./model \

--target color --categories 2 \

--predictors x y --types numeric \

--features 20 --passes 100 --rate 50

...

color ~ -0.157*Intercept Term + -0.678*x + -0.416*y

Intercept Term -0.15655

x -0.67841

y -0.41587

...

Option

Whatit Does

--quiet

Produce less status and progress output

--input <file-or-resource>

Use the specified file or resource as input

--output <file-for-model>

Put the model into the specified file

--target <variable>

Use the specified variable as the target

--categories <n>

How many categories does the target variable have?

--predictors <v1> ... <vn>

A list of the names of the predictor variables

--types <t1> ... <tm>

A list of the types of the predictor variables. Each type should be one of numeric, word or text. Types can be abbreviated to their first letter. If too few types are given, the last one is used again as necessary. Use word for categorical variables.

--passes

The number of times the input data should be re-examined during training. Small input files may need to be examined dozens of times. Very large input files probably don’t even need to be completely examined

--lambda

Controls how much the algorithm tries to eliminate variables from the final model. A value of 0 indicates no effort is made. Typical values are on the order of 0.00001 or less.

--rate

The initial learning rate. This can be large if you have lots of data or use lots of passes because it is decreased progressively as data is examined.

--noBias

Do not use the built-in constant in the model (this eliminates the Intercept Term from the model. Occasionally this is a good idea, but generally it is not since the SGD learning algorithm can usually eliminate the intercept term if warranted.

--features

The size of the internal feature vector to use in building the model. A larger value here can be helpful, especially with text-like input data.

Runlogistic model评价

$ bin/mahout runlogistic --input donut.csv --model ./model \

--auc --confusion

AUC = 0.57

confusion: [[27.0, 13.0], [0.0, 0.0]]

AUC 和confusion表示分类准确率 AUC(readingData的正确率越接近1越好) confusion(识别率和误识率)

参数说明

Option

What it Does

--quiet

Produce less status and progress output

--auc

Print out AUC score for model versus input data after reading data

--scores

Print target variable value and scores for each input example

--threshold <t>

Set the threshold for confusion matrix computation to t (default 0.5)

--confusion

Print out confusion matrix for a particular threshold (See --threshold)

--input <input>

Read data records from specified file or resource

--model <model>

Read model from specified file

mahout分类的更多相关文章

  1. mahout分类学习和遇到的问题总结

    这段时间学习Mahout有喜有悲.在这里首先感谢樊哲老师的指导.以下列出关于这次Mahout分类的学习和遇到的问题,还请大家多多提出建议:(全部文件操作都使用是在hdfs上边进行的). (本人用的环境 ...

  2. Mahout 分类算法

    实验简介 本次课程学习了Mahout 的 Bayes 分类算法. 一.实验环境说明 1. 环境登录 无需密码自动登录,系统用户名 shiyanlou 2. 环境介绍 本实验环境采用带桌面的Ubuntu ...

  3. Mahout朴素贝叶斯文本分类

    Mahout朴素贝叶斯文本分类算法 Mahout贝叶斯分类器按照官方的说法,是按照<Tackling the PoorAssumptions of Naive Bayes Text Classi ...

  4. Mahout Bayes分类

    Mahout Bayes分类器是按照<Tackling the Poor Assumptions of Naive Bayes Text Classiers>论文写出来了,具体查看论文 实 ...

  5. Mahout快速入门教程 分类: B10_计算机基础 2015-03-07 16:20 508人阅读 评论(0) 收藏

    Mahout 是一个很强大的数据挖掘工具,是一个分布式机器学习算法的集合,包括:被称为Taste的分布式协同过滤的实现.分类.聚类等.Mahout最大的优点就是基于hadoop实现,把很多以前运行于单 ...

  6. 机器学习 101 Mahout 简介 建立一个推荐引擎 使用 Mahout 实现集群 使用 Mahout 实现内容分类 结束语 下载资源

      机器学习 101 Mahout 简介 建立一个推荐引擎 使用 Mahout 实现集群 使用 Mahout 实现内容分类 结束语 下载资源 相关主题   在信息时代,公司和个人的成功越来越依赖于迅速 ...

  7. Hadoop里的数据挖掘应用-Mahout——学习笔记<三>

    之前有幸在MOOC学院抽中小象学院hadoop体验课. 这是小象学院hadoop2.X的笔记 由于平时对数据挖掘做的比较多,所以优先看Mahout方向视频. Mahout有很好的扩展性与容错性(基于H ...

  8. Mahout源码分析之 -- 文档向量化TF-IDF

    fesh个人实践,欢迎经验交流!Blog地址:http://www.cnblogs.com/fesh/p/3775429.html Mahout之SparseVectorsFromSequenceFi ...

  9. 利用Mahout实现在Hadoop上运行K-Means算法

    利用Mahout实现在Hadoop上运行K-Means算法 一.介绍Mahout Mahout是Apache下的开源机器学习软件包,目前实现的机器学习算法主要包含有协同过滤/推荐引擎,聚类和分类三个部 ...

随机推荐

  1. Jsp_demo:自定义标签

    Jsp自定义标签: 1.继承SimpleTagSupport,重写doTag(). 2.在WEB-INF/ 下配置**.tld文件 3.Jsp页面引入自定义标签:<%@ taglib uri=& ...

  2. YII 表单验证规则

    官方文档:http://www.yiichina.com/guide/form.model 类参考手册:http://www.yiichina.com/api/CValidatorhttp://www ...

  3. linux diff详解

    diff是Unix系统的一个很重要的工具程序. 它用来比较两个文本文件的差异,是代码版本管理的基石之一.你在命令行下,输入: $ diff <变动前的文件> <变动后的文件> ...

  4. mongodb数据库连接池(java版)

    mongodb数据库接口的设计 package storm.db; import java.util.ArrayList; import com.mongodb.DB; import com.mong ...

  5. 394. Coins in a Line

    最后更新 一刷. 用数学方法是看是不是3的倍数. 不用数学方法的话要动态规划. 当前玩家,dp[i]行不行取决于dp[i-1]和dp[i-2],代表下一个玩家能不能赢,另一个玩家能赢的话当前就不能赢: ...

  6. [Spark] Pair RDD常见转化操作

    本篇博客中的操作都在 ./bin/pyspark 中执行. 对单个 Pair RDD 的转化操作 下面会对 Pair RDD 的一些转化操作进行解释.先假设我们有下面这些RDD(在pyspark中操作 ...

  7. ReactiveCocoa框架学习1

    写block直接使用inline block的声明类型 在ARC中使用strong,如果不使用strong,则会被销毁 在非ARC中使用copy block在开发中的使用场景 把block保存到对象中 ...

  8. sqlserver优化查询

    sql语句的优化分析   开门见山,问题所在 sql语句性能达不到你的要求,执行效率让你忍无可忍,一般会时下面几种情况. 网速不给力,不稳定. 服务器内存不够,或者SQL 被分配的内存不够. sql语 ...

  9. Android 在广播接收器中弹出对话框

    特别需要注意的几点如下: 需要设置AlertDialog的类型 WindowManager.LayoutParams.TYPE_SYSTEM_ALERT 2. 需要声明Window弹框的权限 < ...

  10. web 网站安全证书已过期或不可信 是否继续浏览

    发生环境:魅族MX4  uc浏览器 IIS部署SSL证书后提示不可信的解决方案 第一步:打开mmc——点击文件——添加删除管理单元——证书——计算机帐户 第二步:在计算机帐户的个人证书里面导入pfx格 ...