Spark MLlib + maven + scala 试水～

使用SGD算法逻辑回归的垃圾邮件分类器

 package com.oreilly.learningsparkexamples.scala

 import org.apache.spark.{SparkConf, SparkContext}

 import org.apache.spark.mllib.classification.LogisticRegressionWithSGD

 import org.apache.spark.mllib.feature.HashingTF

 import org.apache.spark.mllib.regression.LabeledPoint

 object MLlib {

   def main(args: Array[String]) {

     val conf = new SparkConf().setAppName(s"MLlib example")

     val sc = new SparkContext(conf)

     // Load 2 types of emails from text files: spam and ham (non-spam).

     // Each line has text from one email.

     val spam = sc.textFile("files/spam.txt")

     val ham = sc.textFile("files/ham.txt")

     // Create a HashingTF instance to map email text to vectors of 100 features.

     val tf = new HashingTF(numFeatures = 100)

     // Each email is split into words, and each word is mapped to one feature.

     val spamFeatures = spam.map(email => tf.transform(email.split(" ")))

     val hamFeatures = ham.map(email => tf.transform(email.split(" ")))

     // Create LabeledPoint datasets for positive (spam) and negative (ham) examples.

     val positiveExamples = spamFeatures.map(features => LabeledPoint(1, features))

     val negativeExamples = hamFeatures.map(features => LabeledPoint(0, features))

     val trainingData = positiveExamples ++ negativeExamples

     trainingData.cache() // Cache data since Logistic Regression is an iterative algorithm.

     // Create a Logistic Regression learner which uses the SGD.

     val lrLearner = new LogisticRegressionWithSGD()

     // Run the actual learning algorithm on the training data.

     val model = lrLearner.run(trainingData)

     // Test on a positive example (spam) and a negative one (ham).

     // First apply the same HashingTF feature transformation used on the training data.

     val posTestExample = tf.transform("O M G GET cheap stuff by sending money to ...".split(" "))

     val negTestExample = tf.transform("Hi Dad, I started studying Spark the other ...".split(" "))

     // Now use the learned model to predict spam/ham for new emails.

     println(s"Prediction for positive test example: ${model.predict(posTestExample)}")

     println(s"Prediction for negative test example: ${model.predict(negTestExample)}")

     sc.stop()

   }

 }

spam.txt

Dear sir, I am a Prince in a far kingdom you have not heard of.  I want to send you money via wire transfer so please ...

Get Viagra real cheap!  Send money right away to ...

Oh my gosh you can be really strong too with these drugs found in the rainforest. Get them cheap right now ...

YOUR COMPUTER HAS BEEN INFECTED!  YOU MUST RESET YOUR PASSWORD.  Reply to this email with your password and SSN ...

THIS IS NOT A SCAM!  Send money and get access to awesome stuff really cheap and never have to ...

ham.txt

Dear Spark Learner, Thanks so much for attending the Spark Summit 2014!  Check out videos of talks from the summit at ...

Hi Mom, Apologies for being late about emailing and forgetting to send you the package.  I hope you and bro have been ...

Wow, hey Fred, just heard about the Spark petabyte sort.  I think we need to take time to try it out immediately ...

Hi Spark user list, This is my first question to this list, so thanks in advance for your help!  I tried running ...

Thanks Tom for your email.  I need to refer you to Alice for this one.  I haven't yet figured out that part either ...

Good job yesterday!  I was attending your talk, and really enjoyed it.  I want to try out GraphX ...

Summit demo got whoops from audience!  Had to let you know. --Joe

maven打包scala程序

├── pom.xml

├── README.md

├── src

│   └── main

│       └── scala

│           └── com

│                   └── learningsparkexamples

│                           └── scala

│                               └── MLlib.scala

MLlib.scala 就是上面写的scala代码，pom.xml 是 maven 编译时候的 配置 文件：

<?xml version="1.0" encoding="UTF-8"?>

<project xmlns="http://maven.apache.org/POM/4.0.0"

   xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"

   xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">

    <modelVersion>4.0.0</modelVersion>

    <groupId>my.demo</groupId>

    <artifactId>sparkdemo</artifactId>

    <version>1.0-SNAPSHOT</version>

    <properties>

  <!--编译时候 java版本

  <maven.compiler.source>1.7</maven.compiler.source>

  <maven.compiler.target>1.7</maven.compiler.target>

  -->

  <encoding>UTF-8</encoding>

  <scala.tools.version>2.10</scala.tools.version>

  <!-- Put the Scala version of the cluster -->

  <scala.version>2.10.5</scala.version>

    </properties>

    <dependencies>

  <dependency> <!-- Spark dependency -->

      <groupId>org.apache.spark</groupId>

      <artifactId>spark-core_2.10</artifactId>

      <version>1.6.1</version>

      <scope>provided</scope>

  </dependency>

  <dependency> <!-- Spark dependency -->

      <groupId>org.apache.spark</groupId>

      <artifactId>spark-mllib_2.10</artifactId>

      <version>1.6.1</version>

      <scope>provided</scope>

  </dependency>

  <dependency>

      <groupId>org.scala-lang</groupId>

      <artifactId>scala-library</artifactId>

      <version>2.10.5</version>

  </dependency>

    </dependencies>

    <build>

  <pluginManagement>

      <plugins>

    <plugin>

        <!--用来编译scala的-->

        <groupId>net.alchim31.maven</groupId>

        <artifactId>

        scala-maven-plugin</artifactId>

        <version>3.1.5</version>

    </plugin>

       </plugins>

  </pluginManagement>

  <plugins>

      <plugin>

    <groupId>net.alchim31.maven</groupId>

    <artifactId>scala-maven-plugin</artifactId>

    <executions>

        <execution>

      <id>scala-compile-first</id>

      <phase>process-resources</phase>

  <goals>

          <goal>add-source</goal>

          <goal>compile</goal>

      </goals>

        </execution>

        <execution>

      <id>scala-test-compile</id>

      <phase>

       process-test-resources</phase>

      <goals>

          <goal>testCompile</goal>

      </goals>

        </execution>

    </executions>

      </plugin>

      </plugins>

    </build>

</project>

其中：

import org.apache.spark.{SparkConf, SparkContext}

所需要的依赖包配置是：

  <dependency> <!-- Spark dependency -->

      <groupId>org.apache.spark</groupId>

      <artifactId>spark-core_2.10</artifactId>

      <version>1.6.1</version>

      <scope>provided</scope>

  </dependency>

import org.apache.spark.mllib.classification.LogisticRegressionWithSGD

import org.apache.spark.mllib.feature.HashingTF

import org.apache.spark.mllib.regression.LabeledPoint

所需要的依赖包配置是：

 <dependency> <!-- Spark dependency -->

      <groupId>org.apache.spark</groupId>

      <artifactId>spark-mllib_2.10</artifactId>

      <version>1.6.1</version>

      <scope>provided</scope>

  </dependency>

配置的时候要注意spark 和 scala 的版本，可以打开spark-shell 观察：

配置完成后，在pom.xml 所在的目录运行命令：

mvn clean && mvn compile && mvn package

如果mvn 下载有问题，可以参考这篇博文：http://www.cnblogs.com/xiaoyesoso/p/5489822.html 的 3. Bulid GitHub Spark Runnable Distribution

spark运行项目

mvn编译打包完成后会pom.xml所在目录下出现一个target文件夹：

├── target

│   ├── classes

│   │   └── com

│   │       └── oreilly

│   │           └── learningsparkexamples

│   │               └── scala

│   │                   ├── MLlib$$anonfun$1.class

│   │                   ├── MLlib$$anonfun$2.class

│   │                   ├── MLlib$$anonfun$3.class

│   │                   ├── MLlib$$anonfun$4.class

│   │                   ├── MLlib.class

│   │                   └── MLlib$.class

│   ├── classes.-475058802.timestamp

│   ├── maven-archiver

│   │   └── pom.properties

│   ├── maven-status

│   │   └── maven-compiler-plugin

│   │       └── compile

│   │           └── default-compile

│   │               ├── createdFiles.lst

│   │               └── inputFiles.lst

│   └── sparkdemo-1.0-SNAPSHOT.jar

最后运行命令，提交执行任务（注意两个test文件所对应的位置）：

${SPARK_HOME}/bin/spark-submit --class ${package.name}.${class.name} ${PROJECT_HOME}/target/*.jar

运行结果：

caizhenwei@caizhenwei-Inspiron-:~/桌面/learning-spark$ vim mini-complete-example/src/main/scala/com/oreilly/learningsparkexamples/mini/scala/MLlib.scala caizhenwei@caizhenwei-Inspiron-:~/桌面/learning-spark$ ../bin-spark-1.6./bin/spark-submit --class com.oreilly.learningsparkexamples.scala.MLlib ./mini-complete-example/target/sparkdemo-1.0-SNAPSHOT.jar

// :: WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

// :: WARN Utils: Your hostname, caizhenwei-Inspiron- resolves to a loopback address: 127.0.1.1; using 172.16.111.93 instead (on interface eth0)

// :: WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address

// :: WARN Utils: Service 'SparkUI' could not bind on port . Attempting port .

// :: WARN BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeSystemBLAS

// :: WARN BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeRefBLAS

Prediction for positive test example: 1.0

Prediction for negative test example: 0.0

Spark MLlib + maven + scala 试水～的更多相关文章

十二、spark MLlib的scala示例
简介 spark MLlib官网:http://spark.apache.org/docs/latest/ml-guide.html mllib是spark core之上的算法库,包含了丰富的机器学习 ...
朴素贝叶斯算法原理及Spark MLlib实例(Scala/Java/Python)
朴素贝叶斯算法介绍: 朴素贝叶斯法是基于贝叶斯定理与特征条件独立假设的分类方法. 朴素贝叶斯的思想基础是这样的:对于给出的待分类项,求解在此项出现的条件下各个类别出现的概率,在没有其它可用信息下,我 ...
eclipse构建maven+scala+spark工程转载
转载地址:http://jingpin.jikexueyuan.com/article/47043.html 本文先叙述如何配置eclipse中maven+scala的开发环境,之后,叙述如何实现sp ...
spark mllib配置pom.xml错误 Multiple markers at this line Could not transfer artifact net.sf.opencsv:opencsv:jar:2.3 from/to central (https://repo.maven.apache.org/maven2): repo.maven.apache.org
刚刚spark mllib,在maven repository网站http://mvnrepository.com/中查询mllib后得到相关库的最新dependence为: <dependen ...
spark Using MLLib in Scala/Java/Python
Using MLLib in ScalaFollowing code snippets can be executed in spark-shell. Binary ClassificationThe ...
Eclipse+maven+scala+spark环境搭建
准备条件我用的Eclipse版本 Eclipse Java EE IDE for Web Developers. Version: Luna Release (4.4.0) 我用的是Eclipse ...
梯度迭代树（GBDT）算法原理及Spark MLlib调用实例（Scala/Java/python）
梯度迭代树(GBDT)算法原理及Spark MLlib调用实例(Scala/Java/python) http://blog.csdn.net/liulingyuan6/article/details ...
3 分钟学会调用 Apache Spark MLlib KMeans
Apache Spark MLlib是Apache Spark体系中重要的一块拼图:提供了机器学习的模块.只是,眼下对此网上介绍的文章不是非常多.拿KMeans来说,网上有些文章提供了一些演示样例程序 ...
Spark MLlib编程API入门系列之特征选择之卡方特征选择（ChiSqSelector）
不多说,直接上干货! 特征选择里,常见的有:VectorSlicer(向量选择) RFormula(R模型公式) ChiSqSelector(卡方特征选择). ChiSqSelector用于使用卡方检 ...

随机推荐

为VS中的括号添加虚线
在VS中的扩展和更新中安装Indent Guides插件,即可实现该功能.
Hibernate-Session使用的背后
一.Session缓存的介绍简单说,缓存介于应用程序和数据库之间,是临时存放数据的内存区域,作用是减少对数据库的访问次数,从而提高应用程序的运行性能.Session有一个缓存,也叫hibernate ...
【Linux】Tmux分屏
1.Tmux Arch维基: https://wiki.archlinux.org/index.php/Tmux_(简体中文) 官方WIKI: https://github.com/tmux/tmux ...
国内的Jquery CDN免费服务
Jquery是个非常流行的JS前端框架,在很多网站都能看到它的身影.很多网站都喜欢采用一些Jquery CDN加速服务,这样网站加载jquery会更快.之前火端网络的一些网站都是使用Google的jq ...
VC操作WORD文档总结
一.写在开头最近研究word文档的解析技术,我本身是VC的忠实用户,看到C#里面操作WORD这么舒服,同时也看到单位有一些需求,就想尝试一下,结果没想到里面的技术点真不少,同时网络上的共享资料很多, ...
异步 BeginInvoke
委托的异步调用异步多线程的三大特点:1.同步方法卡界面,原因是主线程被占用:异步方法不卡界面,原因是计算交给了别的线程,主线程空闲2.同步方法慢,原因是只有一个线程计算:异步方法快,原因是多个线程同事 ...
decompressedResponseImageOfSize:completionHandler:]_block_invoke
原因: It turns out the linker error was caused by the CGImageSourceCreateWithData call. And the root ...
Robot Framework（十一）执行测试用例——后处理输出
3.3后处理输出在测试执行期间生成的XML输出文件可以在之后由rebot工具进行后处理,该工具是Robot Framework的组成部分.在测试执行期间生成测试报告和日志时会自动使用它,但在执行后也 ...
C#中Lock关键字的使用
C# 中的 Lock 语句通过隐式使用 Monitor 来提供同步功能.lock 关键字在块的开始处调用 Enter,而在块的结尾处调用 Exit. 通常,应避免锁定 public 类型,否则实例将超 ...
ewebeditor上传文件大小
做项目大家都少不了要跟html在线编辑器打交道,这里我把我的一些使用经验及遇到的问题发出来和大家交流一下. Ewebeditor使用说明:一.部署方式:1.直接把压缩目录中的文件拷贝到您的网站发布目录 ...

Spark MLlib + maven + scala 试水～

Spark MLlib + maven + scala 试水～的更多相关文章

随机推荐

热门专题