Spark机器学习9· 实时机器学习(scala with sbt)
1 在线学习
模型随着接收的新消息,不断更新自己;而不是像离线训练一次次重新训练。
2 Spark Streaming
- 离散化流(DStream)
输入源:Akka actors、消息队列、Flume、Kafka、……
http://spark.apache.org/docs/latest/streaming-programming-guide.html
类群(lineage):应用到RDD上的转换算子和执行算子的集合
3 MLib+Streaming应用
3.0 build.sbt
依赖Spark MLlib和Spark Streaming
name := "scala-spark-streaming-app"
version := "1.0"
scalaVersion := "2.11.7"
libraryDependencies += "org.apache.spark" %% "spark-mllib" % "1.5.1"
libraryDependencies += "org.apache.spark" %% "spark-streaming" % "1.5.1"
使用国内镜像仓库
~/.sbt/repositories
[repositories]
local
osc: http://maven.oschina.net/content/groups/public/
typesafe: http://repo.typesafe.com/typesafe/ivy-releases/, [organization]/[module]/(scala_[scalaVersion]/)(sbt_[sbtVersion]/)[revision]/[type]s/[artifact](-[classifier]).[ext], bootOnly
sonatype-oss-releases
maven-central
sonatype-oss-snapshots
3.1 生产消息
object StreamingProducer {
def main(args: Array[String]) {
val random = new Random()
// Maximum number of events per second
val MaxEvents = 6
// Read the list of possible names
val namesResource = this.getClass.getResourceAsStream("/names.csv")
val names = scala.io.Source.fromInputStream(namesResource)
.getLines()
.toList
.head
.split(",")
.toSeq
// Generate a sequence of possible products
val products = Seq(
"iPhone Cover" -> 9.99,
"Headphones" -> 5.49,
"Samsung Galaxy Cover" -> 8.95,
"iPad Cover" -> 7.49
)
/** Generate a number of random product events */
def generateProductEvents(n: Int) = {
(1 to n).map { i =>
val (product, price) = products(random.nextInt(products.size))
val user = random.shuffle(names).head
(user, product, price)
}
}
// create a network producer
val listener = new ServerSocket(9999)
println("Listening on port: 9999")
while (true) {
val socket = listener.accept()
new Thread() {
override def run = {
println("Got client connected from: " + socket.getInetAddress)
val out = new PrintWriter(socket.getOutputStream(), true)
while (true) {
Thread.sleep(1000)
val num = random.nextInt(MaxEvents)
val productEvents = generateProductEvents(num)
productEvents.foreach{ event =>
out.write(event.productIterator.mkString(","))
out.write("\n")
}
out.flush()
println(s"Created $num events...")
}
socket.close()
}
}.start()
}
}
}
sbt run
Multiple main classes detected, select one to run:
[1] MonitoringStreamingModel
[2] SimpleStreamingApp
[3] SimpleStreamingModel
[4] StreamingAnalyticsApp
[5] StreamingModelProducer
[6] StreamingProducer
[7] StreamingStateApp
Enter number: 6
3.2 打印消息
object SimpleStreamingApp {
def main(args: Array[String]) {
val ssc = new StreamingContext("local[2]", "First Streaming App", Seconds(10))
val stream = ssc.socketTextStream("localhost", 9999)
// here we simply print out the first few elements of each batch
stream.print()
ssc.start()
ssc.awaitTermination()
}
}
sbt run
Enter number: 2
3.3 流式分析
object StreamingAnalyticsApp {
def main(args: Array[String]) {
val ssc = new StreamingContext("local[2]", "First Streaming App", Seconds(10))
val stream = ssc.socketTextStream("localhost", 9999)
// create stream of events from raw text elements
val events = stream.map { record =>
val event = record.split(",")
(event(0), event(1), event(2))
}
/*
We compute and print out stats for each batch.
Since each batch is an RDD, we call forEeachRDD on the DStream, and apply the usual RDD functions
we used in Chapter 1.
*/
events.foreachRDD { (rdd, time) =>
val numPurchases = rdd.count()
val uniqueUsers = rdd.map { case (user, _, _) => user }.distinct().count()
val totalRevenue = rdd.map { case (_, _, price) => price.toDouble }.sum()
val productsByPopularity = rdd
.map { case (user, product, price) => (product, 1) }
.reduceByKey(_ + _)
.collect()
.sortBy(-_._2)
val mostPopular = productsByPopularity(0)
val formatter = new SimpleDateFormat
val dateStr = formatter.format(new Date(time.milliseconds))
println(s"== Batch start time: $dateStr ==")
println("Total purchases: " + numPurchases)
println("Unique users: " + uniqueUsers)
println("Total revenue: " + totalRevenue)
println("Most popular product: %s with %d purchases".format(mostPopular._1, mostPopular._2))
}
// start the context
ssc.start()
ssc.awaitTermination()
}
}
sbt run
Enter number: 4
3.4 有状态的流计算
object StreamingStateApp {
import org.apache.spark.streaming.StreamingContext._
def updateState(prices: Seq[(String, Double)], currentTotal: Option[(Int, Double)]) = {
val currentRevenue = prices.map(_._2).sum
val currentNumberPurchases = prices.size
val state = currentTotal.getOrElse((0, 0.0))
Some((currentNumberPurchases + state._1, currentRevenue + state._2))
}
def main(args: Array[String]) {
val ssc = new StreamingContext("local[2]", "First Streaming App", Seconds(10))
// for stateful operations, we need to set a checkpoint location
ssc.checkpoint("/tmp/sparkstreaming/")
val stream = ssc.socketTextStream("localhost", 9999)
// create stream of events from raw text elements
val events = stream.map { record =>
val event = record.split(",")
(event(0), event(1), event(2).toDouble)
}
val users = events.map { case (user, product, price) => (user, (product, price)) }
val revenuePerUser = users.updateStateByKey(updateState)
revenuePerUser.print()
// start the context
ssc.start()
ssc.awaitTermination()
}
}
sbt run
Enter number: 7
4 线性流回归
线性回归StreamingLinearRegressionWithSGD
- trainOn
- predictOn
4.1 流数据生成器
object StreamingModelProducer {
import breeze.linalg._
def main(args: Array[String]) {
// Maximum number of events per second
val MaxEvents = 100
val NumFeatures = 100
val random = new Random()
/** Function to generate a normally distributed dense vector */
def generateRandomArray(n: Int) = Array.tabulate(n)(_ => random.nextGaussian())
// Generate a fixed random model weight vector
val w = new DenseVector(generateRandomArray(NumFeatures))
val intercept = random.nextGaussian() * 10
/** Generate a number of random product events */
def generateNoisyData(n: Int) = {
(1 to n).map { i =>
val x = new DenseVector(generateRandomArray(NumFeatures))
val y: Double = w.dot(x)
val noisy = y + intercept //+ 0.1 * random.nextGaussian()
(noisy, x)
}
}
// create a network producer
val listener = new ServerSocket(9999)
println("Listening on port: 9999")
while (true) {
val socket = listener.accept()
new Thread() {
override def run = {
println("Got client connected from: " + socket.getInetAddress)
val out = new PrintWriter(socket.getOutputStream(), true)
while (true) {
Thread.sleep(1000)
val num = random.nextInt(MaxEvents)
val data = generateNoisyData(num)
data.foreach { case (y, x) =>
val xStr = x.data.mkString(",")
val eventStr = s"$y\t$xStr"
out.write(eventStr)
out.write("\n")
}
out.flush()
println(s"Created $num events...")
}
socket.close()
}
}.start()
}
}
}
sbt run
Enter number: 5
4.2 流回归模型
object SimpleStreamingModel {
def main(args: Array[String]) {
val ssc = new StreamingContext("local[2]", "First Streaming App", Seconds(10))
val stream = ssc.socketTextStream("localhost", 9999)
val NumFeatures = 100
val zeroVector = DenseVector.zeros[Double](NumFeatures)
val model = new StreamingLinearRegressionWithSGD()
.setInitialWeights(Vectors.dense(zeroVector.data))
.setNumIterations(1)
.setStepSize(0.01)
// create a stream of labeled points
val labeledStream: DStream[LabeledPoint] = stream.map { event =>
val split = event.split("\t")
val y = split(0).toDouble
val features: Array[Double] = split(1).split(",").map(_.toDouble)
LabeledPoint(label = y, features = Vectors.dense(features))
}
// train and test model on the stream, and print predictions for illustrative purposes
model.trainOn(labeledStream)
//model.predictOn(labeledStream).print()
ssc.start()
ssc.awaitTermination()
}
}
sbt run
Enter number: 5
5 流K-均值
- K-均值聚类:StreamingKMeans
6 评估
object MonitoringStreamingModel {
def main(args: Array[String]) {
val ssc = new StreamingContext("local[2]", "First Streaming App", Seconds(10))
val stream = ssc.socketTextStream("localhost", 9999)
val NumFeatures = 100
val zeroVector = DenseVector.zeros[Double](NumFeatures)
val model1 = new StreamingLinearRegressionWithSGD()
.setInitialWeights(Vectors.dense(zeroVector.data))
.setNumIterations(1)
.setStepSize(0.01)
val model2 = new StreamingLinearRegressionWithSGD()
.setInitialWeights(Vectors.dense(zeroVector.data))
.setNumIterations(1)
.setStepSize(1.0)
// create a stream of labeled points
val labeledStream = stream.map { event =>
val split = event.split("\t")
val y = split(0).toDouble
val features = split(1).split(",").map(_.toDouble)
LabeledPoint(label = y, features = Vectors.dense(features))
}
// train both models on the same stream
model1.trainOn(labeledStream)
model2.trainOn(labeledStream)
// use transform to create a stream with model error rates
val predsAndTrue = labeledStream.transform { rdd =>
val latest1 = model1.latestModel()
val latest2 = model2.latestModel()
rdd.map { point =>
val pred1 = latest1.predict(point.features)
val pred2 = latest2.predict(point.features)
(pred1 - point.label, pred2 - point.label)
}
}
// print out the MSE and RMSE metrics for each model per batch
predsAndTrue.foreachRDD { (rdd, time) =>
val mse1 = rdd.map { case (err1, err2) => err1 * err1 }.mean()
val rmse1 = math.sqrt(mse1)
val mse2 = rdd.map { case (err1, err2) => err2 * err2 }.mean()
val rmse2 = math.sqrt(mse2)
println(
s"""
|-------------------------------------------
|Time: $time
|-------------------------------------------
""".stripMargin)
println(s"MSE current batch: Model 1: $mse1; Model 2: $mse2")
println(s"RMSE current batch: Model 1: $rmse1; Model 2: $rmse2")
println("...\n")
}
ssc.start()
ssc.awaitTermination()
}
}
sbt run
Enter number: 1
Spark机器学习9· 实时机器学习(scala with sbt)的更多相关文章
- Spark机器学习1·编程入门(scala/java/python)
Spark安装目录 /Users/erichan/Garden/spark-1.4.0-bin-hadoop2.6 基本测试 ./bin/run-example org.apache.spark.ex ...
- 【原】Learning Spark (Python版) 学习笔记(四)----Spark Sreaming与MLlib机器学习
本来这篇是准备5.15更的,但是上周一直在忙签证和工作的事,没时间就推迟了,现在终于有时间来写写Learning Spark最后一部分内容了. 第10-11 章主要讲的是Spark Streaming ...
- Spark Sreaming与MLlib机器学习
Spark Sreaming与MLlib机器学习 本来这篇是准备5.15更的,但是上周一直在忙签证和工作的事,没时间就推迟了,现在终于有时间来写写Learning Spark最后一部分内容了. 第10 ...
- 使用spark ml pipeline进行机器学习
一.关于spark ml pipeline与机器学习 一个典型的机器学习构建包含若干个过程 1.源数据ETL 2.数据预处理 3.特征选取 4.模型训练与验证 以上四个步骤可以抽象为一个包括多个步骤的 ...
- spark ml pipeline构建机器学习任务
一.关于spark ml pipeline与机器学习一个典型的机器学习构建包含若干个过程 1.源数据ETL 2.数据预处理 3.特征选取 4.模型训练与验证 以上四个步骤可以抽象为一个包括多个步骤的流 ...
- Spark集群 + Akka + Kafka + Scala 开发(3) : 开发一个Akka + Spark的应用
前言 在Spark集群 + Akka + Kafka + Scala 开发(1) : 配置开发环境中,我们已经部署好了一个Spark的开发环境. 在Spark集群 + Akka + Kafka + S ...
- 基于Spark环境对比Python和Scala语言利弊
在数据挖掘中,Python和Scala语言都是极受欢迎的,本文总结两种语言在Spark环境各自特点. 本文翻译自 https://www.dezyre.com/article/Scala-vs-Py ...
- 苏宁基于Spark Streaming的实时日志分析系统实践 Spark Streaming 在数据平台日志解析功能的应用
https://mp.weixin.qq.com/s/KPTM02-ICt72_7ZdRZIHBA 苏宁基于Spark Streaming的实时日志分析系统实践 原创: AI+落地实践 AI前线 20 ...
- Spark集群 + Akka + Kafka + Scala 开发(2) : 开发一个Spark应用
前言 在Spark集群 + Akka + Kafka + Scala 开发(1) : 配置开发环境,我们已经部署好了一个Spark的开发环境. 本文的目标是写一个Spark应用,并可以在集群中测试. ...
随机推荐
- android数据恢复
很多人都有在使用手机时误删数据的经历,比方说和女朋友分手后把之前一起玩耍的影像资料删除了,结果没过几天又复合了,某天女朋友想和你一起回忆某个温馨时刻,这时候拿不出照片或视频来会非常尴尬.为了避免这类人 ...
- ios UITableView中Cell重用机制导致内容重复解决方法
UITableView继承自UIScrollview,是苹果为我们封装好的一个基于scroll的控件.上面主要是一个个的 UITableViewCell,可以让UITableViewCell响应一些点 ...
- angular 2+ 路由守卫
1. 定义接口名称 /domain/login-guard.ts export interface LoginGuard { data: any; msg: string; status: boole ...
- Linux Tar 命令简明教程
Tar 命令经常用但是它的各种参数又总是记不住,因此彻底梳理了一下,再也不会忘记. Tar 是 Linux 中的(压缩)归档工具. 归档的意思与打包相同,就是把文件或目录或者多个文件和目录打包为一个文 ...
- Dart异步与消息循环机制
Dart与消息循环机制 翻译自https://www.dartlang.org/articles/event-loop/ 异步任务在Dart中随处可见,例如许多库的方法调用都会返回Future对象来实 ...
- 小程序 wx.navigateTo传值总结
wx.navigateTo(Object object) 保留当前页面,跳转到应用内的某个页面.但是不能跳到 tabbar 页面.使用 wx.navigateBack 可以返回到原页面.小程序中页面栈 ...
- 并发编程 - 协程 - 1.协程概念/2.greenlet模块/3.gevent模块/4.gevent实现并发的套接字通信
1.协程并发:切+保存状态单线程下实现并发:协程 切+ 保存状态 yield 遇到io切,提高效率 遇到计算切,并没有提高效率 检测单线程下 IO行为 io阻塞 切 相当于骗操作系统 一直处于计算协程 ...
- Junit 3.8.1 源码分析之两个接口
1. Junit源码文件说明 runner framework:整体框架; extensions:可以对程序进行扩展; textui:JUnit运行时的入口程序以及程序结果的呈现方式; awtui:J ...
- DRF频率、分页、解析器、渲染器
DRF的频率 频率限制是做什么的 开放平台的API接口调用需要限制其频率,以节约服务器资源和避免恶意的频繁调用. 频率组件原理 DRF中的频率控制基本原理是基于访问次数和时间的,当然我们可以通过自己定 ...
- python学习笔记(五)os、sys模块
一.os模块 print(os.name) #输出字符串指示正在使用的平台.如果是window 则用'nt'表示,对于Linux/Unix用户,它是'posix'. print(os.getcwd( ...