Spark is a compelling multi-purpose platform for use cases that span investigative, as well as operational, analytics.

Data science is a broad church. I am a data scientist — or so I’ve been told — but what I do is actually quite different from what other “data scientists” do. For example, there are those practicing “investigative analytics” and those implementing “operational analytics.” (I’m in the second camp.)

Data scientists performing investigative analytics use interactive statistical environments like R to perform ad-hoc, exploratory analytics in order to answer questions and gain insights. By contrast, data scientists building operational analytics systems have more in common with engineers. They build software that creates and queries machine-learning models that operate at scale in real-time serving environments, using systems languages like C++ and Java, and often use several elements of an enterprise data hub, including the Apache Hadoop ecosystem.

And there are subgroups within these groups of data scientists. For example, some analysts who are proficient with R have never heard of Python or scikit-learn, or vice versa, even though both provide libraries of statistical functions that are accessible from a REPL (Read-Evaluate-Print Loop) environment.

A World of Tradeoffs

It would be wonderful to have one tool for everyone, and one architecture and language for investigative as well as operational analytics. If I primarily work in Java, should I really need to know a language like Python or R in order to be effective at exploring data? Coming from a conventional data analyst background, must I understand MapReduce in order to scale up computations? The array of tools available to data scientists tells a story of unfortunate tradeoffs:

  • R offers a rich environment for statistical analysis and machine learning, but it has some rough edges when performing many of the data processing and cleanup tasks that are required before the real analysis work can begin. As a language, it’s not similar to the mainstream languages developers know.
  • Python is a general purpose programming language with excellent libraries for data analysis like Pandas and scikit-learn. But like R, it’s still limited to working with an amount of data that can fit on one machine.
  • It’s possible to develop distributed machine learning algorithms on the classic MapReduce computation framework in Hadoop (see Apache Mahout). But MapReduce is notoriously low-level and difficult to express complex computations in.
  • Apache Crunch offers a simpler, idiomatic Java API for expressing MapReduce computations. But still, the nature of MapReduce makes it inefficient for iterative computations, and most machine learning algorithms have an iterative component.

And so on. There are both gaps and overlaps between these and other data science tools. Coming from a background in Java and Hadoop, I do wonder with envy sometimes: why can’t we have a nice REPL-like investigative analytics environment like the Python and R users have? That’s still scalable and distributed? And has the nice distributed-collection design of Crunch? And can equally be used in operational contexts?

Common Ground in Spark

These are the desires that make me excited about Apache Spark. While discussion about Spark for data science has mostly noted its ability to keep data resident in memory, which can speed up iterative machine learning workloads compared to MapReduce, this is perhaps not even the big news, not to me. It does not solve every problem for everyone. However, Spark has a number of features that make it a compelling crossover platform for investigative as well as operational analytics:

  • Spark comes with a machine-learning library, MLlib, albeit bare bones so far.
  • Being Scala-based, Spark embeds in any JVM-based operational system, but can also be used interactively in a REPL in a way that will feel familiar to R and Python users.
  • For Java programmers, Scala still presents a learning curve. But at least, any Java library can be used from within Scala.
  • Spark’s RDD (Resilient Distributed Dataset) abstraction resembles Crunch’s PCollection, which has proved a useful abstraction in Hadoop that will already be familiar to Crunch developers. (Crunch can even be used on top of Spark.)
  • Spark imitates Scala’s collections API and functional style, which is a boon to Java and Scala developers, but also somewhat familiar to developers coming from Python. Scala is also a compelling choice for statistical computing.
  • Spark itself, and Scala underneath it, are not specific to machine learning. They provide APIs supporting related tasks, like data access, ETL, and integration. As with Python, the entire data science pipeline can be implemented within this paradigm, not just the model fitting and analysis.
  • Code that is implemented in the REPL environment can be used mostly as-is in an operational context.
  • Data operations are transparently distributed across the cluster, even as you type.

Spark, and MLlib in particular, still has a lot of growing to do. For example, the project needs optimizations, fixes, and deeper integration with YARN. It doesn’t yet provide nearly the depth of library functions that conventional data analysis tools do. But as a best-of-most-worlds platform, it is already sufficiently interesting for a data scientist of any denomination to look at seriously.

In Action: Tagging Stack Overflow Questions

A complete example will give a sense of using Spark as an environment for transforming data and building models on Hadoop. The following example uses a dump of data from the popular Stack Overflow Q&A site. On Stack Overflow, developers can ask and answer questions about software. Questions can be tagged with short strings like “java” or “sql“. This example will build a model that can suggest new tags to questions based on existing tags, using thealternating least squares (ALS) recommender algorithm; questions are “users” and tags are “items”.

Getting the Data

Stack Exchange provides complete dumps of all data, most recently from January 20, 2014. The data is provided as a torrent containing different types of data from Stack Overflow and many sister sites. Only the filestackoverflow.com-Posts.7z needs to be downloaded from the torrent.

This file is just a bzip-compressed file. Spark, like Hadoop, can directly read and split some compressed files, but in this case it is necessary to uncompress a copy on to HDFS. In one step, that’s:

 
 
1
bzcat stackoverflow.com-Posts.7z | hdfs dfs -put - /user/srowen/Posts.xml

Uncompressed, it consumes about 24.4GB, and contains about 18 million posts, of which 2.1 million are questions. These questions have about 9.3 million tags from approximately 34,000 unique tags.

Set Up Spark

Given that Spark’s integration with Hadoop is relatively new, it can be time-consuming to get it working manually. Fortunately, CDH hides that complexity by integrating Spark and managing setup of its processes. Spark can beinstalled separately with CDH 4.6.0, and is included in CDH 5 Beta 2. This example uses an installation of CDH 5 Beta 2.

This example uses MLlib, which uses the jblas library for linear algebra, which in turn calls native code using LAPACKand Fortran. At the moment, it is necessary to manually install the Fortran library dependency to enable this. The package is called libgfortran or libgfortran3, and should be available from the standard package manager of major Linux distributions. For example, for RHEL 6, install it with:

 
 
1
sudo yum install libgfortran

This must be installed on all machines that have been designated as Spark workers.

Log in to the machine designated as the Spark master with ssh. It will be necessary, at the moment, to ask Spark to let its workers use a large amount of memory. The code in MLlib that is used in this example, in version 0.9.0, has amemory issue, one that is already fixed for the next release. To configure for more memory and launch the shell:

 
 
1
2
export SPARK_JAVA_OPTS="-Dspark.executor.memory=8g"
spark-shell

Interactive Processing in the Shell

The shell is the Scala REPL. It’s possible to execute lines of code, define methods, and in general access any Scala or Spark functionality in this environment, one line at a time. You can paste the following steps into the REPL, one by one.

First, get a handle on the Posts.xml file:

 
 
1
val postsXML = sc.textFile("hdfs:///user/srowen/Posts.xml")

In response the REPL will print:

 
 
1
postsXML: org.apache.spark.rdd.RDD[String] = MappedRDD[1] at textFile at :12

The text file is an RDD (Resilient Distributed Dataset) of Strings, which are the lines of the file. You can query it by calling methods of the RDD class. For example, to count the lines:

 
 
1
postsXML.count

This command yields a great deal of output from Spark as it counts lines in a distributed way, and finally prints18066983.

The next snippet transforms the lines of the XML file into a collection of (questionID,tag) tuples. This demonstrates Scala’s functional programming style, and other quirks. (Explaining them is out of scope here.) RDDs behave like Scala collections, and expose many of the same methods, like map:

(You can copy the source for the above from here.)

You will notice that this returns immediately, unlike previously. So far, nothing requires Spark to actually perform this transformation. It is possible to force Spark to perform the computation by, for example, calling a method like count. Or Spark can be told to compute and persist the result through checkpointing, for example.

The MLlib implementation of ALS operates on numeric IDs, not strings. The tags (“items”) in this data set are strings. It will be sufficient here to hash tags to a nonnegative integer value, use the integer values for the computation, and then use a reverse mapping to translate back to tag strings later. Here, a hash function is defined since it will be reused shortly.

 
 
1
2
def nnHash(tag: String) = tag.hashCode & 0x7FFFFF
var tagHashes = postIDTags.map(_._2).distinct.map(tag =>(nnHash(tag),tag))

Now, you can convert the tuples from before into the format that the ALS implementation expects, and the model can be computed:

 
 
1
2
3
4
5
import org.apache.spark.mllib.recommendation._
// Convert to Rating(Int,Int,Double) objects
val alsInput = postIDTags.map(t => Rating(t._1, nnHash(t._2), 1.0))
// Train model with 40 features, 10 iterations of ALS
val model = ALS.trainImplicit(alsInput, 40, 10)

This will take minutes or more, depending on the size of your cluster, and will spew a large amount of output from the workers. Take a moment to find the Spark master web UI, which can be found from Cloudera Manager, and will run by default at http://[master]:18080. There will be one running application. Click through, then click “Application Detail UI”. In this view it’s possible to monitor Spark’s distributed execution of lines of code in ALS.scala:

When it is complete, a factored matrix model is available in Spark. It can be used to predict question-tag associations by “recommending” tags to questions. At this early stage of MLlib’s life, there is not even a proper recommend method yet, that would give suggested tags for a question. However it is easy to define one:

 
 
1
2
3
4
5
6
7
8
def recommend(questionID: Int, howMany: Int = 5): Array[(String, Double)] = {
  // Build list of one question and all items and predict value for all of them
  val predictions = model.predict(tagHashes.map(t => (questionID,t._1)))
  // Get top howMany recommendations ordered by prediction value
  val topN = predictions.top(howMany)(Ordering.by[Rating,Double](_.rating))
  // Translate back to tags from IDs
  topN.map(r => (tagHashes.lookup(r.product)(0), r.rating))
}

And to call it, pick any question with at least four tags, like “How to make substring-matching query work fast on a large table?” and get its ID from the URL. Here, that’s 7122697:

 
 
1
recommend(7122697).foreach(println)

This method will take a minute or more to complete, which is slow. The lookups in the last line are quite expensive since each requires a distributed search. It would be somewhat faster if this mapping were available in memory. It’s possible to tell Spark to do this:

 
 
1
tagHashes = tagHashes.cache

Because of the magic of Scala closures, this does in fact affect the object used inside the recommend method just defined. Run the method call again and it will return faster. The result in both cases will be something similar to the following:

 
 
1
2
3
4
5
(sql,0.17745152481166354)
(database,0.13526622226672633)
(oracle,0.1079428707621154)
(ruby-on-rails,0.06067207312463499)
(postgresql,0.050933613169706474)

(Your result will not be identical, since ALS starts from a random solution and iterates.) The original question was tagged “postgresql”, “query-optimization”, “substring”, and “text-search”. It’s reasonable that the question might also be tagged “sql” and “database”. “oracle” makes sense in the context of questions about optimization and text search, and “ruby-on-rails” often comes up with PostgreSQL, even though these tags are not in fact related to this particular question.

Something for Everyone

Of course, this example could be more efficient and more general. But for the practicing data scientists out there — whether you came in as an R analyst, Python hacker, or Hadoop developer — hopefully you saw something familiar in different elements of the example, and have discovered a way to use Spark to access some benefits that the other tribes take for granted.

Learn more about Spark’s role in an EDH, and join the discussion in our brand-new Spark forum.

Sean is Director of Data Science for EMEA at Cloudera, helping customers build large-scale machine learning solutions on Hadoop. Previously, Sean founded Myrrix Ltd, producing a real-time recommender and clustering product evolved from Apache Mahout. Sean was primary author of recommender components in Mahout, and has been an active committer and PMC member for the project. He is co-author of Mahout in Action.

-----

FWD from post: http://blog.cloudera.com/blog/2014/03/why-apache-spark-is-a-crossover-hit-for-data-scientists/

Why Apache Spark is a Crossover Hit for Data Scientists [FWD]的更多相关文章

  1. Apache Spark源码走读之17 -- 如何进行代码跟读

    欢迎转载,转载请注明出处,徽沪一郎 概要 今天不谈Spark中什么复杂的技术实现,只稍为聊聊如何进行代码跟读.众所周知,Spark使用scala进行开发,由于scala有众多的语法糖,很多时候代码跟着 ...

  2. 新手福利:Apache Spark入门攻略

    [编者按]时至今日,Spark已成为大数据领域最火的一个开源项目,具备高性能.易于使用等特性.然而作为一个年轻的开源项目,其使用上存在的挑战亦不可为不大,这里为大家分享SciSpike软件架构师Ash ...

  3. Apache Spark 2.2.0 中文文档 - GraphX Programming Guide | ApacheCN

    GraphX Programming Guide 概述 入门 属性 Graph 示例属性 Graph Graph 运算符 运算符的汇总表 Property 运算符 Structural 运算符 Joi ...

  4. APACHE SPARK 2.0 API IMPROVEMENTS: RDD, DATAFRAME, DATASET AND SQL

    What’s New, What’s Changed and How to get Started. Are you ready for Apache Spark 2.0? If you are ju ...

  5. Apache Spark简单介绍、安装及使用

    Apache Spark简介 Apache Spark是一个高速的通用型计算引擎,用来实现分布式的大规模数据的处理任务. 分布式的处理方式可以使以前单台计算机面对大规模数据时处理不了的情况成为可能. ...

  6. 关于Apache Spark

    Apache Spark :  https://www.oschina.net/p/spark-project

  7. Apache Spark源码剖析

    Apache Spark源码剖析(全面系统介绍Spark源码,提供分析源码的实用技巧和合理的阅读顺序,充分了解Spark的设计思想和运行机理) 许鹏 著   ISBN 978-7-121-25420- ...

  8. [翻译]Apache Spark入门简介

    原文地址:http://blog.jobbole.com/?p=89446 我是在2013年底第一次听说Spark,当时我对Scala很感兴趣,而Spark就是使用Scala编写的.一段时间之后,我做 ...

  9. Apache Spark技术实战之9 -- 日志级别修改

    摘要 在学习使用Spark的过程中,总是想对内部运行过程作深入的了解,其中DEBUG和TRACE级别的日志可以为我们提供详细和有用的信息,那么如何进行合理设置呢,不复杂但也绝不是将一个INFO换为TR ...

随机推荐

  1. SWT, JFace必须的jar包和有可能会用到的jar

    先附上一篇转载的, 转自CSDN, 原文地址:http://blog.csdn.net/nabber/article/details/1879511 如果阁下看到了这篇文章,估计你已经碰到了这样的困那 ...

  2. 重签名问题:does not have a signature matching

    今天在家里电脑重签名过的apk拿到公司来用装到模拟器上,运行Robotium测试用例时,报了如下错误,原本以为是工程里的activity名称和包名写错了呢,检查了一遍发现木有错误呀.... 好吧,那我 ...

  3. C++矩阵运算库armadillo配置笔记

    前言 最近在用C++实现神经网络模型,优化算法需要用到矩阵操作,一开始我用的是boost的ublas库,但用着用着感觉很不习惯,接口不够友好.于是上网搜索矩阵运算哪家强,大神们都推荐armadillo ...

  4. win7 解锁注册表

    win7系统 各种百度,各种尝试,发现只此一种可以解锁 [Version] Signature=“$CHICAGO$” [DefaultInstall] DelReg=del [del] HKCU,S ...

  5. option2

    option = { tooltip : { show: true, trigger: 'item' }, legend: { data:['邮件营销','联盟广告','直接访问','搜索引擎'] } ...

  6. Easyui 关闭弹出框后还显示验证提示信息

    今天下午做form表单,然后可以保存,可以关闭.可是关闭的时候老是会在屏幕左上角显示验证提示框,很是着急. 如图: 可能是easyui自己框架的问题,或许是因为网上有的人,自己代码写得有问题,没有调试 ...

  7. LeetCode OJ-- Valid Number **@

    https://oj.leetcode.com/problems/valid-number/ 判断给的串,是不是合理的 数字形式 主要问题在需求定义上吧 class Solution { public ...

  8. linux运维的认知及RHEL7 Unix/Linux 系统 介绍和安装

    如何成为一个优秀的linux运维人员?      如果你有机会和条件:环境能够磨练一个人的能力和意志.      大胆的做你从未做过的项目,每一个项目都是对自身的极大提升.      有好的环境资源不 ...

  9. 关于最近在做的一个js全屏轮播插件

    最近去面试了,对方要求我在一个星期内用原生的js代码写一个全屏轮播的插件,第一想法就是跟照片轮播很相似,只是照片轮播是有定义一个宽高度大小已经确定了的容器用来存储所有照片,然后将照片全部左浮动,利用m ...

  10. Python批量修改文件名

    处理语料库时,有些文件名字很不规则,为了方便处理,同义按数字顺序修改名称,主要是用到os模块: import os def RenameFiles(srcdir): #将目录下所有的文件命名为数字开头 ...