Spark(3) - External Data Source
Introduction
Spark provides a unified runtime for big data. HDFS, which is Hadoop's filesystem, is the most used storage platform for Spark as it provides const-effefctive storage for unstructured and semi-structured data on commodity hardware. Spark is not limited to HDFS and can work with any Hadoop-supported storage.
Hadoop supported storage means a storage format that can work with Hadoop's InputFormat and OutputFormat interfaces. InputFormats is responsible for creating InputSplits from input data and dividing it further into records. OutputFormat is responsible for writing to storage.
Loading data from the local filesystem
Though the local filesystem is not a good fit to store big data due to disk size limitations and lack of distributed nature, technically you can load data in distributed systems using the local filesystem. But then the file/directory you are accessing has to be available on each node.
1. create the words directory
mkdir words
2. get into the words directory
cd words
3. create the sh.txt file
echo "to be or not to be" > sh.txt
4. start the spark shell
spark-shell
5. load the words directory as RDD
scala> val words = sc.textFile("file:///home/hduser/words")
6. count the number of lines
scala> words.count
7. divide the line (or lines) into multiple words
scala> val wordsFlatMap = words.flatMap(_.split("\\W+"))
8. convert word to (word,1)
scala> val wordsMap = wordsFlatMap.map( w => (w,1))
9. add the number of occurrences for each word
scala > val wordCount = wordsMap.reduceByKey( (a,b) => (a+b))
10. print the RDD
scala> wordCount.collect.foreach(println)
11. doing all in one step
scala> sc.textFile("file:///home/hduser/ words"). flatMap(_.split("\\W+")).map( w => (w,1)). reduceByKey( (a,b) => (a+b)).foreach(println)
Loading data from HDFS
HDFS is the most widely used big data storage system. One of the reasons for the wide adoption of HDFS is schema-on-read. What this means is that HDFS does not put any restriction on data when data is being written. Any and all kinds of data are welcome and can be stored in a raw format. This feature makes it ideal storage for raw unstructured data and semi-structured data.
1. create the words directory
mkdir words
2. get into the words directory
cd words
3. create the sh.txt file
echo "to be or not to be" > sh.txt
4. start the spark shell
spark-shell
5. load the words directory as RDD
scala> val words = sc.textFile("hdfs://localhost:9000/user/hduser/words")
6. count the number of lines
scala> words.count
7. divide the line (or lines) into multiple words
scala> val wordsFlatMap = words.flatMap(_.split("\\W+"))
8. convert word to (word,1)
scala> val wordsMap = wordsFlatMap.map( w => (w,1))
9. add the number of occurrences for each word
scala > val wordCount = wordsMap.reduceByKey( (a,b) => (a+b))
10. print the RDD
scala> wordCount.collect.foreach(println)
11. doing all in one step
scala> sc.textFile("file:///home/hduser/ words"). flatMap(_.split("\\W+")).map( w => (w,1)). reduceByKey( (a,b) => (a+b)).foreach(println)
Loading data from HDFS using a custom InputFormat
Sometimes you need to load data in a specific format and TextInputFormat is not a good fit for that. Spark provides two methods for this purpose:
1. sparkContext.hadoopFile: This supports the old MapReduce API
2. sparkContext.newAPIHadoopFile: This supports the new MapReduce API
These two methods provide support for all of Hadoop's built-in InputFormats interfaces as well as any custom InputFormat.
1. create the currency directory
mkdir currency
2. get into the words directory
cd words
3. create the na.txt file and upload the currency folder to HDFS
vi na.txt
United States of America US Dollar
Canada Canadian Dollar
Mexico Peso
hdfs dfs -put currency /user/hduser/currency
4. start the spark shell and import statements
spark-shell
scala> import org.apache.hadoop.io.Text
scala> import org.apache.hadoop.mapreduce.lib.input.KeyValueTextInputFormat
5. load the currency directory as RDD and convert it from tuple of (Text, Text) to tuple of (String, String)
val currencyFile = sc.newAPIHadoopFile("hdfs://localhost:9000/user/hduser/currency", classOf[KeyValueTextInputFormat], classOf[Text])
val currencyRDD = currencyFile.map(t => (t._1.toString, t._2.toString))
6. count the number of elements in the RDD
scala> currencyRDD.count
7. print the RDD
scala> currencyRDD.collect.foreach(println)
Loading data from Amazon S3
Amazon Simple Storage Service (S3) provides developers and IT teams with a secure, durable, and scalable storage platform. The biggest advantage of Amazon S3 is that there is no up-front IT investment and companies can build capacity (just by clicking a button a button) as they need.
Though Amazon S3 can be used with any compute platform, it integrates really well with Amazon's cloud services such as Amazon Elastic Compute Cloud (EC2) and Amazon Elastic Block Storage (EBS). For this reason, companies who use Amazon Web Services (AWS) are likely to have significant data is already stored on Amazon S3.
1. go to http://aws.amazon.com and log in with username and password
2. navigate to Storage & Content Delivery | S3 | Create Bucket
3. enter the bucket name - for example, com.infoobjects.wordcount
4. select Region, click on Create
5. click on Create Folder end enter words as the folder name
6. create sh.txt file on the local system
echo "to be or not to be" > sh.txt
7. navigate to Words | Upload | Add Files and choose sh.txt from the dialog box
8. click on Start Upload
9. select sh.txt and click on Properties
10. set AWS_ACCESS_KEY and AWS_SECRET_ACCESS_KEY as environment variables
11. open the spark shell and load the words directory from s3 in the words RDD
scala> val words = sc.textFile("s3n://com.infoobjects.wordcount/words")
Load data from Apache Cassandra
Apache Cassandra is a NoSQL database with a masterless ring cluster structure. While HDFS is a good fit for streaming data access, it does not work well with random access. For example, HDFS will work well when your average file size is 100 MB and you want to read the whole file. If you frequently access the nth line in a file or some other part as a record, HDFS would be too slow.
Relational databases have traditionally provided a solution to that, providing low latency, random access, but they do not work well with big data. NoSQL databases such as Cassandra fill the gap by providing relational database type access but in a distributed architecture on commodity servers.
1. create a keyspace named people in Cassandra using the CQL shell
cqlsh> CREATE KEYSPACE people WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 1};
2. create a column family(from CQL 3.0 onwords, it can also be called a table) person in newer version of Cassandra
cqlsh> create columnfamily person(id int primary key, first_name varchar, last_name varchar);
3. insert a few records in the column family
cqlsh> insert into person(id,first_name,last_name) values(1,'Barack','Obama');
cqlsh> insert into person(id,first_name,last_name) values(2,'Joe','Smith');
4. add Cassandra connector dependency to SBT
"com.datastax.spark" % "spark-cassandra-connector" % 1.2.0
5. can also add the Cassandra dependency to Maven
<dependency>
<groupId>com.datastax.spark</groupId>
<artifactId>spark-cassandra-connector_2.10</artifactId>
<version>1.2.0</version>
</dependency>
6. start the spark shell
spark-shell
7. set the spark.cassandra.connection.host property
scala> sc.getConf.set("spark.cassandra.connection.host", "localhost")
8. import Cassandra-specific libraries
scala> import com.datastax.spark.connector._
9. load the person column family as an RDD
scala> val personRDD = sc.cassandraTable("people", "person")
10. count the number of lines
scala> personRDD.count
11. print the RDD
scala> personRDD.collect.foreach(println)
12. retrieve the first row
scala> var firstRow = personRDD.first
13. get the column names
scala> firstRow.columnNames
14. access Cassandra through Spark SQL
scala> val cc = new org.apache.spark.sql.cassandra.CassandraSQLContext(sc)
15. load the person data as SchemaRDD
scala> val p = cc.sql("select * from people.person")
16. print the person data
scala> p.collect.foreach(println)
creating uber JARs with sbt-assembly plugin provided by SBT
1. mkdir uber
2. cd uber
3. open the SBT prompt
sbt
4. give the project a name sc-uber, save the session and exit
> set name := "sc-uber"
> session save
> exit
5. add the spark-cassandra-driver denpendency to build.sbt
vi build.sbt
name := "sc-uber"
libraryDependencies += "com.datastax.spark" % "spark-cassandra-connector" % "1.1.0"
assemblyMergeStrategy in assembly := {
case PathList("META-INF", xs @ _*) => (xs map {_.toLowerCase}) match {
case ("manifest.mf" :: Nil) | ("index.list" :: Nil) | ("dependencies" :: Nil) => MergeStrategy.discard
case _ => MergeStrategy.discard
}
case _ => MergeStrategy.first
}
9. create plugins.sbt in the project folder
vi plugins.sbt
addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.12.0")
10. build a JAR
sbt assembly
The uber JAR is now created in target/scala-2.10/sc-uber-assembly-0.1- SNAPSHOT.jar.
11. rename the JAR file
mv thirdparty/sc-uber-assembly-0.1-SNAPSHOT.jar thirdparty/sc-uber.jar
12. load the spark shell with the uber JAR
spark-shell --jars thirdparty/sc-uber.jar
13. call spark-submit with JARS option to submit Scala code to a cluster
spark-submit --jars thirdparty/sc-uber.jar
Spark(3) - External Data Source的更多相关文章
- Spark SQL External Data Sources JDBC官方实现写测试
通过Spark SQL External Data Sources JDBC实现将RDD的数据写入到MySQL数据库中. jdbc.scala重要API介绍: /** * Save this RDD ...
- Spark SQL External Data Sources JDBC简易实现
在spark1.2版本中最令我期待的功能是External Data Sources,通过该API可以直接将External Data Sources注册成一个临时表,该表可以和已经存在的表等通过sq ...
- Spark SQL External Data Sources JDBC官方实现读测试
在最新的master分支上官方提供了Spark JDBC外部数据源的实现,先尝为快. 通过spark-shell测试: import org.apache.spark.sql.SQLContext v ...
- Searching External Data in SharePoint 2010 Using Business Connectivity Services
from:http://blogs.msdn.com/b/ericwhite/archive/2010/04/28/searching-external-data-in-sharepoint-2010 ...
- Spark Streaming、Kafka结合Spark JDBC External DataSouces处理案例
场景:使用Spark Streaming接收Kafka发送过来的数据与关系型数据库中的表进行相关的查询操作: Kafka发送过来的数据格式为:id.name.cityId,分隔符为tab zhangs ...
- Spark Streaming、HDFS结合Spark JDBC External DataSouces处理案例
场景:使用Spark Streaming接收HDFS上的文件数据与关系型数据库中的表进行相关的查询操作: 使用技术:Spark Streaming + Spark JDBC External Data ...
- Spark SQL 之 Data Sources
#Spark SQL 之 Data Sources 转载请注明出处:http://www.cnblogs.com/BYRans/ 数据源(Data Source) Spark SQL的DataFram ...
- 以Excel 作为Data Source,将data导入db
将Excel作为数据源,将数据导入db,是SSIS的一个简单的应用,下图是示例Excel,数据列是code和name 第一部分,Excel中的数据类型是数值类型 1,使用SSDT创建一个package ...
- Learning Spark: Lightning-Fast Big Data Analysis 中文翻译
Learning Spark: Lightning-Fast Big Data Analysis 中文翻译行为纯属个人对于Spark的兴趣,仅供学习. 如果我的翻译行为侵犯您的版权,请您告知,我将停止 ...
随机推荐
- HNOI2006-公路修建问题(二分答案+并查集)
公路修建问题 OI island是一个非常漂亮的岛屿,自开发以来,到这儿来旅游的人很多.然而,由于该岛屿刚刚开发不久,所以那里的交通情况还是很糟糕.所以,OIER Association组织成立了,旨 ...
- hdu 2196 Computer 树的直径
Computer Time Limit: 1000/1000 MS (Java/Others) Memory Limit: 32768/32768 K (Java/Others) Problem ...
- 纯js写验证码
<html> <head> <meta name="viewport" content="width=device-width" ...
- Json-lib使用 转载
1.从Object到String 要先用Object对象构造一个JSONObject或者JSONArray对象,然后调用它的toString()方法即可 (1)示例一 1 Book book=new ...
- More Effective C++ (2)
接下来的是more effective c++ 11至20条款: 11.禁止异常信息(exceptions)传递到析构函数外.析构函数的调用情况可能有两种:(1)对象正常销毁 (2)异常传播过程中的栈 ...
- KaliLinux装好系统后安装常用软件
1.配置软件源 leafpad /etc/apt/source.list or(recommand):#官方源deb kali main non-free contribdeb-src kali ma ...
- iOS - UILabel
前言 NS_CLASS_AVAILABLE_IOS(2_0) @interface UILabel : UIView <NSCoding> @available(iOS 2.0, *) p ...
- Spring JDBC主从数据库配置
通过昨天学习的自定义配置注释的知识,探索了解一下web主从数据库的配置: 背景:主从数据库:主要是数据上的读写分离: 数据库的读写分离的好处? 1. 将读操作和写操作分离到不同的数据库上,避免主服务器 ...
- 百度之星Astar2016 Round2A
All X 等比数列求和一下 A/B MOD C = A MOD (B*C) / B 或者分治一下 Sitting in Line 状压+拓扑dp dp(i, j)表示当前二进制状态为j,当前状态的 ...
- JavaScript基于对象(面向对象)<一>类和对象
javascript中一切皆对象,比如:Array,Date.....这些都是对象.javascript中没有class的定义,function既是定义函数,也可以是定义类.function Obj( ...