如何在spark中读写cassandra数据 ---- 分布式计算框架spark学习之六

由于预处理的数据都存储在cassandra里面,所以想要用spark进行数据分析的话，需要读取cassandra数据，并把分析结果也一并存回到cassandra；因此需要研究一下spark如何读写cassandra。

话说这个单词敲起来好累，说是spark，其实就是看你开发语言是否有对应的driver了。

因为cassandra是datastax主打的，所以该公司也提供了spark的对应的driver了，见这里。

我就参考它的demo，使用scala语言来测试一把。

1.执行代码

//CassandraTest.scala
import org.apache.spark.{Logging, SparkContext, SparkConf}

import com.datastax.spark.connector.cql.CassandraConnector

object CassandraTestApp {

  def main(args: Array[String]) {
　　　　#配置spark，cassandra的ip，这里都是本机

      val SparkMasterHost = "127.0.0.1"

      val CassandraHost = "127.0.0.1"

      // Tell Spark the address of one Cassandra node:

      val conf = new SparkConf(true)

        .set("spark.cassandra.connection.host", CassandraHost)

        .set("spark.cleaner.ttl", "")

        .setMaster("local[12]")

        .setAppName("CassandraTestApp")

      // Connect to the Spark cluster:

      lazy val sc = new SparkContext(conf)


　　　 //预处理脚本,连接的时候就执行这些

      CassandraConnector(conf).withSessionDo { session =>

        session.execute("CREATE KEYSPACE IF NOT EXISTS test WITH REPLICATION = {'class': 'SimpleStrategy', 'replication_factor': 1 }")

        session.execute("CREATE TABLE IF NOT EXISTS test.key_value (key INT PRIMARY KEY, value VARCHAR)")

        session.execute("TRUNCATE test.key_value")

        session.execute("INSERT INTO test.key_value(key, value) VALUES (1, 'first row')")

        session.execute("INSERT INTO test.key_value(key, value) VALUES (2, 'second row')")

        session.execute("INSERT INTO test.key_value(key, value) VALUES (3, 'third row')")

      }

　　　　

      //加载connector

      import com.datastax.spark.connector._

      // Read table test.kv and print its contents:

      val rdd = sc.cassandraTable("test", "key_value").select("key", "value")

      rdd.collect().foreach(row => println(s"Existing Data: $row"))

      // Write two new rows to the test.kv table:

      val col = sc.parallelize(Seq((, "fourth row"), (, "fifth row")))

      col.saveToCassandra("test", "key_value", SomeColumns("key", "value"))

      // Assert the two new rows were stored in test.kv table:

      assert(col.collect().length == )

      col.collect().foreach(row => println(s"New Data: $row"))

      println(s"Work completed, stopping the Spark context.")

      sc.stop()

  }

}

2.目录结构

由于构建工具是用sbt，所以目录结构也必须遵循sbt规范，主要是build.sbt 和 src目录，其它目录会自动生成。

qpzhang@qpzhangdeMac-mini:~/scala_code/CassandraTest $ll

total

drwxr-xr-x    qpzhang  staff     : ./

drwxr-xr-x   qpzhang  staff     : ../

-rw-r--r--   1 qpzhang  staff  460 11 26 10:11 build.sbt

drwxr-xr-x    qpzhang  staff     : project/

drwxr-xr-x   3 qpzhang  staff  102 11 25 17:32 src/

drwxr-xr-x    qpzhang  staff     : target/

qpzhang@qpzhangdeMac-mini:~/scala_code/CassandraTest $tree src/

src/

└── main

    └── scala

        └── CassandraTest.scala

qpzhang@qpzhangdeMac-mini:~/scala_code/CassandraTest $cat build.sbt

name := "CassandraTest"

version := "1.0"

scalaVersion := "2.10.4"

libraryDependencies += "org.apache.spark" %% "spark-core" % "1.5.2" % "provided"

libraryDependencies += "com.datastax.spark" %% "spark-cassandra-connector" % "1.5.0-M2"

assemblyMergeStrategy in assembly := {

  case PathList(ps @ _*) if ps.last endsWith ".properties" => MergeStrategy.first

  case x =>

    val oldStrategy = (assemblyMergeStrategy in assembly).value

    oldStrategy(x)

}

这里需要注意的是，sbt安装的是当时最新版本 0.13 , 并且安装了 assembly插件(这里要吐槽一下sbt，下载一坨坨的jar包，最好有翻墙代理，否则下载等待时间很长)。

qpzhang@qpzhangdeMac-mini:~/scala_code/CassandraTest $cat ~/.sbt/0.13/plugins/plugins.sbt

addSbtPlugin("com.typesafe.sbteclipse" % "sbteclipse-plugin" % "2.5.0")

addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.14.1")

3.sbt编译打包

在build.sbt 目录下，使用sbt命令启动。

然后使用 compile 命令进行编译，使用assembly进行打包。

在次期间，遇到了sbt-assembly-deduplicate-error的问题，参考这里。

> compile

[success] Total time:  s, completed -- ::

>> assembly

[info] Including from cache: slf4j-api-1.7..jar

[info] Including from cache: metrics-core-3.0..jar

[info] Including from cache: netty-codec-4.0..Final.jar

[info] Including from cache: netty-handler-4.0..Final.jar

[info] Including from cache: netty-common-4.0..Final.jar

[info] Including from cache: joda-time-2.3.jar

[info] Including from cache: netty-buffer-4.0..Final.jar

[info] Including from cache: commons-lang3-3.3..jar

[info] Including from cache: jsr166e-1.1..jar

[info] Including from cache: cassandra-clientutil-2.1..jar

[info] Including from cache: joda-convert-1.2.jar

[info] Including from cache: netty-transport-4.0..Final.jar

[info] Including from cache: guava-16.0..jar

[info] Including from cache: spark-cassandra-connector_2.-1.5.-M2.jar

[info] Including from cache: cassandra-driver-core-2.2.-rc3.jar

[info] Including from cache: scala-reflect-2.10..jar

[info] Including from cache: scala-library-2.10..jar

[info] Checking every *.class/*.jar file's SHA-1.

[info] Merging files...

[warn] Merging 'META-INF/INDEX.LIST' with strategy 'discard'

[warn] Merging 'META-INF/MANIFEST.MF' with strategy 'discard'

[warn] Merging 'META-INF/io.netty.versions.properties' with strategy 'first'

[warn] Merging 'META-INF/maven/com.codahale.metrics/metrics-core/pom.xml' with strategy 'discard'

[warn] Merging 'META-INF/maven/com.datastax.cassandra/cassandra-driver-core/pom.xml' with strategy 'discard'

[warn] Merging 'META-INF/maven/com.google.guava/guava/pom.xml' with strategy 'discard'

[warn] Merging 'META-INF/maven/com.twitter/jsr166e/pom.xml' with strategy 'discard'

[warn] Merging 'META-INF/maven/io.netty/netty-buffer/pom.xml' with strategy 'discard'

[warn] Merging 'META-INF/maven/io.netty/netty-codec/pom.xml' with strategy 'discard'

[warn] Merging 'META-INF/maven/io.netty/netty-common/pom.xml' with strategy 'discard'

[warn] Merging 'META-INF/maven/io.netty/netty-handler/pom.xml' with strategy 'discard'

[warn] Merging 'META-INF/maven/io.netty/netty-transport/pom.xml' with strategy 'discard'

[warn] Merging 'META-INF/maven/joda-time/joda-time/pom.xml' with strategy 'discard'

[warn] Merging 'META-INF/maven/org.apache.commons/commons-lang3/pom.xml' with strategy 'discard'

[warn] Merging 'META-INF/maven/org.joda/joda-convert/pom.xml' with strategy 'discard'

[warn] Merging 'META-INF/maven/org.slf4j/slf4j-api/pom.xml' with strategy 'discard'

[warn] Strategy 'discard' was applied to 15 files

[warn] Strategy 'first' was applied to a file

[info] SHA-1: d2cb403e090e6a3ae36b08c860b258c79120fc90

[info] Packaging /Users/qpzhang/scala_code/CassandraTest/target/scala-2.10/CassandraTest-assembly-1.0.jar ...

[info] Done packaging.

[success] Total time: 19 s, completed 2015-11-26 10:12:22

4.提交到spark，执行结果

qpzhang@qpzhangdeMac-mini:~/project/spark-1.5.-bin-hadoop2. $./bin/spark-submit --class "CassandraTestApp" --master local[] ~/scala_code/CassandraTest/target/scala-2.10/CassandraTest-assembly-1.0.jar

//...........................

// :: INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID , localhost, NODE_LOCAL,  bytes)

// :: INFO Executor: Running task 0.0 in stage 0.0 (TID )

// :: INFO Executor: Fetching http://10.60.215.42:57683/jars/CassandraTest-assembly-1.0.jar with timestamp 1448509221160

// :: INFO CassandraConnector: Disconnected from Cassandra cluster: Test Cluster

// :: INFO Utils: Fetching http://10.60.215.42:57683/jars/CassandraTest-assembly-1.0.jar to /private/var/folders/2l/195zcc1n0sn2wjfjwf9hl9d80000gn/T/spark-4030cadf-8489-4540-976e-e98eedf50412/userFiles-63085bda-aa04-4906-9621-c1cedd98c163/fetchFileTemp7487594

.tmp

// :: INFO Executor: Adding file:/private/var/folders/2l/195zcc1n0sn2wjfjwf9hl9d80000gn/T/spark-4030cadf---976e-e98eedf50412/userFiles-63085bda-aa04---c1cedd98c163/CassandraTest-assembly-1.0.jar to class loader

// :: INFO Cluster: New Cassandra host localhost/127.0.0.1: added

// :: INFO CassandraConnector: Connected to Cassandra cluster: Test Cluster

// :: INFO Executor: Finished task 0.0 in stage 0.0 (TID ).  bytes result sent to driver

// :: INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID ) in  ms on localhost (/)

// :: INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool

// :: INFO DAGScheduler: ResultStage  (collect at CassandraTest.scala:) finished in 2.481 s

// :: INFO DAGScheduler: Job  finished: collect at CassandraTest.scala:, took 2.940601 s

Existing Data: CassandraRow{key: 1, value: first row}

Existing Data: CassandraRow{key: 2, value: second row}

Existing Data: CassandraRow{key: 3, value: third row}

//....................

// :: INFO TaskSchedulerImpl: Removed TaskSet 3.0, whose tasks have all completed, from pool

// :: INFO DAGScheduler: ResultStage  (collect at CassandraTest.scala:) finished in 0.032 s

// :: INFO DAGScheduler: Job  finished: collect at CassandraTest.scala:, took 0.046502 s

New Data: (4,fourth row)

New Data: (5,fifth row)

Work completed, stopping the Spark context.

cassandra中的数据

cqlsh:test> select * from key_value ;

 key | value

-----+------------

    |  fifth row

    |  first row

    | second row

    | fourth row

    |  third row

( rows)

到此位置，还算顺利，除了assembly 重复文件的问题，都还算ok。

如何在spark中读写cassandra数据 ---- 分布式计算框架spark学习之六的更多相关文章

【转】 Linux内核中读写文件数据的方法--不错
原文网址:http://blog.csdn.net/tommy_wxie/article/details/8193954 Linux内核中读写文件数据的方法有时候需要在Linuxkernel--大 ...
Electron-vue实战（三）— 如何在Vuex中管理Mock数据
Electron-vue实战(三)— 如何在Vuex中管理Mock数据作者:狐狸家的鱼本文链接:Vuex管理Mock数据 GitHub:sueRimn 在vuex中管理mock数据关于vuex的 ...
解决spark中遇到的数据倾斜问题
一. 数据倾斜的现象多数task执行速度较快,少数task执行时间非常长,或者等待很长时间后提示你内存不足,执行失败. 二. 数据倾斜的原因常见于各种shuffle操作,例如reduceByKey ...
分布式计算框架Spark
Apache Spark是一个开源分布式运算框架,最初是由加州大学柏克莱分校AMPLab所开发. Hadoop MapReduce的每一步完成必须将数据序列化写到分布式文件系统导致效率大幅降低.Spa ...
分布式计算框架-Spark(spark环境搭建、生态环境、运行架构）
Spark涉及的几个概念:RDD:Resilient Distributed Dataset(弹性分布数据集).DAG:Direct Acyclic Graph(有向无环图).SparkContext ...
大数据并行计算框架Spark
Spark2.1. http://dblab.xmu.edu.cn/blog/1689-2/ 0+入门:Spark的安装和使用(Python版) Spark2.1.0+入门:第一个Spark应用程序: ...
spring-boot+mybatis开发实战：如何在spring-boot中使用myabtis持久层框架
前言: 本项目基于maven构建,使用mybatis-spring-boot作为spring-boot项目的持久层框架 spring-boot中使用mybatis持久层框架与原spring项目使用方式 ...
如何在python中读写和存储matlab的数据文件(*.mat)
使用sicpy.io即可.sicpy.io提供了两个函数loadmat和savemat,非常方便. 以前也有一些开源的库(pymat和pymat2等)来做这个事, 不过自从有了numpy和scipy以 ...
在spark中操作mysql数据 ---- spark学习之七
使用spark的 DataFrame 来操作mysql数据. DataFrame是比RDD更高一个级别的抽象,可以应用SQL语句进行操作,详细参考: https://spark.apache.org/ ...

随机推荐

C# Ping的例子，可用于测试网络，延迟xx毫秒 C#编写网站测速
C#编写网站测速 WebClient wcl = new WebClient(); Stopwatch spwatch = new Stopwatch(); spwatch.Start(); byte ...
UIScroView 3倍的contentSize，左右Scroll时，懒惰加载View
UIScroView 3倍的contentSize,左右Scroll时,懒惰添加左右的View 用途:分段加载数据定义枚举: typedefenum { ViewPositionLeft = , V ...
SQL级联删除——删除主表同时删除从表——同时删除具有主外键关系的表
create table a(id varchar(20) primary key,password varchar(20) not null) create table b(id int iden ...
浏览器收藏夹插件-Xmarks
Xmarks 一一一款简约实用的浏览器书签同步插件首先还是想吐槽一下firefox的收藏夹同步功能,感觉不实用,密钥的长度如果不是存到手机或者别的终端,压根没办法实现同步. 而且还区分了,如果两台 ...
今天的工作发现了4年前的“bug一枚”
上午的时候山东公司要求下拨资金160万(因目前系统不能支付个人卡),在下拨单保存的时候系统提示余额不足,我马上看内部存款,结果发现人家还有190万呢,然后就看今天的委托付款单还有下拨单,山东都没有,一 ...
51nod 1290 Counting Diff Pairs 莫队 + bit
一个长度为N的正整数数组A,给出一个数K以及Q个查询,每个查询包含2个数l和r,对于每个查询输出从A[i]到A[j]中,有多少对数,abs(A[i] - A[j]) <= K(abs表示绝对值) ...
jmf找不到摄像头设备解决办法
1.安装的jdk需要时32位的. 2.在jmfregistry中注册捕获设备. 3.jar文件必须和配置文件在一起.可以新建User library把jar文件添加进去.
使用Auto Layout中的VFL(Visual format language)--代码实现自动布局【转】
本文将通过简单的UI来说明如何用VFL来实现自动布局.在自动布局的时候避免不了使用代码来加以优化以及根据内容来实现不同的UI. 一:API介绍 NSLayoutConstraint API 1 2 3 ...
【SSM 3】Mybatis应用，和Hibernate的区别
PS:每次写概念性的总结,都是各种复制,各种粘,然后各种理解各种猜.但是这一步的总结,决定了我能够再这条路上走的远近和是否开心.是否创造!so,开启Ctrl A+Ctrl C的模式吧. 接触到这个概念 ...
unity3d InverseTransformPoint方法
从歪果仁的脚本里看到了这个方法,查脚本,看脚本说明也没看懂,官方的说明是,变换位置从世界坐标到自身坐标,Transform.TransformPoint相反. 试验了一下得出这个结论,如果某一个物体A ...

如何在spark中读写cassandra数据 ---- 分布式计算框架spark学习之六

如何在spark中读写cassandra数据 ---- 分布式计算框架spark学习之六的更多相关文章

随机推荐

热门专题