es-09-spark集成
es和spark的集成比较简单, 直接使用内部封装的一些方法即可
版本设置说明:
https://www.elastic.co/guide/en/elasticsearch/hadoop/current/requirements.html
maven依赖说明:
https://www.elastic.co/guide/en/elasticsearch/hadoop/current/install.html
1, maven配置:
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<parent>
<artifactId>xiaoniubigdata</artifactId>
<groupId>com.wenbronk</groupId>
<version>1.0</version>
</parent>
<modelVersion>4.0.</modelVersion> <artifactId>spark06-es</artifactId> <properties>
<spark.version>2.3.</spark.version>
<spark.scala.version>2.11</spark.scala.version>
<scala.version>2.11.</scala.version>
</properties> <dependencies>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_${spark.scala.version}</artifactId>
<version>${spark.version}</version>
<!--<scope>provided</scope>-->
</dependency> <dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_${spark.scala.version}</artifactId>
<version>${spark.version}</version>
<!--<scope>provided</scope>-->
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_${spark.scala.version}</artifactId>
<version>${spark.version}</version>
<!--<scope>provided</scope>-->
</dependency> <dependency>
<groupId>org.elasticsearch</groupId>
<artifactId>elasticsearch-spark-20_2.</artifactId>
<version>6.3.</version>
</dependency> </dependencies> <build> <plugins>
<plugin>
<groupId>org.scala-tools</groupId>
<artifactId>maven-scala-plugin</artifactId>
<version>2.15.</version>
<executions>
<execution>
<goals>
<goal>compile</goal>
<goal>testCompile</goal>
</goals>
</execution>
</executions>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-deploy-plugin</artifactId>
<version>2.8.</version>
<configuration>
<skip>true</skip>
</configuration>
</plugin>
</plugins>
</build> </project>
2, RDD的使用
1), read
package com.wenbronk.spark.es.rdd import org.apache.spark.rdd.RDD
import org.apache.spark.sql.SparkSession
import org.apache.spark.{SparkConf, SparkContext} /**
* 从es中读取数据
*/
object ReadMain { def main(args: Array[String]) = {
// val sparkconf = new SparkConf().setAppName("read-es").setMaster("local[4]")
// val spark = new SparkContext(sparkconf) val sparkSession = SparkSession.builder()
.appName("read-es-rdd")
.master("local[4]")
.config("es.index.auto.create", true)
.config("es.nodes", "10.124.147.22")
.config("es.port", )
.getOrCreate() val spark = sparkSession.sparkContext // 自定义query, 导入es包
import org.elasticsearch.spark._
// 以array方式读取
val esreadRdd: RDD[(String, collection.Map[String, AnyRef])] = spark.esRDD("macsearch_fileds/mac",
"""
|{
| "query": {
| "match_all": {}
| }
|}
""".stripMargin) val value: RDD[(Option[AnyRef], Int)] = esreadRdd.map(_._2.get("mac")).map(mac => (mac, )).reduceByKey(_ + _)
.sortBy(_._2) val tuples: Array[(Option[AnyRef], Int)] = value.collect() tuples.foreach(println) esreadRdd.saveAsTextFile("/Users/bronkwen/work/IdeaProjects/xiaoniubigdata/spark06-es/target/json") sparkSession.close()
} }
2, readJson
package com.wenbronk.spark.es.rdd
import org.apache.spark.sql.SparkSession
import scala.util.parsing.json.JSON
object ReadJsonMain {
def main(args: Array[String]): Unit = {
val sparkSession = SparkSession.builder()
.appName("read-es-rdd")
.master("local[4]")
.config("es.index.auto.create", true)
.config("es.nodes", "10.124.147.22")
.config("es.port", )
.getOrCreate()
val spark = sparkSession.sparkContext
// 使用json的方式读取, 带查询的
import org.elasticsearch.spark._
val esJsonRdd = spark.esJsonRDD("macsearch_fileds/mac",
"""
{
"query": {
"match_all": {}
}
}
""".stripMargin)
esJsonRdd.map(_._2).saveAsTextFile("/Users/bronkwen/work/IdeaProjects/xiaoniubigdata/spark06-es/target/json")
sparkSession.close()
}
}
3, write
package com.wenbronk.spark.es.rdd import org.apache.spark.rdd.RDD
import org.apache.spark.sql.SparkSession
import org.elasticsearch.spark.rdd.EsSpark object WriteMain { def main(args: Array[String]): Unit = { val spark = SparkSession.builder()
.master("local[4]")
.appName("write-spark-es")
.config("es.index.auto.create", true)
.config("es.nodes", "10.124.147.22")
.config("es.port", )
.getOrCreate() val df: RDD[String] = spark.sparkContext.textFile("/Users/bronkwen/work/IdeaProjects/xiaoniubigdata/spark06-es/target/json") // df.map(_.substring()) import org.elasticsearch.spark._
// df.rdd.saveToEs("spark/docs")
// EsSpark.saveToEs(df, "spark/docs")
EsSpark.saveJsonToEs(df, "spark/json") spark.close()
} }
4, 写入多个index中
package com.wenbronk.spark.es.rdd
import org.apache.spark.sql.SparkSession
object WriteMultiIndex {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder()
.master("local[4]")
.appName("es-spark-multiindex")
.config("es.es.index.auto.create", true)
.config("es.nodes", "10.124.147.22")
.config("es.port", )
.getOrCreate()
val sc = spark.sparkContext
val game = Map("media_type"->"game","title" -> "FF VI","year" -> "")
val book = Map("media_type" -> "book","title" -> "Harry Potter","year" -> "")
val cd = Map("media_type" -> "music","title" -> "Surfing With The Alien")
import org.elasticsearch.spark._
// 可以自定义自己的metadata, 只添加id
sc.makeRDD(Seq((, game), (, book), (, cd))).saveToEs("my-collection-{media_type}/doc")
spark.close()
}
}
2, streaming
1), write
package com.wenbronk.spark.es.stream import org.apache.spark.streaming.dstream.InputDStream
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.elasticsearch.spark.rdd.EsSpark
import org.elasticsearch.spark.streaming.EsSparkStreaming import scala.collection.mutable object WriteStreamingMain { def main (args: Array[String]): Unit = { val conf = new SparkConf().setAppName("es-spark-streaming-write").setMaster("local[4]")
conf.set("es.index.auto.create", "true")
conf.set("es.nodes", "10.124.147.22")
// 默认端口9200, 不知道怎么设置 Int类型 val sc = new SparkContext(conf)
val ssc = new StreamingContext(sc, Seconds()) val numbers = Map("one" -> , "two" -> , "three" -> )
val airports = Map("arrival" -> "Otopeni", "SFO" -> "San Fran") val rdd = sc.makeRDD(Seq(numbers, airports))
val microbatches = mutable.Queue(rdd) val dstream: InputDStream[Map[String, Any]] = ssc.queueStream(microbatches) // import org.elasticsearch.spark.streaming._
// dstream.saveToEs("sparkstreaming/doc") // EsSparkStreaming.saveToEs(dstream, "sparkstreaming/doc") // 带有id的
// EsSparkStreaming.saveToEs(dstream, "spark/docs", Map("es.mapping.id" -> "id")) // json格式
EsSparkStreaming.saveJsonToEs(dstream, "sparkstreaming/json") ssc.start()
ssc.awaitTermination() } }
2, 写入带有meta的, rdd也是用
package com.wenbronk.spark.es.stream
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.streaming.{Seconds, StreamingContext}
import scala.collection.mutable
object WriteStreamMeta {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName("es-spark-streaming-write").setMaster("local[4]")
conf.set("es.index.auto.create", "true")
conf.set("es.nodes", "10.124.147.22")
// 默认端口9200, 不知道怎么设置 Int类型
val sc = new SparkContext(conf)
val ssc = new StreamingContext(sc, Seconds())
val otp = Map("iata" -> "OTP", "name" -> "Otopeni")
val muc = Map("iata" -> "MUC", "name" -> "Munich")
val sfo = Map("iata" -> "SFO", "name" -> "San Fran")
val airportsRDD = sc.makeRDD(Seq((, otp), (, muc), (, sfo)))
val microbatches = mutable.Queue(airportsRDD)
import org.elasticsearch.spark.streaming._
ssc.queueStream(microbatches).saveToEsWithMeta("airports/2015")
ssc.start()
ssc.awaitTermination()
}
/**
* 使用多种meta
*/
def main1(args: Array[String]): Unit = {
val ID = "id";
val TTL = "ttl"
val VERSION = "version"
val conf = new SparkConf().setAppName("es-spark-streaming-write").setMaster("local[4]")
val sc = new SparkContext(conf)
val ssc = new StreamingContext(sc, Seconds())
val otp = Map("iata" -> "OTP", "name" -> "Otopeni")
val muc = Map("iata" -> "MUC", "name" -> "Munich")
val sfo = Map("iata" -> "SFO", "name" -> "San Fran")
// 定义meta 不需要一对一对应
val otpMeta = Map(ID -> , TTL -> "3h")
val mucMeta = Map(ID -> , VERSION -> "")
val sfoMeta = Map(ID -> )
val airportsRDD = sc.makeRDD(Seq((otpMeta, otp), (mucMeta, muc), (sfoMeta, sfo)))
val microbatches = mutable.Queue(airportsRDD)
import org.elasticsearch.spark.streaming._
ssc.queueStream(microbatches).saveToEsWithMeta("airports/2015")
ssc.start()
ssc.awaitTermination()
}
}
3, sql的使用
1), read
package com.wenbronk.spark.es.sql
import org.apache.spark.sql.{DataFrame, SparkSession}
object ESSqlReadMain {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder()
.master("local[4]")
.appName("es-sql-read")
.config("es.index.auto.create", true)
// 转换sql为es的DSL
.config("pushown", true)
.config("es.nodes", "10.124.147.22")
.config("es.port", )
.getOrCreate()
// 完全查询
// val df: DataFrame = spark.read.format("es").load("macsearch_fileds/mac")
import org.elasticsearch.spark.sql._
val df = spark.esDF("macsearch_fileds/mac",
"""
|{
| "query": {
| "match_all": {
| }
|}
""".stripMargin)
// 显示下数据
df.printSchema()
df.createOrReplaceTempView("macseach_fileds")
val dfSql: DataFrame = spark.sql(
"""
select
mac,
count(mac) con
from macseach_fileds
group by mac
order by con desc
""".stripMargin)
dfSql.show()
// 存入本地文件中
import spark.implicits._
df.write.json("/Users/bronkwen/work/IdeaProjects/xiaoniubigdata/spark06-es/target/sql/json")
spark.stop()
}
}
2), write
package com.wenbronk.spark.es.sql
import org.apache.spark.sql.{DataFrame, SparkSession}
import org.elasticsearch.spark.sql.EsSparkSQL
object ESSqlWriteMain {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder()
.master("local[4]")
.appName("es-sql-write")
.config("es.index.auto.create", true)
.config("es.nodes", "10.124.147.22")
.config("es.port", )
.getOrCreate()
import spark.implicits._
val df: DataFrame = spark.read.format("json").load("/Users/bronkwen/work/IdeaProjects/xiaoniubigdata/spark06-es/target/sql/json")
df.show()
// json格式直接写入
// import org.elasticsearch.spark.sql._
// df.saveToEs("spark/people")
EsSparkSQL.saveToEs(df, "spark/people")
spark.close()
}
}
4, structStream
对 结构化流不太熟悉, 等熟悉了在看
package com.wenbronk.spark.es.structstream
import org.apache.spark.sql.SparkSession
object StructStreamWriteMain {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder()
.appName("structstream-es-write")
.master("local[4]")
.config("es.index.auto.create", true)
.config("es.nodes", "10.124.147.22")
.config("es.port", )
.getOrCreate()
val df = spark.readStream
.format("json")
.load("/Users/bronkwen/work/IdeaProjects/xiaoniubigdata/spark06-es/target/json")
df.writeStream
.option("checkpointLocation", "/save/location")
.format("es")
.start()
spark.close()
}
}
es-09-spark集成的更多相关文章
- Spark:利用Eclipse构建Spark集成开发环境
前一篇文章“Apache Spark学习:将Spark部署到Hadoop 2.2.0上”介绍了如何使用Maven编译生成可直接运行在Hadoop 2.2.0上的Spark jar包,而本文则在此基础上 ...
- spark集成hive遭遇mysql check失败的问题
问题: spark集成hive,启动spark-shell或者spark-sql的时候,报错: INFO MetaStoreDirectSql: MySQL check failed, assumin ...
- 09.客户端集成IdentityServer
09.客户端集成IdentityServer 新建API的项目 dotnet new webapi --name ClientCredentialApi 在我们上一节课的代码IdentityServe ...
- Spark集成
一.Spark 架构与优化器 1.Spark架构 (重点) 2.Spark优化器 二.Spark+SQL的API (重点) 1.DataSet简介 2.DataFrame简介 3.RDD与DF/DS的 ...
- 机器学习 - pycharm, pyspark, spark集成篇
AS WE ALL KNOW,学机器学习的一般都是从python+sklearn开始学,适用于数据量不大的场景(这里就别计较“不大”具体指标是啥了,哈哈) 数据量大了,就需要用到其他技术了,如:spa ...
- 机器学习 - 开发环境安装pycharm + pyspark + spark集成篇
AS WE ALL KNOW,学机器学习的一般都是从python+sklearn开始学,适用于数据量不大的场景(这里就别计较“不大”具体指标是啥了,哈哈) 数据量大了,就需要用到其他技术了,如:spa ...
- spark 集成elasticsearch
pyspark读写elasticsearch依赖elasticsearch-hadoop包,需要首先在这里下载,版本号可以通过自行修改url解决. """ write d ...
- Ignite与Spark集成时,ClassNotFoundException问题解决
参考文章:https://apacheignite-fs.readme.io/docs/installation-deployment Spark application deployment mod ...
- spark集成hbase与hive数据转换与代码练习
帮一个朋友写个样例,顺便练手啦~一直在做平台的各种事,但是代码后续还要精进啊... import java.util.Date import org.apache.hadoop.hbase.HBase ...
- ES 09 - 定制Elasticsearch的分词器 (自定义分词策略)
目录 1 索引的分析 1.1 分析器的组成 1.2 倒排索引的核心原理-normalization 2 ES的默认分词器 3 修改分词器 4 定制分词器 4.1 向索引中添加自定义的分词器 4.2 测 ...
随机推荐
- (转)忘记wamp-mysql数据库root用户密码重置方法
转自:http://www.jb51.net/article/28883.htm 1.打开任务管理器,结束进程 mysqld-nt.exe . 2.运行命令窗口 1)进行php服务管理器安装目录中的 ...
- [转]Android SQLite
数据库操作SQLite Expert Personal 3 注:下载相关SQLite的文档在:http://www.sqlite.org/ 具体的sql语句不作长细介绍,在本博客中也有相关的文章. 一 ...
- 在Win环境下配置java的环境进行开发步骤
1.下载官方JDK,网址如下 http://www.oracle.com/technetwork/java/javase/downloads/index.html
- Codeforces Round #264 (Div. 2) E. Caisa and Tree 树上操作暴力
http://codeforces.com/contest/463/problem/E 给出一个总节点数量为n的树,每个节点有权值,进行q次操作,每次操作有两种选项: 1. 询问节点v到root之间的 ...
- git命令行的操作实例教程
Git 常用命令常用命令 创建新仓库 创建新文件夹,打开,然后执行 git init 1 以创建新的 git 仓库. 检出仓库 执行如下命令以创建一个本地仓库的克隆版本: git clone /pat ...
- UNIGUI接收普通消息和被动回复用户消息
接收普通消息和被动回复用户消息 用户发送消息给公众号时(或某些特定的用户操作引发的事件推送时),会产生一个POST请求,开发者可以在响应包(Get)中返回特定XML结构,来对该消息进行响应(现支持回复 ...
- ubuntu16.04 LTS把下载源改为阿里云的源
为什么要切换下载源到国内的源上? Ubuntu的中国服务器下载速度很慢,我们可以尝试修改软件更新源,这样下载和更新软件的速度会加快很多. 一.linux系统版本: ubuntukylin-16.04- ...
- 【BZOJ2882】 工艺(SAM)
传送门 BZOJCH 洛谷 Solution 这个东西要求的不就是最小表示法吗? 把原串复制一遍然后都加到后缀自动机里面去. 用个map跑一下,这样子可以保证每一次选的是最小字典序的. 然后跑\(n\ ...
- 【洛谷4587】 [FJOI2016]神秘数(主席树)
传送门 BZOJ 然而是权限题 洛谷 Solution 发现题目给出的一些规律,emm,如果我们新凑出来的一个数,那么后面一个数一定是\(sum+1\). 于是就可以主席树随便维护了! 代码实现 #i ...
- Linux之IRQ domain
概述 Linux使用IRQ domain来描述一个中断控制器(IRQ Controller)所管理的中断源.换句话说,每个中断控制器都有自己的domain.我们可以将IRQ Domain看作是IRQ ...