在日常工作中，有时候需要读取mysql的数据作为DataFrame数据源进行后期的Spark处理,Spark自带了一些方法供我们使用，读取mysql我们可以直接使用表的结构信息，而不需要自己再去定义每个字段信息。
下面是我的实现方式。

1.mysql的信息：

mysql的信息我保存在了外部的配置文件，这样方便后续的配置添加。

 mysql的信息我保存在了外部的配置文件，这样方便后续的配置添加。

 //配置文件示例：

 [hdfs@iptve2e03 tmp_lillcol]$ cat job.properties

 #mysql数据库配置

 mysql.driver=com.mysql.jdbc.Driver

 mysql.url=jdbc:mysql://127.0.0.1:3306/database1?useSSL=false&autoReconnect=true&failOverReadOnly=false&rewriteBatchedStatements=true

 mysql.username=user

 mysql.password=123456

2.需要的jar依赖

sbt版本，maven的对应修改即可

 libraryDependencies += "org.apache.spark" % "spark-core_2.10" % "1.6.0-cdh5.7.2"

 libraryDependencies += "org.apache.spark" % "spark-sql_2.10" % "1.6.0-cdh5.7.2"

 libraryDependencies += "org.apache.spark" % "spark-hive_2.10" % "1.6.0-cdh5.7.2"

 libraryDependencies += "org.apache.hbase" % "hbase-client" % "1.2.0-cdh5.7.2"

 libraryDependencies += "org.apache.hbase" % "hbase-server" % "1.2.0-cdh5.7.2"

 libraryDependencies += "org.apache.hbase" % "hbase-common" % "1.2.0-cdh5.7.2"

 libraryDependencies += "org.apache.hbase" % "hbase-protocol" % "1.2.0-cdh5.7.2"

 libraryDependencies += "mysql" % "mysql-connector-java" % "5.1.38"

 libraryDependencies += "org.apache.spark" % "spark-streaming_2.10" % "1.6.0-cdh5.7.2"

 libraryDependencies += "com.yammer.metrics" % "metrics-core" % "2.2.0"

3.完整实现代码

 import java.io.FileInputStream

 import java.util.Properties

 import org.apache.spark.sql.hive.HiveContext

 import org.apache.spark.sql.{DataFrame, SQLContext}

 import org.apache.spark.{SparkConf, SparkContext}

 /**

   * @author Administrator

   *         2018/10/16-9:18

   *

   */

 object TestReadMysql {

   var hdfsPath: String = ""

   var proPath: String = ""

   var DATE: String = ""

   val sparkConf: SparkConf = new SparkConf().setAppName(getClass.getSimpleName)

   val sc: SparkContext = new SparkContext(sparkConf)

   val sqlContext: SQLContext = new HiveContext(sc)

   def main(args: Array[String]): Unit = {

     hdfsPath = args(0)

     proPath = args(1)

     //不过滤读取

     val dim_sys_city_dict: DataFrame = readMysqlTable(sqlContext, "TestMysqlTble1", proPath)

     dim_sys_city_dict.show(10)

     //过滤读取

     val dim_sys_city_dict1: DataFrame = readMysqlTable(sqlContext, "TestMysqlTble1", s"city_id=240", proPath)

     dim_sys_city_dict1.show(10)

   }

   /**

     * 获取 Mysql 表的数据

     *

     * @param sqlContext

     * @param tableName 读取Mysql表的名字

     * @param proPath   配置文件的路径

     * @return 返回 Mysql 表的 DataFrame

     */

   def readMysqlTable(sqlContext: SQLContext, tableName: String, proPath: String) = {

     val properties: Properties = getProPerties(proPath)

     sqlContext

       .read

       .format("jdbc")

       .option("url", properties.getProperty("mysql.url"))

       .option("driver", properties.getProperty("mysql.driver"))

       .option("user", properties.getProperty("mysql.username"))

       .option("password", properties.getProperty("mysql.password"))

       //        .option("dbtable", tableName.toUpperCase)

       .option("dbtable", tableName)

       .load()

   }

   /**

     * 获取 Mysql 表的数据 添加过滤条件

     *

     * @param sqlContext

     * @param table           读取Mysql表的名字

     * @param filterCondition 过滤条件

     * @param proPath         配置文件的路径

     * @return 返回 Mysql 表的 DataFrame

     */

   def readMysqlTable(sqlContext: SQLContext, table: String, filterCondition: String, proPath: String) = {

     val properties: Properties = getProPerties(proPath)

     var tableName = ""

     tableName = "(select * from " + table + " where " + filterCondition + " ) as t1"

     sqlContext

       .read

       .format("jdbc")

       .option("url", properties.getProperty("mysql.url"))

       .option("driver", properties.getProperty("mysql.driver"))

       .option("user", properties.getProperty("mysql.username"))

       .option("password", properties.getProperty("mysql.password"))

       .option("dbtable", tableName)

       .load()

   }

   /**

     * 获取配置文件

     *

     * @param proPath

     * @return

     */

   def getProPerties(proPath: String) = {

     val properties: Properties = new Properties()

     properties.load(new FileInputStream(proPath))

     properties

   }

 }

4.测试

 def main(args: Array[String]): Unit = {

     hdfsPath = args(0)

     proPath = args(1)

     //不过滤读取

     val dim_sys_city_dict: DataFrame = readMysqlTable(sqlContext, "TestMysqlTble1", proPath)

     dim_sys_city_dict.show(10)

     //过滤读取

     val dim_sys_city_dict1: DataFrame = readMysqlTable(sqlContext, "TestMysqlTble1", s"city_id=240", proPath)

     dim_sys_city_dict1.show(10)

   }

5.运行结果

数据因为保密原因进行了处理

  // 不过滤读取结果

 +-------+-------+---------+---------+--------+----------+---------+--------------------+----+-----------+

 |dict_id|city_id|city_name|city_code|group_id|group_name|area_code| bureau_id|sort|bureau_name|

 +-------+-------+---------+---------+--------+----------+---------+--------------------+----+-----------+

 |     1|    249|       **|    **_ab|     100|      **按时|    **-查到|xcaasd...| 21|    张三公司|

 |     2|    240|       **|    **_ab|     300|      **按时|    **-查到|xcaasd...| 21|    张三公司|

 |     3|    240|       **|    **_ab|     100|      **按时|    **-查到|xcaasd...| 21|    张三公司|

 |     4|    242|       **|    **_ab|     300|      **按时|    **-查到|xcaasd...| 01|    张三公司|

 |     5|    246|       **|    **_ab|     100|      **按时|    **-查到|xcaasd...| 01|    张三公司|

 |     6|    246|       **|    **_ab|     300|      **按时|    **-查到|xcaasd...| 01|    张三公司|

 |     7|    248|       **|    **_ab|     200|      **按时|    **-查到|xcaasd...| 01|    张三公司|

 |     8|    242|       **|    **_ab|     400|      **按时|    **-查到|xcaasd...| 01|    张三公司|

 |     9|    247|       **|    **_ab|     200|      **按时|    **-查到|xcaasd...| 01|    张三公司|

 |     0|    243|       **|    **_ab|     400|      **按时|    **-查到|xcaasd...| 01|    张三公司|

 +-------+-------+---------+---------+--------+----------+---------+--------------------+----+-----------+

 // 过滤读取结果

 +-------+-------+---------+---------+--------+----------+---------+--------------------+----+-----------+

 |dict_id|city_id|city_name|city_code|group_id|group_name|area_code| bureau_id|sort|bureau_name|

 +-------+-------+---------+---------+--------+----------+---------+--------------------+----+-----------+

 |     2|    240|       **|    **_JM|     300|      **按时|    **-查到|xcaasd...| 21|    张三公司|

 |     3|    240|       **|    **_ZS|     100|      **按时|    **-查到|xcaasd...| 21|    张三公司|

 |     6|    240|       **|    **_JY|     400|      **按时|    **-查到|xcaasd...| 01|    张三公司|

 +-------+-------+---------+---------+--------+----------+---------+--------------------+----+-----------+

6.总结

读取mysql其实不难，就是一些参数的配置而已。
在此处记录下。

本文章为工作日常总结，转载请标明出处！！！！！！！

Spark:读取mysql数据作为DataFrame的更多相关文章

Spark使用Java读取mysql数据和保存数据到mysql
原文引自:http://blog.csdn.net/fengzhimohan/article/details/78471952 项目应用需要利用Spark读取mysql数据进行数据分析,然后将分析结果 ...
Spark读取elasticsearch数据指南
最近要在 Spark job 中通过 Spark SQL 的方式读取 Elasticsearch 数据,踩了一些坑,总结于此. 环境说明 Spark job 的编写语言为 Scala,scala-li ...
关于C#读取MySql数据时，返回DataTable中某字段数据是System.Array[]形式
我在使用C#(VS2008)读取MySql数据库(5.1版本)时,返回的DataTable数据中arrivalDate字段数据显示为System.Array[]形式(程序中没有对返回的数据进行任何加工 ...
Django读取Mysql数据并显示在前端
一.首先按添加网页的步骤添加网页,我的网页名为table.html, app名为web table.html放到相应目录下, froms文件提前写好修改views.py ? 1 2 3 4 5 6 ...
spark读取kafka数据 createStream和createDirectStream的区别
1.KafkaUtils.createDstream 构造函数为KafkaUtils.createDstream(ssc, [zk], [consumer group id], [per-topic, ...
spark1.4加载mysql数据创建Dataframe及join操作连接方法问题
首先我们使用新的API方法连接mysql加载数据创建DF import org.apache.spark.sql.DataFrame import org.apache.spark.{SparkCo ...
Spark 读取HBase数据
Spark1.6.2 读取 HBase 1.2.3 //hbase-common-1.2.3.jar //hbase-protocol-1.2.3.jar //hbase-server-1.2.3.j ...
spark读取hdfs数据本地性异常
在分布式计算中,为了提高计算速度,数据本地性是其中重要的一环. 不过有时候它同样也会带来一些问题. 一.问题描述在分布式计算中,大多数情况下要做到移动计算而非移动数据,所以数据本地性尤其重要,因此我 ...
R读取MySQL数据出现乱码，解决该问题的方法总结
最终的解决办法直接看 4 我的思路: 我用的都是utf-8编码,电脑系统win7, MySQL-Front进行数据库的可视化. 1.我用的是RStudio,先去设置R的默认编码: Tools→Glob ...

随机推荐

The Guideline of Setting Up Samba Server on linux(Ubuntu)
The Guideline of Setting Up Samba Server on linux(Ubuntu) From terminate command window, install the ...
python tar 打包
import os import tarfile def make_targz_one_by_one(output_filename, source_dir): tar = tarfile.open( ...
浅谈StringBuffer
StringBuffer,由名字可以看出,是一个String的缓冲区,也就是说一个类似于String的字符串缓冲区,和String不同的是,它可以被修改,而且是线程安全的.StringBuffer在任 ...
HTTP请求处理流程 MVC核心（MVC就是扩展了一个HttpModule）
访问Localhost:8080/Home/index.aspx 在调用MVC扩展的UrlRoutingModule的时候会先检查物理路径文件是否存在存在的话就不执行MVC中的路由匹配规则 ...
tar命令-压缩，解压缩文件
tar: -c: 建立压缩档案 -x:解压 -t:查看内容 -r:向压缩归档文件末尾追加文件 -u:更新原压缩包中的文件上面五个参数是独立的,压缩解压都要用到其中一个,可以和下面的命令连用但只能用其 ...
RNN流程
1.记号 2.前向计算,第二张图是第一张图的公式的简化.其中a称之为隐状态 3.计算代价函数
天地图常用WMTS配置参数
wmts常用参数 var matrixIds = ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', ...
fcn+caffe+siftflow实验记录
环境搭建: vs2013,编译caffe工程,cuda8.0,cudnn5.1,python2.7. 还需要安装python的一些包.Numpy+mkl scipy matplotlib sci ...
mac os使用迁移助手之后运行php报：dyld相关错误，错误排错流程分析
在执行php相关命令的时候,报如下错误: dyld: Library not loaded:/usr/local/opt/openldap/lib/libldap-2.4.2.dylib Refere ...
Matlab:高阶常微分三种边界条件的特殊解法（隐式Euler）
函数文件1: function b=F(f,x0,u,h) b(1,1)=x0(1)-h*x0(2)-u(1); b(2,1)=x0(2)+h*x0(1)^2-u(2)-h*f; 函数文件2: fun ...

Spark:读取mysql数据作为DataFrame

3.完整实现代码

4.测试

5.运行结果

Spark:读取mysql数据作为DataFrame的更多相关文章

随机推荐

热门专题