spark 解析非结构化数据存储至hive的scala代码

//提交代码包

// /usr/local/spark/bin$  spark-submit --class "getkv" /data/chun/sparktes.jar

import org.apache.spark.sql.{DataFrame, Row, SQLContext, SaveMode}

import org.apache.spark.{SparkConf, SparkContext}

import org.apache.spark.sql.hive.HiveContext

object split {

   def  main(args:Array[String])

  {

  val cf = new SparkConf().setAppName("ass").setMaster("local")

  val sc = new SparkContext(cf)

  val sqlContext = new SQLContext(sc)

  val hc = new HiveContext(sc)

  val  format=new java.text.SimpleDateFormat("yyyy-MM-dd")

  val date=format.format(new java.util.Date().getTime-****)

  val lg= sc.textFile("hdfs://master:9000/data/"+date+"/*/*.gz")

  val filed1=lg.map(l=>(l.split("android_id\":\"").last.split("\"").head.toString,

    l.split("anylst_ver\":").last.split(",").head.toString,

    l.split("area\":\"").last.split("\"").head,

    l.split("build_CPU_ABI\":\"").last.split("\"").head,

    l.split("build_board\":\"").last.split("\"").head,

    l.split("build_model\":\"").last.split("\"").head,

    l.split("\"city\":\"").last.split("\"").head,

    l.split("country\":\"").last.split("\"").head,

    l.split("cpuCount\":").last.split(",").head,

    l.split("cpuName\":\"").last.split("\"").head,

    l.split("custom_uuid\":\"").last.split("\"").head,

    l.split("cid\":\"").last.split("\"").head,

    l.split("definition\":\"").last.split("\"").head,

    l.split("firstTitle\":\"").last.split("\"").head,

    l.split("modeType\":\"").last.split("\"").head,

    l.split("pageName\":\"").last.split("\"").head,

    l.split("playIndex\":\"").last.split("\"").head,

    l.split("rectime\":").last.split(",").head,

    l.split("time\":\"").last.split("\"").head))

   //val F1=filed1.toDF("custom_uuid","region","screenHeight","screenWidth","serial_number","touchMode","umengChannel","vercode","vername","wlan0_mac","rectime","time")

  val scoreDataFrame1 = hc.createDataFrame(filed1).toDF("android_id","anylst_ver","area","build_CPU_ABI","build_board","build_model","city","country","cpuCount","cpuName","custom_uuid","cid","definition","firstTitle","modeType","pageName","playIndex","rectime","time")

  scoreDataFrame1.write.mode(SaveMode.Append).saveAsTable("test.f1")

  val filed2=lg.map(l=>(l.split("custom_uuid\":\"").last.split("\"").head,

    l.split("playType\":\"").last.split("\"").head,

    l.split("prevName\":\"").last.split("\"").head,

    l.split("prevue\":").last.split(",").head,

    l.split("siteName\":\"").last.split("\"").head,

    l.split("title\":\"").last.split("\"").head,

    l.split("uuid\":\"").last.split("\"").head,

    l.split("vod_seek\":\"").last.split("\"").head,

    l.split("device_id\":\"").last.split("\"").head,

    l.split("device_name\":\"").last.split("\"").head,

    l.split("dpi\":").last.split(",").head,

    l.split("eth0_mac\":\"").last.split("\"").head,

    l.split("ip\":\"").last.split("\"").head,

    l.split("ipaddr\":\"").last.split("\"").head,

    l.split("isp\":\"").last.split("\"").head,

    l.split("largeMem\":").last.split(",").head,

    l.split("limitMem\":").last.split(",").head,

    l.split("packageName\":\"").last.split("\"").head,

    l.split("rectime\":").last.split(",").head,

    l.split("time\":\"").last.split("\"").head))

  import sqlContext.implicits._

  val scoreDataFrame2 = hc.createDataFrame(filed2).toDF("custom_uuid","playType","prevName","prevue","siteName","title","uuid","vod_seek","device_id","device_name","dpi","eth0_mac","ip","ipaddr","isp","largeMem","limitMem","packageName","rectime","time")

  scoreDataFrame2.write.mode(SaveMode.Append).saveAsTable("test.f2")

//

  val filed3=lg.map(l=>(l.split("custom_uuid\":\"").last.split("\"").head,

    l.split("region\":\"").last.split("\"").head,

    l.split("screenHeight\":").last.split(",").head,

    l.split("screenWidth\":").last.split(",").head,

    l.split("serial_number\":\"").last.split("\"").head,

    l.split("touchMode\":").last.split(",").head,

    l.split("umengChannel\":\"").last.split("\"").head,

    l.split("vercode\":").last.split(",").head,

    l.split("vername\":\"").last.split("\"").head,

    l.split("wlan0_mac\":\"").last.split("\"").head,

    l.split("rectime\":").last.split(",").head,

    l.split("time\":\"").last.split("\"").head

  ))

  import sqlContext.implicits._

  val scoreDataFrame3= hc.createDataFrame(filed3).toDF("custom_uuid","region","screenHeight","screenWidth","serial_number","touchMode","umengChannel","vercode","vername","wlan0_mac","rectime","time")

  scoreDataFrame3.write.mode(SaveMode.Append).saveAsTable("test.f3")

  }

}

spark 解析非结构化数据存储至hive的scala代码的更多相关文章

MySQL 5.7：非结构化数据存储的新选择
本文转载自:http://www.innomysql.net/article/23959.html (只作转载, 不代表本站和博主同意文中观点或证实文中信息) 工作10余年,没有一个版本能像MySQL ...
Spark如何与深度学习框架协作，处理非结构化数据
随着大数据和AI业务的不断融合,大数据分析和处理过程中,通过深度学习技术对非结构化数据(如图片.音频.文本)进行大数据处理的业务场景越来越多.本文会介绍Spark如何与深度学习框架进行协同工作,在大数 ...
Python爬虫(九)_非结构化数据与结构化数据
爬虫的一个重要步骤就是页面解析与数据提取.更多内容请参考:Python学习指南页面解析与数据提取实际上爬虫一共就四个主要步骤: 定(要知道你准备在哪个范围或者网站去搜索) 爬(将所有的网站的内容全 ...
结构化数据（structured），半结构化数据(semi-structured)，非结构化数据(unstructured)
概念结构化数据:即行数据,存储在数据库里,可以用二维表结构来逻辑表达实现的数据. 半结构化数据:介于完全结构化数据(如关系型数据库.面向对象数据库中的数据)和完全无结构的数据(如声音.图像文件等)之 ...
结构化数据、半结构化数据、非结构化数据——Hadoop处理非结构化数据
刚开始接触Hadoop ,指南中说Hadoop处理非结构化数据,学习数据库的时候,老师总提结构化数据,就是一张二维表,那非结构化数据是什么呢?难道是文本那样的文件?经过上网搜索,感觉这个帖子不错网址 ...
Scrapy系列教程（2）------Item（结构化数据存储结构）
Items 爬取的主要目标就是从非结构性的数据源提取结构性数据,比如网页. Scrapy提供 Item 类来满足这种需求. Item 对象是种简单的容器.保存了爬取到得数据. 其提供了类似于词典(d ...
hbase非结构化数据库与结构化数据库比较
目的:了解hbase与支持海量数据查询的特性以及实现方式传统关系型数据库特点及局限传统数据库事务性特别强,要求数据完整性及安全性,造成系统可用性以及伸缩性大打折扣.对于高并发的访问量,数据库性能不 ...
利用Gson和SharePreference存储结构化数据
问题的导入 Android互联网产品通常会有很多的结构化数据需要保存,比如对于登录这个流程,通常会保存诸如username.profile_pic.access_token等等之类的数据,这些数据可以 ...
Spark读取结构化数据
读取结构化数据 Spark可以从本地CSV,HDFS以及Hive读取结构化数据,直接解析为DataFrame,进行后续分析. 读取本地CSV 需要指定一些选项,比如留header,比如指定delimi ...

随机推荐

Python 爬取bangumi网页信息
1.数据库连接池 #######db.py########## import time import pymysql import threading from DBUtils.PooledDB im ...
eclipse如何导入jar包 BUILD PATH
http://blog.csdn.net/believejava/article/details/41750987
Centos7 MongoDB-3.4
MongoDB 是一个介于关系数据库和非关系数据库之间的产品,是非关系数据库当中功能最丰富,最像关系数据库的关系型数据库遵循ACID规则事务在英文中是transaction,和现实世界中的交易很类 ...
Java时间串获取(格式:yyyyMMddHHmmss)
DateFormat df = new SimpleDateFormat("yyyyMMddHHmmss");Calendar calendar = Calendar.getI ...
Spring AOP获取拦截方法的参数名称跟参数值
注意:这种方式需要JDK1.8版本支持开始:http://www.cnblogs.com/wing7319/p/9592184.html 1.aop配置: <aop:aspectj-autop ...
解决javah生成c头文件时找不到android类库的问题
问题描述: cmd下面进入工程的bin/classes下面,执行 javah xxx.xxx.A 生成头文件, 一般来说都是可以成功执行的,但是如果xxx.xxx.A类里面引用了android类库里面 ...
PL/SQL学习笔记之条件控制语句
一:IF-THEN语句 IF (condition) THEN commands; END IF; 二:IF-THEN_ELSE语句 IF (condition) THEN S1; ELSE S2; ...
【PMP】项目采购管理~重点知识
1.合同的类型与区别固定总价(FFP):大多数买方都喜欢这种合同,因为货物的采购价格在一开始就已确定,并且不允许改变(除非工作范围发生变更) 总价加激励费用(FPIF):这种总价合同给买方和卖方提供 ...
【阿里巴巴Java开发手册——集合处理】13.集合的稳定性（order）和有序性（sort）
有序性(sort):指遍历的结果是按照某种比较规则依次排列的. 稳定性(order):指集合每次遍历的元素的次序是一定的. 如:ArrayList是order/unsort HashMap是unord ...
std::lower_bound 功能
std::lower_bound default (1) template <class ForwardIterator, class T> ForwardIterator lower_b ...

spark 解析非结构化数据存储至hive的scala代码

spark 解析非结构化数据存储至hive的scala代码的更多相关文章

随机推荐

热门专题