Spark读取Hbase的数据

val conf = HBaseConfiguration.create()

    conf.addResource(new Path("/opt/cloudera/parcels/CDH-5.4.4-1.cdh5.4.4.p0.4/lib/hbase/conf/hbase-site.xml"))

    conf.addResource(new Path("/opt/cloudera/parcels/CDH-5.4.4-1.cdh5.4.4.p0.4/lib/hadoop/etc/hadoop/core-site.xml"))

    conf.set(TableInputFormat.INPUT_TABLE, "FLOW")

    //添加过滤条件，年龄大于 18 岁

    //val scan = new Scan()

    //conf.set(TableInputFormat.SCAN, convertScanToString(scan))

    /*

    scan.setFilter(new SingleColumnValueFilter("basic".getBytes, "age".getBytes,

      CompareOp.GREATER_OR_EQUAL, Bytes.toBytes(18)))

    */

    val usersRDD = sc.newAPIHadoopRDD(conf, classOf[TableInputFormat],

      classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable],

      classOf[org.apache.hadoop.hbase.client.Result])

    val data1 = usersRDD.count()

    val sf = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss.SSSSS")

    println("data length:" + data1)

    var map = HashMap[String, HashMap[String, collection.mutable.ArrayBuffer[Double]]]()

    usersRDD.collect().map {

      case (_, result) =>

        val key = Bytes.toInt(result.getRow)

        println("Key:" + key)

        val ip = Bytes.toString(result.getValue("F".getBytes, "SADDR".getBytes))

        val port = Bytes.toString(result.getValue("F".getBytes, "SPORT".getBytes))

        val startTimeLong = Bytes.toString(result.getValue("F".getBytes, "STIME".getBytes))

        val endTimeLong = Bytes.toString(result.getValue("F".getBytes, "LTIME".getBytes))

        val protocol = Bytes.toString(result.getValue("F".getBytes, "PROTO".getBytes))

        val sumTime = Bytes.toString(result.getValue("F".getBytes, "DUR".getBytes))

        val sum = Bytes.toString(result.getValue("F".getBytes, "DBYTES".getBytes)).toDouble

        println("ip:" + ip + ",port:" + port + ",startTime:" + startTimeLong + ",endTime:" + endTimeLong + ",protocol:" + protocol + ",sum:" + sum)

        //ip+port+udp，14:02 14:07 List

        //ip+port+tcp，15:02 15:07 List

        val startTimeDate = sf.parse(startTimeLong)

        val endTimeLongDate = sf.parse(endTimeLong)

        val startHours = startTimeDate.getHours

        val startMinutes = startTimeDate.getMinutes

        val endHours = endTimeLongDate.getHours

        val endMinutes = endTimeLongDate.getMinutes

        val key1 = ip + "_" + port + "_" + protocol

        println("key1:" + key1)

        val key2 = startHours + ":" + startMinutes + "_" + endHours + ":" + endMinutes

        println("key2:" + key2)

        val tmpMap = map.get(key1)

        if (!tmpMap.isEmpty) {

          println("--------------------map is not null:" + tmpMap.size + "--------------------")

          val sumArray = tmpMap.get.get(key2)

          if (!sumArray.isEmpty) {

            sumArray.get += sum

          }

        } else {

          println("--------------------map is null--------------------")

          //如果当前Key不存在的话，是一个全新的Ip

          val sumArray = collection.mutable.ArrayBuffer[Double]()

          sumArray += sum

          val secondMap = HashMap[String, collection.mutable.ArrayBuffer[Double]]()

          secondMap += (key2 -> sumArray)

          map += (key1 -> secondMap)

        }

        map

        println("map size-----------------:" + map.size)

    }

    println("map size:" + map.size)

    map.map(e => {

      println("--------------------Statistics start --------------------")

      val resultKey1 = e._1

      val resultVal1 = e._2

      println("resultKey1:" + resultKey1)

      resultVal1.foreach(f => {

        val resultKey2 = f._1

        val resultVal2 = f._2

        println("resultKey2:" + resultKey2)

        println("-----------------resultVal2:" + resultVal2.length)

        resultVal2.map(f=>{

            println("------------------------f:"+f)

        })

        val dataArray = resultVal2.map(f => Vectors.dense(f))

        val summary: MultivariateStatisticalSummary = Statistics.colStats(sc.parallelize(dataArray))

        //

        println("--------------------mean:" + summary.mean + " --------------------")

        println("--------------------variance:" + summary.variance + " --------------------")

        println("--------------------mean apply 0:" + summary.mean.toArray.apply(0) + " --------------------")

        println("--------------------variance apply 0:" + summary.variance.apply(0) + " --------------------")

        val upbase = summary.mean.toArray.apply(0) + 1.960 * Math.sqrt(summary.variance.apply(0))

        val downbase = summary.mean.toArray.apply(0) - 1.960 * Math.sqrt(summary.variance.apply(0))

        println("------------------- " + upbase + " ---------- " + downbase)

        val df = new DecimalFormat(".##")

        val upbaseString = df.format(upbase)

        val downbaseString = df.format(downbase)

        //resultMap.put(key, value)

        val result3 = HashMap[Double, Double]()

        //result3 +=(upbase -> downbase)

        println("ip port:" + resultKey1 + ",time:" + resultKey2 + ",upbase:" + upbase + ",downbase:" + downbase)

      })

    })

    println("--------------------baseLine end --------------------")

    sc.stop()

Spark读取Hbase的数据的更多相关文章

使用TableSnapshotInputFormat读取Hbase快照数据
根据快照名称读取hbase快照中的数据,在网上查了好多资料,很少有资料能够给出清晰的方案,根据自己的摸索终于实现,现将代码贴出,希望能给大家有所帮助: public void read(org.apa ...
Spark 读取HBase和SolrCloud数据
Spark1.6.2读取SolrCloud 5.5.1 //httpmime-4.4.1.jar // solr-solrj-5.5.1.jar //spark-solr-2.2.2-20161007 ...
Spark 读取HBase数据
Spark1.6.2 读取 HBase 1.2.3 //hbase-common-1.2.3.jar //hbase-protocol-1.2.3.jar //hbase-server-1.2.3.j ...
Spark读取Hbase中的数据
大家可能都知道很熟悉Spark的两种常见的数据读取方式(存放到RDD中):(1).调用parallelize函数直接从集合中获取数据,并存入RDD中:Java版本如下: JavaRDD<Inte ...
spark读取hbase形成RDD，存入hive或者spark_sql分析
object SaprkReadHbase { var total:Int = 0 def main(args: Array[String]) { val spark = SparkSession . ...
Spark读取结构化数据
读取结构化数据 Spark可以从本地CSV,HDFS以及Hive读取结构化数据,直接解析为DataFrame,进行后续分析. 读取本地CSV 需要指定一些选项,比如留header,比如指定delimi ...
spark读取hbase(NewHadoopAPI 例子)
package cn.piesat.controller import java.text.{DecimalFormat, SimpleDateFormat}import java.utilimpor ...
spark读取hbase数据
def main(args: Array[String]): Unit = { val hConf = HBaseConfiguration.create(); hConf.set("hba ...
Spark读取HBase
背景:公司有些业务需求是存储在HBase上的,总是有业务人员找我要各种数据,所以想直接用Spark( shell) 加载到RDD进行计算摘要: 1.相关环境 2.代码例子内容 1.相关环境 Spa ...

随机推荐

Javascript 事件对象进阶（一）拖拽的原理
拖拽原理鼠标和Div的相对距离不变三大事件把拖拽加到document上拖拽简单点来说就是不停的更改物体到页面左边&顶部的距离! 那么如何计算出物体到页面左端的距离呢? 当鼠标按下的时候 ...
web前端性能14条规则
14条规则 1.减少Http请求使用图片地图使用CSS Sprites 合并JS和CSS文件这个是由于浏览器对同一个host有并行下载的限制,http请求越多,总体下载速度越慢 2.使用CDN( ...
Node.prototype.contains
document.documentElement.contains(document.body) // true document.documentElement.compareDocumentPos ...
C++中的new与delete
C++中对象数组创建时,主要注意的点有:虚函数和带参数的构造函数,当出现虚函数时,对象数组中如通过父对象指向子对象,因为需要做Slice,析构时会造成指针错误引发内存泄露.测试程序如下: class ...
sass在mac中安装
$ curl -L https://get.rvm.io | bash -s stable $ source ~/.rvm/scripts/rvm $ rvm -v $ rvm install 2.0 ...
Android学习笔记
1.问题:Error when loading the SDK:发现了以元素 'd:skin' 开头的无效内容方法:删除了android-wear 用sdk\tools\lib下的de ...
lua-nginx-module 学习
下载安装LuaJIT cd /usr/local/src sudo wget http://luajit.org/download/LuaJIT-2.0.3.tar.gz tar -xzvf LuaJ ...
first
不知道学啥,怎么办,写博客.找不到工作,怎么办,写博客.好吧,第一天博客完成.-渣渣米
Nginx 配置 HTTP 强缓存
server { listen 80; server_name tirion.me www.tirion.me; # note that these lines are originally from ...
[llvm] Call the LLVM Jit from c program
stackoverflow: http://stackoverflow.com/questions/1838304/call-the-llvm-jit-from-c-program Another t ...

Spark读取Hbase的数据

Spark读取Hbase的数据的更多相关文章

随机推荐

热门专题