Hive项目实战：用Hive分析“余额宝”躺着赚大钱背后的逻辑

一、项目背景

前两年，支付宝推出的“余额宝”赚尽无数人的眼球，同时也吸引的大量的小额资金进入。“余额宝”把用户的散钱利息提高到了年化收益率4.0%左右，比起银行活期存储存款0.3%左右高出太多了，也正在撼动着银行躺着赚钱的地位。

在金融市场，如果想获得年化收益率4%-5%左右也并非难事，通过“逆回购”一样可以。一旦遇到货币紧张时(银行缺钱)，更可达到50%一天隔夜回够利率。我们就可以美美地在家里数钱了！！

所谓逆回购：通俗来讲，就是你(A)把钱借给别人(B)，到期时，B按照约定利息，还给你(A)本资+利息。逆回购本身是无风险的。(操作银行储蓄存款类似)。现在火热吵起来的，阿里金融的“余额宝”利息与逆回购持平。我们可以猜测“余额宝”的资金也在操作“逆回购”，不仅保持良好的流通性，同时也提供稳定的利息。

二、项目需求分析

通过历史数据分析，找出走势规律，发现当日高点，进行逆回购，赚取最高利息。

三、项目数据集

猛戳此链接下载数据集

数据格式如下：

　　tradedate: 交易日期

　　tradetime: 交易时间

　　stockid: 股票id

　　buyprice: 买入价格

　　buysize: 买入数量

　　sellprice: 卖出价格

　　sellsize: 卖出数量

四、项目思路分析

基于项目的需求，我们可以使用Hive工具完成数据的分析。

1、首先将数据集total.csv导入Hive中，用日期做为分区表的分区ID。

2、选取自己的股票编号stockid，分别统计该股票产品每日的最高价和最低价。

3、以分钟做为最小单位，统计出所选股票每天每分钟均价。

五、步骤详解

第一步：将数据导入Hive中

在hive中，创建 stock 表结构。

hive> create table if not exists stock(tradedate string, tradetime string, stockid string, buyprice double, buysize int, sellprice string, sellsize int) row format delimited fields terminated by ',' stored as textfile;

OK

Time taken: 0.207 seconds

hive> desc stock;

OK

tradedate               string

tradetime               string

stockid                 string

buyprice                double

buysize                 int

sellprice               string

sellsize                int

Time taken: 0.147 seconds, Fetched:  row(s)

将HDFS中的股票历史数据导入hive中。

[hadoop@master bin]$ cd /home/hadoop/test/

[hadoop@master test]$ sudo rz

hive> load data local inpath ‘/home/handoop/test/stock.csv’ into table stock;

创建分区表 stock_partition，用日期做为分区表的分区ID。

hive> create table if not exists stock_partition(tradetime string, stockid string, buyprice double, buysize int, sellprice string, sellsize int) partitioned by (tradedate string) row format delimited fields terminated by ',';

OK

Time taken: 0.112 seconds

hive> desc stock_partition;

OK

tradetime               string

stockid                 string

buyprice                double

buysize                 int

sellprice               string

sellsize                int

tradedate               string                                      

# Partition Information

# col_name                data_type               comment             

tradedate               string

如果设置动态分区首先执行。

hive>set hive.exec.dynamic.partition.mode=nonstrict;

创建动态分区，将stock表中的数据导入stock_partition表。

hive> insert overwrite table stock_partition partition(tradedate) select tradetime, stockid, buyprice, buysize, sellprice, sellsize, tradedate from stock distribute by tradedate;

Query ID = hadoop_20180524122020_f7a1b61a-84ed--a37e-64ef9c3abc5f

Total jobs =

Launching Job  out of

Number of reduce tasks not specified. Estimated from input data size:

In order to change the average load for a reducer (in bytes):

  set hive.exec.reducers.bytes.per.reducer=<number>

In order to limit the maximum number of reducers:

  set hive.exec.reducers.max=<number>

In order to set a constant number of reducers:

  set mapreduce.job.reduces=<number>

Starting Job = job_1527103938304_0002, Tracking URL = http://master:8088/proxy/application_1527103938304_0002/

Kill Command = /opt/modules/hadoop-2.6./bin/hadoop job  -kill job_1527103938304_0002

Hadoop job information for Stage-: number of mappers: ; number of reducers:

-- ::, Stage- map = %,  reduce = %

-- ::, Stage- map = %,  reduce = %, Cumulative CPU 2.19 sec

-- ::, Stage- map = %,  reduce = %, Cumulative CPU 5.87 sec

MapReduce Total cumulative CPU time:  seconds  msec

Ended Job = job_1527103938304_0002

Loading data to table default.stock_partition partition (tradedate=null)

     Time taken for load dynamic partitions :

    Loading partition {tradedate=}

    Loading partition {tradedate=}

    Loading partition {tradedate=}

    Loading partition {tradedate=}

    Loading partition {tradedate=}

     Time taken for adding to write entity :

Partition default.stock_partition{tradedate=} stats: [numFiles=, numRows=, totalSize=, rawDataSize=]

Partition default.stock_partition{tradedate=} stats: [numFiles=, numRows=, totalSize=, rawDataSize=]

Partition default.stock_partition{tradedate=} stats: [numFiles=, numRows=, totalSize=, rawDataSize=]

Partition default.stock_partition{tradedate=} stats: [numFiles=, numRows=, totalSize=, rawDataSize=]

Partition default.stock_partition{tradedate=} stats: [numFiles=, numRows=, totalSize=, rawDataSize=]

MapReduce Jobs Launched:

Stage-Stage-: Map:   Reduce:    Cumulative CPU: 5.87 sec   HDFS Read:  HDFS Write:  SUCCESS

Total MapReduce CPU Time Spent:  seconds  msec

OK

Time taken: 39.826 seconds

第二步：hive自定义UDF，统计204001该只股票每日的最高价和最低价

Hive 自定义Max统计最大值。

package zimo.hadoop.hive;

import org.apache.hadoop.hive.ql.exec.UDF;

/**

* @function 自定义UDF统计最大值

* @author Zimo

*

*/

public class Max extends UDF{

    public Double evaluate(Double a, Double b) {

        if(a == null)

            a=0.0;

        if(b == null)

            b=0.0;

        if(a >= b){

            return a;

        } else {

            return b;

        }

    }

}

Hive 自定义Min统计最小值。

package zimo.hadoop.hive;

import org.apache.hadoop.hive.ql.exec.UDF;

/**

 * @function 自定义UDF统计最小值

 * @author Zimo

 *

 */

public class Min  extends UDF{

    public Double evaluate(Double a, Double b) {

        if(a == null)

            a = 0.0;

        if(b == null)

            b = 0.0;

        if(a >= b){

            return b;

        } else {

            return a;

        }

    }

}

将自定义的Max和Min分别打包成maxUDF.jar和minUDF.jar, 然后上传至/home/hadoop/hive目录下，添加Hive自定义的UDF函数

[hadoop@master ~]$ cd $HIVE_HOME

[hadoop@master hive1.0.0]$ sudo mkdir jar/

[hadoop@master hive1.0.0]$ ll

total

drwxr-xr-x  hadoop hadoop    May  : bin

drwxr-xr-x  hadoop hadoop    May  : conf

drwxr-xr-x  hadoop hadoop    May  : examples

drwxr-xr-x  hadoop hadoop    May  : hcatalog

drwxrwxr-x  hadoop hadoop    May  : iotmp

drwxr-xr-x  root   root      May  : jar

drwxr-xr-x  hadoop hadoop    May  : lib

-rw-r--r--  hadoop hadoop   Jan    LICENSE

drwxr-xr-x  hadoop hadoop    May  : logs

-rw-r--r--  hadoop hadoop     Jan    NOTICE

-rw-r--r--  hadoop hadoop    Jan    README.txt

-rw-r--r--  hadoop hadoop  Jan    RELEASE_NOTES.txt

drwxr-xr-x  hadoop hadoop    May  : scripts

[hadoop@master hive1.0.0]$ cd jar/

[hadoop@master jar]$ sudo rz

[hadoop@master jar]$ ll

total

-rw-r--r--  root root  May    maxUDF.jar

-rw-r--r--  root root  May    minUDF.jar

hive> add jar /opt/modules/hive1.0.0/jar/maxUDF.jar;

Added [/opt/modules/hive1.0.0/jar/maxUDF.jar] to class path

Added resources: [/opt/modules/hive1.0.0/jar/maxUDF.jar]

hive> add jar /opt/modules/hive1.0.0/jar/minUDF.jar;

Added [/opt/modules/hive1.0.0/jar/minUDF.jar] to class path

Added resources: [/opt/modules/hive1.0.0/jar/minUDF.jar]

创建Hive自定义的临时方法maxprice和minprice。

hive> create temporary function maxprice as 'zimo.hadoop.hive.Max';

OK

Time taken: 0.009 seconds

hive> create temporary function minprice as 'zimo.hadoop.hive.Min';

OK

Time taken: 0.004 seconds

统计204001股票，每日的最高价格和最低价格。

hive> select stockid, tradedate, max(maxprice(buyprice,sellprice)), min(minprice(buyprice,sellprice)) from stock_partition where stockid='204001' group by tradedate;

                              4.05    0.0

                              4.48    2.2

                              4.65    2.205

                              11.9    8.7

                              12.3    5.2

第三步：统计每分钟均价

统计204001这只股票，每天每分钟的均价

hive> select stockid, tradedate, substring(tradetime,0,4), sum(buyprice+sellprice)/(count(*)*2) from stock_partition where stockid='204001' group by stockid, tradedate, substring(tradetime,0,4);

                                  9.94375

                                  9.959999999999999

                                  10.046666666666667

                                  10.111041666666667

                                  10.132500000000002

                                  10.181458333333333

                                  10.180625

                                  10.20340909090909

                                  10.287291666666667

                                  10.331041666666668

                                  10.342500000000001

                                  10.344375

                                  10.385

                                  10.532083333333333

                                  10.621041666666667

                                  10.697291666666667

                                  10.702916666666667

                                  10.78

以上就是博主为大家介绍的这一板块的主要内容，这都是博主自己的学习过程，希望能给大家带来一定的指导作用，有用的还望大家点个支持，如果对你没用也望包涵，有错误烦请指出。如有期待可关注博主以第一时间获取更新哦，谢谢！