STREAMING HIVE流过滤官网例子注意中间用的py脚本

梓沂 2024-08-23 15:19:10 原文

Simple Example Use Cases

MovieLens User Ratings

First, create a table with tab-delimited text file format:

CREATE TABLE u_data (

  userid INT,

  movieid INT,

  rating INT,

  unixtime STRING)

ROW FORMAT DELIMITED

FIELDS TERMINATED BY '\t'

STORED AS TEXTFILE;

Then, download the data files from MovieLens 100k on the GroupLens datasets page (which also has a README.txt file and index of unzipped files):

wget http://files.grouplens.org/datasets/movielens/ml-100k.zip

or:

curl --remote-name http://files.grouplens.org/datasets/movielens/ml-100k.zip

Note: If the link to GroupLens datasets does not work, please report it on HIVE-5341 or send a message to the user@hive.apache.org mailing list.

Unzip the data files:

unzip ml-100k.zip

And load u.data into the table that was just created:

LOAD DATA LOCAL INPATH '<path>/u.data'

OVERWRITE INTO TABLE u_data;

Count the number of rows in table u_data:

SELECT COUNT(*) FROM u_data;

Note that for older versions of Hive which don't include HIVE-287, you'll need to use COUNT(1) in place of COUNT(*).

Now we can do some complex data analysis on the table u_data:

Create weekday_mapper.py:

import sys

import datetime

for line in sys.stdin:

  line = line.strip()

  userid, movieid, rating, unixtime = line.split('\t')

  weekday = datetime.datetime.fromtimestamp(float(unixtime)).isoweekday()

  print '\t'.join([userid, movieid, rating, str(weekday)])

https://cwiki.apache.org/confluence/display/Hive/GettingStarted#GettingStarted-DDLOperations

Use the mapper script:

CREATE TABLE u_data_new (

  userid INT,

  movieid INT,

  rating INT,

  weekday INT)

ROW FORMAT DELIMITED

FIELDS TERMINATED BY '\t';

add FILE weekday_mapper.py;

INSERT OVERWRITE TABLE u_data_new

SELECT

  TRANSFORM (userid, movieid, rating, unixtime)

  USING 'python weekday_mapper.py'

  AS (userid, movieid, rating, weekday)

FROM u_data;

SELECT weekday, COUNT(*)

FROM u_data_new

GROUP BY weekday;

STREAMING HIVE流过滤官网例子注意中间用的py脚本的更多相关文章

OpenLayers 官网例子的中文详解
https://segmentfault.com/a/1190000009679800?utm_source=tag-newest 当你希望实现某种功能的时候,即使你对 openlayers 几乎一窍 ...
针对Openlayer3官网例子的简介
网址:http://openlayers.org/en/latest/examples/ 如果大家想了解ol3能做什么,或者说已提供的API有什么,又闲一个个翻例子跟API累的话,就看看这个吧. 1. ...
Vue组件化应用构建官网例子 Unknown custom element: <todo-item>
[博客园cnblogs笔者m-yb原创,转载请加本文博客链接,笔者github: https://github.com/mayangbo666,公众号aandb7,QQ群927113708] htt ...
【转】一个lucene的官网例子
创建索引: import java.io.BufferedReader; import java.io.File; import java.io.FileInputStream; import jav ...
导航条且手机版.html——仿照官网例子
<!doctype html> <html> <head> <meta charset="utf-8"> <title> ...
官网例子，mt-field password获取不到
新尝试了Mint-UI,在使用表单组件Field时, 直接从demo中拷贝了如下代码: <mt-field label="username" placeholder=&quo ...
three.js的wave特效（ivew官网首页波浪特效实现）
查看效果请访问:https://521lbx.github.io/Web3D/index.html公司的好几个vue项目都是用ivew作为UI框架,所以ivew官网时不时就得逛一圈.每一次进首页都会被 ...
Java微信扫描支付模式二Demo ,整合官网直接运行版本
概述场景介绍用户使用微信“扫一扫”扫描二维码后,获取商品支付信息,引导用户完成支付. 详细代码下载:http://www.demodashi.com/demo/13880.html 一.相关配置 ...
【慕课网实战】Spark Streaming实时流处理项目实战笔记十四之铭文升级版
铭文一级: 第11章 Spark Streaming整合Flume&Kafka打造通用流处理基础 streaming.conf agent1.sources=avro-sourceagent1 ...

随机推荐

zookeep服务启动命令
zookeeper的安装目录:/usr/local/zookeeper-3.4.6/bin/zkServer.sh; 配置文件路径:../conf/zoo.cfg 端口 :2181: ZooKeepe ...
unity发布的WebGL部署到IIS
一.创建WebGL代码在win7下,Unity3D中发布WebGL,然后部署到IIS,只要代码是对,关键是添加mime类型二.为网站添加mime类型 .json text/json .unity3 ...
【easy】111. Minimum Depth of Binary Tree求二叉树的最小深度
求二叉树的最小深度: /** * Definition for a binary tree node. * struct TreeNode { * int val; * TreeNode *left; ...
【原创】大数据基础之Flink（1）简介、安装、使用
Flink 1.7 官方:https://flink.apache.org/ 一简介 Apache Flink is an open source platform for distributed ...
Visual Studio 2015中没有“安装和部署”的解决方法
使用Visual Studio 2015 Community新建项目,在已安装模板中的“其它项目类型”下未找到“安装和部署”选项.在微软官网下载 Microsoft Visual Studio 201 ...
Vue项目中使用webpack配置了别名，引入的时候报错
chainWebpack(config) { config.resolve.alias .set('@', resolve('src')) .set('assets', resolve('src/as ...
配置php5.6.4 + Apache2.4.10
一.下载并安装apache 下载地址:www.apachelounge.com 解压后:执行以下命令: #httpd.exe –k install #httpd.exe -k start 在执行过程中 ...
python---哈希算法实现
# coding = utf-8 class Array: def __init__(self, size=32, init=None): self._size = size self._items ...
RxJS操作符（一）
一.创建类操作符创建类操作符是连接传统编程和响应式编程的强梁 from: 可以把数组.Promise.以及Iterable转化为Observable. fromEvent: 可以把事件转化为Obse ...
HTML 中的预留字符(如标签的小于号 < )必须被替换为字符实体( < )。不间断空格( )
1. 参考 HTML 字符实体 Python处理HTML转义字符比方说一个从网页中抓到的字符串 html = '<abc>' 用Python可以这样处理: import HTMLPars ...