pig简单的代码实例:报表统计行业中的点击和曝光量
注意:pig中用run或者exec 运行脚本。除了cd和ls,其他命令不用。在本代码中用rm和mv命令做例子,容易出错。
另外,pig只有在store或dump时候才会真正加载数据,否则,只是加载代码,不具体操作数据。所以在rm操作时必须注意该文件是否已经生成。如果rm的文件为生成,可以第三文件,进行mv改名操作
SET job.name 'test_age_reporth_istorical';-- 定义任务名字,在http://172.XX.XX.XX:50030/jobtracker.jsp中查看任务状态,失败成功。
SET job.priority HIGH;--优先级
--注册jar包,用于读取sequence file和输出分析结果文件
REGISTER piggybank.jar;
DEFINE SequenceFileLoader org.apache.pig.piggybank.storage.SequenceFileLoader(); --读取二进制文件,函数名定义
%default Cleaned_Log /user/C/data/XXX/cleaned/$date/*/part* --$date是外部传入参数
%default AD_Data /user/XXX/data/xxx/metadata/ad/part*
%default Campaign_Data /user/xxx/data/xxx/metadata/campaign/part*
%default Social_Data /user/xxx/data/report/socialdata/part*
--所有的输出文件路径:
%default Industry_Path $file_path/report/historical/age/$year/industry
%default Industry_SUM $file_path/report/historical/age/$year/industry_sum
%default Industry_TMP $file_path/report/historical/age/$year/industry_tmp
%default Industry_Brand_Path $file_path/report/historical/age/$year/industry_brand
%default Industry_Brand_SUM $file_path/report/historical/age/$year/industry_brand_sum
%default Industry_Brand_TMP $file_path/report/historical/age/$year/industry_brand_tmp
%default ALL_Path $file_path/report/historical/age/$year/all
%default ALL_SUM $file_path/report/historical/age/$year/all_sum
%default ALL_TMP $file_path/report/historical/age/$year/all_tmp
%default output_path /user/xxx/tmp/result
origin_cleaned_data = LOAD '$Cleaned_Log' USING PigStorage(',') --读取日志文件
AS (ad_network_id:chararray,
xxx_ad_id:chararray,
guid:chararray,
id:chararray,
create_time:chararray,
action_time:chararray,
log_type:chararray,
ad_id:chararray,
positioning_method:chararray,
location_accuracy:chararray,
lat:chararray,
lon:chararray,
cell_id:chararray,
lac:chararray,
mcc:chararray,
mnc:chararray,
ip:chararray,
connection_type:chararray,
android_id:chararray,
android_advertising_id:chararray,
openudid:chararray,
mac_address:chararray,
uid:chararray,
density:chararray,
screen_height:chararray,
screen_width:chararray,
user_agent:chararray,
app_id:chararray,
app_category_id:chararray,
device_model_id:chararray,
carrier_id:chararray,
os_id:chararray,
device_type:chararray,
os_version:chararray,
country_region_id:chararray,
province_region_id:chararray,
city_region_id:chararray,
ip_lat:chararray,
ip_lon:chararray,
quadkey:chararray);
--loading metadata/ad(adId,campaignId)
metadata_ad = LOAD '$AD_Data' USING PigStorage(',') AS (adId:chararray, campaignId:chararray);
--loading metadata/campaignæ•°æ®(campaignId, industryId, brandId)
metadata_campaign = LOAD '$Campaign_Data' USING PigStorage(',') AS (campaignId:chararray, industryId:chararray, brandId:chararray);
--ad and campaign for inner join
joinAdCampaignByCampaignId = JOIN metadata_ad BY campaignId,metadata_campaign BY campaignId;--(adId,campaignId,campaignId,industryId,brandId)
--filtering out redundant column of joinAdCampaignByCampaignId
joined_ad_campaign_data = FOREACH joinAdCampaignByCampaignId GENERATE $0 AS adId,$3 AS industryId,$4 AS brandId; --(adId,industryId,brandId)
--extract column for analyzing
origin_historical_age = FOREACH origin_cleaned_data GENERATE xxx_ad_id,guid,log_type;--(xxx_ad_id,guid,log_type)
--distinct
distinct_origin_historical_age = DISTINCT origin_historical_age;--(xxx_ad_id,guid,log_type)
--loading metadata_region(guid_social, sex, age, income, edu, hobby)
metadata_social = LOAD '$Social_Data' USING PigStorage(',') AS (guid_social:chararray, sex:chararray, age:chararray, income:chararray, edu:chararray, hobby:chararray);
--extract needed column in metadata_social
social_age = FOREACH metadata_social GENERATE guid_social,age;
--join socialData(metadata_social) and logData(distinct_origin_historical_age):
joinedByGUID = JOIN social_age BY guid_social, distinct_origin_historical_age BY guid;
--(guid_social, age; xxx_ad_id,guid,log_type)
--generating analyzing age data
joined_orgin_age_data = FOREACH joinedByGUID GENERATE xxx_ad_id,guid,log_type,age;
joinedByAdId = JOIN joined_ad_campaign_data BY adId, joined_orgin_age_data BY xxx_ad_id; --(adId,industryId,brandId,xxx_ad_id,guid,log_type,age)
--filtering
all_current_data = FOREACH joinedByAdId GENERATE guid,log_type,industryId,brandId,age; --(guid,log_type,industryId,brandId,age)
--for industry analyzing
industry_current_data = FOREACH all_current_data GENERATE industryId,guid,age,log_type; --(industryId,guid,age,log_type)
--load all in the path "industry"
industry_existed_Data = LOAD '$Industry_Path' USING PigStorage(',') AS (industryId:chararray,guid:chararray,age:chararray,log_type:chararray);
--merge with history data
union_Industry = UNION industry_existed_Data, industry_current_data;
distict_union_industry = DISTINCT union_Industry;
group_industry = GROUP distict_union_industry BY ($2,$0,$3);
count_guid_for_industry = FOREACH group_industry GENERATE FLATTEN(group),COUNT($1.$1);
rm $Industry_SUM;
STORE count_guid_for_industry INTO '$Industry_SUM' USING PigStorage(',');
--storing union industry data(current and history)
STORE distict_union_industry INTO '$Industry_TMP' USING PigStorage(',');
rm $Industry_Path
mv $Industry_TMP $Industry_Path
--counting guid for industry and brand
industry_brand_current = FOREACH all_current_data GENERATE age,industryId,brandId,log_type,guid;
--(age,industryId,brandId,log_type,guid)
--load history data of industry_brand
industry_brand_history = LOAD '$Industry_Brand_Path' USING PigStorage(',') AS(age:chararray, industryId:chararray, brandId:chararray, log_type:chararray, guid:chararray);
--union all data of industry_brand
union_industry_brand = UNION industry_brand_current,industry_brand_history;
unique_industry_brand = DISTINCT union_industry_brand;
--(age,industryId,brandId,log_type,guid)
--counting users' number for industry and brand
group_industry_brand = GROUP unique_industry_brand BY ($0,$1,$2,$3);
count_guid_for_industry_brand = FOREACH group_industry_brand GENERATE FLATTEN(group),COUNT($1.$4);
rm $Industry_Brand_SUM;
STORE count_guid_for_industry_brand INTO '$Industry_Brand_SUM' USING PigStorage(',');
STORE unique_industry_brand INTO '$Industry_Brand_TMP' USING PigStorage(',');
rm $Industry_Brand_Path;
mv $Industry_Brand_TMP $Industry_Brand_Path
--counting user number for age and logtype
current_data = FOREACH all_current_data GENERATE age,log_type,guid;--(age,log_type,guid)
--load history data of age and logtype
history_data = LOAD '$ALL_Path' USING PigStorage(',') AS(age:chararray,log_type:chararray,guid:chararray);
--union current and history data
union_all_data = UNION history_data, current_data;
unique_all_data = DISTINCT union_all_data;
--count users' number
group_all_data = GROUP unique_all_data BY ($0,$1);
count_guid_for_age_logtype = FOREACH group_all_data GENERATE FLATTEN(group),COUNT($1.$2);
rm $ALL_SUM;
STORE count_guid_for_age_logtype INTO '$ALL_SUM' USING PigStorage(',');
STORE unique_all_data INTO '$ALL_TMP' USING PigStorage(',');
rm $ALL_Path
mv $ALL_TMP $ALL_Path
pig简单的代码实例:报表统计行业中的点击和曝光量的更多相关文章
- JS代码实用代码实例(输入框监听,点击显示点击其他地方消失,文件本地预览上传)
前段时间写前端,遇到一些模块非常有用,总结以备后用 一.input框字数监听 <!DOCTYPE html> <html lang="en"> <he ...
- Redis:安装、配置、操作和简单代码实例(C语言Client端)
Redis:安装.配置.操作和简单代码实例(C语言Client端) - hj19870806的专栏 - 博客频道 - CSDN.NET Redis:安装.配置.操作和简单代码实例(C语言Client端 ...
- Python模拟登陆淘宝并统计淘宝消费情况的代码实例分享
Python模拟登陆淘宝并统计淘宝消费情况的代码实例分享 支付宝十年账单上的数字有点吓人,但它统计的项目太多,只是想看看到底单纯在淘宝上支出了多少,于是写了段脚本,统计任意时间段淘宝订单的消费情况,看 ...
- 使用ssm(spring+springMVC+mybatis)创建一个简单的查询实例(二)(代码篇)
这篇是上一篇的延续: 用ssm(spring+springMVC+mybatis)创建一个简单的查询实例(一) 源代码在github上可以下载,地址:https://github.com/guoxia ...
- c#之GDI简单实现代码及其实例
作业:文档形式 3到5页理解 1.理解 2.源代码解释(1到2页) 3.实现效果 项目地址: https://github.com/zhiyishou/polyer Demo:https://zhiy ...
- 审核流(3)低调奢华,简单不凡,实例演示-SNF.WorkFlow--SNF快速开发平台3.1
下面我们就从什么都没有,结合审核流进行演示实例.从无到有如何快速完美的实现,然而如此简单.低调而奢华,简单而不凡. 从只有数据表通过SNF.CodeGenerator代码生成器快速生成单据并与审核流进 ...
- bzoj P1058 [ZJOI2007]报表统计——solution
1058: [ZJOI2007]报表统计 Time Limit: 15 Sec Memory Limit: 162 MB Submit: 4099 Solved: 1390 [Submit][St ...
- input文本框实现宽度自适应代码实例
代码实例如下: <!DOCTYPE html> <html><head><meta charset="utf-8"><meta ...
- jQuery实现的鼠标滑过切换图片代码实例
jQuery实现的鼠标滑过切换图片代码实例:有时候网页需要这样的简单效果,那就是当鼠标滑过默认图片的时候,能够实现图片的切换,可能在实际应用中,往往没有这么简单,不过大家可以自行扩展一下,下面简单介绍 ...
随机推荐
- js数组求差集
var arr1 = [2,3,5,88,99,444,66];var arr2 = [2,88,66]; arr_dive(arr1,arr2); function arr_dive(aArr,bA ...
- iOS核心面试题
1,请简述你对协议的理解? protocol无论是在那个领域都是一种约束,规范.在OC中的协议主要用于在各个类之间进行回调传值. 协议有 委托方,代理方, 委托方是协议的制定者,需要声明协议的方 ...
- IOS JavaScriptCore介绍
本文主要转自:https://www.jianshu.com/p/cdaf9bc3d65d http://blog.csdn.net/u011993697/article/details/515772 ...
- 小程序敏感信息解密-java
/** * AES解密 * @param content 密文 * @return * @throws InvalidAlgorithmParameterException * @throws NoS ...
- day04 Java Web 开发入门
day04 Java Web 开发入门 1. web 开发相关介绍 2. web 服务器 3. Tomcat服务器启动的问题 4. Tomcat目录结构 5. Web应用程序(虚拟目录映射,缺省web ...
- 关于熊猫认证软件IOS安装步骤教程(适用于其他软件)
IOS运行企业版应用教程 1.扫描二维码之后微信进入界面,如下图所示:点击右上角三个点 2.弹出分享界面,如图所示:点击苹果自带浏览器(sarfari) 3.进入苹果自带浏览器后如图所示, ...
- Python中的数据类型
计算机顾名思义就是可以做数学计算的机器,因此,计算机程序理所当然地可以处理各种数值.但是,计算机能处理的远不止数值,还可以处理文本.图形.音频.视频.网页等各种各样的数据,不同的数据,需要定义不同的数 ...
- css 中calc无效属性值问题
width:calc(50%-20px); 这样书写是无效的因为calc中计算的两个因子同运算符号之间必须存在空格:
- Android签名与权限的安全问题(3)
签名和权限的作用 Android签名中使用到的一些加密技术有: 公/私钥, SHA1(CERT.SF,MANIFEST.MF), RSA(CERT.RSA), 消息摘要, 移动平台中的主流签名作用: ...
- Tomcat怎么实现异步Servlet
有时Servlet在生成响应报文前必须等待某些耗时的操作,比如在等待一个可用的JDBC连接或等待一个远程Web服务的响应.对于这种情况servlet规范中定义了异步处理方式,由于Servlet中等待阻 ...