使用shell进行etl数据验证

方法如下：

　　整理校验的类型，不同的类型配置文件不一样。

　　　　1：校验数据增量：需要设置表名，增量字段。

　　　　2：非法值校验：设置表名，条件，校验字段，合法值/非法值范围。

　　　　3：自定义校验：设置表名，校验名称，自定义sql。

　　参数解析：

　　　　使用特殊字符作为参数的前缀，后缀；便于在脚本中进行检测和替换。

　　所实现的脚本如下：

　　配置文件：

dm_monitor_list.conf 　　

 record dm_box_office_summary index_date

 record dm_channel_index index_date

 record dm_comment_emotion_summary

 record dm_comment_keyword_summary

 record dm_comment_meterial dt

 record dm_event_meterial index_date

 record dm_event_meterial_comment dt

 record dm_event_summary

 record dm_index index_date

 record dm_main_actor_index index_date

 record dm_movie_comment_summary index_date

 record dm_movie_wish_rating_summary dt

 record dm_voice_meterial dt

 record dm_index_date

 record dm_comment_keyword_base

 record dm_index_base

 record dm_event_meterial_base

 primary_check dm_box_office_summary select concat(movie_id,":",rating_type,":",count(1)) as row_val from dm_box_office_summary where index_date='_##dt##_' group by movie_id,rating_type having count(1) >1

 primary_check dm_channel_index select concat(movie_id,":",count(1)) as row_val from dm_channel_index where datediff('_##dt##_',index_date)=1 group by movie_id having count(1)>1

 primary_cyeck dm_box_office_summary select concat(movie_id,":",index_date,":",value) as row_val from dm_box_office_summary  where index_date='_##dt##_' and value<=0

 primary_check dm_channel_index select concat(movie_id,":",count(1)) as row_val from dm_channel_index where datediff('_##dt##_',index_date)=1 group by movie_id having count(1)>1

 primary_check dm_comment_emotion_summary select concat(movie_id,":",mood_type,":",platform_id,":",channel_id,":",index_date,":",count(1)) as row_val from dm_comment_emotion_summary group by movie_id,mood_type,platform_id,channel_id,index_date having count(1)>1

 primary_check dm_comment_keyword_summary select concat(movie_id,":",mood_type,":",keyword,":",platform_id,":",channel_id,":",index_date,":",count(1)) as row_val from dm_comment_keyword_summary group by movie_id,mood_type,keyword,platform_id,channel_id,index_date having count(1)>1

 primary_check dm_comment_meterial select concat(comment_id,":",count(1)) as row_val from dm_comment_meterial where dt="_##dt##_" group by comment_id having count(1)>1

 primary_check dm_event_meterial select concat(material_url,":",count(1)) as row_val from dm_event_meterial where index_date='_##dt##_' and index_type=1 group by material_url having count(1)>1

 primary_check dm_event_meterial_comment select concat(comment_id,":",count(1)) as row_val from dm_event_meterial_comment where dt='_##dt##_' group by comment_id having count(1)>1

 primary_check dm_event_summary select concat(event_id,":",platform_id,":",channel_id,":",index_date,":",count(1)) as row_val from dm_event_summary group by event_id,platform_id,channel_id,index_date having count(1)>1

　　脚本文件：monitor.sh

 #!/sh/bash

 # 分析表数据量状态

 # .数据的唯一性

 #   电影id唯一

 # .指标的正确行

 #   增量不能小于  ;全量表小

 # .基本状态

 ##  运算的条数、电影数量、空间

 # 日志格式为

 ## tablename    dt  check_type  value   insert_date

 ## check_type :

 ### record 记录值；

 ### movie_num :电影数量 ；

 ### space :所占空间

 ### diff: 昨天和今天的电影差异，使用01 代表今天有昨天没有  代表昨天有今天没有

 ### movie_rep:重复的电影数量

 ### index-* :代表某个指标增量的为负值

 basepath=$(cd `dirname $`;pwd);

 cd $basepath

 source /etc/profile

 source ../../etc/env.ini

 if [[ ! -f "$basepath/monitor_list.conf" ]]; then

     echo "check monitor list file not exists. system exit."

     exit

 fi

 #config

 #分区

 dt=$(date -d "-1 day" "+%Y-%m-%d")

 if [[ $# -eq  ]]; then

     dt=$(date -d "$1" "+%Y-%m-%d")

 fi

 insert_date=$(date "+%Y-%m-%d")

 file_path=$OPERATION_LOG_FILE_PATH

 log_name=monitor_data.log

 log=${file_path}/${log_name}

 cat $basepath/monitor_list.conf | while read line

 do

     check_type=`echo $line | cut -d " " -f `

     table_name=`echo $line |cut -d " " -f `

     profix=$table_name"\t"$insert_date"\t"

     if [[ $check_type == 'dw' ]];then

         DB=$HIVE_DB_DW

         hdfs_path=$HADOOP_DW_DATA_DESC

     elif [[ $check_type == 'ods' ]];then

         DB=$HIVE_DB_ODS_S

         hdfs_path=$HADOOP_ODS_S_DATA_DESC

     fi

     #record

     record=$(spark-sql -e "select count(1) from $DB.$table_name where dt = '$dt';")

     echo -e $profix"record\t"$record >> $log

     #movie_num

     if [[ $table_name == 'dw_weibo_materials' ]];then

         mtime_id="movie_id"

     elif [[ $table_name == 'g4_weibo_materiel_post' ]];then

         mtime_id='x_movie_id'

     else

         mtime_id="mtime_id"

     fi

     if [[ $table_name == 'dw_weibo_user' ]];then

         movie_num=$(hive -e "select count(1) from

             (select mtime_actor_id from $DB.$table_name where dt = '$dt' and source = 'govwb' group by mtime_actor_id) a")

     else

         movie_num=$(spark-sql -e "select count(1) from (select $mtime_id from $DB.$table_name where dt = '$dt' group by $mtime_id) a")

     fi

     echo -e $profix"movie_num\t"$movie_num >> $log

     #space

     if [[ $check_type == 'ods' ]];then

         space=$(hadoop fs -du $hdfs_path/$table_name/$dt)

     else

         space=$(hadoop fs -du $hdfs_path/$table_name/dt=$dt)

     fi

     echo -e $profix"space\t"$space>> $log

     #diff

     if [[ $table_name != 'dw_weibo_user' ]];then

         yesterday=$(date -d "-1 day $dt" "+%Y-%m-%d")

         diff=$(spark-sql -e "

             select concat_ws('|',collect_set(flag)) from (

                 select 'gf' as gf, concat_ws('=',flag,cast(count() as string)) as flag  from (

                     select concat(if(y.$mtime_id is null, , ),if(t.$mtime_id is null,,)) as flag

                     from (select distinct $mtime_id from $DB.$table_name where dt='$dt') t

                     full outer join (select distinct $mtime_id from $DB.$table_name where dt='$yesterday') y

                     on  t.$mtime_id = y.$mtime_id

                 ) a group by flag

             ) b group by gf;")

         echo -e $profix"diff\t"$diff>> $log

     fi

     #movie_rep

     if [[ $check_type == 'dw' ]];then

         movie_rep=$(spark-sql -e "

             select concat_ws('|',collect_set(v)) from (

                 select 'k' as k ,concat_ws('=',id,cast(count() as string)) as v

                 from $DB.$table_name where dt = '$dt'

                 group by id

                 having count()>

             )a  group by k;")

         echo -e $profix"movie_rep\t"$movie_rep>> $log

     fi

     #index-*

     if [[ $table_name == 'dw_comment_statistics' ]];then

         up_day=$(spark-sql -e "select concat('<0:',count(1)) from $DB.$table_name

             where dt = '$dt' and

                 (cast(up_day as int) <

                 or cast(down_day as int) <

                 or cast(vv_day as int ) <

                 or cast(cmts_day as int) <

                 );")

         echo -e $profix"index_day\t"$up_day >> $log

     fi

 done

 #dm

 args_prefix="_##"

 args_suffix="##_"

 cat $basepath/dm_monitor_list.conf | while read line

 do

     check_type=`echo $line | cut -d " " -f `

     table_name=`echo $line |cut -d " " -f `

     echo "表"$table_name

     if [[ $check_type == 'record' ]]; then

         dt_str=`echo $line |cut -d " " -f `

         echo "记录数校验 分区字段"$dt_str

     else

         custom_sql=`echo $line |cut -d " " -f -`

         echo "自定义校验"$check_type

     fi

     profix=$table_name"\t"$insert_date"\t"

     DB=$HIVE_DB_DW

     hdfs_path=$HADOOP_DW_DATA_DESC

     if [[ $check_type == 'record' ]]; then

         record_sql="select count(1) from $DB.$table_name"

         if [[ -n $dt_str ]]; then

             # if [[ $table_name == 'dm_channel_index' ]]; then

                 #     record_sql=$record_sql" where datediff('$dt',$dt_str)=1;"

             # else

             #     record_sql=$record_sql" where $dt_str = '$dt';"

             # fi

             record_sql=$record_sql" where $dt_str = '$dt';"

         else

             record_sql=$record_sql";"

         fi

         echo "执行的语句："$record_sql

         #record

         record=$(hive -e "set hive.mapred.mode = nonstrict;$record_sql")

         #record=$(spark-sql -e "$record_sql")

         echo -e $profix"$check_type\t"$record >> $log

     else

         #custom_sql

         custom_sql=${custom_sql//$args_prefix"dt"$args_suffix/$dt}

         echo "执行的语句："$custom_sql

         invalid_records=$(hive -e "set hive.mapred.mode = nonstrict;use $DB;select concat_ws(\" | \",collect_set(row_val)) from ( $custom_sql ) tmp;")

         echo $invalid_records

         if [[ ! -n $invalid_records || $invalid_records == ''  ]]; then

                 invalid_records=""

         fi

         echo -e $profix"$check_type\t"$invalid_records >> $log

     fi

 done

 # insert hive

 hadoop fs -rm -r $HADOOP_ODS_CONFIG_DATA_DESC/yq_monitor_data_log/dt=$dt

 if [ -f "${file_path}/$log_name" ]; then

     hive -e "

     ALTER TABLE $HIVE_DB_MONITOR.yq_monitor_data_log DROP IF EXISTS PARTITION (dt = '$dt');

     alter table $HIVE_DB_MONITOR.yq_monitor_data_log add partition (dt = '$dt');

     "

 fi

 cd $file_path

 hadoop fs -put $log_name $HADOOP_ODS_CONFIG_DATA_DESC/yq_monitor_data_log/dt=$dt

 mv -f $log_name /home/trash

使用shell进行etl数据验证的更多相关文章

使用 JsonPath 完成接口自动化测试中参数关联和数据验证（Python语言）
背景: 接口自动化测试实现简单.成本较低.收益较高,越来越受到企业重视 restful风格的api设计大行其道 json成为主流的轻量级数据交换格式痛点: 接口关联也称为关联参数.在应用业务接口中 ...
在kettle中实现数据验证和检查
在kettle中实现数据验证和检查在ETL项目,输入数据通常不能保证一致性.在kettle中有一些步骤能够实现数据验证或检查.验证步骤能够在一些计算的基础上验证行货字段:过滤步骤实现数据过滤:jav ...
我这么玩Web Api（二）：数据验证，全局数据验证与单元测试
目录一.模型状态 - ModelState 二.数据注解 - Data Annotations 三.自定义数据注解四.全局数据验证五.单元测试一.模型状态 - ModelState 我理解 ...
MVC 数据验证
MVC 数据验证前一篇说了MVC数据验证的例子,这次来详细说说各种各样的验证注解.System.ComponentModel.DataAnnotations 一.基础特性一.Required 必填 ...
kpvalidate开辟验证组件,通用Java Web请求服务器端数据验证组件
小菜利用工作之余编写了一款Java小插件,主要是用来验证Web请求的数据,是在服务器端进行验证,不是简单的浏览器端验证. 小菜编写的仅仅是一款非常初级的组件而已,但小菜为它写了详细的说明文档. 简单介 ...
MVC3 数据验证用法之密码验证设计思路
描述:MVC数据验证使用小结内容:display,Required,stringLength,Remote,compare,RegularExpression 本人最近在公司用mvc做了一个修改密码 ...
jQuery MiniUI开发系列之：数据验证
在开发应用系统界面时,往往需要进行很多.复杂的数据验证,当填写的数据符合规定,才能提交保存. jQuery MiniUI提供了比较完美的表单数据验证和错误显示的方式. 常见的表单控件,都有一个验证事件 ...
AngularJS快速入门指南14：数据验证
thead>tr>th, table.reference>tbody>tr>th, table.reference>tfoot>tr>th, table ...
atitit.数据验证--db数据库数据验证约束
atitit.数据验证--db数据库数据验证约束 1. 为了加强账户数据金额的安全性,需要增加验证字段..1 2. 创建帐户1 3. 更改账户2 4. ---code3 5. --fini4 1. 为 ...

随机推荐

安装系统后IP配置问题
1.配置静态IP 在/etc/sysconfig/network-script/ifcfg-eth0 文件,网卡管理文件.修改为静态IP.IPADDR.网关.掩码等同一台机器上的网卡不能配置在同一网 ...
文档资源搜索小工具 - 支持PDF,DOC,PPT,XLS
最近做了一个文档搜索小工具,当然不是网盘搜索工具,这个工具支持四种文件格式搜索(pdf,doc,ppt,xls),你只需要在搜索框中输入你想要搜索资源的关键词,点击搜索按钮即可获取相关资源,点击下载按 ...
ArcMap图层属性表中添加图片
一看标题是不是有点懵?懵就对了!刚接触到的时候我也有点懵,属性表不是都是文本啊数字啊之类的格式,怎么还可以存图片,下面就带大家来看看吧! 一.关于图层入库问题图层进入数据库和图层以shp格式存储时, ...
Python函数基础--def及return语句地操作
1·def是可执行的代码 Python的函数是有一个新的语句编写的,即def.不像C这样的编译语言,def 实际上是一个可执行的语句--函数并不存在,直到Python运行了def后才存在.在典型的操作 ...
关于使用deepin在linux下安装mysql出现Can't connect to local MySQL server through socket '/tmp/mysql/mysql.sock' (2)的解决方法
根据目录/etc/mysql打开文件debain.cnf 此时文本里的内容为 # Automatically generated for Debian scripts. DO NOT TOUCH![c ...
《Java核心技术（卷一）》读书笔记——第六章：内部类
1. 内部类的概念? 类中类 2. 为什么要用内部类? 内部类的方法可以访问外部类的实例域内部类对外部类的同一个包中的类实现了隐藏匿名内部类在“想要定义一个回调函数却又不想编写 ...
cf549B Looksery Party 贪心
题目大意:有n个员工,每个员工通讯录里有自己的号码和其他一些员工的号码.现在有若干员工参加一个聚会,他们会给自己通讯录里所有的人发一条短信,包括自己.现在有个人预测了每个员工会收到多少条短信,而你要寻 ...
hibernate二级缓存demo2
@Test public void hello3(){ Session session=sessionFactory.openSession(); List list = session.create ...
19. Rootkit detectors （隐形工具包检测器 5个）
Sysinternals提供了许多小型Windows实用程序,对于低级别的Windows黑客攻击来说非常有用. 一些是免费的和/或包括源代码,而其他是专有的. 调查受访者最喜欢:ProcessExpl ...
Spark开发环境搭建（IDEA、Scala、SVN、SBT）
软件版本软件信息软件名称版本下载地址备注 Java 1.8 https://www.oracle.com/technetwork/java/javase/downloads/jdk8-dow ...

使用shell进行etl数据验证

使用shell进行etl数据验证的更多相关文章

随机推荐

热门专题