使用spark对hive表中的多列数据判重

本文处理的场景如下，hive表中的数据，对其中的多列进行判重deduplicate。

1、先解决依赖，spark相关的所有包，pom.xml

spark-hive是我们进行hive表spark处理的关键。

<dependencies>

        <dependency>

            <groupId>org.apache.spark</groupId>

            <artifactId>spark-core_2.10</artifactId>

            <version>1.6.0</version>

            <scope>provided</scope>

        </dependency>

        <dependency>

            <groupId>org.apache.spark</groupId>

            <artifactId>spark-hive_2.10</artifactId>

            <version>1.6.0</version>

            <scope>provided</scope>

        </dependency>

        <dependency>

            <groupId>org.apache.spark</groupId>

            <artifactId>spark-sql_2.10</artifactId>

            <version>1.6.0</version>

            <scope>provided</scope>

        </dependency>

        <dependency>

            <groupId>com.alibaba</groupId>

            <artifactId>fastjson</artifactId>

            <version>1.2.19</version>

        </dependency>

    </dependencies>

2、spark-client

package com.xiaoju.kangaroo.duplicate;

import org.apache.spark.SparkConf;

import org.apache.spark.api.java.JavaSparkContext;

import org.apache.spark.sql.SQLContext;

import org.apache.spark.sql.hive.HiveContext;

import java.io.Serializable;

public class SparkClient implements Serializable{

    private SparkConf sparkConf;

    private JavaSparkContext javaSparkContext;

    public SparkClient() {

        initSparkConf();

        javaSparkContext = new JavaSparkContext(sparkConf);

    }

    public SQLContext getSQLContext() {

        return new SQLContext(javaSparkContext);

    }

    public HiveContext getHiveContext() {

        return new HiveContext(javaSparkContext);

    }

    private void initSparkConf() {

        try {

            String warehouseLocation = System.getProperty("user.dir");

            sparkConf = new SparkConf()

                    .setAppName("duplicate")

                    .set("spark.sql.warehouse.dir", warehouseLocation)

                    .setMaster("yarn-client");

        } catch (Exception ex) {

            ex.printStackTrace();

        }

    }

}

3、判重流程

package com.xiaoju.kangaroo.duplicate;

import org.apache.spark.api.java.JavaRDD;

import org.apache.spark.api.java.function.FlatMapFunction;

import org.apache.spark.api.java.function.Function;

import org.apache.spark.api.java.function.Function2;

import org.apache.spark.api.java.function.PairFunction;

import org.apache.spark.sql.DataFrame;

import org.apache.spark.sql.Row;

import org.apache.spark.sql.hive.HiveContext;

import scala.Tuple2;

import java.io.Serializable;

import java.util.ArrayList;

import java.util.HashMap;

import java.util.List;

import java.util.Map;

public class SparkDuplicate implements Serializable  {

    private transient SparkClient sparkClient;

    private transient HiveContext hiveContext;

    private String db;

    private String tb;

    private String pt;

    private String cols;

    public SparkDuplicate(String db, String tb, String pt, String cols) {

        this.db = db;

        this.tb = tb;

        this.pt = pt;

        this.cols = cols;

        this.sparkClient = new SparkClient();

        this.hiveContext = sparkClient.getHiveContext();

    }

    public void duplicate() {

        String partition = formatPartition(pt);

        String query = String.format("select * from %s.%s where %s", db ,tb, partition);

        System.out.println(query);

        DataFrame rows = hiveContext.sql(query);

        JavaRDD<Row> rdd = rows.toJavaRDD();

        Map<String, Integer> repeatRetMap = rdd.flatMap(new FlatMapFunction<Row, String>() {

            public Iterable<String> call(Row row) throws Exception {

                HashMap<String, Object> rowMap = formatRowMap(row);

                List<String> sList = new ArrayList<String>();

                String[] colList = cols.split(",");

                for (String col : colList) {

                    sList.add(col + "@" + rowMap.get(col));

                }

                return sList;

            }

        }).mapToPair(new PairFunction<String, String, Integer>() {

            public Tuple2<String, Integer> call(String s) throws Exception {

                return new Tuple2<String, Integer>(s, 1);

            }

        }).reduceByKey(new Function2<Integer, Integer, Integer>() {

            public Integer call(Integer integer, Integer integer2) throws Exception {

                return integer + integer2;

            }

        }).map(new Function<Tuple2<String,Integer>, Map<String, Integer>>() {

            public Map<String, Integer> call(Tuple2<String, Integer> stringIntegerTuple2) throws Exception {

                Map<String, Integer> retMap = new HashMap<String, Integer>();

                if (stringIntegerTuple2._2 > 1) {

                    retMap.put(stringIntegerTuple2._1, stringIntegerTuple2._2);

                }

                return retMap;

            }

        }).reduce(new Function2<Map<String, Integer>, Map<String, Integer>, Map<String, Integer>>() {

            public Map<String, Integer> call(Map<String, Integer> stringIntegerMap, Map<String, Integer> stringIntegerMap2) throws Exception {

                stringIntegerMap.putAll(stringIntegerMap2);

                return stringIntegerMap;

            }

        });

        for (Map.Entry<String, Integer> entry : repeatRetMap.entrySet()) {

            if (entry.getValue() > 1) {

                System.out.println("重复值为:" + entry.getKey() + ", 重复个数" + entry.getValue());

            }

        }

    }

    private String formatPartition(String partition) {

        String format = "";

        if (partition.startsWith("pt") || partition.startsWith("dt")) {

            String[] items = partition.split("=");

            for (int i = 0; i < items.length; i++) {

                if (items[i].equals("pt") || items[i].equals("dt")) {

                    format += items[i];

                } else {

                    format += "='" + items[i] + "'";

                }

            }

        } else {

            String[] keys;

            if (partition.contains("w=")){

                keys = new String[] {"year", "week"};

                partition = partition.replace("w=", "");

            }

            else{

                keys = new String[] {"year","month","day", "hour"};

            }

            String[] items = partition.split("/");

            for(int i=0; i<items.length; i++) {

                if (i == items.length-1) {

                    format += keys[i] + "='" + items[i] + "'";

                } else {

                    format += keys[i] + "='" + items[i] + "' and ";

                }

            }

        }

        return format;

    }

    private HashMap<String, Object> formatRowMap(Row row){

        HashMap<String, Object> rowMap = new HashMap<String, Object>();

        try {

            for (int i=0; i<row.schema().fields().length; i++) {

                String colName = row.schema().fields()[i].name();

                Object colValue = row.get(i);

                rowMap.put(colName, colValue);

            }

        }catch (Exception ex) {

            ex.printStackTrace();

        }

        return rowMap;

    }

    public static void main(String[] args) {

        String db = args[0];

        String tb = args[1];

        String pt = args[2];

        String cols = args[3];

        SparkDuplicate sparkDuplicate = new SparkDuplicate(db, tb, pt, cols);

        sparkDuplicate.duplicate();

    }

}

4、运行方式

提交任务脚本

#!/bin/bash

source /etc/profile

source ~/.bash_profile

db=$

table=$

partition=$

cols=$

spark-submit \

    --queue=root.zhiliangbu_prod_datamonitor \

    --driver-memory 500M \

    --executor-memory 13G \

    --num-executors  \

    spark-duplicate-1.0-SNAPSHOT-jar-with-dependencies.jar ${db} ${table} ${partition} ${cols}

运行：

sh run.sh gulfstream_ods g_order // area,type

结果

重复值为:area@, 重复个数225

重复值为:area@, 重复个数7398

重复值为:area@, 重复个数69823

重复值为:area@, 重复个数98317

重复值为:area@, 重复个数91775

重复值为:area@, 重复个数72053

重复值为:area@, 重复个数2362

重复值为:area@, 重复个数264487

重复值为:area@, 重复个数2927

重复值为:area@, 重复个数230484

重复值为:area@, 重复个数87527

重复值为:area@, 重复个数74987

重复值为:area@, 重复个数130297

重复值为:area@, 重复个数24463

重复值为:area@, 重复个数15699

重复值为:area@, 重复个数13517

重复值为:area@, 重复个数4774

重复值为:area@, 重复个数5022

重复值为:area@, 重复个数6737

重复值为:area@, 重复个数12705

重复值为:area@, 重复个数18961

重复值为:area@, 重复个数20715

重复值为:area@, 重复个数15179

重复值为:area@, 重复个数1276

重复值为:area@, 重复个数31664

重复值为:area@, 重复个数61261

重复值为:area@, 重复个数32496

重复值为:area@, 重复个数55877

重复值为:area@, 重复个数40933

重复值为:area@, 重复个数32564

重复值为:area@, 重复个数300

重复值为:area@, 重复个数21405

重复值为:area@, 重复个数37696

重复值为:area@, 重复个数212

重复值为:area@, 重复个数12442

重复值为:area@, 重复个数2526

重复值为:area@, 重复个数17456

重复值为:area@, 重复个数12688

重复值为:area@, 重复个数17285

重复值为:area@, 重复个数11511

重复值为:area@, 重复个数6622

重复值为:area@, 重复个数9573

重复值为:area@, 重复个数2416

重复值为:area@, 重复个数8109

重复值为:area@, 重复个数27915

重复值为:area@, 重复个数58942

重复值为:area@, 重复个数18842

重复值为:area@, 重复个数3482

重复值为:area@, 重复个数31452

重复值为:area@, 重复个数11436

重复值为:area@, 重复个数656

重复值为:area@, 重复个数31557

重复值为:area@, 重复个数1726

重复值为:type@, 重复个数288479

重复值为:type@, 重复个数21067365

使用spark对hive表中的多列数据判重的更多相关文章

使用spark将内存中的数据写入到hive表中
使用spark将内存中的数据写入到hive表中 hive-site.xml <?xml version="1.0" encoding="UTF-8" st ...
将DataFrame数据如何写入到Hive表中
1.将DataFrame数据如何写入到Hive表中?2.通过那个API实现创建spark临时表?3.如何将DataFrame数据写入hive指定数据表的分区中? 从spark1.2 到spark1.3 ...
Spark 读写hive 表
spark 读写hive表主要是通过sparkssSession 读表的时候,很简单,直接像写sql一样sparkSession.sql("select * from xx") 就 ...
[Spark][Hive][Python][SQL]Spark 读取Hive表的小例子
[Spark][Hive][Python][SQL]Spark 读取Hive表的小例子$ cat customers.txt 1 Ali us 2 Bsb ca 3 Carls mx $ hive h ...
Hive表中Partition的创建
作用: 在Hive Select查询中一般会扫描整个表内容,会消耗很多时间做没必要的工作.有时候只需要扫描表中关心的一部分数据,在对应的partition里面去查找就可以,减少查询时间. 1. 创建表 ...
sqoop导入数据到hive表中的相关操作
1.使用sqoop创建表并且指定对应的hive表中的字段的数据类型,同时指定该表的分区字段名称 sqoop create-hive-table --connect "jdbc:oracle: ...
如何将hive表中的数据导出
近期经常将现场的数据带回公司测试,所以写下该文章,梳理一下思路. 1.首先要查询相应的hive表,比如我要将c_cons这张表导出,我先查出hive中是否有这张表. 查出数据,证明该表在hive中存在 ...
20.采集项目流程篇之清洗数据绑定到hive表中
先启动hive 在mydb2这个数据库中创建表: create external table mydb2.access(ip string,day string,url string,upflow s ...
11.把文本文件的数据导入到Hive表中
先在hive里面创建一个表 create table mydb2.t3(id int,name string,age int) row format delimited fields terminat ...

随机推荐

制作Visual Studio 2017 (VS 2017) 离线安装包
史上功能最强大的Visual Studio 2017版本发布,但是由于版本更新速度加快和与第三方工具包集成的原因,微软研发团队没有为这个版本提供离线下载的安装文件.如果用户处在一个与外网隔离的网络环境 ...
C# xml增删查改
C# XML XmlDocument 添加命名空间: using System.Xml; 定义公共对象: XmlDocument xmldoc ; XmlNode xmlnode ; XmlEleme ...
201521123097 《JAVA程序设计》第七周学习总结
1. 本周学习总结总结 2. 书面作业 1.ArrayList代码分析 1.1 解释ArrayList的contains源代码源代码: public boolean contains(Object ...
201521123014 《Java程序设计》第2周学习总结
1. 本周学习总结 (1)类Scanner 一个可以使用正则表达式来解析基本类型和字符串的简单文本扫描器. -例如以下代码使用户能够从System.in 中读取一个数: Scanner sc = ne ...
201521123033《Java程序设计》第12周学习总结
1. 本周学习总结 1.1 以你喜欢的方式(思维导图或其他)归纳总结多流与文件相关内容. 2. 书面作业将Student对象(属性:int id, String name,int age,doubl ...
201521123096《Java程序设计》第十四周学习总结
1. 本周学习总结 1.1 以你喜欢的方式(思维导图或其他)归纳总结多数据库相关内容. 2. 书面作业 1. MySQL数据库基本操作建立数据库,将自己的姓名.学号作为一条记录插入.(截图,需出现自 ...
MUI如何调取相册的方法
第一种是HTML方法 <label> <input style="opacity: 0;" type="file" accept=" ...
python 输出颜色的与样式的方法
上次遇到这个问题就想写下来,其实当时我也不怎么会,老师说这个东西不需要理解,只需要死记硬背,写的多了就记住了,所以今天搜集了几篇文章,加上自己的理解,写下了这篇python 输出颜色的样式与方法的文章 ...
利用ASP.NET操作IIS （可以制作安装程序）
很多web安装程序都会在IIS里添加应用程序或者应用程序池,早期用ASP.NET操作IIS非常困难,不过,从7.0开始,微软提供了 Microsoft.Web.Administration 类,可以很 ...
String类的一些常见的比较方法（4）
1:boolean equals(Object obj); //比较字符穿的内容是否相同区分大小写的 2:boolean equalsIgnoreCase(String str); //比较字符穿的 ...

使用spark对hive表中的多列数据判重

使用spark对hive表中的多列数据判重的更多相关文章

随机推荐

热门专题