Spark：实现行转列

示例JAVA代码：

import static org.apache.spark.sql.functions.col;

import static org.apache.spark.sql.functions.split;

import static org.apache.spark.sql.functions.explode;

import java.util.ArrayList;

import java.util.List;

import org.apache.spark.sql.Dataset;

import org.apache.spark.sql.Row;

import org.apache.spark.sql.SparkSession;

public class TestSparkSqlSplit {

    public static void main(String[] args){

        SparkSession sparkSession =SparkSession.builder().appName("test").master("local[*]").getOrCreate();

        List<MyEntity> items=new ArrayList<MyEntity>();

        MyEntity myEntity=new MyEntity();

        myEntity.setId("scene_id1,scene_name1;scene_id2,scene_name2|id1");

        myEntity.setName("name");

        myEntity.setFields("other");

        items.add(myEntity);

        sparkSession.createDataFrame(items, MyEntity.class).createOrReplaceTempView("test");

        Dataset<Row> rows=sparkSession.sql("select * from test");

        rows = rows.withColumn("id", explode(split(split(col("id"), "\\|").getItem(), ";")));

        rows=rows.withColumn("id1",split(rows.col("id"),",").getItem())

                .withColumn("name1",split(rows.col("id"),",").getItem());

        rows=rows.withColumn("id",rows.col("id1"))

                .withColumn("name",rows.col("name1"));

        rows=rows.drop("id1","name1");

        rows.show();

        sparkSession.stop();

    }

}

MyEntity.java

import java.io.Serializable;

public class MyEntity implements Serializable{

    private String id;

    private String name;

    private String fields;

    public String getId() {

        return id;

    }

    public void setId(String id) {

        this.id = id;

    }

    public String getName() {

        return name;

    }

    public void setName(String name) {

        this.name = name;

    }

    public String getFields() {

        return fields;

    }

    public void setFields(String fields) {

        this.fields = fields;

    }

}

打印结果：

// :: INFO codegen.CodeGenerator: Code generated in 36.359731 ms

+------+---------+-----------+

|fields|       id|       name|

+------+---------+-----------+

| other|scene_id1|scene_name1|

| other|scene_id2|scene_name2|

+------+---------+-----------+

Scala实现：

[dx@CDH- ~]$ spark-shell2

-bash: spark-shell2: command not found

[boco@CDH- ~]$ spark2-shell

Setting default log level to "WARN".

...

Spark context available as 'sc' (master = yarn, app id = application_1552012317155_0189).

Spark session available as 'spark'.

Welcome to

      ____              __

     / __/__  ___ _____/ /__

    _\ \/ _ \/ _ `/ __/  '_/

   /___/ .__/\_,_/_/ /_/\_\   version 2.2..cloudera1

      /_/

Using Scala version 2.11. (Java HotSpot(TM) -Bit Server VM, Java 1.8.0_171)

Type in expressions to have them evaluated.

Type :help for more information.

scala>

scala> val df = Seq(

     |   (, "scene_id1,scene_name1;scene_id2,scene_name2",""),

     |   (, "scene_id1,scene_name1;scene_id2,scene_name2;scene_id3,scene_name3",""),

     |   (, "scene_id4,scene_name4;scene_id2,scene_name2",""),

     |   (, "scene_id6,scene_name6;scene_id5,scene_name5","")

     | ).toDF("id", "int_id","name");

df: org.apache.spark.sql.DataFrame = [id: int, int_id: string ...  more field]

scala> df.show;

+---+--------------------+----+

| id|              int_id|name|

+---+--------------------+----+

|  |scene_id1,scene_n...|    |

|  |scene_id1,scene_n...|    |

|  |scene_id4,scene_n...|    |

|  |scene_id6,scene_n...|    |

+---+--------------------+----+

scala> df.withColumn("int_id", explode(split(col("int_id"), ";")));

res1: org.apache.spark.sql.DataFrame = [id: int, int_id: string ...  more field]

scala> res1.show();

+---+--------------------+----+

| id|              int_id|name|

+---+--------------------+----+

|  |scene_id1,scene_n...|    |

|  |scene_id2,scene_n...|    |

|  |scene_id1,scene_n...|    |

|  |scene_id2,scene_n...|    |

|  |scene_id3,scene_n...|    |

|  |scene_id4,scene_n...|    |

|  |scene_id2,scene_n...|    |

|  |scene_id6,scene_n...|    |

|  |scene_id5,scene_n...|    |

+---+--------------------+----+

scala> res1.withColumn("int_id", split(col("int_id"), ",")()).withColumn("name", split(col("int_id"), ",")());

res5: org.apache.spark.sql.DataFrame = [id: int, int_id: string ...  more field]

scala> res5.show

+---+---------+----+

| id|   int_id|name|

+---+---------+----+

|  |scene_id1|null|

|  |scene_id2|null|

|  |scene_id1|null|

|  |scene_id2|null|

|  |scene_id3|null|

|  |scene_id4|null|

|  |scene_id2|null|

|  |scene_id6|null|

|  |scene_id5|null|

+---+---------+----+

scala> res1.withColumn("name", split(col("int_id"), ",")()).withColumn("int_id", split(col("int_id"), ",")());

res7: org.apache.spark.sql.DataFrame = [id: int, int_id: string ...  more field]

scala> res7.show

+---+---------+-----------+

| id|   int_id|       name|

+---+---------+-----------+

|  |scene_id1|scene_name1|

|  |scene_id2|scene_name2|

|  |scene_id1|scene_name1|

|  |scene_id2|scene_name2|

|  |scene_id3|scene_name3|

|  |scene_id4|scene_name4|

|  |scene_id2|scene_name2|

|  |scene_id6|scene_name6|

|  |scene_id5|scene_name5|

+---+---------+-----------+

scala>

int_id(string类型)为null,会自动转化为空字符串，如果filter中写过滤条件col("int_id").notEqual(null),将会过滤掉所有数据：

// MARK:如果int_id(string类型)为null,会自动转化为空字符串，如果filter中写过滤条件col("int_id").notEqual(null),将会过滤掉所有数据。

scala> val df = Seq(

     |             (1, null,""),

     |             (2, "-1",""),

     |             (3, "scene_id4,scene_name4;scene_id2,scene_name2",""),

     |             (4, "scene_id6,scene_name6;scene_id5,scene_name5","")

     |           ).toDF("id", "int_id","name");

df: org.apache.spark.sql.DataFrame = [id: int, int_id: string ... 1 more field]

scala> df.filter(col("int_id").notEqual(null).and(col("int_id").notEqual("-1")));

res5: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [id: int, int_id: string ... 1 more field]

scala> res5.show;

+---+------+----+

| id|int_id|name|

+---+------+----+

+---+------+----+

scala> df.filter(col("int_id").notEqual("").and(col("int_id").notEqual("-1")));

res7: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [id: int, int_id: string ... 1 more field]

scala> res7.show;

+---+--------------------+----+

| id|              int_id|name|

+---+--------------------+----+

|  3|scene_id4,scene_n...|    |

|  4|scene_id6,scene_n...|    |

+---+--------------------+----+

int_id如果不包含列传行的条件，数据不会丢失:

scala> 

scala> val df = Seq(

     | (, null,""),

     | (, "-1",""),

     | (, "scene_id4,scene_name4;scene_id2,scene_name2",""),

     | (, "scene_id6,scene_name6;scene_id5,scene_name5","")

     | ).toDF("id", "int_id","name");

df: org.apache.spark.sql.DataFrame = [id: int, int_id: string ...  more field]

scala> 

scala> df.withColumn("name", split(col("int_id"), ",")()).withColumn("int_id", split(col("int_id"), ",")());

res0: org.apache.spark.sql.DataFrame = [id: int, int_id: string ...  more field]

scala> res0.show;

+---+---------+--------------------+

| id|   int_id|                name|

+---+---------+--------------------+

|  |     null|                null|

|  |       -|                null|

|  |scene_id4|scene_name4;scene...|

|  |scene_id6|scene_name6;scene...|

+---+---------+--------------------+

scala>

Spark：实现行转列的更多相关文章

spark 累加历史 + 统计全部 + 行转列
spark 累加历史主要用到了窗口函数,而进行全部统计,则需要用到rollup函数 1 应用场景: 1.我们需要统计用户的总使用时长(累加历史) 2.前台展现页面需要对多个维度进行查询,如:产品.地 ...
Spark基于自定义聚合函数实现【列转行、行转列】
一.分析 Spark提供了非常丰富的算子,可以实现大部分的逻辑处理,例如,要实现行转列,可以用hiveContext中支持的concat_ws(',', collect_set('字段'))实现.但是 ...
Databricks 第11篇：Spark SQL 查询（行转列、列转行、Lateral View、排序）
本文分享在Azure Databricks中如何实现行转列和列转行. 一,行转列在分组中,把每个分组中的某一列的数据连接在一起: collect_list:把一个分组中的列合成为数组,数据不去重,格 ...
SQL Server 动态行转列（参数化表名、分组列、行转列字段、字段值）
一.本文所涉及的内容(Contents) 本文所涉及的内容(Contents) 背景(Contexts) 实现代码(SQL Codes) 方法一:使用拼接SQL,静态列字段: 方法二:使用拼接SQL, ...
T-SQL 实现行转列
问题: 我正在寻找一种有效的方式将行转换为SQL服务器中的列例如,通过下表如何构建出预期结果表. Id Value ColumnName 1 John FirstName 2 2 ...
Oracle行转列、列转行的Sql语句总结
多行转字符串这个比较简单,用||或concat函数可以实现 SQL Code 12 select concat(id,username) str from app_userselect i ...
sql的行转列(PIVOT)与列转行(UNPIVOT)
在做数据统计的时候,行转列,列转行是经常碰到的问题.case when方式太麻烦了,而且可扩展性不强,可以使用 PIVOT,UNPIVOT比较快速实现行转列,列转行,而且可扩展性强一.行转列 1.测 ...
做图表统计你需要掌握SQL Server 行转列和列转行
说在前面做一个数据统计和分析的项目,每天面对着各种数据,经过存储过程从源表计算汇总后需要写入中间结果表以提高数据使用效率,那么此时就需要用到行转列和列转行. 1.列转行数据经过计算加工后会直接生成 ...
SQL SERVER特殊行转列案列一则
今天有个同事找我,他说他有个需求,需要进行行转列,但是又跟一般的行转列有些区别,具体需求如下所说,需要将表1的数据转换为表2的显示格式. 我想了一下,给出了一个解决方法,具体如下所示(先给出测试数据) ...
SQL Server中使用PIVOT行转列
使用PIVOT行转列 1.建表及插入数据 USE [AdventureDB] GO /****** Object: Table [dbo].[Score] Script Date: 11/25/201 ...

随机推荐

（二）使用CXF开发WebService服务器端接口
CXF作为java领域主流的WebService实现框架,Java程序员有必要掌握它. CXF主页:http://cxf.apache.org/ 简介:百度百科今天的话,主要是用CXF来开发下Web ...
python 全栈开发，Day129(玩具开机提示语,为多个玩具发送点播,聊天界面,app录音,app与服务器端文件传输,简单的对话)
一.玩具开机提示语先下载github代码,下面的操作,都是基于这个版本来的! https://github.com/987334176/Intelligent_toy/archive/v1.2.zi ...
windows下安装GIT，使用GIT GUI 上传文件到github
安装 1.从官网 https://git-scm.com/download/win下载安装包 2.打开安装包安装,点击next,接着再点击三次next 3.在下拉菜单中选择已安装的文本编辑器,点击ne ...
PR2017添加字幕文本或文字水印
1.新建一个文本图层(先点击下右下方区域,避免新建图层是灰色不可用) 2.可以看到已经新建了一个文本图层,然后可以在效果控件修改属性,可以用文字工具在文字的地方进行修改文本.(注意点击T图标才能编辑文 ...
C# 读取WAV文件（详细）
class WAVReader { #region RIFF WAVE Chunk private string Id; //文件标识 private double Size; //文件大小 priv ...
.NetCore下利用Jenkins如何将程序自动打包发布到Docker容器中运行
说道这一块纠结了我两天时间,感觉真的很心累,Jenkins的安装就不多说了这里我们最好直接安装到宿主机上,应该pull到的jenkins版本是2.6的,里面很多都不支持,我自己试了在容器中安装的情况 ...
otter部署【原创】
环境IP:10.10.6.171 部署:mysql源库IP:10.10.6.172 部署:mysql目标库IP:10.10.6.173 部署:zookeeper,manager,node,canal ...
bzoj2654
题解: 老早看的并没有写 wqs二分的原理和这个凸函数的性质已经证明过了写的时候主要的问题在于每次的答案是一个范围什么意思呢其实比较简单的做法是优先取白边,优先取黑边做两次然后看一下要求的 ...
[HNOI2016]序列（未通过）
题解: 虽然知道有点问题但是并没有debug出来发现错误了..相同元素的处理有错误网上题解大都是分块..(hn怎么道道分块) 用最普通的思路,可以枚举每个点作为最小值,向左向右延伸但是多组询问显 ...
java技术第二次作业
(一)学习总结 1.什么是构造方法?什么是构造方法的重载? 构造方法是用于对对象初始化的方法,当新对象被创建的时候,构造函数会被调用. 每一个类都有构造函数.在程序员没有给类提供构造函数的情况下,Ja ...

Spark：实现行转列

示例JAVA代码：

Scala实现：

int_id(string类型)为null,会自动转化为空字符串，如果filter中写过滤条件col("int_id").notEqual(null),将会过滤掉所有数据：

int_id如果不包含列传行的条件，数据不会丢失:

Spark：实现行转列的更多相关文章

随机推荐

热门专题