Hive学习笔记：列转行之collect_list/collect_set/concat_ws

相关文章推荐

阳刚的显示器 · 热招 | 第二期人工智能（AIGC）应用技能研修班· 4 月前 ·

活泼的小熊猫 · 知乎与清华系NLP公司面壁智能合作发布首个大 ...· 1 年前 ·

想出国的扁豆 · .NetCore技术研究-一套代码同时支持. ...· 1 年前 ·

安静的牛肉面 · Git常用命令记录 - 金色旭光 - 博客园· 2 年前 ·

冷静的牛排 · 浏览器、Postman可以调通第三方接口，但 ...· 2 年前 ·

在 Hive 中想实现按某字段分组，对另外字段进行合并，可通过 collect_list 或者 collect_set 实现。

它们都是将分组中的某列转为一个数组返回，其中区别在于：

collect_list -- 不去重

collect_set -- 去重

有点类似于 Python 中的列表、集合。

1.创建测试表

create table table_tmp(
    id string,
    classes string
) partitioned by (month string)
row format delimited fields terminated by ',';
2.本地文件
3.数据加载Hive表
load data local inpath '/root/data/id.data' into table table_tmp partition (month='202201');
select id,
       collect_list(classes) as col_001
from table_tmp
group by id;
5.concat_ws + collect_list 实现不去重合并
select id,
       concat_ws('-', collect_list(cast(col_001 as string))) as col_concat
from table_tmp
group by id;
6.concat_ws + collect_set 实现去重合并
select id,
       concat_ws('-', collect_set(cast(col_001 as string))) as col_concat
from table_tmp
group by id;
1.突破group by限制
可以利用 collect 突破 group by 的限制，分组查询的时候要求出现在 select 后面的列都必须是分组的列。
但有时候我们想根据某列进行分组后，随机抽取另一列中的一个值，即可通过以下实现：
select id
       collect_list(classes)[0] as col_001
from table_tmp
group by id;
有种类似于 Python 中索引切片的感觉。
2.concat_ws语法
concat_ws(separator, str1, str2, ...)
concat_ws(separator, [str1, str2, ...])
参考链接：hive中对多行进行合并—collect_set&collect_list函数
参考链接：Hive笔记之collect_list/collect_set（列转行）

推荐文章

阳刚的显示器 · 热招 | 第二期人工智能（AIGC）应用技能研修班

4 月前

活泼的小熊猫 · 知乎与清华系NLP公司面壁智能合作发布首个大语言模型“知海图AI”_应用_场景_曾国

1 年前

想出国的扁豆 · .NetCore技术研究-一套代码同时支持.NET Framework和.NET Core - Eric zhou - 博客园

1 年前

安静的牛肉面 · Git常用命令记录 - 金色旭光 - 博客园

2 年前

冷静的牛排 · 浏览器、Postman可以调通第三方接口，但开发工具不行的问题，ip端口都通_postman可以访问程序不行_一问一记的博客-CSDN博客

2 年前