1. partition by和group by区别和联系
1)group by是分组函数,partition by是分析函数(然后像sum()等是聚合函数)
2)在执行顺序上partition by应用在以上关键字之后,实际上就是在执行完select之后,在所得结果集之上进行partition,group by 使用常用sql关键字的优先级(from > where > group by > having > order by)
3)partition by相比较于group by,能够在
保留全部数据
的基础上,只对其中某些字段做分组排序,而group by则只保留
参与分组的字段和聚合函数
的结果
2. spark sql 使用group by
val df = Seq(
("ABC", "2019-02-10", "411626"),
("ABC", "2019-02-10", "411627"),
("BCD", "2020-04-01", "411626"),
("BCD", "2020-04-01", "411627"),
("BCD", "2020-04-02", "411626"),
("BCD", "2020-04-02", "411627"),
("DEF", "2019-01-09", "411626"))
.toDF("user_id", "start_time", "end_time")
df.groupBy(col("user_id"), col("start_time"))
.agg(count(col("end_time")), sum(col("end_time")))
.show()
+-------+----------+---------------+-------------+
|user_id|start_time|count(end_time)|sum(end_time)|
+-------+----------+---------------+-------------+
| BCD|2020-04-02| 2| 823253.0|
| ABC|2019-02-10| 2| 823253.0|
| BCD|2020-04-01| 2| 823253.0|
| DEF|2019-01-09| 1| 411626.0|
+-------+----------+---------------+-------------+
2. spark sql 使用
partition by
df.withColumn("rank",row_number().over(Window.partitionBy(col("user_id"), col("start_time")).orderBy(col("end_time"))))
.show()
+-------+----------+--------+----+
|user_id|start_time|end_time|rank|
+-------+----------+--------+----+
| BCD|2020-04-02| 411626| 1|
| BCD|2020-04-02| 411627| 2|
| ABC|2019-02-10| 411626| 1|
| ABC|2019-02-10| 411627| 2|
| BCD|2020-04-01| 411626| 1|
| BCD|2020-04-01| 411627| 2|
| DEF|2019-01-09| 411626| 1|
+-------+----------+--------+----+
partition by 返所有数据列
3.
group by实现返所有数据列
df.groupBy(col("user_id"), col("start_time"))
.agg(count(col("end_time")), sum(col("end_time")), collect_set(col("end_time"))(0).as("end_time"))
.show()
+-------+----------+---------------+-------------+--------+
|user_id|start_time|count(end_time)|sum(end_time)|end_time|
+-------+----------+---------------+-------------+--------+
| BCD|2020-04-02| 2| 823253.0| 411627|
| ABC|2019-02-10| 2| 823253.0| 411627|
| BCD|2020-04-01| 2| 823253.0| 411627|
| DEF|2019-01-09| 1| 411626.0| 411626|
+-------+----------+---------------+-------------+--------+
使用 collect_set(去重)可以实现返回所有列
我不生产知识 我只是知识的搬运工