我想从数据框架中删除一个子集内的连续重复的列。我 在这里 找到了一个关于如何做的解决方案,但只适用于单列
有一个这样的数据框架。
test_df = spark.createDataFrame([
(2,3.0,"a", "2020-01-01"),
(2,6.0,"a", "2020-01-02"),
(3,2.0,"a", "2020-01-02"),
(4,1.0,"b", "2020-01-04"),
(4,9.0,"b", "2020-01-05"),
(4,7.0,"b", "2020-01-05"),
(2,3.0,"a", "2020-01-08"),
(4,7.0,"b", "2020-01-09")
], ("id", "num","st", "date"))
##############
id num st date
2, 3.0, "a" "2020-01-01"
2, 6.0, "a" "2020-01-02"
3, 2.0, "a" "2020-01-02"
4, 1.0, "b" "2020-01-04"
4, 9.0, "b" "2020-01-05"
4, 7.0, "b" "2020-01-05"
2, 3.0, "a" "2020-01-08"
4, 7.0, "b" "2020-01-09"
我想在一个特定的列[id,st]中删除连续的重复数据,当连续的案例出现时,保留第一条记录(按日期排序)。如果两个样本出现在同一天,不能正确排序,可以随机选择。结果会是这样的。
##############
id num st date
2, 3.0, "a" "2020-01-01"
3, 2.0, "a" "2020-01-02"
4, 1.0, "b" "2020-01-04"
2, 3.0, "a" "2020-01-08"
4, 7.0, "b" "2020-01-09"
我怎么能这样做呢?