使用scala替换在spark数据框架列中多次出现的字符串的Regex

0 人关注

我有一个列,一个特定的字符串出现了多次。出现的次数是不固定的。我可以得到这种字符串的任何次数。

描述列有以下数据

The account account has been cancelled for the account account account and with the account

在这里,我想把多次出现的账户替换成单一的。

预期的输出。

The account has been cancelled for the account and with the account
    
scala
apache-spark
apache-spark-sql
user12976942
user12976942
发布于 2021-04-08
1 个回答
mck
mck
发布于 2021-04-08
已采纳
0 人赞同

你可以使用regex模式(来源: java正则表达式去除重复词 )与 regexp_replace ,来替换重复的词。

val df = spark.sql("select 'The account account has been cancelled for the account account account and with the account' col")
df.show(false)
+-------------------------------------------------------------------------------------------+
|col                                                                                        |
+-------------------------------------------------------------------------------------------+
|The account account has been cancelled for the account account account and with the account|
+-------------------------------------------------------------------------------------------+
val df2 = df.withColumn("col", regexp_replace(col("col"), "\\b(\\w+)(\\b\\W+\\b\\1\\b)*", "$1"))
df2.show(false)
+-------------------------------------------------------------------+