使用scala替换在spark数据框架列中多次出现的字符串的Regex

0 人关注

我有一个列，一个特定的字符串出现了多次。出现的次数是不固定的。我可以得到这种字符串的任何次数。

描述列有以下数据

The account account has been cancelled for the account account account and with the account
在这里，我想把多次出现的账户替换成单一的。
预期的输出。
The account has been cancelled for the account and with the account


         
          scala


         
          apache-spark


         
          apache-spark-sql


        
         
          
          
           user12976942
          
         
         
          发布于
          
          2021-04-08


          
           已采纳


          
           
            你可以使用regex模式（来源：
            
             java正则表达式去除重复词
            
            ）与
            
             regexp_replace
            
            ，来替换重复的词。
           
           val df = spark.sql("select 'The account account has been cancelled for the account account account and with the account' col")
df.show(false)
+-------------------------------------------------------------------------------------------+
|col                                                                                        |
+-------------------------------------------------------------------------------------------+
|The account account has been cancelled for the account account account and with the account|
+-------------------------------------------------------------------------------------------+
val df2 = df.withColumn("col", regexp_replace(col("col"), "\\b(\\w+)(\\b\\W+\\b\\1\\b)*", "$1"))
df2.show(false)
+-------------------------------------------------------------------+