如果使用列名，Spark条件性求和函数返回空值

0 人关注

我解释说， spark的sum函数可以用字符串列名工作。然而，当使用 column name 或 column object 时，我看到不同的结果。

schema = ["department", "employee", "knwos_ops", "developer"]
data = [("frontend", "john", 0, 1,), ("frontend", "jenny", 1, 1,), ("frontend", "michael", 0, 1,)]
input_df = spark.createDataFrame(data, schema=schema)
input_df.show(5, False)
+----------+--------+---------+---------+
|department|employee|knwos_ops|developer|
+----------+--------+---------+---------+
|frontend  |john    |0        |1        |
|frontend  |jenny   |1        |1        |
|frontend  |michael |0        |1        |
+----------+--------+---------+---------+
input_df \
    .groupBy(*["department"]) \
    .agg( \
            f.sum("developer").alias("dev"), \
            f.sum(f.when(f.col("knwos_ops") == 1, "developer")).alias("devops"), \
            f.sum("knwos_ops").alias("ops"),
    ).show(5, False)
+----------+---+------+---+
|department|dev|devops|ops|
+----------+---+------+---+
|frontend  |3  |null  |1  |
+----------+---+------+---+
input_df \
    .groupBy(*["department"]) \
    .agg( \
            f.sum("developer").alias("developer"), \
            f.sum(f.when(f.col("knwos_ops") == 1, f.col("developer"))).alias("devops"), \
            f.sum("knwos_ops").alias("ops"),
    ).show(5, False)
+----------+---+------+---+
|department|dev|devops|ops|
+----------+---+------+---+
|frontend  |3  |1     |1  |
+----------+---+------+---+
我对函数sum 和when 的理解如下。
函数when ，如果条件符合，则返回值，否则返回null。
函数sum 通过使用字符串类型的列名或Column类型的列名。
基于此，在第一个聚合例子中，函数when 中的条件应该返回列developer 的名称为字符串，这应该被函数sum 用来聚合并返回2。但是它却返回空值。
为什么Spark不能识别developer 是数据框架的一个列。谁能帮助我理解这背后的文档？
谢谢你的回答。正如我在第二次聚合中所做的，我有办法解决这个问题。我正在寻找这种行为背后的解释，有人指出了我对sum 。
让我这样重新表述一下。如果函数sum得到的参数是字符串，它就会试图在数据框架中找到同名的列。
#### sum function receives string as argument, and finds the column and does the sum
input_df.agg(f.sum("developer")).show(5, False)
+--------------+
|sum(developer)|
+--------------+
|3             |
+--------------+
#### sum function receives string as argument, and finds the column and does the sum. Field type is string so it return null
input_df.agg(f.sum("employee")).show(5, False)
+--------------+
|sum(developer)|
+--------------+
|null          |
+--------------+
#### sum function receives string as argument, and does not find the column and throws error
input_df.agg(f.sum("manager")).show(5, False)
Py4JJavaError: An error occurred while calling o839.agg.
: org.apache.spark.sql.AnalysisException: cannot resolve '`manager`' given input columns: [department, employee, knwos_ops, developer];
根据上面的片段，我希望函数when 会返回字符串developer ，而
我希望函数sum 将使用该字符串来解决该字符串中的列并进行聚合。


         
          apache-spark


         
          pyspark


         
          apache-spark-sql


          
           已采纳


          
           
            
             when
            
            与其他Spark SQL函数有点不同。如果你在 / 语句中指定一个字符串，它将被解释为一个字符串字面而不是一个列。
            
             then
            
            
             otherwise
            
           
           
            例如，字符串字面意义的一个可能的用例是
           
           F.when(F.col('size') > 10, 'large').otherwise('small')
而Spark将把large 和small 解释为字符串字面意义而不是列。
因此，在你的用例中，你是在对'developer' 字符串进行求和，而这将返回空值，因为字符串不能被求和。
由于这种模糊性，有必要指定F.col ，以澄清你想要一个列作为then/otherwise 语句的结果。


           
            
             
              根据我的理解，函数
              
               sum
              
              ，只希望在参数中的列类型不是数字时返回空值（type string/Column）。  我在这里错过了什么？


           
            
             
              
               when
              
              返回一个带有字面意思的
              
               字符串列
              
              
               developer
              
              。它并不返回一个字符串。


          
           
            
             你可能想做的是用以下方法代替。
            
            input_df \
    .groupBy(*["department"]) \
    .agg( \