Spark中进行聚合时的特殊场景

在对数据进行统计分析时,如果对指标进行聚合运算,而待查询的字段中还包含了维度,则原则上我们还需要按照维度字段进行分组。倘若这个聚合运算为sum函数,分组之后就相当于分类汇总了。有一种特殊场景是我们对指标执行了sum聚合,查询字段也包含了维度,但我们不希望对维度分组。例如:

select name, role, sum(income) from employee

虽然返回的结果挺奇怪,因为它事实上是针对整张表的income进行了求和运算,与name、role无关。查询结果中返回的其实是第一条记录的name与role。但至少在MySQL中,这样的SQL语法是正确的。

但是在Spark中,执行这样的SQL语句,则会抛出org.apache.spark.sql.AnalysisException异常:

org.apache.spark.sql.AnalysisException: expression 'employee.`name`' is neither present in the group by, nor is it an aggregate function. Add to group by or wrap in first() (or first_value) if you don't care which value you get.

这是因为Spark SQL在执行SQL语句时,事先会调用CheckAnalysis的checkAnalysis方法对LogicalPlan进行检查:

def checkAnalysis(plan: LogicalPlan): Unit = {
    case e: Attribute if groupingExprs.isEmpty =>
        // Collect all [[AggregateExpressions]]s.
        val aggExprs = aggregateExprs.filter(_.collect {
                case a: AggregateExpression => a
            }.nonEmpty)
        failAnalysis(
           s"grouping expressions sequence is empty, " +
           s"and '${e.sql}' is not an aggregate function. " +
           s"Wrap '${aggExprs.map(_.sql).mkString("(", ", ", ")")}' in windowing " +
           s"function(s) or wrap '${e.sql}' in first() (or first_value) " +
           s"if you don't care which value you get."
      case e: Attribute if !groupingExprs.exists(_.semanticEquals(e)) =>
         failAnalysis(
          s"expression '${e.sql}' is neither present in the group by, " +
          s"nor is it an aggregate function. " +
          "Add to group by or wrap in first() (or first_value) if you don't care " +
          "which value you get.")