在对数据进行统计分析时,如果对指标进行聚合运算,而待查询的字段中还包含了维度,则原则上我们还需要按照维度字段进行分组。倘若这个聚合运算为sum函数,分组之后就相当于分类汇总了。有一种特殊场景是我们对指标执行了sum聚合,查询字段也包含了维度,但我们不希望对维度分组。例如:
select name, role, sum(income) from employee
虽然返回的结果挺奇怪,因为它事实上是针对整张表的income进行了求和运算,与name、role无关。查询结果中返回的其实是第一条记录的name与role。但至少在MySQL中,这样的SQL语法是正确的。
但是在Spark中,执行这样的SQL语句,则会抛出org.apache.spark.sql.AnalysisException异常:
org.apache.spark.sql.AnalysisException: expression 'employee.`name`' is neither present in the group by, nor is it an aggregate function. Add to group by or wrap in first() (or first_value) if you don't care which value you get.
这是因为Spark SQL在执行SQL语句时,事先会调用CheckAnalysis的checkAnalysis方法对LogicalPlan进行检查:
def checkAnalysis(plan: LogicalPlan): Unit = {
case e: Attribute if groupingExprs.isEmpty =>
// Collect all [[AggregateExpressions]]s.
val aggExprs = aggregateExprs.filter(_.collect {
case a: AggregateExpression => a
}.nonEmpty)
failAnalysis(
s"grouping expressions sequence is empty, " +
s"and '${e.sql}' is not an aggregate function. " +
s"Wrap '${aggExprs.map(_.sql).mkString("(", ", ", ")")}' in windowing " +
s"function(s) or wrap '${e.sql}' in first() (or first_value) " +
s"if you don't care which value you get."
case e: Attribute if !groupingExprs.exists(_.semanticEquals(e)) =>
failAnalysis(
s"expression '${e.sql}' is neither present in the group by, " +
s"nor is it an aggregate function. " +
"Add to group by or wrap in first() (or first_value) if you don't care " +
"which value you get.")