Collectives™ on Stack Overflow
Find centralized, trusted content and collaborate around the technologies you use most.
Learn more about Collectives
Teams
Q&A for work
Connect and share knowledge within a single location that is structured and easy to search.
Learn more about Teams
I have a
streaming
Dataset with columns: bag_id, ball_color. I want to find the most popular color for each bag. So, I tried:
dataset.groupBy("bag_id", "color") # 1st aggregation
.agg(count("color").as("color_count"))
.groupBy("bag_id") # 2nd aggregation
.agg(max("color_count"))
But I had an error:
Exception in thread "main" org.apache.spark.sql.AnalysisException:
Multiple streaming aggregations are not supported with streaming
DataFrames/Datasets;;
Can I create right query with only one aggregation function?
–
–
There is an open Jira addressing this issue Spark-26655, as of now we can't run multiple aggregations on the Streaming data.
One workaround would be Performing one aggregation
and saving back to Kafka..etc and again read from kafka to perform another aggregation.
We can run only one aggregation on the streaming data and saving it to HDFS/Hive/HBase and fetch to perform additional aggregations(this would be seperate job)
Yes, in Spark 2.4.4 (latest for now) is NOT support yet Multiple streaming aggregations. But, as a workaround you can use the .foreachBatch()
method:
def foreach_batch_function(df, epoch_id):
df.groupBy("bag_id","color")
.agg(count("color").as("color_count"))
.groupBy("bag_id").agg(max("color_count"))
.show() # .show() is a dummy action
streamingDF.writeStream.foreachBatch(foreach_batch_function).start()
In .foreachBatch()
the df is not a streaming df, so you can do everything you want.
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.