In Apache Spark, shuffle is one of costliest operation. Effective parallelising of this operation gives good performing for spark jobs.
Shuffle Partitions in Spark SQL
Shuffle partitions are the partitions in spark dataframe, which is created using a grouped or join operation. Number of partitions in this dataframe is different than the original dataframe partitions.
The above code prints
200
. The
2
partition increased to
200
.
This is because the parameter
spark.sql.shuffle.partitions
which controls number of shuffle partitions is set to
200
by default.
Challenges with Default Shuffle Partitions
The number of shuffle partitions in spark is static. It doesn’t change with different data size. This will lead into below issues
For smaller data, 200 is a overkill which often leads to slower processing because of scheduling overheads.
For large data, 200 is small and doesn’t effectively use the all resources in the cluster.
To over come the issues mentioned above, we need to control shuffle partitions dynamically.
Dynamically Setting the Shuffle Partitions
Spark allows changing the configuration of spark sql using
conf
method on the
sparkSession
. Using this method, we can set wide variety of configurations dynamically.
So if we need to reduce the number of shuffle partitions for a given dataset, we can do that by below code
The above code will print
100
. This shows how to set the number partitions dynamically.
The exact logic for coming up with number of shuffle partitions depends on actual analysis. You can typically set it to be 1.5 or 2 times of the initial partitions.