The degree of parallelism (DOP) specifies the number of tasks to be executed concurrently. It determines the number of data blocks after the shuffle operation. Configure the DOP to improve the processing capability of the system.
Query the CPU and memory usage. If the tasks and data are not evenly distributed among nodes, increase the DOP. Generally, set the DOP to two or three times that of the total CPUs in the cluster.
Configure the DOP parameter using one of the following methods based on the actual memory, CPU, data, and application logic conditions:
testRDD.groupByKey(24)
val conf = new SparkConf(); conf.set("spark.default.parallelism", 24);
spark.default.parallelism 24