Broadcast distributes data sets to each node. It allows data to be obtained locally when a data set is needed during a Spark task. If broadcast is not used, data serialization will be scheduled to tasks each time when a task requires data sets. It is time-consuming and makes the task get bigger.
Add the following code to broadcast the testArr data to each node:
def main(args: Array[String) { ... val testArr: Array[Long] = new Array[Long](200) val testBroadcast: Broadcast[Array[Long]] = sc.broadcast(testArr) val resultRdd: RDD[Long] = inpputRdd.map(input => handleData(testBroadcast, input)) ... } def handleData(broadcast: Broadcast[Array[Long]], input: String) { val value = broadcast.value ... }