The divide of tasks can be optimized by optimizing the partitioning method. If data skew occurs in a certain task, the whole execution process is delayed. Therefore, when designing the partitioning method, ensure that partitions are evenly assigned.
Partitioning methods are as follows:
dataStream.shuffle();
dataStream.rebalance();
dataStream.rescale();
dataStream.broadcast();
The following is an example:
// fromElements builds simple Tuple2 stream DataStream<Tuple2<String, Integer>> dataStream = env.fromElements(Tuple2.of("hello",1), Tuple2.of("test",2), Tuple2.of("world",100)); // Defines the key value used for partitioning. Adding one to the value equals to the id. Partitioner<Tuple2<String, Integer>> strPartitioner = new Partitioner<Tuple2<String, Integer>>() { @Override public int partition(Tuple2<String, Integer> key, int numPartitions) { return (key.f0.length() + key.f1) % numPartitions; } }; // The Tuple2 data is used as the basis for partitioning. dataStream.partitionCustom(strPartitioner, new KeySelector<Tuple2<String, Integer>, Tuple2<String, Integer>>() { @Override public Tuple2<String, Integer> getKey(Tuple2<String, Integer> value) throws Exception { return value; } }).print();