data processing
data skewness
Data skew primarily refers to a non uniform distribution in a dataset. this causes one task takes much more time than other tasks when shuffling by key is used. Exact location of data skewness can be found with: 1) check shuffle operator: groupByKey、countByKey、reduceByKey、join 2) check logs: check which stage takes more time
Solution
Aggregate data source
- aggregate value in the key, so that one key only maps to one value
- aggregate data to generate coarse-grained data
increase reduce parallelism of shuffle operations
Random key to realizes double aggregation
change reduce join to map join
eliminate the problem of data skew that may be caused by the join operation, especially when one rdd is large and the other is small
sampling skewed key
take out the key of skewed data to a seperated rdd