spark

data processing

data skewness

Data skew primarily refers to a non uniform distribution in a dataset. this causes one task takes much more time than other tasks when shuffling by key is used. Exact location of data skewness can be found with: 1) check shuffle operator: groupByKey、countByKey、reduceByKey、join 2) check logs: check which stage takes more time

Solution

Aggregate data source

aggregate value in the key, so that one key only maps to one value
aggregate data to generate coarse-grained data

increase reduce parallelism of shuffle operations

Random key to realizes double aggregation

change reduce join to map join

eliminate the problem of data skew that may be caused by the join operation, especially when one rdd is large and the other is small

sampling skewed key

take out the key of skewed data to a seperated rdd