spark

Posted by neverset on September 5, 2020

data processing

data skewness

Data skew primarily refers to a non uniform distribution in a dataset. this causes one task takes much more time than other tasks when shuffling by key is used. Exact location of data skewness can be found with: 1) check shuffle operator: groupByKey、countByKey、reduceByKey、join 2) check logs: check which stage takes more time

Solution

Aggregate data source
  • aggregate value in the key, so that one key only maps to one value
  • aggregate data to generate coarse-grained data
increase reduce parallelism of shuffle operations
Random key to realizes double aggregation
change reduce join to map join

eliminate the problem of data skew that may be caused by the join operation, especially when one rdd is large and the other is small

sampling skewed key

take out the key of skewed data to a seperated rdd

using random numbers and extended rdd to join