Spark shuffle – Case #3 – using salt in repartition

In the previous blog entry we saw how a skew in a processed dataset is affecting performance of Spark jobs. We resolved the problem by repartitioning the dataset by a column which naturally splits the data into reasonably sized chunks. But what if we don’t have such columns in our dataset? Or what if you would […]