In the previous blog entry we reviewed a Spark scenario where calling the partitionBy method resulted in each task creating as many files as you had days of events in your dataset (which was too much and caused problems). We fixed that by calling the repartition method. But will repartitioning your dataset always be enough? […]

shuffle
Spark shuffle – Case #1 – partitionBy and repartition
This is the first of a series of articles explaining the idea of how the shuffle operation works in Spark and how to use this knowledge in your daily job as a data engineer or data scientist. It will be a case-by-case explanation, so I will start with showing you a code example which does […]