In the previous blog entry we saw how a skew in a processed dataset is affecting performance of Spark jobs. We resolved the problem by repartitioning the dataset by a column which naturally splits the data into reasonably sized chunks. But what if we don’t have such columns in our dataset? Or what if you would […]
Blog
Spark shuffle – Case #2 – repartitioning skewed data
In the previous blog entry we reviewed a Spark scenario where calling the partitionBy method resulted in each task creating as many files as you had days of events in your dataset (which was too much and caused problems). We fixed that by calling the repartition method. But will repartitioning your dataset always be enough? […]
Spark shuffle – Case #1 – partitionBy and repartition
This is the first of a series of articles explaining the idea of how the shuffle operation works in Spark and how to use this knowledge in your daily job as a data engineer or data scientist. It will be a case-by-case explanation, so I will start with showing you a code example which does […]
Sqoop and support for custom types
In this post you will become familiar with some more advanced Sqoop options. We will be discussing a very specific usecase of Postgres database and a schema containing UUID and WKB types. You might find them useful even if you are working with different database or some other data types.