In the previous blog entry we reviewed a Spark scenario where calling the partitionBy method resulted in each task creating as many files as you had days of events in your dataset (which was too much and caused problems). We fixed that by calling the repartition method. But will repartitioning your dataset always be enough? […]

Author: Marcin
Sqoop and support for custom types
In this post you will become familiar with some more advanced Sqoop options. We will be discussing a very specific usecase of Postgres database and a schema containing UUID and WKB types. You might find them useful even if you are working with different database or some other data types.
Spark shuffle – Case #1 – partitionBy and repartition
This is the first of a series of articles explaining the idea of how the shuffle operation works in Spark and how to use this knowledge in your daily job as a data engineer or data scientist. It will be a case-by-case explanation, so I will start with showing you a code example which does […]