Spark shuffle – Case #2 – repartitioning skewed data

In the previous blog entry we reviewed a Spark scenario where calling the partitionBy method resulted in each task creating as many files as you had days of events in your dataset (which was too much and caused problems). We fixed that by calling the repartition method. But will repartitioning your dataset always be enough? […]

Sqoop and support for custom types

In this post you will become familiar with some more advanced Sqoop options. We will be discussing a very specific usecase of Postgres database and a schema containing UUID and WKB types. You might find them useful even if you are working with different database or some other data types.