Blog

SmartData conference 2021

SmartData online conference organized by the JUG.ru committee from Russia took place on Oct 11-14. It was full of interesting presentations and discussions going in 3 threads in parallel. I?d like to briefly state here some of the topics presented. Apache Airflow 2.3 and beyond: What comes next? By Ash Berlin-Taylor from Astronomer.io Ash has […]

How to waste money in cloud

Expense optimization is often the main reason for migrating from on-premise to the cloud. The combination of pay-as-you-go and flexible provisioning reduces the problem of overestimated and overprovisioned compute resources. However, in order to actually reduce infrastructure bills, one has to fully understand the cloud pricing model. Otherwise, invoice total may be a huge surprise. […]

Tags

Spark shuffle – Case #3 – using salt in repartition

In the previous blog entry we saw how a skew in a processed dataset is affecting performance of Spark jobs. We resolved the problem by repartitioning the dataset by a column which naturally splits the data into reasonably sized chunks. But what if we don’t have such columns in our dataset? Or what if you would […]

Spark shuffle – Case #2 – repartitioning skewed data

In the previous blog entry we reviewed a Spark scenario where calling the partitionBy method resulted in each task creating as many files as you had days of events in your dataset (which was too much and caused problems). We fixed that by calling the repartition method. But will repartitioning your dataset always be enough? […]

Sqoop and support for custom types

In this post you will become familiar with some more advanced Sqoop options. We will be discussing a very specific usecase of Postgres database and a schema containing UUID and WKB types. You might find them useful even if you are working with different database or some other data types.