SmartData online conference organized by the JUG.ru committee from Russia took place on Oct 11-14. It was full of interesting presentations and discussions going in 3 threads in parallel. I?d like to briefly state here some of the topics presented. Apache Airflow 2.3 and beyond: What comes next? By Ash Berlin-Taylor from Astronomer.io Ash has […]
Expense optimization is often the main reason for migrating from on-premise to the cloud. The combination of pay-as-you-go and flexible provisioning reduces the problem of overestimated and overprovisioned compute resources. However, in order to actually reduce infrastructure bills, one has to fully understand the cloud pricing model. Otherwise, invoice total may be a huge surprise. […]
In the previous blog entry we saw how a skew in a processed dataset is affecting performance of Spark jobs. We resolved the problem by repartitioning the dataset by a column which naturally splits the data into reasonably sized chunks. But what if we don’t have such columns in our dataset? Or what if you would […]
In the previous blog entry we reviewed a Spark scenario where calling the partitionBy method resulted in each task creating as many files as you had days of events in your dataset (which was too much and caused problems). We fixed that by calling the repartition method. But will repartitioning your dataset always be enough? […]
This is the first of a series of articles explaining the idea of how the shuffle operation works in Spark and how to use this knowledge in your daily job as a data engineer or data scientist. It will be a case-by-case explanation, so I will start with showing you a code example which does […]
In this post you will become familiar with some more advanced Sqoop options. We will be discussing a very specific usecase of Postgres database and a schema containing UUID and WKB types. You might find them useful even if you are working with different database or some other data types.