The technicalities

All posts Apache Airflow Big Data ouTDo LLM News and insights The technicalities Workshops

all articles

NPM & N on MacOSX

Most programming languages have their SDK available for non-system-wide installation. Python offers venv, while Java Development Kit needs only the JAVA_HOME variable to be set,…

Apache Airflow 15 min read

Managing inter-DAG dependencies in Airflow

In the real world, data pipelines only sometimes come as a completely independent sequence of operations. Usually, they share dependences on one another, occasionally easy…

The technicalities 15 min read

Big cartesian join in Big Query

When working on more advanced analytics (or reports), you may stumble upon the problem of doing a self-join. Be it all pairs of available products,…

Big Data ouTDo 10 min read

GDPR, the forgotten done right

In the previous article, we covered anonymization and pseudonymization – techniques used in the context of ensuring data privacy, and more specifically, in the…

The technicalities 10 min read

How does storage organisation affect query performance?

Despite great efforts to separate interface from the implementation (like SQL), the pesky details always come up important when deploying to production, either when performance…

storage_organisation_vs_query_performance_TantusData

The technicalities 10 min read

Storage organisation vs query performance – examples

The article How does storage organisation affect query performance described a number of principles on how to model data in Amazon…

The technicalities 10 min read

Obtaining value from GDPR with solutions that work for your bottom line.

Compliance with the GDPR regulations can be profitable when done right. Apart from saving on legal fees and avoiding customer attrition, you can also…

The technicalities 10 min read

How to waste money in the cloud

Expense optimization is often the main reason for migrating from on-premise to the cloud. The combination of pay-as-you-go and flexible provisioning reduces the problem of…

The technicalities 10 min read

Spark shuffle – Case #3 – using salt in repartition

Why use salt in repartition? In the previous blog entry we saw how a skew in a processed dataset is affecting performance of Spark…

The technicalities 10 min read

Spark shuffle – Case #2 – repartitioning skewed data

In the previous blog entry we reviewed a Spark scenario where calling the partitionBy method resulted in each task creating as many files as you had days…

The technicalities 10 min read

Spark shuffle – Case #1 – partitionBy and repartition

This is the first of a series of articles explaining the idea of how the shuffle operation works in Spark and how to use…

The technicalities 10 min read

Sqoop and support for custom types

In this post you will become familiar with some more advanced Sqoop options. We will be discussing a very specific use case of Postgres database…