Select your language
Most programming languages have their SDK available for non-system-wide installation. Python offers venv, while Java Development Kit needs only the JAVA_HOME variable to be set,…
In the real world, data pipelines only sometimes come as a completely independent sequence of operations. Usually, they share dependences on one another, occasionally easy…
When working on more advanced analytics (or reports), you may stumble upon the problem of doing a self-join. Be it all pairs of available products,…
In the previous article, we covered anonymization and pseudonymization – techniques used in the context of ensuring data privacy, and more specifically, in the…
Despite great efforts to separate interface from the implementation (like SQL), the pesky details always come up important when deploying to production, either when performance…
The article How does storage organisation affect query performance described a number of principles on how to model data in Amazon…
Compliance with the GDPR regulations can be profitable when done right. Apart from saving on legal fees and avoiding customer attrition, you can also…
Expense optimization is often the main reason for migrating from on-premise to the cloud. The combination of pay-as-you-go and flexible provisioning reduces the problem of…
Why use salt in repartition? In the previous blog entry we saw how a skew in a processed dataset is affecting performance of Spark…
In the previous blog entry we reviewed a Spark scenario where calling the partitionBy method resulted in each task creating as many files as you had days…
This is the first of a series of articles explaining the idea of how the shuffle operation works in Spark and how to use…
In this post you will become familiar with some more advanced Sqoop options. We will be discussing a very specific use case of Postgres database…