Sharing the magician’s toolkit.

Case study

Sharing the magician’s toolkit.

Empowering client’s team with knowledge to future-proof their solutions.

Challenge

Apache Spark that was quite unstable and started to fail consistently. As part of the production data pipeline solution, this could lead to a significant business impact for the client. It was caused by an incorrect partitioning configuration. To prevent such occurrences we needed to show the team how to maintain the solution and fix issues that might arise in the future.

Solutions

We fixed the root issue so the job would succeed every run, thus getting rid of errors and the necessity to restart the pipeline manually. Future proofing involved showing the maintenance team what we have done and running a special know how sharing session.

Technology & Tools

Client

A nordic telecom company. One of the leading global contenders.

Opportunity

Apache Spark fails affected the production pipeline, which was a part of data delivery engagement. Failure to deliver these data on time means missed SLA. Subsequently, fines, angry customers and reputation issues. On top of that a permanent fix would save time spent on restarting failed data jobs. It would reduce maintenance, provide better reliability and the issue of failures occurring over the weekends would also be solved.

Delivery

We identified the problem as a performance issue with Apache Spark job caused by incorrect partitioning configuration. We fixed the job, keeping in mind that the root cause of it is something that can be prevented with the appropriate know-how. Therefore, we made sure that we are transparent about the case and our work on fixing the issue. We led a knowledge sharing session and created a KB article for the maintenance team. It was to ensure that they would be able to fix such issues in other jobs by themselves and better understand the applications they own.

Our session covered:

A brief explanation about Spark partitioning – how to tune memory and how it relates to the job itself.
A description on how the issue happened – what exactly Spark did on the cluster and why.
A short note on how to mitigate such problems in the future and how to anticipate those during development; including ‘future reading’ with more elaborate articles on the topic.

Effect

We have fixed the issue, thus ensuring no more missed SLA as well as saving manual labour and other maintenance costs. We ensured that other occurrences of such issues will not have a negative impact by sharing the relevant knowledge with the team.