Sharing the magician’s toolkit.
Technology & Tools
A nordic telecom company. One of the leading global contenders.
Apache Spark fails affected the production pipeline, which was a part of data delivery engagement. Failure to deliver these data on time means missed SLA. Subsequently, fines, angry customers and reputation issues. On top of that a permanent fix would save time spent on restarting failed data jobs. It would reduce maintenance, provide better reliability and the issue of failures occurring over the weekends would also be solved.
We identified the problem as a performance issue with Apache Spark job caused by incorrect partitioning configuration. We fixed the job, keeping in mind that the root cause of it is something that can be prevented with the appropriate know-how. Therefore, we made sure that we are transparent about the case and our work on fixing the issue. We led a knowledge sharing session and created a KB article for the maintenance team. It was to ensure that they would be able to fix such issues in other jobs by themselves and better understand the applications they own.
Our session covered:
- A brief explanation about Spark partitioning – how to tune memory and how it relates to the job itself.
- A description on how the issue happened – what exactly Spark did on the cluster and why.
- A short note on how to mitigate such problems in the future and how to anticipate those during development; including ‘future reading’ with more elaborate articles on the topic.
We have fixed the issue, thus ensuring no more missed SLA as well as saving manual labour and other maintenance costs. We ensured that other occurrences of such issues will not have a negative impact by sharing the relevant knowledge with the team.