News and insights 10 min read 22 February 2022

SmartData conference 2021

Author

Senior Software Engineer

The SmartData online conference organized by the JUG.ru committee from Russia took place on Oct 11-14. It was full of interesting presentations and discussions going in 3 threads in parallel. I’d like to briefly state here some of the topics presented.

Apache Airflow 2.3 and beyond: What comes next? By Ash Berlin-Taylor from Astronomer.io

Ash has been a contributor to Airflow for almost four years. He was the Release Manager for majority of the 1.10 release series and he also re-wrote a lot of the Scheduler internals to be highly-available and increase performance by an order of magnitude. Outside of Airflow he is the Director of Airflow Engineering at Astronomer.io where he runs the team of developers contributing to the open source Airflow project.

Ash started by mentioning Airflow recent achievements, like performance 10-100 time speed up of Airflow 2.0 released in December last year. Airflow 2.2, which is about to be released, contains following updates:

i. AIP-39: Run DAGs on customisable schedules . Confusing ?execution date? deprecated, introduced ?logical_date?, ?data_interval_start?, ?data_interval_end? instead.

ii. AIP-40: Any operator could ?defer? itself. Deferrable(async) operators are generalizations of smart sensors introduced earlier. Helps to avoid waste of resources while waiting for external dependency to complete.

Roadmap for version 2.3 is not yet finalised and Ash presented his own vision of possible Airflow’s future.

1. DAG should be a joy to write and read which makes Airflow an orchestrator choice for any workflow. Make it easier to operate confidentially.

2. Dynamic DAGs. Mapped task concept could make it possible to launch as many copies of operators, like map tasks, to process all files in parallel. Parametrized DAG defined once could be used with different parameters.

3. Get rid of the UNIQUE constraint on execution_date value to allow scheduling multiple DAGs at same time.

4. Introduce airflowctl: CLI over REST API.

5. Solve ?untrusted worker? problem – set access control per connection.

6. Make it easier to assign lifecycle hooks and DAG notifications, like send slack notification on failure.

7. Better cross-DAG story: event-triggered DAGs and introduction of a concept of Data object. It should be possible to bind a storage folder with Data object reference and assign a hook which triggers task execution upon content change. When DAG execution is completed all temporary files could be automatically cleaned up.

Among even more distant future ideas these were presented the following:

1. To append DAG versioning and reflect the historical state of DAG on UI. Having introduced versioning will help to simplify DAG deployments.

2. Streaming support and better support for microbatch and long running batch jobs.

3. Support and integration with Machine Learning tools, model hosting, comparing models.

Big Data Tool presentation by Oleg Chiruhin from JetBrains

Big Data Tool is a powerful plugin for IntellijIdea which allows to work with Zeppelin notebooks, monitor Spark and Hadoop applications and explore cloud storage systems.

I. Browser file storage. Possible options are local FS, HDFS, Google Storage, S3, Azure, Minio and some more. When local and cloud storages are configured it is possible to upload files in any direction. Also it?s possible to preview remote file sample with different view options – plain text(for csv) or table view:

Table view of .parquet file in Big Data Tools

II. Zeppelin notebook editor with ssh tunnelling for accessing remote notebooks within a restricted network.

SSH tunneling configuration for Zeppelin connection

BDT appends rich scala autocomplete functionality missing in the web interface for Zeppelin.There is also ability to add external modules and jars for autocompletion and in-place documentation, not only standard library.

Navigation to code declaration and other opportunities of IDE are supported.

III. Markdown support

IV. Easy imports

V. Intermediate type visibility

VI. Plot execution results

VII. Ability to monitor job execution

Hadoop 3: Erasure coding catastrophe by Denis Efarov from OK (Odnoklassniki)

Denis Efarov, a lead developer at Odnoklassniki, has been working with Big data since 2013. He has designed and developed the platform for storing and processing data for the Odnoklassniki project since 2018.

Denis told an extremely exciting and dramatic story about migration from Hadoop 2.7.3 to 3.1.4 and back, which cost the company more than 1 year of useless work and around 115 Tb of lost production data.

Hadoop 3 introduced erasure coding mechanism instead of replication to protect against data loss. This mechanism appends 3 additional bytes of Reed-Solomon codes after each 6 bytes of original data (parity block). It leads to only 50% disc redundancy compared to 200% in case of replication with factor 3x with similar reliability guarantee. Of course the encoding/decoding process slows down read/write performance, but space preservation was a priority and in scope of OK it was tens of petabytes scale.

The whole migration process was expected to take 8-10 month, but after half a year they noticed that some parquet files were broken. At first glance damage seemed to appear randomly, but after deep analysis they detected an issue with parity blocks cleanup (see these links: https://issues.apache.org/jira/browse/HDFS-14768 and https://issues.apache.org/jira/browse/HDFS-15186). By this time 2 out of 3 redundant clusters were already migrated and 4 000 000 of files were broken. 90% was restored from the remaining backup cluster. In order to restore the remaining part they developed a tricky decoding tool which was trying to guess original data from all possible byte combinations and based on parquet file structure. This way about 9.9% of files were restored. And approximately 40 000 production files (115 Tb) were lost forever.