September 5, 2018

Managing a Databricks Spark Environment with Ansible

Bringing configuration management to Big Data #

Apache Spark is an absolutely amazing tool for working with big data. It hides and optimizes all the complexity of Hadoop MapReduce, and reduces coding to (mostly) simple Scala, Python, or SQL statements.

Databricks takes it a step further by drastically reducing the DevOps complexity involved in using Spark. They do this by providing native implementations of notebooks, jobs, easy-to-use cluster configurations (optimized for Spark), scheduling, and monitoring. Sure, all of these can be done natively in AWS, but it ends up being a pain. Instead, you can just set up your jobs via the Databricks UI.

Easy, right? Yes, but there’s always a risk associated with ease. It becomes much more difficult to track versions, especially when your Spark job is part of a larger pipeline.

Thankfully, Databricks also provides an amazing API to custom-manage deployments. While our team at First was building a new pipeline, we took the opportunity to develop an Ansible Galaxy role to manage parts of our Databricks environment.

What is Ansible? #

Ansible is an Open Source configuration management and deployment tool. Written in Python and configured via YAML, it’s main purpose is managing multiple servers that require similar configurations. Prime examples of Ansible usage are SSH'ing into multiple servers simultaneously to make configuration updates. The main benefit of Ansible is it’s focus on idempotency - ensuring that only actions required for a specified end-state are executed.

At First, we use Ansible for our application DevOps to manage application servers, and AWS resources such as RDS and Elasticache.
Some of the benefits of Ansible (specifically as they relate to our use with Databricks):

Manage multiple environments via inventory files
Secret management with Ansible Vault
Re-use application variables across non-Spark components
Easily-repeatable across multiple developer machines

Ansible + Databricks deployment flow #

Our deployment flow is rather simple. We run deployments from our developer machines, in the context of a playbook defined for the rest of our application.

This is mostly done via the Databricks CLI. We added some extra tasks to parse JSON output from the REST API, to ensure our playbooks are idempotent.

Try it! #

In order to efficiently re-use our Ansible tasks, we defined it as a Galaxy role and open-sourced it: ansible_databricks on Ansible Galaxy

Right now, it only supports the following:

Checks for existing DBFS mount points
Secrets
Libraries stored on DBFS
Jobs

The REST API and CLI support many more functions, and I hope to add them as we branch in to those areas. Of particular interest are interactive cluster management, and notebook management.

As with any open source project, please feel free to contribute on Github by opening issues or PRs!

Feel free to connect with me!

Kudos

Managing a Databricks Spark Environment with Ansible

Bringing configuration management to Big Data #

What is Ansible? #

Ansible + Databricks deployment flow #

Try it! #

Now read this

Scaling Marketing Data Pipelines