June 9, 2017

Spark Summit 2017 - Day 2 Takeaways

This blog is going away soon! :( Check out my new site where you can read the latest and subscribe for updates!

Day 2 was a bustle of activity at Spark Summit 2017. I started the day catching up with a bit of work at Workshop Cafe in the Financial District, then made my way down to Moscone for the opening keynotes and a day of sessions.

The official beginning of Spark Summit, billed as Developer Day, had a lot of great content to digest. Most surprising were clear trends independently developed at different companies that signal a clear direction for the immediate future of Spark and big data processing.

Compute-as-a-service is the future of data #

Running a Spark cluster in production (or any other framework for data engineering) requires a lot of up-front investment, but does not end there. Once use catches on, you quickly have to deal with multi-user tenancy, data security, differing environment needs for different teams, and continual maintenance and improvement to make data useage more efficient. Once you consider these factors, it makes much more sense to have one infrastructure team manage those costs up-front and allow the rest of the organization to benefit.

Prabhu Kasinethan from Paypal gave a great walkthrough of their in-house solution, leveraging YARN and a custom solution Livy to meet their internal customer needs. Jim Downing of the KHT Royal Institute of Technology presented on Hopsworks, an open source solution they built to provide Spark Streaming + Kafka - as-a-service to data analyst groups in Sweden. And, of course, the biggest example is Databricks, the creators of Apache Spark (and organizers of Spark Summit), who professionally provide Spark-Cluster-as-a-service to their customers (and in the day 2 keynote demoed their new auto-scaling clusters).

In short, big data processing is becoming a commodity within organizations, which is opening the door to spend less time configuring more time doing.

Streaming is ready for primetime #

Structured Streaming is a great feature of Spark which has been in the experimental stage for the last year or so. A few months ago, when I first started experimenting with Spark, my first application was actually built with the structured streaming framework to demonstrate processing of our team’s common data. The APIs were so intuitive that I could follow along with and customize examples from the Databricks engineering blog my first time out.

Now, in the upcoming Spark 2.2 release, structured streaming is removing the “experimental” tag. But regardless of that, the many sessions on real-life uses show that it has already been pulling it’s weight in the enterprise environment.

As mentioned earlier, Jim Downing gave a great overview of their streaming-as-a-service infrastructure, which the entire country of Sweden has been able to leverage. Michael Armburst (Databricks) presented a deep-dive in to the problems that streaming architecture presents, how Structured Streaming tackles those problems, and common design patterns to make the most of the functionality. And, of course, the Databricks keynote demo showed application of a machine learning model to streaming image data ~1ms latencies (practically useless, considering network streaming latencies present in all videos feeds, plus being about 10x more efficient than it needs to be for even 60FPS video, but still an extremely impressive technical feat).

Bonus: Zeppelin for advanced-user visualizations #

Choosing the right data visualization tool depends very much on the target users within your business; so it makes sense you could end up with multiple solutions. In Marketing Operations at Red Hat, we’ve used Qlikview dashboards for years to bring high-level performance data to the masses. In recent months, Qlik Sense has allowed our more advanced data users to construct their own reports and dashboards.

Zeppelin is not a new tool, but it was new to me. Within a few minutes of simple docker-pull statement, I had I had a new custom queryable dashboard set up on top of our JBoss Data Virtualization (JDV) environment. While we aim to eventually connect Qlik Sense to JDV, Zeppelin could be more immediately helpful for data engineers (such as myself) who need to explore data. Also, as we begin moving to a Spark-based infrastructure, it could be a valuable tool for analyzing incoming data streams.

Feel free to connect with me!

Kudos

Spark Summit 2017 - Day 2 Takeaways

Compute-as-a-service is the future of data #

Streaming is ready for primetime #

Bonus: Zeppelin for advanced-user visualizations #

Now read this

My First Successful Product