Spark Summit 2017 - Day 1 Takeaways

This blog is going away soon! :( Check out my new site where you can read the latest and subscribe for updates!

I arrived in San Francisco for the first time Sunday evening to attend Spark Summit 2017. Spark is the hip-tool-on-the-block for data engineering and data science, so it’s really exciting to see how other companies are using and what cases we may be able to adapt on my team.

I’m going to try sharing some of my key takeaways from each day (although delaying 1–2 days to give the concepts some time to gel).

Day 1 was “Training Day" - I attended the session Architecting a Data Platform, given by the top-notch team at Silicon Valley Data Science.

Build a Lab and a Factory #

This one stuck out because it is a philosophy we’ve already adopted. The most frequent analogy we use to describe "what we do” is that we are a data factory. Not only do we focus on automation, but we have to warrant the quality of the data we create (basically, a promise that data will meet a certain standard, and that we will monitor and troubleshoot as needed to prevent impact to the business).

We also have a lab in our department; a group we actually call the “Labs” team. Hypothetically, this is the group that is on the cutting-edge, always trying to develop new and trendy solutions, determine the potential ROI, and pass on to engineering for productionalization.

Here is a great write-up on the relationship between lab and factory: https://hbr.org/2013/04/two-departments-for-data-succe

Data Architecture Best Practices #

The team had a lot of great advice here, but there are a few that were particularly noteworthy:

  1. Track data transformations - this is particularly important in my role. For internal users to trust any data or data process, they want to see the logic behind it (ultimately, this is not a scalable approach, but that is a different problem). It’s also crucial for the eventual troubleshooting that will happen.
  2. Keep a copy of the raw data - I’ve slowly been gravitating here for a while now. Code changes are probably one of the biggest risks to an established pipeline, and keeping the raw data provides the ability to re-compute in the case that buggy code gets deployed.
  3. Be careful in evaluating on-ingest vs. on-use transforms - using streaming processing for everything is becoming very trendy, but can easily be a bigger operational burden. It’s important to use it only when absolutely needed, and stick with more traditional methods otherwise.

Check back in later this week - I’ll be posting more takeaways from each day!


Feel free to connect with me!

 
0
Kudos
 
0
Kudos

Now read this

Making Marketing Infrastructure Robust

This blog is going away soon! :( Check out my new site where you can read the latest and subscribe for updates! Lessons from Site Reliability Engineering: Part 1 # About 17 months ago, our marketing data infrastructure went through... Continue →