Making Marketing Infrastructure Robust

This blog is going away soon! :( Check out my new site where you can read the latest and subscribe for updates!

Lessons from Site Reliability Engineering: Part 1 #

About 17 months ago, our marketing data infrastructure went through a significant change. Red Hat Summit 2016 was quickly approaching, and we were concerned about the volume of event data which would be flooding our system. With multiple steps in our data flow, both vendors and internal processes, small mistakes could quickly propagate to issues which would take months to fix. So, we took a big step forward.

Taking a page from Google’s Site Reliability Engineering practices, we decided to automate our monitoring, and add alerts for potential data issues. We identified Service-Level Objectives (SLOs) for our data, set thresholds, and got to work. We implemented monitoring and alerts through the use of Eloqua’s APIs, Prometheus (an open-source monitoring tool), and some Python scripting.

The result was amazing: we caught 4 potentially major issues with data before everything flowed downstream. This saved us weeks on weeks of man-hours in analysis and remediation. Now we have extensive monitoring in place on all of our major data processes, and the Google SRE book is standard reading for new Infrastructure team members.

Taking a Site Reliability Engineering viewpoint can be extremely beneficial to marketing automation experts, but getting there can take a fair amount of change. And, to be fair, while the Google SRE book is a great resource, it isn’t intended to be directly consumable for non-engineers. So, I have picked out a few points that can hopefully set you on the right track.

Start Small #

Be Open #

Design for Monitoring #

Some of our SLIs #

This is a lot of good theory, what about some solid examples? Ask no more:

Summary #

Site Reliability Engineering is not just for Google programmers - applying the same concepts to marketing automation can save you time and money in the long run by catching issues early. Begin with simple SLIs, be transparent with your team, and thing about monitoring when creating new automations. Go forth and make your marketing robust!


Feel free to connect with me!

 
0
Kudos
 
0
Kudos

Now read this

How to Kill a Critical Platform

This blog is going away soon! :( Check out my new site where you can read the latest and subscribe for updates! I’ve been lucky enough to kill a few internal platform products that were seen as “business critical” (but in reality, caused... Continue →