November 1, 2017

Making Marketing Infrastructure Robust

This blog is going away soon! :( Check out my new site where you can read the latest and subscribe for updates!

Lessons from Site Reliability Engineering: Part 1 #

About 17 months ago, our marketing data infrastructure went through a significant change. Red Hat Summit 2016 was quickly approaching, and we were concerned about the volume of event data which would be flooding our system. With multiple steps in our data flow, both vendors and internal processes, small mistakes could quickly propagate to issues which would take months to fix. So, we took a big step forward.

Taking a page from Google’s Site Reliability Engineering practices, we decided to automate our monitoring, and add alerts for potential data issues. We identified Service-Level Objectives (SLOs) for our data, set thresholds, and got to work. We implemented monitoring and alerts through the use of Eloqua’s APIs, Prometheus (an open-source monitoring tool), and some Python scripting.

The result was amazing: we caught 4 potentially major issues with data before everything flowed downstream. This saved us weeks on weeks of man-hours in analysis and remediation. Now we have extensive monitoring in place on all of our major data processes, and the Google SRE book is standard reading for new Infrastructure team members.

Taking a Site Reliability Engineering viewpoint can be extremely beneficial to marketing automation experts, but getting there can take a fair amount of change. And, to be fair, while the Google SRE book is a great resource, it isn’t intended to be directly consumable for non-engineers. So, I have picked out a few points that can hopefully set you on the right track.

Start Small #

Start with Service-Level Indicators (SLIs): What can we measure? Why is this number important? If there’s an issue with our data (or our automation), how does this number tell us?
Next pick a Service-Level Objective (SLOs): What is “normal” for our SLI? What threshold should the SLI reach before we spend time investigating?
Avoid absolutes in your SLOs; 0% or 100% are often objectives that require more investment than they will return.
Don’t worry about automation, initially. Pick metrics you can manually pull once a week or so. Move to automation once the benefit is proven.

Be Open #

An SRE approach usually starts with one person; to avoid the appearance of becoming “data police,” try to involve others early.
Consider using the Open Decision Framework to encourage transparency and collaboration.
Don’t get attached - revisit your SLIs regularly, asking if they are still relevant and worth monitoring (or if they need adjustment).

Design for Monitoring #

When designing a new automation, such as Program Canvas in Eloqua, ask yourself “How would I monitor this?”
Sketch out some rough SLIs/SLOs when evaluating and designing new vendor integrations.

Some of our SLIs #

This is a lot of good theory, what about some solid examples? Ask no more:

Number of contacts created in the last day: helps us catch potential spam, bad integrations, or marketers uploading ancient email lists.
Number of form submissions in the last hour: Also good catches for spam or bad integrations.
Number of contacts that do not have associated Salesforce Leads: find potential people/process issues in campaign configuration or uploads; spot integration or processing issues early.

Summary #

Site Reliability Engineering is not just for Google programmers - applying the same concepts to marketing automation can save you time and money in the long run by catching issues early. Begin with simple SLIs, be transparent with your team, and thing about monitoring when creating new automations. Go forth and make your marketing robust!

Feel free to connect with me!

Kudos