Making Marketing Infrastructure Robust
This blog is going away soon! :( Check out my new site where you can read the latest and subscribe for updates!
Lessons from Site Reliability Engineering: Part 1
About 17 months ago, our marketing data infrastructure went through a significant change. Red Hat Summit 2016 was quickly approaching, and we were concerned about the volume of event data which would be flooding our system. With multiple steps in our data flow, both vendors and internal processes, small mistakes could quickly propagate to issues which would take months to fix. So, we took a big step forward.
Taking a page from Google’s Site Reliability Engineering practices, we decided to automate our monitoring, and add alerts for potential data issues. We identified Service-Level Objectives (SLOs) for our data, set thresholds, and got to work. We implemented monitoring and alerts through the use of Eloqua’s APIs, Prometheus...