How-To: Loading Eloqua Activity Data in to Google BigQuery
This blog is going away soon! :( Check out my new site where you can read the latest and subscribe for updates!
Integrating Marketing Automation and Google Cloud Platform #
Previously, we discussed various methods of data pipelines and their integration with Marketing Automation Platforms (MAPs) (see: Scaling Marketing Data Pipelines). The final state outlined was the use of a Big Data platform as the center for data management, with the MAP becoming a source for email activity data, and a sync for contact/segmentation data, as shown below:
This also sets the stage for future enhancements, such as advanced data science analyses, including machine learning.
In this article, we’ll walk through an example of importing email send data from Eloqua in to Google BigQuery (part of Google Cloud Platform). We’ll use an Apache Beam pipeline deployed in Google Cloud Dataflow to make this happen, along with the PyEloqua Python package.
Why BigQuery? #
GCP offers several cloud-native database solutions, so let’s quickly dig in to why we’ve chosen BigQuery:
- Cloud SQL and CloudSpanner are implementations of MySQL, best used for transactional queries
- BigTable is a NoSQL offering which supports high-throughput inserts with millisecond latency (too much for us, especially if we run ETL once a day). Also, queries are much more limited and have to be planned in advance - this is due to the key-value nature of BigTable
- BigQuery is a serverless SQL engine, which supports high-throughput inserts with a latency of seconds; it stores data in a columnar format and is very efficient at querying high volumes of data and performing complex transformations
Looking at these options, BigQuery is the closest fit for our needs.
Set up Google Cloud Platform project #
If you do not already have one, you’ll need to set up an account and project on Google Cloud Platform. At the time of this writing, Google offers a trial of GCP of $300 for up to 12 months, which is more than enough for this example.
You’ll also need to enable APIs for Dataflow, BigQuery, ComputeEngine, and Google Cloud Storage. Next, create a new Google Cloud Storage bucket.
Clone example repo and install dependencies #
You can do this either locally (if you have the GCloud CLI installed and configured) or via Google Cloud Shell:
git clone https://github.com/colemanja91/eloqua-gcp
cd ./eloqua-gcp/etl-activity-email-sends
sudo bash ./install_packages.sh
# Activate resulting virtualenv
source venv/bin/activate
At this point, you may need to authenticate if you have not already:
gcloud auth application-default login
Set up BigQuery dataset and table #
Navigate to the BigQuery Console. On the left, next to your project name, click the dropdown, then select “Create new dataset.” Enter the “Dataset ID” as “eloqua”, and leave the other parameters on default.
Now open the file ./bq-schema/email_send.json
and copy the contents. Back in the BigQuery Console, next to eloqua , click the “+” sign to create a new table. Enter activity_email_send
as the table name. Under “Source data,” select “Create empty table.” Under “Schema,” click “Edit as Text” and paste the JSON file you copied earlier.
Click “Create Table” and you now have a destination table for our ETL!
Configure scripts #
Open the file etl_elq_activity_email_send.py
, locate the section shown below, and replace constant definitions with the appropriate values:
PROJECT = 'gcp-project-name'
...
ELOQUA_COMPANY = 'company'
ELOQUA_USER = 'user'
ELOQUA_PASSWORD = 'password'
Further down, the line with the following contents indicates the time range we will be exporting:
'EmailSend', '2017-10-01 00:00:00', '2017-10-02 00:00:00'
This shows we will be exporting all Email Send data for October 1, 2017. You may modify this as you like, but keep in mind that very large send volumes will need to be broken down in to several data pulls.
Examine and run Dataflow pipeline #
Looking at the run()
function defined in the Python script, we can see exactly how the pipeline is configured. It will be run direct on Cloud Dataflow (as indicated by --runner=DataFlowRunner
), and sends over a requirements.txt file which specifies to install PyEloqua.
The pipeline itself has only two components:
-
GetEloquaActivityEmailSend
- creates an in-memory input to our pipeline. The code itself calls a helper function which runs the export via Bulk API from Eloqua -
WriteToBQ
- Writes output to the BigQuery table we just created
Now we can run our pipeline!
cd ./etl-activity-email-sends
python etl_elq_activity_email_send.py
In GCP console, navigate to Dataflow and you should see the following:
It will take a few minutes for the workers to spin up and the pipeline to run.
View data in BigQuery #
Navigate back to the BigQuery console. Run the following query to view a subset of your data:
SELECT * FROM [<PROJECT ID>:eloqua.activity_email_send] LIMIT 10
Once you do a few back-loads of data, you’ll be able to examine your entire email history, and run more advanced SQL queries backed by the magic of BigQuery!
Next Steps #
There are still a few pieces missing that we’ll want to address later:
- Full back-loads of data to fill out our data warehouse
- Scheduled ETL of Email Send data to keep tables current
- Build out pipelines for other activity types
Once those are in place, we can get to the fun part:
- Run exploratory analyses using notebooks in Cloud Datalab
- Build dashboards in Data Studio
- Create predictive models using Tensorflow and Cloud ML
But for now, we have a good start!
Feel free to connect with me!
- www.linkedin.com/in/jeremiah-coleman-product
- https://twitter.com/nerds_s
- jeremiah.coleman@daasnerds.com
- https://github.com/colemanja91