Extract Data from Google Analytics using StreamSets Data Collector
Angel Alvarado is a senior software engineer at One Degree, a San Francisco-based non-profit, and also helps run the Molanco data engineering community. Angel previously contributed a Fun Example of Streaming Data into Minecraft; this time he get serious with the Google Analytics API. Many thanks to Angel for his kind permission to adapt this article from his original.
Back in January 2017, when I was working with version 22.214.171.124 of StreamSets Data Collector, it didn't have the OAuth 2.0 integration needed to connect to the Google Analytics API. To workaround this issue I created a custom StreamSets origin for Google Analytics – it worked really well to solve a simple use case!
Soon afterwards, Data Collector 126.96.36.199 integrated OAuth 2.0 into the built-in HTTP Client origin. This article gives you a quick recipe to configure the HTTP Client origin with Google Analytics Core Reporting V3 API.
Configuring HTTP Origin — HTTP
Here are the settings I used:
As in the configuration above, you should use runtime resources or a credential store for sensitive values such as the Google Analytics account id, rather than including them in the pipeline. You should also use relative dates, such as
2daysAgo, so the pipeline can pull batches of data on a regular basis. Ideally, of course, we would stream data instead of load it in batches.
The reason behind using version 3 of the Google Analytics API is that it provides compatibility with the HTTP Client origin’s pagination feature. Integrating the HTTP Client origin's pagination with version 4 of the API poses the challenge of sending requests using a JSON object inside an HTTP POST request which has to be updated in order to make the pagination work. In contrast, version 3 manipulates pagination using HTTP parameters, for instance, the
nextLink field in the response will contain a new URL containing pagination information using the parameters
Configuring HTTP Origin — Pagination
Use the following or similar settings:
As mentioned above, Google Analytics API v3 and the HTTP Client origin let you work with the pagination using the
We can stop the pagination loop by using an Expression Language conditional; once the API dispatches the last page of data the
/nextLink result field won’t appear anymore.
We don't want to include the pagination data in the pipeline or have to exclude it in future stages, so by choosing
/rows for Result Field Path, the HTTP Client origin will take care of filtering out all other fields and only send this field downstream.
Configuring HTTP Origin — OAuth2
Again, here are the settings I used:
The JWT key should come from a JSON key file when you generate the Google credentials under Google Console. Locate the
private_key field in the file, which contains a string version of the key. Place this string into a file and replace all
\n literals with new lines. Again, use a runtime resource or credential store to load the the file that contains the JWT key.
Thanks, Angel, for a great example of the power and versatility of the HTTP Client origin! Have you used the HTTP Client origin in an innovative way? Let us know in the comments!