Angel Alvarado is a senior software engineer at One Degree, a San Francisco-based non-profit, and also helps run the Molanco data engineering community. Angel previously contributed a Fun Example of Streaming Data into Minecraft; this time he get serious with the Google Analytics API. Many thanks to Angel for his kind permission to adapt this article from his original.
Back in January 2017, when I was working with version 2.2.0.0 of StreamSets Data Collector, it didn’t have the OAuth 2.0 integration needed to connect to the Google Analytics API. To workaround this issue I created a custom StreamSets origin for Google Analytics – it worked really well to solve a simple use case!
Soon afterwards, Data Collector 2.3.0.0 integrated OAuth 2.0 into the built-in HTTP Client origin. This article gives you a quick recipe to configure the HTTP Client origin with Google Analytics Core Reporting V3 API.
Configuring HTTP Origin — HTTP
Here are the settings I used:
As in the configuration above, you should use runtime resources or a credential store for sensitive values such as the Google Analytics account id, rather than including them in the pipeline. You should also use relative dates, such as 2daysAgo
, so the pipeline can pull batches of data on a regular basis. Ideally, of course, we would stream data instead of load it in batches.
The reason behind using version 3 of the Google Analytics API is that it provides compatibility with the HTTP Client origin’s pagination feature. Integrating the HTTP Client origin’s pagination with version 4 of the API poses the challenge of sending requests using a JSON object inside an HTTP POST request which has to be updated in order to make the pagination work. In contrast, version 3 manipulates pagination using HTTP parameters, for instance, the nextLink
field in the response will contain a new URL containing pagination information using the parameters start-index
and max-results
.
Configuring HTTP Origin — Pagination
Use the following or similar settings:
As mentioned above, Google Analytics API v3 and the HTTP Client origin let you work with the pagination using the /nextLink
field.
We can stop the pagination loop by using an Expression Language conditional; once the API dispatches the last page of data the /nextLink
result field won’t appear anymore.
We don’t want to include the pagination data in the pipeline or have to exclude it in future stages, so by choosing /rows
for Result Field Path, the HTTP Client origin will take care of filtering out all other fields and only send this field downstream.
Configuring HTTP Origin — OAuth2
Again, here are the settings I used:
The JWT key should come from a JSON key file when you generate the Google credentials under Google Console. Locate the private_key
field in the file, which contains a string version of the key. Place this string into a file and replace all \n
literals with new lines. Again, use a runtime resource or credential store to load the the file that contains the JWT key.
Thanks, Angel, for a great example of the power and versatility of the HTTP Client origin! Have you used the HTTP Client origin in an innovative way? Let us know in the comments!