skip to Main Content

The DataOps Blog

Where Change Is Welcome

Clickstream Analysis on Amazon EMR, Amazon Redshift and Elasticsearch

By Posted in Engineering April 23, 2020

Learn how quickly you can start ingesting and aggregating Clickstream logs on Amazon EMR using StreamSets Transformer Engine, and see how the data is analyzed in Elasticsearch, Kibana, and Amazon Redshift.

What is Clickstream Analysis?

Clickstream analysis by definition is the process of collecting, analyzing, and reporting aggregate information about webpage visits. In this blog, we will review how StreamSets Transformer Engine, a Spark ETL engine, running on Amazon EMR can help ingest and aggregate Clickstream logs.

Pipeline Overview

Here are the details of the dataset and data pipeline components:

Clickstream Analysis on Amazon EMR, Amazon Redshift and Elasticsearch

  • Dataset and Data Source: Clickstream logs read from Amazon S3
  • Transformations: Include aggregations, such as:
    • Number of views for each session with respect to action for a specific URL
    • The total number of sessions for each client IP address
    • Number of events captured for each brand of products
  • Destinations:
    • Aggregations are stored in Amazon Redshift tables. (Note: if the tables don’t already exist, the destination can be configured for the tables to be auto-created.)
    • All the logs are sent to Elasticsearch for searching and quick visualizations in Kibana. (Note: if the index doesn’t already exist in Elasticsearch, the destination can be configured for the index to be auto-created.)

Aggregations

Here are the aggregated stats being collected and stored in Amazon Redshift.

Clickstream Analysis on Amazon EMR, Amazon Redshift and Elasticsearch

Clickstream Analysis on Amazon EMR, Amazon Redshift and Elasticsearch

Clickstream Analysis on Amazon EMR, Amazon Redshift and Elasticsearch

Elasticsearch and Kibana

Once the logs were available in Elasticsearch, I created an index pattern called clickstream_data with all the attributes of the logs.

Clickstream Analysis on Amazon EMR, Amazon Redshift and Elasticsearch

Using the clickstream_data index pattern as source, I then created a dashboard with different visualizations in Kibana.

  • Session Wise Analysis — Number of views for each session with respect to action for a specific URL
  • Client Wise Analysis — The total number of sessions for each client IP address.
  • Brand Analysis — Number of events captured for each brand of products
  • HTTP Response Analysis — Number of events captured with a response status such as Successful, Request Lost, No Response, Error Response, etc.

Clickstream Analysis on Amazon EMR, Amazon Redshift and Elasticsearch

Querying Amazon Redshift

Aggregations are stored in number_of_views_per_session, number_of_sessions_per_ip, and number_of_events_per_http_response Redshift tables for faster-querying. For example, query to see the top 5 IP addresses from where the HTTP sessions were initiated.

SELECT distinct(clientip),total_sessions from number_of_sessions_per_ip order by total_sessions desc limit 5

Clickstream Analysis on Amazon EMR, Amazon Redshift and Elasticsearch

Watch Demo Video

Summary

While there are different ways to dissect and analyze data, hopefully this blog and demo video gives you ideas on how to use some of these tools you might have at your disposal in order to make better, data-driven decisions, faster.

Learn more about StreamSets For AWS and StreamSets Transformer Engine. Also checkout getting started resources to jumpstart designing your Spark ETL pipelines.

Back To Top

We use cookies to improve your experience with our website. Click Allow All to consent and continue to our site. Privacy Policy