StreamSets Cloud
Unlocking Insights: Amazon S3 to Snowflake

StreamSets Cloud
Unlocking Insights: Amazon S3 to Snowflake

StreamSets Cloud is a cloud service for designing, deploying and operating smart data pipelines, combining ease and scalability with the flexibility to execute pipelines anywhere – on-premise, or in a private or public cloud. It provides an integrated user interface to design, deploy, operate and monitor smart data pipelines managed by StreamSets cloud service.

In this step-by-step tutorial blog, you’ll learn how to get started and accelerate your journey with the StreamSets Cloud.

Cloud Data Warehouse

Data warehouses are a critical, modern data architecture component in enterprises that leverage massive amounts of data to drive quality of their products and services. A data warehouse is an OLAP (Online Analytical Processing) database that collects data from transactional databases such as Billing, CRM, ERP, etc. and provides a layer on top specifically for fast analytics, reporting and data mining.

Cloud data warehouses are a new breed that bring the added advantage of cost-effectiveness and scalability with pay-as-you-go pricing models, a serverless approach, and on-demand resources made possible by separating compute and storage.

While the benefits of moving from on-prem data warehouses to cloud might outweigh the disadvantages, there are some important challenges that companies need to be aware of. One of them being data movement—solutions like Amazon Snowball are great for initial large-scale data transfer aka lift-and-shift, but the real challenge is in seamlessly ingesting and synchronizing data to the cloud data warehouses as companies continue to gather more data. Add to this the complexity of transforming and enriching data in-flight before storing it for further, advanced analytics.

Unlocking Insights: HTTP Web Server Logs

As you can imagine, there’s a wealth of information that can be extracted from HTTP web application server logs to potentially provide better services, products, and/or improve user experiences.

Here’s what a typical log line looks like:

150.47.54.136 – – [14/Jun/2014:10:30:19 -0400] “GET /department/outdoors/category/kids'%20golf%20clubs/product/Polar%20Loop%20Activity%20Tracker HTTP/1.1” 200 1026 “-” “Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.153 Safari/537.36”

Some of the attributes in the log entries include client IP, HTTP method (GET, POST, etc.), HTTP response code, request URL, request timestamp, client browser and platform. Although, before these attributes can be leveraged for further analysis, they must be ingested and stored in a structured format for better and faster queryability.

Pipeline Overview

Before we dive into details, here’s the data pipeline overview.

  • Amazon S3 is the main source of input data and is used to load HTTP web server logs
  • Field Type Converter processor is used to convert attributes like HTTP response code, request timestamp, HTTP version, etc. from string to numeric values.
  • Using Expression Evaluator we’re able to decode request URL in UTF-8 format so it’s human readable, and also parse product names from the request URL–which will enable us to use product names in SQL where clauses to perform joins and apply aggregate functions in Snowflake.
  • For Snowflake destination, we’ll need to provide account credentials, warehouse, database, schema, and table name. (Note: if the table doesn’t exist in Snowflake, it will be created automatically.)

If you’d like to create a similar data pipeline, follow along the instructions outlined below.

Step-by-step Guide

Prerequisites: You will need access to Amazon S3 and Snowflake. You will also need to download the web_logs file and upload it to your S3 bucket.

Step 1—Sign up and Login

To login or sign up for the StreamSets Cloud service, click here. Note that you will be able to login using your existing Google or Microsoft account.

Step 2—Create Pipeline

Click on Pipelines on the left-hand side nav menu and then click on Create button on the right to create a new pipeline.

Then, within the pipeline designer select Amazon S3 as origin, Field Type Converter and Expression Evaluator processors, and Snowflake destination from the panel on the right.

Step 3—Pipeline Configuration
Amazon S3

On Data Format tab, select Log for Data Format.

On Amazon S3 tab, provide credentials and other required details.

Field Type Converter

On Conversions tab, click on Bulk Edit Mode and copy-and-paste the following.

[
	{
		"fields": [
			"/response"
		],
		"targetType": "INTEGER",
		"treatInputFieldAsDate": false,
		"dataLocale": "en,US",
		"scale": -1,
		"decimalScaleRoundingStrategy": "ROUND_UNNECESSARY",
		"dateFormat": "YYYY_MM_DD_HH_MM_SS",
		"zonedDateTimeFormat": "ISO_ZONED_DATE_TIME",
		"encoding": "UTF-8"
	},
	{
		"fields": [
			"/timestamp"
		],
		"targetType": "DATETIME",
		"treatInputFieldAsDate": false,
		"dataLocale": "en,US",
		"scale": -1,
		"decimalScaleRoundingStrategy": "ROUND_UNNECESSARY",
		"dateFormat": "OTHER",
		"zonedDateTimeFormat": "OTHER",
		"encoding": "UTF-8",
		"otherDateFormat": "dd/MMM/yyyy:HH:mm:ss Z",
		"otherZonedDateTimeFormat": "dd/MM/yyyy:HH:mm:ss"
	},
	{
		"fields": [
			"/bytes"
		],
		"targetType": "INTEGER",
		"treatInputFieldAsDate": false,
		"dataLocale": "en,US",
		"scale": -1,
		"decimalScaleRoundingStrategy": "ROUND_UNNECESSARY",
		"dateFormat": "YYYY_MM_DD",
		"zonedDateTimeFormat": "ISO_ZONED_DATE_TIME",
		"encoding": "UTF-8"
	},
	{
		"fields": [
			"/httpversion"
		],
		"targetType": "DOUBLE",
		"treatInputFieldAsDate": false,
		"dataLocale": "en,US",
		"scale": -1,
		"decimalScaleRoundingStrategy": "ROUND_UNNECESSARY",
		"dateFormat": "YYYY_MM_DD",
		"zonedDateTimeFormat": "ISO_ZONED_DATE_TIME",
		"encoding": "UTF-8"
	}
]
Expression Evaluator

On Expressions tab, click on Bulk Edit Mode and copy-and-paste the following.

[
	{
		"fieldToSet": "/request",
		"expression": "${str:urlDecode(record:value('/request'), 'UTF-8')}"
	},
	{
		"fieldToSet": "/is_product_page_view",
		"expression": "${str:contains(record:value(\"/request\"), \"department\") && str:contains(record:value(\"/request\"), \"product\") ? 1 : 0}"
	},
	{
		"fieldToSet": "/product_name",
		"expression": "${str:replace(str:regExCapture(record:value('/request'),\"(/product/(.*))\",2),'/add_to_cart','')}"
	}
]
Snowflake

On Snowflake Connection Info tab, provide the required details.

On Snowflake tab, provide the required details.

Note: If the table (WEB_LOGS) doesn’t exist in Snowflake, then enabling Table Auto Create will automatically create it.

On Staging tab, provide the required details:

Step 4—Pipeline Preview

Now we’re ready to preview the data flow to get a true sense of the data transformations at every stage. Not just to ensure data integrity and data quality, but also for debugging purposes.

Click on Preview icon at the top and leave the default values.

Click on Confirm button and you should see preview of data flowing through the pipeline as shown below.

Note: By default the data is not written to the final destination so you should not see data inserted in WEB_LOGS in Snowflake just yet.

Step 5—Run Pipeline

Clicking on Run icon will first create necessary resources and once that’s done, you should see web logs being migrated from Amazon S3 to Snowflake.

Step 6—Run Queries In Snowflake

Now we’re ready to run queries in Snowflake. At this point, here's what the WEB_LOGS table structure looks like.

Query 1: Top 10 most viewed products

Recall that using Expression Evaluator we were able to parse product names from the request URL—that makes writing such queries pretty straightforward.

SELECT product_name, count(*) as product_page_view_count 
FROM web_logs 
WHERE product_name is not null G
ROUP BY product_name 
ORDER BY product_page_view_count DESC 
limit 10;

Query 2: Failed vs. Successful HTTP requests

SELECT response, count(*) as http_response_code_count 
FROM web_logs 
GROUP BY response;

Summary

In this blog, you learned how quickly you can create pipelines for ingesting data to and from cloud data platforms using the StreamSets Cloud service.

For detailed documentation on how to build, run and monitor pipelines, click here. You can also ask questions in the #streamsets-cloud channel on our community Slack team—sign up here for access.

To sign up for the StreamSets Cloud service, click here.

Related Resources

Check out StreamSets white papers, videos, webinars, report and more.

Visit the Resource Library

Related Blog Posts

Receive Updates

Receive Updates

Join our mailing list to receive the latest news from StreamSets.

You have Successfully Subscribed!