skip to Main Content

The DataOps Blog

Where Change Is Welcome

What are Grok Patterns?

By Posted in Engineering August 10, 2020

Grok leverages regular expression language that allows you to name existing patterns and/or combine them into more complex patterns. Because Grok is based on regular expressions, any valid regular expressions (regexp) are also valid in grok.

In StreamSets Data Collector, a fast data ingestion engine, you can use a single grok pattern, compose several patterns, or create a custom pattern to define the structure of logs data by configuring the following properties.

Grok Pattern Definition

Use this property to define a single or multiple grok pattern patterns.

<PATTERN NAME> <grok pattern>
<PATTERN NAME2> <grok pattern>
MYHOSTTIMESTAMP %{CISCOTIMESTAMP:timestamp} %{HOST:host}
MYCUSTOMPATTERN %{MYHOSTTIMESTAMP} %{WORD:program}%{NOTSPACE} %{NOTSPACE}
DURATIONLOG %{NUMBER:duration}%{NOTSPACE} %{GREEDYDATA:kernel_logs}

Grok Pattern

Use this property to define the grok pattern that will evaluate the data. Options for defining it are:

  • Predefined grok pattern, such as %{COMMONAPACHELOG}
  • Define a custom grok pattern. (See below Custom Grok Patterns.)
  • Patterns defined in Grok Pattern Definition

For example, you can use the patterns outlined in Grok Pattern Definition above to configure Grok Pattern as follows:

%{MYCUSTOMPATTERN} %{DURATIONLOG}

Reusing Grok Patterns

Here is the syntax for reusing a grok pattern: %{SYNTAX:SEMANTIC}, %{SYNTAX}, %{SYNTAX:SEMANTIC:TYPE}.

Where;

  • SYNTAX is the name of the pattern that will match the text
  • SEMANTIC is the identifier given to the piece of text being matched
  • TYPE is the type to cast the named field

Custom Grok Patterns

You can use the following custom grok patterns to define the structure of logs.

Using Grok Patterns In Amazon S3

For instance, let’s take a look at how you can use grok pattern in Amazon S3 origin to easily parse Apache web logs ingested in the following format.

79.133.215.123 - - [14/Jun/2014:10:30:13 -0400] "GET /home HTTP/1.1" 200 1671 "-" "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.153 Safari/537.36"
162.235.161.200 - - [14/Jun/2014:10:30:13 -0400] "GET /department/apparel/category/featured%20shops/product/adidas%20Kids'%20RG%20III%20Mid%20Football%20Cleat HTTP/1.1" 200 1175 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.76.4 (KHTML, like Gecko) Version/7.0.4 Safari/537.76.4"
39.244.91.133 - - [14/Jun/2014:10:30:14 -0400] "GET /department/fitness HTTP/1.1" 200 1435 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.153 Safari/537.36"

Grok Pattern in Amazon S3 origin in StreamSets Data Collector

Note on Amazon S3 >> Data Format tab

  • Data Format is set to Log
  • Log Format is set to Grok Pattern
  • Grok Pattern is set to %{COMMONAPACHELOG}

Grok Pattern in Amazon S3 origin in StreamSets Data Collector

In the pipeline:

  • Field Type Converter processor converts fields like response, timestamp, httpversion from string to their respective datatypes
  • Expression Evaluator processor decodes field request (the HTTP url) to UTF-8 and also extracts product name from the URL using regExCapture()
  • Geo IP processor enriches the data by adding new fields city, latitude, longitude, and country based on clientip present in web logs

Snowflake

The parsed Apache web logs are stored in Snowflake cloud data warehouse for further analytics.

Grok Pattern in StreamSets Data Collector

Query: Top 10 most viewed products

Grok Patterns in StreamSets Data Collector and Snowflake

Query: Failed (403, 500, etc.) vs. successful (200) HTTP requests

Grok Patterns in StreamSets Data Collector and Snowflake

Query: Total requests by Client IP 

Grok Patterns in StreamSets Data Collector and Snowflake

Log Data Format Support In StreamSets Data Collector

For a comprehensive list of all the origins, processors, and destinations in StreamSets Data Collector that support Log data format where Grok Patterns can be used, visit our documentation.

Try Grok Patterns in Your Data Pipeline

To take advantage of this and other features, get started today by deploying data ingestion execution engines in your favorite cloud to design, deploy, and operate smart data pipelines.

Back To Top