Google BigQuery

The Google BigQuery origin executes a query job and reads the result from Google BigQuery.

The origin submits the query that you define, and then Google BigQuery runs the query as an interactive query. When the query is complete, the origin reads the query results to generate records. The origin runs the query once and then the pipeline stops when it finishes reading all query results. If you start the pipeline again, the origin submits the query again.

When you configure the origin, you define the query to run using valid BigQuery standard SQL or legacy SQL syntax. By default, BigQuery writes all query results to a temporary, cached results table. You can choose to disable retrieving cached results and force BigQuery to compute the query result.

You also define the project and the credentials to use to connect to Google BigQuery.

The origin can generate events for an event stream.


When the Google BigQuery origin executes a query job and reads the result from Google BigQuery, it must pass credentials to Google BigQuery using a Google Cloud service account credentials file.

Complete the following steps to configure the origin to use a service account credentials file:

  1. Generate a service account credentials file in JSON format.

    Use the Google Cloud Platform Console or the gcloud command-line tool to generate and download the credentials file. For more information about generating a credentials file, see the Google Cloud Platform documentation.

  2. On the Credentials tab for the stage, select Service Account Credentials (JSON) for the Credentials Provider property, then paste the full contents of the credentials file into the Credentials File Content (JSON) property.

BigQuery Data Types

The Google BigQuery origin converts the Google BigQuery data types to StreamSets Cloud data types.

The following table lists the data types that the Google BigQuery origin supports and the StreamSets Cloud data types that the origin converts them to:
BigQuery Data Type StreamSets Cloud Data Type
Boolean Boolean
Bytes Byte Array
Date Date
Datetime Datetime
Float Double
Integer Long
Numeric Decimal
String String
Time Datetime
Timestamp Datetime

Datetime Conversion

In Google BigQuery, the Datetime, Time, and Timestamp data types have microsecond precision, but the corresponding Datetime data type in StreamSets Cloud has millisecond precision. The conversion between data types results in some precision loss.

To preserve potentially lost precision during data type conversion, the Google Big Query origin generates the bq.fullValue field attribute that stores a string containing the original value with microsecond precision. You can use the record:fieldAttribute or record:fieldAttributeOrDefault functions to access the information in the attribute.

Generated Field Attribute Description
bq.fullValue Provides the original precision for Datetime, Time, and Timestamp fields.

For more information about field attributes, see Field Attributes.

Event Generation

The Google BigQuery origin generates an event when a query completes successfully.

A Google BigQuery event can be used with a destination to store information about completed queries. For an example, see Preserving an Audit Trail of Events.

For more information about dataflow triggers and the event framework, see Overview.

Event Record

Event records generated by the Google BigQuery origin have the following event-related record header attributes:

Record Header Attribute Description
sdc.event.type Event type. Uses the following type:
  • big-query-success - Generated when the origin successfully completes a query.
sdc.event.version Integer that indicates the version of the event record type.
sdc.event.creation_timestamp Epoch timestamp when the stage created the event.
The origin can generate the following type of event records:
Query success
The origin generates a query success event record when it completes processing the data returned from a query.
The query success event records have the sdc.event.type record header attribute set to big-query-success and include the following fields:
Field Description
query Query that completed successfully.
timestamp Timestamp when the query completed.
row-count Number of processed rows.
source-offset Offset after the query completed.

Configuring a Google BigQuery Origin

Configure a Google BigQuery origin to execute a query job and read the result from Google BigQuery.

  1. In the properties pane, on the General tab, configure the following properties:
    General Property Description
    Name Stage name.
    Description Optional description.
    Produce Events Generates event records when events occur. Use for event handling.
    On Record Error Error record handling for the stage:
    • Discard - Discards the record.
    • Send to Error - Sends the record to the pipeline for error handling.
    • Stop Pipeline - Stops the pipeline.
  2. On the BigQuery tab, configure the following properties:
    BigQuery Property Description
    Query SQL query to use for the query job. Write the query using valid BigQuery standard SQL or legacy SQL syntax.

    Do not include the #legacySql or #standardSql prefix in the query. Instead, select or clear the Use Legacy SQL property to specify the SQL syntax type.

    Use Legacy SQL Specifies whether the query uses standard SQL or legacy SQL syntax.

    Clear to use standard SQL. Select to use legacy SQL.

    Use Query Cache Determines whether Google BigQuery retrieves cached results if they are present.

    Select to retrieve cached results. Clear to disable retrieving cached results.

    Query Timeout (sec) Maximum number of seconds to wait for the query to finish. If the query fails to complete within the timeout, the origin aborts the query and the pipeline fails.

    Enter a time in seconds or use the MINUTES or HOURS constant in an expression to define the time increment.

    Default is five minutes, defined as follows: ${5 * MINUTES}.

    Max Batch Size (records) Maximum number of records processed at one time. Honors values up to the maximum batch size of 1000.

    Default is 1000.

  3. On the Credentials tab, configure the following properties:
    Credentials Property Description
    Project ID Google BigQuery project ID to connect to.
    Credentials Provider Select Service Account Credentials (JSON).

    The remaining options are not supported at this time.

    Credentials File Content (JSON) Google Cloud service account credentials file in JSON format used to connect to Google BigQuery.
    Use a text editor to open the credentials file, and then copy and paste the full contents of the file into the property.
    Note: When you enter secrets such as the credentials file content, the stage encrypts the secret values.