Google BigQuery

Supported pipeline types:
  • Data Collector

The Google BigQuery destination streams data into Google BigQuery. You can use other destinations to write to Google Bigtable, Google Cloud Storage, and Google Pub/Sub.

When you configure the destination, you define the existing BigQuery dataset and table to stream data into. The destination streams each record into a row in a BigQuery table. You can optionally define an expression to specify the insert ID to insert or update. The insert ID is a unique ID for each row. If you do not specify an insert ID, the destination inserts each record into a new row.

The destination maps fields from records to BigQuery columns based on matching names and compatible data types. You can configure the destination to ignore invalid columns when the destination cannot map a field to an existing BigQuery column. You can also configure the table cache size for the destination.

You also define the project and credentials provider to use to connect to Google BigQuery. The destination can retrieve credentials from the Google Application Default Credentials or from a Google Cloud service account credentials file.

For more information about streaming data into Google BigQuery, see the Google BigQuery documentation.

BigQuery Data Types

The Google BigQuery destination maps fields from records to BigQuery columns in existing tables based on matching names and compatible data types. If needed, the destination converts Data Collector data types to BigQuery data types.

The following table lists the Data Collector data types and the BigQuery data types that the destination converts them to:

Data Collector Data Type BigQuery Data Type
Boolean Boolean
Byte Array Bytes
Date Date
Datetime Datetime or Timestamp
Double Float
Float Float
Integer Integer
List Array
List-Map Record with repeated fields
Long Integer
Map Record
Short Integer
String String
Time Time
The Google BigQuery destination cannot convert the following Data Collector data types:
  • Byte
  • Character
  • Decimal

Credentials

When the Google BigQuery destination streams data into Google BigQuery, it must pass credentials to BigQuery. Configure the destination to retrieve the credentials from the Google Application Default Credentials or from a Google Cloud service account credentials file.

Default Credentials Provider

When configured to use the Google Application Default Credentials, the destination checks for the credentials file defined in the GOOGLE_APPLICATION_CREDENTIALS environment variable. If the environment variable doesn't exist and Data Collector is running on a virtual machine (VM) in Google Cloud Platform (GCP), the destination uses the built-in service account associated with the virtual machine instance.

For more information about the default credentials, see Google Application Default Credentials in the Google Developer documentation.

Complete the following steps to define the credentials file in the environment variable:

  1. Use the Google Cloud Platform Console or the gcloud command-line tool to create a Google service account and have your application use it for API access.
    For example, to use the command line tool, run the following commands:
    gcloud iam service-accounts create my-account
    gcloud iam service-accounts keys create key.json --iam-account=my-account@my-project.iam.gserviceaccount.com
  2. Store the generated credentials file on the Data Collector machine.
  3. Add the GOOGLE_APPLICATION_CREDENTIALS environment variable to the appropriate file and point it to the credentials file.

    Modify environment variables using the method required by your installation type.

    Set the environment variable as follows:

    export GOOGLE_APPLICATION_CREDENTIALS="/var/lib/sdc-resources/keyfile.json"
  4. Restart Data Collector to enable the changes.
  5. On the Credentials tab for the stage, select Default Credentials Provider for the credentials provider.

Service Account Credentials File (JSON)

When configured to use the Google Cloud service account credentials file, the destination checks for the file defined in the destination properties.

Complete the following steps to use the service account credentials file:

  1. Generate a service account credentials file in JSON format.

    Use the Google Cloud Platform Console or the gcloud command-line tool to generate and download the credentials file. For more information, see generating a service account credential in the Google Cloud Platform documentation.

  2. Store the generated credentials file on the Data Collector machine.

    As a best practice, store the file in the Data Collector resources directory, $SDC_RESOURCES.

  3. On the Credentials tab for the stage, select Service Account Credentials File for the credentials provider and enter the path to the credentials file.

Configuring a Google BigQuery Destination

Configure a Google BigQuery destination to stream data into Google BigQuery.

  1. In the Properties panel, on the General tab, configure the following properties:
    General Property Description
    Name Stage name.
    Description Optional description.
    Required Fields Fields that must include data for the record to be passed into the stage.
    Tip: You might include fields that the stage uses.

    Records that do not include all required fields are processed based on the error handling configured for the pipeline.

    Preconditions Conditions that must evaluate to TRUE to allow a record to enter the stage for processing. Click Add to create additional preconditions.

    Records that do not meet all preconditions are processed based on the error handling configured for the stage.

    On Record Error Error record handling for the stage:
    • Discard - Discards the record.
    • Send to Error - Sends the record to the pipeline for error handling.
    • Stop Pipeline - Stops the pipeline.
  2. On the BigQuery tab, configure the following properties:
    BigQuery Property Description
    Dataset BigQuery dataset to write to.
    Enter the name of an existing dataset or an expression that evaluates to the name of an existing dataset. For example, if the dataset name is stored in the 'dataset' record attribute, enter the following expression:
    ${record:attribute('dataset')}
    Table Name Name of the BigQuery table to write to.
    Enter the name of an existing table or an expression that evaluates to the name of an existing table. For example, if the table name is stored in the 'table' record attribute, enter the following expression:
    ${record:attribute('table')}
    Insert ID Expression Expression that evaluates to the BigQuery insert ID to insert or update. The insert ID is a unique ID for each row. Leave blank to insert each record into a new row.

    For more information about the insert ID property used to stream data into BigQuery, see the Google BigQuery documentation.

    Ignore Invalid Column Ignores invalid columns.

    If selected and the destination encounters a field path that cannot be mapped to a BigQuery column with the same name and a compatible data type, the destination ignores the invalid column and writes the remaining fields in the record to BigQuery. If cleared and the destination encounters an invalid column, the record is sent to the stage for error handling.

    Table Cache Size Maximum number of table ID entries to cache locally.

    When the destination evaluates the dataset and table name to write to, it checks that the table ID exists in BigQuery and then caches the table ID. When possible, the destination uses the cache to avoid unnecessary retrievals from BigQuery. When the cache reaches the maximum size, the oldest cached entries are evicted to allow for new data.

    Default is -1, an unlimited cache size.

  3. On the Credentials tab, configure the following properties:
    Credentials Property Description
    Project ID Google BigQuery project ID to connect to.
    Credentials Provider Credentials provider to use to connect to Google BigQuery:
    • Default credentials provider
    • Service account credentials file (JSON)
    Credentials File Path (JSON) When using a Google Cloud service account credentials file, path to the file that the destination uses to connect to Google BigQuery. The credentials file must be a JSON file.

    Enter a path relative to the Data Collector resources directory, $SDC_RESOURCES, or enter an absolute path.