skip to Main Content

Ingest Salesforce Data Into Amazon S3 Data Lake

By Posted in Data Integration November 5, 2020

In this blog, you will learn how to ingest Salesforce data using Bulk API (optimized to process large sets of data) and store it in Amazon Simple Storage Service (Amazon S3) Data Lake using StreamSets Data Collector, a fast data ingestion engine. The primary AWS service used in our data pipeline is Amazon S3, which provides cost effective storage and archival to underpin the data lake.

Consider the use case where a data engineer is tasked with archiving all Salesforce contacts along with some of their account information in Amazon S3. To demonstrate an approach of connecting Salesforce and AWS, I have created a data pipeline that is specifically designed to facilitate seamless, secure, and real-time flow of data between Salesforce and Amazon S3.

Pipeline Overview And Implementation

Let’s deep dive into our data pipeline implementation.

Salesforce origin

Ingest Salesforce Data Into Amazon S3 Data Lake using StreamSets Data Collector

    • You can configure the Salesforce origin to read existing data using the Bulk or SOAP API and provide the SOQL query, offset field, and optional initial offset to use. When using the Bulk API, you can enable PK Chunking to efficiently process very large volumes of data.
    • The Salesforce origin is also capable of performing a full or incremental read at specified intervals.
    • The origin can also be configured to subscribe to notifications to process PushTopic, platform, or change data capture change events.
    • In our case, the origin is configured to ingest existing contacts information using Salesforce Object Query Language (SOQL) in Bulk API mode.
    • SOQL used to retrieve contacts — “Select Id,AccountId,FirstName,LastName,LeadSource,Email FROM CONTACT WHERE Id > ‘${OFFSET}’ Order By Id”
    • For details on additional configuration, refer to the documentation.

Salesforce Lookup processor

Ingest Salesforce Data Into Amazon S3 Data Lake

    • This processor is configured to perform a lookup against Salesforce to retrieve additional information and enrich data before storing it in Amazon S3.
    • In particular, based on AccountId associated with the contact, it’s retrieving AnnualRevenue, AccountSource, and Rating for that account.
    • For details on additional configuration, refer to the documentation.

Field Masker processor

Ingest Salesforce Data Into Amazon S3 Data Lake

    • This processor is configured to mask PII (contact’s email address) before storing the data in Amazon S3.
    • For details on additional configuration, refer to the documentation.

Schema Generator processor

Ingest Salesforce Data Into Amazon S3 Data Lake

    • This processor is configured to automatically generate Avro schema based on the structure of contacts records.
    • This enables writing data in a compressed (Avro) format for cost effective storage in Amazon S3.
    • For details on additional configuration, refer to the documentation.

Amazon S3 destination

Ingest Salesforce Data Into Amazon S3 Data Lake using StreamSets Data Collector

    • The Amazon S3 is configured to to store the contacts data in compressed, Avro format.
    • It is also configured to use AWS Server-Side encryption (SSE) to protect and secure contacts data written to Amazon S3.
    • For details on additional configuration, refer to the documentation.

Pipeline Run

Ingest Salesforce Data Into Amazon S3 Data Lake using StreamSets Data Collector

After the pipeline runs successfully, you should see the output similar to the one shown below. Notice the highlighted AWS encryption and data format of the object stored on Amazon S3.

Ingest Salesforce Data Into Amazon S3 Data Lake using StreamSets Data Collector

Summary

In this post, you learned the value companies can realize by leveraging and integrating data between AWS and Salesforce using StreamSets Data Collector. Closer integration between AWS and Salesforce opens up plenty of opportunities for enterprises to develop new and unique ways of accessing, analyzing, and storing their data.

Here are some resources that will help jump start your journey to the cloud:

For any other questions and inquiries, please contact us.

Conduct Data Ingestion and Transformations In One Place

Deploy across hybrid and multi-cloud
Schedule a Demo
Back To Top