DNB is Norway’s largest financial services group, and has a reputation as a trusted financial institution throughout the region. In this guest post, the DNB Data Engineering Centre of Practice team–Saleem Pothiwala, Operations Lead – Customer Insights, Jones Mabea Agwata, Software Engineer, and Bikram Rout, Data Engineer–share their data engineering best practices for harnessing the power of data for digital transformation.
At DNB, we thrive on implementing data engineering best practices to make clean, usable and punctual data available in a reliable manner. We treat it as a major criterion of success for Customer Insights and Data Analytics initiatives. We use data to generate reports, insights, dashboards, feeds to other downstream systems and provide data to draw machine learning insights.
The Data Engineering field can be thought of as an umbrella for BI and data warehousing that brings more elements from Software Engineering.
We at DNB, have StreamSets under the Elastic Data Processor category. A Data Processor that will grow and shrink based on workload and provides us with the best EtL and ELT capabilities using the Data Collector Engine and Transformer Engine. Orchestrating everything using the Control Hub.
Data engineers are given the responsibility to take account of a broad set of dependencies and requirements as they design and build their data pipelines.
Here’s our high-level architecture diagram.
Here Are Our 13 Data Engineering Best Practices
#1 Follow a design pattern if it exists
We have created data patterns for Data Engineering across DNB. This provides us with the best tools, processes, techniques and framework to use. If you find a pattern that suits perfectly then use it, if not, pick an existing one and enhance it for your use case and publish it for others to follow. Patterns will help us keep a tab on existing types and counts of pipelines across DNB and also will help us move to another vendor/platform/technology by picking an existing pattern, changing it and communicating with all the users of that pattern.
#2 Implement idempotence
We encourage using upserts rather than inserts. This allows the pipelines to be rerun without worrying about creating duplicate records in the destination or stops from generating errors because of primary key/unique constraints.
#3 Use Streaming instead of Batch whenever possible
Lot of our data sources allow us to use streaming mode or micro batch mode in addition to batch execution. We encourage our data engineers to use streaming mode wherever possible. The downstream pipeline can be run as per the requirement, but it always gives us the option of running it more frequently than once a day to a near real-time by using this approach.
#4 Encourage collaboration
The pipelines designed by the Data Engineer in StreamSets are shared with the Product Team. All the pipelines created by all the Data Engineers within a product development are shared across to allow collaboration, peer review and increase efficiency by learning from existing pipelines. This allows us to reuse pipeline assets to create new pipelines without spending too much time figuring out the most efficient set of configuration values.
#5 User defined Processors
#6 Use Credential Stores/IAM Roles/Runtime file parameters
We totally look down upon using plain text credentials in our pipelines for security reasons. All our pipelines are developed by using Credential Stores (AWS Secrets Manager in our case) to save credentials. We extensively use Runtime file parameters for all our database connect strings, Schema Registry URLs, Kafka broker URI and other things where the value will change depending upon the execution environments. This allows us to create pipelines once and then create jobs in different environments without making any change. Also provides safety over using wrong values in the right environment. No Secret ID and Secret Keys are used in any of our pipelines. We use IAM Roles to access AWS resources.
StreamSets enables data engineers to build end-to-end smart data pipelines. Spend your time building, enabling and innovating instead of maintaining, rewriting and fixing.
#7 Parameterize your pipelines
We try to create our pipelines in such a way that they can be used for multiple data sources/targets. This allows us to reuse, scale, standardise and create efficient pipelines across DNB. An example: Pipeline to read a topic from Kafka, add tracking attributes and write to a Delta Lake table. Once the pipeline is created, we can use it again and again in multiple jobs to prepare from several topics and create several Delta Lake tables.
#8 Use Labels and Tags
Several of StreamSets components allow the use of Tags/Labels. We make sure to use them everywhere and as descriptive as possible. This helps us to monitor, group, search and provide access accordingly.
- Execution Environment (Sandbox, Staging, Prod)
- Provider (AWS, Azure, GCP, On Prem)
- Pipeline Environment (Dev, Test, Prod)
For Pipelines and Jobs
- Test Indicator
- PoC Indicator
#9 Role Based Access Control implementation
We have a multi-tenant setup for StreamSets at DNB. We do not want Data Citizens from one Product getting access to objects from other Products. Also, the roles have impact on which environments the users have access to.
- Data Engineers – Dev
- Testers – Dev and Test
- Operators – Production
- Operations – Production
- Administrators – All
To control this, we have set up groups for
- Environment – Dev, Stage, Production
- Role – Data Engineers, Operators, Operations, Admin
- Products – Product A, Product B, Product C
- Each user is then configured to have one or more groups from the above categories
#10 Document and comment on commits
All the code commits in pipeline versions needs to be descriptive and mention the details that have been worked on. Provide Jira ticket references for any feature release or bug fixes. This helps us create a Release Document that lists feature upgrades and bug fixes for each release.
#11 Naming Standards for Pipelines, Jobs and Topologies
Due to multi-tenancy of StreamSets across DNB, it was very important to make sure each pipeline can be tied to a Project, Source, Destination and each Job to be tied to an Environment, Project and Pipeline.
- The rule for naming a pipeline is: Product-Source-Destination-(Generic/Specific)
- The rule for naming a job is: Environment-Product-Source-Target-(Table/Topic Name)
- The rule for naming a topology is: Environment-Product-(Details)
#12 Use Fragments and Templates
StreamSets Control Hub allows the use of Data Pipeline Fragments and Templates (Demo Pipelines as they are called now). The pipeline fragment can be created from one or more processors. These can be origin, intermediate fragments and destinations. They can be used to package up the business logic. Example setting values to Tracking Attributes in Avro files. We packaged 4 different processors in a fragment and then used it for multiple pipelines. Also, you can create a template pipeline with certain origins, processors and destinations. Base your new pipeline on this template and it brings all the components already. Just configure and use.
Important to note that fragments are linked to the pipelines so if you change the fragment, you have an option to update all or some of the pipelines with the changes. Templates are not linked so you can base new pipelines on new templates and not change the existing ones.
#13 Provide explanatory descriptions on pipelines, jobs and components
Pipelines, Jobs, Topologies, Origins, Destinations, Processors and Executors allow descriptions to be added. This allows others and yourself to read about the reason for the creation of these components. Also, in a pipeline adding descriptions to processors, origins, destinations and executors makes maintenance and debug of objects much easier. It also provides an in-place documentation of objects.
Summary of Data Engineering Best Practices
When we build data pipelines, efficiency in data processing is not enough. We need to make sure that the operations team can easily manage the pipeline.
By following these data engineering best practices of making your data pipelines consistent, robust, scalable, reliable, reusable and production ready, data consumers like data scientists can focus on science, instead of worrying about data management.
Here are some resources that will help jump start your journey to the cloud: