I am always eager to learn about new architectures and best big data practices. Recently I came across a paper from Trifacta discussing the role of data preparation and it got me thinking about the complementary nature of data ingestion and data preparation.
Data preparation, more colorfully known as data wrangling, is the activity performed by data-driven professionals, such as data or business analysts, to explore, clean, transform and blend data of all varieties to make it trustworthy for analysis or predictive modeling. A form of data manipulation that has traditionally been achieved using Excel or, for more technically-advanced end users, languages such as R, SAS or Python. But with the rise of enormous and dynamic data sets in Hadoop, these approaches are no longer feasible. Trifacta took the lead in creating a self-service web-based solution that enables business users to access and manipulate data stored in Hadoop without needing programming skills.
The Rise of Enormous and Dynamic Data Sets
Data preparation gives end users the ability to explore the wealth of data in the lake, use trial and error to validate hypothesis, blend different data sets on-demand, and apply various data manipulation and cleansing rules. They can prepare data as a one-off to get their responses right away or operationalize a preparation “recipe” to deliver results on a recurring basis. This is a huge benefit to analysts and data scientists, as these preparation tasks have been estimated to take 60% to 80% of their time.
In thinking about where data preparation tools like Trifacta sit in the data life cycle, you can consider them consumption-side tools – pulling data at rest from the data lake and outputting the prepared data to analytic tools. This is complementary with Streamsets Data Collector, which populates the data lake and operates on data in motion.
So how do these technologies work together?
When business users discover and manipulate data with Trifacta, they apply various rules to make the data consistent and accurate to the level required for their analysis. They would prefer not to define and validate the data every time they access it, but rather to have the data already clean for them and their coworkers. To get the data to that level, it is better to apply basic “sanitization” functions upfront, at the ingestion phase, so that everyone accessing the data store can benefit.
This is where Trifacta and Streamsets work together. First, Trifacta data preparation recipes, the series of steps to prepare a data set for consumption, are shareable and, when applicable to data being ingested, can be implemented in Streamsets pipelines upon ingest. StreamSets also has the ability to define alerts when it detects that incoming data has drifted out of the bounds of expected ranges, which may indicate a quality issue. This is huge benefit to making sure the data store is fed with trustworthy data and is not becoming a data dump.
A sample Trifacta data preparation recipe.
Second, some of the Trifacta data preparation work may result in the need to operationalize the process so the analyses are provided automatically on a regular basis. In this case, the IT team will take this request to automate the process and make sure it follows internal IT policies.
Again, Streamsets can incorporate Trifacta preparation recipes into the ingestion process to automate the cleansing of data up-front.
Ultimately both sanitization upon ingest and data preparation are important components of a successful Data Lake or Hadoop implementation to ensure rapid adoption of this new paradigm to the business users.