I am always eager to learn about new architectures and best big data practices. Recently I came across a paper from Trifacta discussing the role of data preparation and it got me thinking about the complementary nature of data ingestion and data preparation.
Data preparation, more colorfully known as data wrangling, is the activity performed by data-driven professionals, such as data or business analysts, to explore, clean, transform and blend data of all varieties to make it trustworthy for analysis or predictive modeling. A form of data manipulation that has traditionally been achieved using Excel or, for more technically-advanced end users, languages such as R, SAS or Python. But with the rise of enormous and dynamic data sets in Hadoop, these approaches are no longer feasible. Trifacta took the lead in creating a self-service web-based solution that enables business users to access and manipulate data stored in Hadoop without needing programming skills.