Dataflow Performance Blog

Embrace Diversity in Your Data Architecture

Many Roads Lead to Rome

Over the last ten years, the data management landscape has changed dramatically — on that, I think we can all agree. The rise of big data and the new data management ecosystem has created an abundance of new patterns and tools, each of which is more specialized than the last. With each new iteration, engineers and architects face pressure from all sides to simplify and consolidate.

But counter-intuitively, the best data architects embrace infrastructure diversity rather than fight it. The reality is that all of these tools and patterns have important uses in enterprise data architecture, and that today Kafka is no more the cure to all that ails than MapReduce was five years ago. The most sophisticated enterprises enable each business unit to use best-of-breed technology to succeed while facilitating seamless integration between them, creating agility and avoiding the chaos that often arises across legacy environments.

When designing applications, there are many business and technology requirements that need to be considered. Rather than try to force an application into a singular architecture, it is important to ask the right questions to determine which architecture best fits a particular application.

How is the data received?

If you have data that arrives in a stream, a streaming architecture may make perfect sense. Data coming in over REST interfaces, or log data and events being written to files, are perfect candidates for stream processing. Treating a relational table as a stream of rows is an elegant way to map an otherwise batch dataset onto a streaming application architecture, and in fact, this is how StreamSets processes relational data.

However, not all data is available in this way. Businesses often rely on data sourced from third-parties, partners, or customers. If that data is not made available as a stream, or if new data only becomes available intermittently, a streaming architecture may not make sense. If data arrives infrequently, the best architecture boils down to a choice between the convenience of having either an always-on streaming pipeline that sits idle for most of the day or a batch job that is started only when necessary, and stopped as soon as operation is finished. The choice ends up coming down to a matter of resource management and economics, and a streaming architecture may not always be feasible.

How urgently does the data need to be consumed?

The urgency for data availability is a spectacular driver for continuous data movement and streaming ingest within the business. Users have come to expect immediate gratification and businesses know that they’re missing opportunities to improve “customer experience” if they don’t deal with data as it becomes available. This is true with both external and internal customers. Externally, an application might provide a “next best action” for a customer on a website, outside a storefront, or in front of a shelf of products. Internally, it might be a cybersecurity or operational insight application, whose customers want to be able to iterate quickly on the questions they’re able to ask and the data they’re able to analyze. In particular, data urgency requires you to answer the question “how will I manage my data operations?” For instance, the definition and continuous enforcement of SLAs for data latency will play a role in the technology you choose.

Am I working with events?

If your data is event-based, applications often can fit nicely into a stream processing paradigm, especially if the messages are small. Unfortunately, stream processing systems like Kafka and Amazon Kinesis are not designed to handle large messages (megabytes or more). For applications that deal with large unstructured data like images, audio or documents, it probably won’t be technically feasible to process the data as a stream. In that case, taking a continuous data movement approach enables unstructured objects to be moved to a location where processing is practical, while metadata is passed in parallel to other locations like search indices.

What tools are available for the job?

The open source movement within data management has spawned an abundance of powerful, purpose-built technologies. Within the Apache Software Foundation alone, there are 38 different projects labelled big data, cloud, database, or Hadoop.

This is both a blessing and a curse. It enables organizations to craft extremely specialized solutions, but even though they may leverage the same data, these systems often don’t run in the same place on the same hardware, and may not be managed by the same teams. Over time this can create an incomprehensible sprawl of data storage and systems, plus an internal cottage industry for maintaining a complex web of interconnections.

For this reason, it is critical to adopt a best-of-breed technology for data movement — ideally, one on which the organization can standardize. In short, it’s impossible to leverage that breadth of technology available without first being able to move the data to where it needs to be.

How do economics play into the equation?

The justification for batch or stream processing is not just a technical decision but an economic decision as well. For instance, a data synchronization use case may run perfectly well in a batch environment, and re-architecting the process to run as a streaming application may be prohibitively expensive. Writing custom code, regardless of whether the application is streaming or batch, is often more expensive to implement and maintain, than using a packaged toolset. Custom code also has a tendency to accrue technical debt, which can be expensive to pay off later, especially as the original developers move on to new projects. In the end, adopting a data movement tool that retains flexibility at its core can be a very sound economic decision.

We built StreamSets specifically to help data professionals succeed in exactly this environment of ever-accelerating diversity. The combination of a platform-agnostic core, a wide library of connectors, and management and discovery tools gives data engineers and architects the capabilities needed to craft applications composed of the best-in-class systems from every category.

At the end of the day, application architectures are just a means to an end, and what really matters is being able to present the right information, in the right way, at the right time, to the right people, at an an acceptable cost. Whether the information is consumed through a graphical UI or a programmatic API, by people or machines, the choice of approach is simply the means to that end. Forcing everything into one paradigm for the sake of architectural purity robs the organization of the adaptability and effectiveness that is afforded by a hybrid architecture. The real world requires flexibility, and your tools should take you there.

Jonathan NatkinsEmbrace Diversity in Your Data Architecture