This week we are happy to host Keith Gaputis. Keith is the Data Engineering Lead at Axis Group. Axis Group is a longtime StreamSets partner and friend to the field. Keith joins us to talks about cloud, containers, and working on functional teams across engineering and data science. Be sure to check out Keith’s talk at DataOps Summit as he covers Modern Data Architecture: Lessons from the Trenches on September 29th at 12:45 PM PDT
Sean Anderson 0:38
Hello, hello, and welcome back to a new and exciting episode of Sources and Destinations Podcast, where we talk about all things data, data engineering, data science, and machine learning. I’m one of your hosts, Sean Anderson. And as always, I’m joined by my co-host, Dash Desai. And this week, we’re happy to host our guest, Keith Gaputis. He’s the data engineering lead at Axis Group. And Axis Group is a longtime StreamSets partner, a friend to the field, and a great partner in solutioning in the data management and data integration space. The Axis Group has long been known as the expert. Things like big data, data management, and solutions built around the various workloads precipitate those ecosystems. So I’m really happy to get somebody like Keith on the podcast with extensive experience in these data projects. And not only that, working with data engineering teams and data engineers directly. So welcome, Keith. Thanks for joining our podcast this week.
Keith Gaputis 1:44
Thanks a lot, Sean. It’s an honor to be included in the podcast. Hopefully, I can share some thoughts and experiences that will be useful to people.
Sean Anderson 1:54
I think there’s no hesitation about that. I’d like to hand it over to my co-host Dash to kick us off with the first question without further ado, Keith.
Metadata-driven Data Flows
Dash Desai 2:04
Hey, welcome, Keith. And thanks for joining us today. So when we first spoke, you talked a lot about metadata-driven data flows. Can you share experiences around that with our listeners today?
Keith Gaputis 2:17
Yeah, sure. That’s definitely something that comes up a lot on bigger projects. Often, there’s a lot of different data sources or tables that we want to ingest. There are many, many different jobs that we want to run on a particular platform. There could be complex dependencies on downstream jobs when ingestion jobs are done and when you start getting up into numbers, like hundreds or 1000s of jobs. It really becomes a question of how can we use metadata? That could be rows in a database table or a CSV inventory file and version control. It could be lots of things. But how can we use this metadata to automate the creation of jobs, management of jobs, the execution of jobs, maybe even changing jobs together in the orchestration of jobs? And it’s just something where it becomes really, really difficult, if not impossible, to manage these things manually. That’s kind of what it means to me. I would say my recommendations always, try to start simple. If you’re not taking a metadata-driven approach right now to loading or processing data and really focus on just automating the stuff that people are doing manually. Specific tools are more conducive to this. In general, we like to look for tools that have good APIs or SDKs. Because oftentimes, what we’re doing is writing a little bit of custom code to kind of do this automation. One example of these tools could be StreamSets. For example, the Python SDK makes it easy to automate the creation or execution of pipelines or even dynamically generate pipelines based on metadata. Regarding orchestration software, many people use something like Apache Airflow, which tends to be kind of static. We’ve been using a tool recently called Prefect as an alternative, which is more dynamic. In general, trying to find tooling in our data platforms. That’s more conducive to being controlled via metadata.
Sean Anderson 4:56
Yeah, I think that’s really interesting. I know as the data in Engineering lead. And specifically, since you’re working with end clients that are kind of all over the place, everything from big data to cloud to data science, machine learning, all of those things. As you’re working closely across this kind of broad array of requests, how do you kind of leverage the collective expertise of the people on your team? You mentioned things like SeamSets and Airflow. How do you really kind of bring that together to cover such a broad surface area? And are there things about that that are reusable or good best practices?
Keith Gaputis 5:36
That’s a good question, Sean. They’re ideally if it’s a greenfield use case, and we can make recommendations on tooling. We do try to stick with certain technologies that we think are best in class. As a company, we’re too agnostic. We often work with what clients have in their existing environments and try to caters solutions to what clients will be successful with. I would say that, in general, it’s difficult, if not impossible, for a data engineer to know everything to be good at everything. And it always helps, just to break the problem up into pieces. It could be something like data ingestion, data transformation, cloud infrastructure, development, DevOps, things like that. And then having people on the team who are a bit more specialized in one or the other. As a company, we focus a lot on training to ensure that people have a strong foundation in Linux. You know at least one cloud platform, things like Docker, Kubernetes, Python, some sort of modern data movement tool, like StreamSets, like a platform like Snowflake. Suppose people have a pretty good representative set of experience. In that case, it can be pretty easy to translate that into a different cloud platform or a similar tool and in the same class. It is something where, obviously, not everyone can know everything, be great at everything. So we try to set up lots of internal communication channels to the different subject matter experts that we have in the company to offer advice or jump in. And in looking at a particular problem, I mean, this is almost a parallel to a company, having a co-op program to gather and foster that knowledge and experience into promoting the collaboration. We also try to build out examples of solutions, things like reference architectures in a lab environment, so that people can do more than just training. I’ve certainly seen companies not investing enough in training but then actually having a place to go in and play around with the types of things that you’re wanting to learn or to ramp up on. It really helps quite a bit. As a consulting company, in general, we get to work with a lot of different clients. Every client, every use case is different. But there are also many similarities in how we approach those use cases in the client’s tech stacks. Seeing that it kind of help realize, like, yeah, if you know, this, it’s pretty easy to learn that. That’s a benefit, I would say. But in general, this is a tough problem for everyone. There is a very broad service area. There can be many moving parts in a modern data platform, and it just involves constant learning, I would say.
Dash Desai 9:07
Yeah, so Keith, you mentioned cloud platforms. In your mind, what does the cloud-native technologies landscape look like today?
Keith Gaputis 9:18
When I hear cloud-native, I think of Kubernetes and sort of the Cloud Native Compute Foundation ecosystem. I think you could take a broader view and think of cloud-native is things like Snowflake and different SaaS products. Sort of cloud vendor offerings, but specifically in terms of Kubernetes. Obviously, it’s a rich ecosystem. Many different layers are being built up on top of the core ability to run containers in the cloud. But and I do feel like the data world is playing catch up, still. We have more and more of our biggest customers asking for data platforms that run workloads in Kubernetes is a big part of the platform. And they do tend to have a lot of in-house proficiency. When they’re asking for it, we usually find that it’s more upstream of the data layer. They’re using it for transactional systems, for custom apps, things like that. But the cloud-native technologies haven’t made it yet into the more kind of data analytics realm that much in these companies. This is interesting, and it’s something that you know I’m really big on cloud-native technologies and trying to advocate for them. I will say that, depending on where someone’s at, it can be difficult to jump directly into Kubernetes. Get to a point where you have parity in terms of how you’ve been operating for many years with existing data platforms and systems and jobs. There’s a lot of initial upfront investment. I have found that once you have a way to do each thing, whether it could be deploying something, or making sure there’s a good way of managing and exposing credentials, or, sending traffic in or monitoring apps, or really, whatever it is, running, running jobs a certain way, You can kind of come up with a standard way of doing it across the board. I’ve been doing a lot of software development for most of my career, and you’re always trying to do the whole thing, you know, don’t repeat yourself. The more that you can take work in and put it into a container where you know how it’s going to run. It’s very predictable, it’s very repeatable, it’s stateless, and then kind of come up with a way to do all the things that you need to do using that kind of broader container ecosystem in the cloud. There are just huge benefits. But it is certainly a journey.
Dash Desai 12:25
Yeah, so just a quick follow-up on that. How easy or difficult are these cloud companies making it to adopt these tech stacks?
Keith Gaputis 12:34
If you use a cloud vendor offering something like AKS, EKS, or Google Kubernetes engine is one of these things. That’s about the easiest way to start using Kubernetes as an organization. That being said, there is still a pretty big upfront investment, as I mentioned, to get to a point where you’re comfortable putting things into production. Oftentimes, it can help this sort of taking an intermediate step like something like Azure Container Instances, or Fargate, or something like that. Where you’re starting to run workloads and Docker but without some of the complexity of Kubernetes. But really, I think that the biggest challenge isn’t how easy it is to adopt it from a cloud vendor perspective, but more of just getting used to a lot of new concepts. Needing to rethink how your apps are constructed and deployed, how your jobs are packaged up, and really just trying to get to parity with how you’ve been doing things for quite a while with a more traditional approach. Once you can get one thing that’s not trivial into production and you’re comfortable with it. And everyone’s kind of comfortable with how to operate it and maintain it over time. It becomes a lot easier to start moving more and more things into this cloud-native approach. I think you want to take it one step at a time. And then once it starts to become more clear to the benefits that you’re going to get. Once you can get one thing into production, it becomes a lot easier to start moving more and more workloads into this cloud-native approach. So really, it’s about taking it just one step at a time and almost viewing it as part of a long-term roadmap to switch into using more cloud-native technologies.
Sean Anderson 14:57
Yeah, that’s really great. And you know, Something we hear pretty often the podcast is really that struggle to get things into production that struggle to move from kind of a legacy way of working with data. And the processes around that. When you’re speaking with these kind of end clients, how do you best describe to them? The importance of moving to some of these kind of more modern data practices? You mentioned things like containers, various different cloud technologies. How do you convince a customer or a client that they need to rethink the way that their processes are structured?
Keith Gaputis 15:45
It’s a good question. I would say that, ideally, people are kind of asking us for it, which we don’t really have to make that case is much in those examples. If you’re building a new platform, it’s an opportunity to rethink how you’re doing things. It could be pretty difficult in a legacy architecture and environment to really begin to adopt these things, which is why I think that the data world is playing catch up still. But really, it’s trying to get to a point where everything can be code, where it’s incredibly easy to deploy things where you know how it’s going to work. And you can easily scale it up, you can scale it down, it can be easily kind of turn off when you don’t need it. There’s all these benefits. And I think that many people when they start to hit a wall, or just begin to get new requirements, and they can’t really easily do those requirements, or the cost of maintaining existing systems gets too high. Probably things like that could make the case easier to sell. As you’re trying to move to a new platform, just trying to embrace all the new technologies that are out there that will make your life easier in the long run. And finding a partner potentially who can kind of help you in that journey if you don’t have the right knowledge in-house.
Data Modeling and Visualization
Sean Anderson 17:33
I think what you said there is really impactful about having a partner that has that expertise. And I know that specifically with Axis Group, a lot of times, you know, we’ve talked a lot about the data architecture and things like monetization and cloud, but a lot of times to you guys are also being brought in on the actual use case level. And you have, you know, people that are actually modeling data and doing the analytic side of that. So can you kind of give us a characterization of how you guys define the requirements across infrastructure across the types of data modeling and visualizations that have to happen? And then, how do you work with those teams?
Keith Gaputis 18:16
Yeah, on a big project, we do typically tend to use cross-functional teams. We do kind of have different roles. We typically do more of the data engineering work, which is kind of taking a process view of the world where we’re moving data, integrating tools, trying to get that sort of everything is code thing that I just mentioned, really embracing the principles of Data Ops or ML Ops. But we do have kind of a totally separate group that we call it data architecture, which is really more focused on the form of the data and kind of how the data sets are sort of organized into storage tiers, and kind of within the storage tiers, and then, they’re really big on agile data modeling. So, we do a lot of like, agile software development, but they do agile data modeling based on the concept of prioritized business events. So it’s really is like understanding kind of how the company works, what are the business events in that company? And then trying to represent that in the form of data. So it’s something where we do like kind of work together on projects with them, and you know, they tend to tackle it from more of a use case perspective. We tend to tackle it from more of a system perspective. And then there are other teams in groups involved. It could be data governance, or as you said, kind of our enterprise analytics team, which does a lot more of the BI or sort of reporting, data visualization type stuff. Also, predictive analytics, which does a lot of the sort of data science and machine learning. But then like, more broadly, we have a group called analytics enablement, which really helps companies set up COE programs to ensure adoption of their platforms and have their analytics and enable employees through training programs and certain custom tools that we’ve developed. And then even things like data strategy, or digital transformation, which are more like top-down, trying to help companies build data strategies that align with their business strategies and kind of modernize the business processes, things like that. So there are a lot of different people kind of working on some of these problems. And it really is the kind of thing where you need to take a bunch of different perspectives, I think, to arrive at the best outcome. Because as an engineer, we like to build cool, complicated stuff, but, we need to make sure that what we’re building out aligns with what people actually need, and it’s going to help them be successful and kind of get an ROI on their investments.
Modular Data Mesh
Dash Desai 21:29
Yeah, absolutely. So when we first spoke, you also talked about modular data mesh. Can you elaborate on that for our listeners today?
Keith Gaputis 21:40
when I hear data mesh. I think there’s a couple of different threads that are sort of converging into this concept. I’ve been approaching it from more of like, we’re building a single data platform where we have almost like a kind of a multi-tenant platform, so it could be kind of like a federated data warehouse, or kind of a multi-tenant, data lakes, something like that. There’s this common theme where we want certain data sets, and we want certain processes to be independently deployable, and we want the system to be modular, and things like that. And in and so there’s like, kind of a whole nother camp, which is like, sort of the Data Fabric where it could be multiple data platforms all kind of serving up their own data sets, and in a sort of connected to each other with a data fabric. And data mesh is this idea of having data products have contracts, having them be shared and kind of connected to each other, but then having, creating ownership of them. And so I think, regardless of kind of how you get to a data mesh, you know, it’s really this idea of having these sort of modular, independently deployable sets of data, and kind of all the corresponding processes behind that. You sort of wind up, you know, with the common pipelines, common architecture, but just kind of like repeating that stuff and having multiple instances of it. And obviously, certain tooling makes it easier to do this than other tools. Particularly on kind of the data storage perspective, using something like Snowflake is probably our go to option for something like this. But it is definitely, increasingly common ask, although many people don’t use the term data mesh, I would say yet, specifically.
Dash Desai 24:09
Yeah, it makes sense. It almost sounds like it’s a mashup of different technologies and applications.
Keith Gaputis 24:14
It can be, and I think, really that other view, where it’s a bunch of different systems that kind of have their own data products, and there is a contract for kind of how you would communicate for them how the data would be represented something like that. It can be a bunch of totally independent systems that are independently maintained. But in the first case that I was describing, it’s more of like in one big system. We just have a bunch of different people that really have their own stuff. But it can’t work either way. It is ultimately a mashup.
That One Weird Thing
Dash Desai 24:57
Yeah, thanks for sharing that. That brings us to our last segment of the podcast called that one weird thing. So Keith, as we’re working on these data engineering projects, big or small, you always run into these weird things, right? Like, one of the simplest examples is time zones. But then there are also things like SQL Boolean expressions that involve no values. And there are others, like, when you’re working with CSV files, sometimes the empty values are read in like empty quotes. Sometimes there are no. So is there anything that you want to share with our audiences today?
Keith Gaputis 25:36
Yeah, and I think from the sounds of the questions, I’m assuming it’s more of like a weird, technical thing because I would say if it’s a nontechnical thing, I would say it’s some of the agile implementations that you see in different places where it’s not really agile, but they call it agile. Specifically, if we’re talking technical, I would maybe say more generally, every data source, every destination has its own nuances. You could be great at writing Spark jobs or great at developing StreamSets, pipelines, and you’ve worked with a bunch of different data sources. And, you’re on a project working in a production-style environment, at scale, there’s always just these weird things that come up. You know, assuming that the connection works in the first place, it could be security-related or performance-related. And it tends to be these like weird, obscure things. You really have to understand that system to get to the bottom of like one thing that comes to mind is recently I was trying to load a very large table out of SQL Server. And for some reason, that JDBC driver could it page through the results set. It was getting out of memory errors. And it turned out that someone on the client had unexpectedly added a clustered column store index to SQL Server, which had huge implications of kind of how the JDBC could page through the result set. It couldn’t stream the results anymore, and it was just this really obscure issue that took like a combination of JVM profiling and looking at SQL Server execution plans and trying to understand how the JDBC driver works and stuff like that. It’s one of those things where like, it almost reminds me of going out into the wilderness and, you know, not really appreciating or maybe respecting the things that can happen that can go wrong, as you get eaten by a bear, or you’re in a boat, and the boat could capsize, or something like that. But in this case, it’s like the more of the dangers of like specific data sources or targets, or just maybe distributed systems in general, like there’s just these things that happen.
Sean Anderson 28:18
So Keith really loved that analogy there on the wilderness. I mean, there’s just so many unknowns when working with data. And it definitely doesn’t help the data is changing all the time, the structure of data, the semantics, and the data. And as we just transfer to these multiple solutions and systems, just the kind of evolving nature of data doesn’t make that an easy value proposition. So I think we’ve had a great conversation here today. Keith, can you tell our listeners real quick how they can find out a little bit more about you and about Axis Group?
Keith Gaputis 28:49
Yeah, sure. So if you want to find out more about me, you can just search for me on LinkedIn. It’s Keith Gaputis, and or you can go to axisgroup.com. As Sean mentioned, we’re big advocates of using StreamSets and trying to solve some of the more challenging problems in the world of data and would love to talk.
Sean Anderson 29:21
Yeah, speaking of Keith Axis Group and StreamSets. Keith will be joining us at DataOps Summit on September 28 through the 30th of this year. The DataOps Summit is open to everyone. If you’d like to find out more about Keith and his talk and all the great talks that we have at the DataOps Summit this year, please visit us at https://www.dataopssummit-sf.com/about. It’s free to register for the event. We’ll be doing a special long episode of the podcast with a bunch of our guest speakers highlighted on the podcast as well. So come join us for the event and come join us for the very special podcast, and you can hear more for myself. So until next time, we’ll see you on the next episode of the Sources and Destinations Podcast. Thank you, Keith, for your time today. And as always, thanks to my co-host, Dash.