Who actually owns the schema? I’ve been struggling with that. But I don’t know if it’s weird.
Our Guest This Week
This week’s guest is Naveen Siddereddy. Naveen is a Data Solutions Architect, at Sony PlayStation. His work is aimed at helping studios across the world plan better for the next game release. His role requires using the right data frameworks to enable analysts, and coordinating with business line planners make wise financial decisions. Naveen chats with Sean and Dash about how he empowers each studio to do data discovery and data integration in a way that is discovered using the schema. Naveen’s thought process was forever changed when he read “Making Sense of Stream Processing” by Martin Clubman which showed him that a schema registry and Kafka could forever change the way he approaches data integration. This led him to a domain specific data integration approach for his company and led him to choose StreamSets as key tool for modern data engineering designs. Using StreamSets and other tools he empowers each studio to discover data based on clues within the schema. This helps the collective team execute on the key tenants of DataOps.
Here is the play by play….
Sean Anderson 0:38
Hello, Hello. And welcome back to the sources and destinations podcast, where we talk about all things data and data engineering. And this week, we’re happy to be with you with a brand new guest for the conversation. I want to remind our listeners that you can listen to us not only on our native homes on SoundCloud and Spotify, but also now through Google Podcasts and platforms like Breaker. You can also check us out on the StreamSets website under the community section, where you can stream all of these episodes directly through the website. And this week, I’m really excited about our shiny new logo. We hope that you guys like the design as well. And we’re inviting data engineers from all around the country to join in these conversations with us to really dig deep into data engineering and Data)ps. And as always, I’m Sean Anderson, your host and I’m joined by my faithful co hosts Dash Desai. Will you tell us a little bit about our guest this week?
Dash Desai 1:36
Absolutely. Thanks, Sean. And thanks everyone for listening in today. So this week, we’re super excited to host Naveen. He’s a data Solutions Architect at Sony PlayStation. Naveen’s work is aimed at helping studios across the world plan better for the next game release. His role requires using the right data frameworks to enable analysts and coordinating with business line planners to make wise financial decisions. So hey, Naveen. Welcome to the sources and destinations podcast. Would you like to say a couple of words about your background in data engineering?
Naveen Siddereddy 2:18
Thanks for inviting me, Dash and Sean. I’ve been working with Sony PlayStation for last seven years, most of my work has been helping out the financial analysts do buying and planning decisions. And my focus has always been in ETL. Mostly using tools to deliver those solutions. So I’m excited to be in this past podcast specifically, you know, designed for data engineers. Thank you.
Apache Kafka Revolution
Dash Desai 2:45
Thank you for joining us today. So the first time we spoke, you mentioned that Kafka changed your view of how data management worked. Can you please elaborate on that for us?
Naveen Siddereddy 2:57
Once this whole big data world started exploding, there were too many technologies in the in that space. And coming from the Informatica world, which was my go to tool for a lot of data processing, I pretty much got confused. I ended up with this book called “Making Sense of Stream Processing” by Martin Clubman. Out of all the technologies that I played around with, up until then, this book, change your mindset. I didn’t even realize it was about Kafka, but I got it in a conference. So when I was reading through that, it gave me so much confidence as a data engineer that, okay, so they’re not just simply throwing away the old technologies, but they’re just saying that, you know, the whole concept is actually evolving from the older world. So that kind of gave me a confidence in saying that, yeah, they’re making sense. Now, they’re not just saying that, we’re inventing something new.
So, the key things that I liked in that book are that the things that worked in old days. Right? So one is transaction logs, which is the core core thing that exists in every database. And so you pretty much can rebuild the whole database using transaction logs. They adopted that concept. And then the second one is Unix piping, the concept of actually pushing down stuff from one command to another one. Unix piping has been there for ages until the invention of message queues. So blend of all these three, and that is what I understood as Kafka and that actually, you know, made it very clear for me that okay, this makes sense. Then I started to map things into my world, which is database and ETL tools, right? And when I started to notice that So the database basically consists of the transaction log, and the storage, and then the transaction log, and there is a compute element to it. There are schemas, to organize the whole data. So if you basically split these out, you have all these newer technologies that are in the Big Data world, to take as an example Kafka would be something that would do storage, right? It has transaction logs embedded into it. So you pretty much covered the storage piece. Take an Oracle database, so storage and transaction log is covered there. And then the compute. So in the Kafka world, you pretty much can no spin up a Kubernetes cluster to just to process the data that sits in Kafka. You already took out the compute out of it. And then Kafka has this, not just Kafka and Confluent have this concept called schema registry.
I think that that is a key thing that is making a big difference. The schema registry basically is how your data is in Kafka world. In Oracle, we used to have whole table structures, the tables, no constraints, and all these things, right. So you could think of that as a schema registry. So where I’m going with all this is you pretty much we are ripping the database into smaller modules, and we have tools for them. Kafka sits right in the end. Which gives me a good advantage in that I’m looking for newer tools, right? So because all these things, data processing modules are broken out. I can pick and choose which one fits where just take an example schema registry, which I thought was the only guys in the market. From last four years, we have like six tools, a data hub from LinkedIn, which actually manages all the schemas, wherever the data is, and there is a NEEMO from Facebook. There’s a**************, which I really like, because they actually use graph database on the backend, which I think is very neat. And so yeah, that’s that’s what changed my perspective on data management. And I give credit to Kafka for that.
Sean Anderson 7:13
That’s really interesting to Naveen and you pointed out some good things there. I wanted to drill into it. So obviously, you came from the world of kind of monstrous legacy platforms like Informatica, which do everything soup to nuts across the data value chain from Master Data Management to Governance to ETL. Now, you’re looking at tools like Apache Kafka that are still enterprise but are open source, very modular in terms of, Kafka does what exactly what it does and doesn’t try to do all of the other things. Yes. Do you see that as kind of a shift? Are they part and parcel of the same thing?
Naveen Siddereddy 8:02
Oh, it’s, I would say it’s a continuation because I would think that Informatica was the first to actually take compute out of database, right? They really are like saying, don’t do the compute for me, Oracle, I know how to do it. And so, I think it’s a continuation. I think we are hitting the peak of that, like breaking everything apart, and having one specific tool to do the thing. So to take another example, I think that Informatica even covered the schema registry. To be honest, because if you think about it, all other data in the company, were supposed to create a source and a target inside Informatica. So like a definition, right? That pretty much is schema registry. It is just that they didn’t envision that to be a separate module, really. Anybody could go and query those sources and targets for creating the discovery of data. So I think it’s a continuation from here.
Domain Driven Architectures
Dash Desai 8:57
That’s awesome. We talked a little bit about technologies and tools, like Informatica, what are your thoughts on domain driven architectures? With regards to data engineering?
Naveen Siddereddy 9:08
Yeah, so domain driven, I think there are many words for this. But what I think of domain driven is a decentralized way of working. The best analogy I have is a world map, right? The current world we live in. If I think all the countries as domains, and these domains are organic, in the sense, they might come and go, right? So like, finance and marketing are pretty standard domains in an enterprise, but you could have new domains come and go. So let me go back to the world map scenario. So what I think of domain driven is that all these domains have their own way of doing things. US has its own legal server, you know, the way that things have been done, and let’s say take India and it has its own stuff. But all these are connected through actual events. I think that’s the key thing in this whole world of data processing frameworks. They are going to end up in this mode of domain driven design, in the sense, all these domains will, they don’t do their own stuff and then they throw up events that the other domain can consume.
Going back to the analogy here. So a COVID thing happened in China, that’s an event in one domain and that event event is thrown out. If any other domain could react to it and they could ignore it too. Some of them have been ignored, that could be a effect on that domain. So we probably reacted to it in a different way. So in enterprise world, it’s going to be the same thing. All these domains will have their own toolsets. Not like the old days, where there is one tool that is going to give us all the domains and everyone will use them. I think that’s not going to help, because each domain has their own assets, that actually, they can deliver better value if they have their own tool. I think this whole domain driven design will actually make tools available for that domain, the analysts will be empowered with the right tools. So coming to the data engineering part of it, obviously, the analyst would want to have the end to end control over it, like the DataOps.
The tool needs to have the nice user experience. So that’s where I think the industry is heading to, and taking some of the examples of the studies is one where we, we have great user experience to actually collect the data. And then similarly, I’ve seen in the Kafka world is that there is a tool called Landoop, those guys have provided the user with end to end framework. And on the back end, there are so many moving parts. But that doesn’t matter, because the open source and all these tools that made this into broken down modules that you can plug in and plug out inside that user interface. So I think ultimately, this domain driven design actually is going to make analysts more powerful. And it may take out the IT department to be honest, because data engineers and analysts will sit inside those domains supporting purely that domain.
Dash Desai 12:28
Yeah, thanks. That makes sense. So does that also mean that each team is using right tool for the right job? And then, not only that, but different teams have to work with each other together? Right? So if the answer to that question is yes, you can correct me if I’m wrong? Then can you talk about the team dynamics in your organization? Like how many senior data engineers do you have? How do you work with other teams and things like that?
Naveen Siddereddy 12:58
The communication between domains is going to happen through events or data discovery tools. I think that will be at the enterprise level. If you talk about how these domains communicate. I think inside the domain, we would have data engineers and analysts and actually the data scientists that support that business function. And so to the next question, my team actually has two senior data engineers and one data scientist. We interact directly with the financial analyst. So that’s how we are set up. When we communicate with other teams, it’s mostly, very decoupled in our case, so we either get data using the files itself, the CSV files, or we get it through a standard database.
Sean Anderson 13:57
Excellent. So you talk a lot here about distribution and decoupling. I’m curious, how does the collective team look at things like repeatability and creating artifacts to save developer time. You talked a lot about self service and be able to do data discovery. But what mechanisms do you guys have in place to make sure that the best practices that one team is earning gets distributed out to the other teams?
Naveen Siddereddy 14:28
Yeah, that’s a challenge, obviously. But I think, going back to the data discovery stuff, there are initiatives inside our company that actually make all the data sources available for discovery in that new application. I think Atlas is the one that we’re trying to get on board. I think it’s ultimately about putting all of them in a discovery level. There could be schema registry or ****** way of doing things and showing all the lineage for the data, everything could be part of that discovery module. So that part is taken care of by Atlas, I think. But when it comes to the actual standards, I think it’s going to be with the domains, especially if you’re moving towards the domain driven designs. I think what they do inside the domain actually doesn’t really translate to helping the other domain because the needs are different. Because the tools are so fit for use and specific to that need. I think it’s up to the domain to decide what it wants to do. I think the key thing is the communication between each of these domains and that should happen through data discovery tools, showing all the lineage and everything. Alongside an Event Hub, which stores all the events which are relevant from all these domains in one place. I think that’s what would solve the problem.
Sean Anderson 16:04
Excellent. So it sounds like creating best practices is focused around empowering your teams to do that discovery, and then also reading all the schema clues. Do you characterize that as something unique about PlayStation? Or do you feel like that’s a model that the industry is is going to have to adopt?
Naveen Siddereddy 16:28
No, I think it’s specific to PlayStation I feel because previous companies I worked for the model wasn’t like this. I always was leaning towards centralized models, saying that we can support every other use case. But at PlayStation, at least in my world of PlayStation studios, that’s not how it’s been. I would say it’s specific to PlayStation. I’ll just add, Sean, I think empowering the user, I think you got that right. I think that’s what matters ultimately. And the way you empower them is actually through the user experience. And, especially in the data engineering world, user experience means having a trust to the business user that…. okay, I know where it is coming from, and I can trust this guy. And that only happens through the whole giving them a transparent view of the whole operations of how the data moves from the real source to where they can do the analysis.
DataOps at Playstation
Sean Anderson 17:29
That’s great. I think that’s a good segue into the next question that I have. How do you see DataOps put into practice at Sony PlayStation?
Naveen Siddereddy 17:37
Yeah. I think the key to that is…. the way I would use DataOps to bring trust to the business user who’s actually using the data. So, most of the times, they’re the concern. Are we collecting the data correctly? Are we able to support the the critical workflow they have? And how are we able to anticipate what is coming next, right? These are usually the common business user scenarios. So that is where DataOps has played a big role in supporting all those things. Specific to some of the use cases that I’ve worked with, it actually helps them understand where the data came from. Some examples: I’ve used a couple of tools, ETL tools, I think some of them have a granular way of showing how the data moves and some show. Like, for example, stream sets actually has a bunch of things that I think are very valuable, like data alerts, and Metro based alerts, you know, and then the statistics of how the homeless data has been processed. So I think those are the ones that we really use in our current pipelines. I think we’ll add a lot more value going forward.
That One Weird Thing
Dash Desai 19:03
Yeah, that makes sense. So now we’re down to the final segment of the podcast, which is one of my favorites. It’s called That One Weird Thing. As we work on data engineering projects, big or small, we all run into these weird things, right? Like a good example that I can give you right now is working with different date formats across platforms, databases, sources and destinations. Get it? So what is something that stands out for you that you want to share with audience. Another good example I’ll give you is timezones. Right? So can you think of something that you can share?
Naveen Siddereddy 19:43
When we were talking about the schemas? One debate I always had was who is the schema owner? Is it the end user who’s consuming it? Who should define the schema? don’t know why I think it’s weird. Is it? There are tools in AWS like Glue and other tools who use AI to decipher the schema. So that’s one way to do it. The other one is actually like the schema registry where I’ve defined the owner defining the schema. I’ll start this like a developer who actually owns the schema. I’ve been struggling with that. But I don’t know if it’s weird in data engineering space, but that’s my take on the weirdness.
Dash Desai 20:30
No, that’s absolutely valid. I can think of an example as well, where you have API’s. Right? And you have a contract. People developing API’s versus people consuming API’s. I get it. Yeah, thanks a lot. That was a good unique one, for sure.
Sean Anderson 20:48
So Naveen, I really want to thank you for being our guest on Sources and Destinations this time and we really enjoyed your insight on how you operate as a team, how you evaluated data engineering tools, and obviously your experience brings a ton of value to the conversation. So I hope everybody out there enjoys this week’s conversation. We’ll be back with you in two weeks with Joseph Arriola from Yalo Chat, really this time, so please join us in two weeks for our next episode.
Join Us Next Time
We cant thank Naveen enough for sharing his insights with us. What to learn more about The Sources and Destinations Podcast? Follow us on Soundcloud or Spotify for future episodes. Do you want to be a guest on our podcast? Drop us a note!
Find out more about our guests!