About Data Story Series
Join The Data Story podcast, where two veterans of the Data & Analytics industry cut through the chase and bring you the most relevant technology trends transforming the industry. James Serra and Khalil Sheikh have helped transform several Fortune 100 enterprises into data-driven enterprises. This fortnightly podcast will equip you with the best practices, tools, and frameworks available that will help you spearhead your business insights journey. Stick around for each new topic discussion and subscribe to this channel.
Guests on this episode
James Serra is a Data Platform Architecture Lead at EY, and previously was a big data and data warehousing solution architect at Microsoft for seven years. He is a thought leader in the use and application of Big Data and advanced analytics, including solutions involving hybrid technologies of relational and non-relational data, Hadoop, MPP, IoT,
Khalil Sheikh is the Executive Vice President of Saxon Global. Under his leadership, Saxon is transforming from an IT Staffing and services organization into a new age digital transformation partner and a strong brand. Khalil has extensive experience in the IT services industry and in turning around businesses by promoting growth and profitability.
Season: 01 Episode: 02
You also can think of data as a product. And so, each of those domains follows a contract and makes the data available. And so, you can think of it as data as a product. And, the big change too, is that the data engineering teams then suddenly come in IT, doing everything and becoming the bottleneck.
You can think of it as outside of me, IT departments in each other domains. And so the skill set is transferred from Central IT to each build domain, because they're the ones now are taking that operational data and analytical data, create the following contract to make it available to everybody other domains in there. And so each of those business users, business domains have the engineering skills to do all that. And so the idea is with database analysis, because the more scalable because instead of having IT do everything, now, each of the major things, you know, a lot more people scaling out in that capacity to prevent that bottleneck. And we like the Data Mesh, that a lot of struggles are happening within companies in order to collect all this big data. And let's make the business owners who know the data better than central IT be the gatekeepers of that data and make it presentable for everybody else to consume. So those are the key points that would differentiate data mesh. And I didn't talk about technology, not because it's more of a process change that applies in the database.
Khalil Sheikh: [2:13] Thank you, James. And what about the governance if you're going to give domain ownership to groups like Supply Chain Finance, HR sales, customer service? Yes, it's easier for them to own the data because they are closest to them. Right. But what about the governance? How do you make sure that there is a federation around computational governance? What are the good practices there? How do you manage that?
James Serra: [2:42]Yeah, the idea is that you could have what they call data mesh with Infrastructure as a Service, data infrastructure as a platform and that you're not centralizing the contract or the guidelines to have data governance on there. So you are telling each of those domains, hey, we want you to be part of the data mesh. And to do that, you need to follow these governance rules that we have. Now, that could be just a simple set of guidelines. Or it could be, here's the code that you should use to clean all the data. And let's, instead of everybody inventing their own solutions, let's give them the code that they should use and the procedures for that. Now, this is very challenging, because each of the domains may have different technologies that they're using. And so a set of code that we worked, one may not work for the others, but we're gonna have to have some central root cause the guidelines because think of something simple as I'm in HR, and we're going to call states, abbreviated TX, well, another one may say, we're going to use the full state name, what you need to have some conformity to somebody and IT will have to come up with the guidelines to follow. And this is what we mean by filling the data on there. But that's very challenging, because everybody's got their own ways of doing things. And this is one of the concerns about the database, you're gonna have to ask all domains to change the way they're doing things and a lot of cases in there. And it's gonna take them more time and effort to do that. And so how are you going to convince them? What's the incentive to doing that? And it may be that while you're doing this for the greater good, you're going to get more insights to the data and that doesn't really have to show people in there to take on additional work. I've done on it differently. Where before the data was all we didn't try to keep our central it, give me all the data, then we'll clean it. And we'll do it according to one standard. Now you're pushing those things out on there. And then there's no real technology to do that yet. Where you can just open up a box of data governance and say, Hey, follow all this one here. So I'm open to that data come but I think we were a long way from that.
Khalil Sheikh: [4:55]Great, Thank you, James. So Data Mesh, as you know, most people think it's a self serve data platform, right? But then there is complexity to set up the data products, the schema, you know, lineage, compute, locality of that data. So how difficult it is, you know, like, conceptually, that looks like, you know, it can create a lot of differentiation for the enterprises. But how difficult it is to set up a self serve data platform, thinking of all of the above, as I talked about locality and lineage, and others.
James Serra: [5:33] Yeah, and when I talk to people about data mesh, and it's a very popular topic, in various user groups, and presentation, IBM, and certainly the EY where I work. There's a very large technology consulting group, and we're constantly having conversations with that group and their customers about the data mesh. And if there's a lot of time involved in building a data mesh, much more than people think. And it's the technology, but it's a cultural change. And so there's nothing a lot of one way to build out a data mesh. And part of that to answer your question is, well, we want to keep track of lineage, right here. So again, you have to go to each of the domains and come up with this contract that says you will keep tracking the data lineage. And maybe we have some technology that will help you. But we need to make sure that when you come out and say here is the data that we have for HR, an example. Not only people access the data, but if they asked where did this particular field come from, you and HR should be able to have a lineage that can show me where the source came from maybe some DRP system on their end, and all the steps it took to be clean and come to that presentable, that consumable spot on there. Now, this is where I see a lot of variation. And what the data mesh in theory is that this governance will be pushed out to all the individual domains. But then I'll find some companies who will centralise certain things compared to the way the data mesh says every domain takes care of it. So what they will say is, we're going to track all the data lineage from there. So we're going to be maybe scanning technology to have a data lineage, and we're going to Central line to that maybe we're going to Central where all the data is stored. So it makes it easier to do the lineage on there, maybe we're going to set up a data like a centralised data lake, and divided up by folders, each of the domains, instead of having each of the domain cabinet own data within there, this way makes it easier to govern the data and to tell us the lineage of the data. So there's, there's usually some balance, and because every company is so different, a solution that works great for one may not work great for others on there. So we're seeing a lot of exceptions to what I would call the traditional data mesh. And certainly, a lot depends on the skill set of the people the technology they're currently using. Have they been setting up data before? Or is this a brand new concept to them? And so some of those are going to be challenges look at when you're trying to decide to do
Khalil Sheikh: [8:29] James, what kind of industries like today as you see in the modern enterprises? What percentage of industries are adopting this concept of data mesh, enabling their functional groups like supply chain manufacturing, sales, marketing, right? What do you see? Like, what is the adoption rate at this point in time based on your experience?
James Serra: [8:55] It’ like opening the can of worms. It's interesting that you could there's a lot of posts not a lot there's a couple dozen of posted videos that you can go and look at of companies that have implemented data mesh in an early stage like stated that most of them have been doing it for over a year. Some of them done for two years, even before data mesh became an official buzzword in there. And I've seen companies a lot of banks like Saxo bank, JP Morgan, they all have videos out there that you can see what they've done under other big companies like Intuit and Hello fresh and DBG media is another one. And what they have in common is that they have a lot of big pain points now and they have a culture that can change and is willing to go down the data mesh route, because you have some of these companies have 50 100 domains in there. And it's just a challenge to figure out what the domains are. Because it may one domain may be made up of multiple groups within your company. So you have to be able to go out all these domains and get them to buy into the data mesh. And all it takes is one to say no, and then you have this group that's outside of your data mesh, and that is isolation. And that can be a big problem there. So those videos are good, you can see how they're all trying to solve that. But there's mostly a lot of variation in technical implementations. And their understanding the data mesh, and how do you handle things like Master Data Management and data cleaning. And, and it's good that they are talking about this, but the one thing I get on all that is like, wow, this is a lot of work. And you really have to have a number of pain points to go down this route, and not to have a company culture open to this change. And there's other things I can talk about that I think are key for successful data mesh. But there is not a lot of companies out there that have successfully implemented one, I would just say there's probably a few dozen, and even the data mesh people out there who are the experts will say that this is very early stage, it's kind of experimental. And don't go in thinking that this is something that is definitely going to work. And it's been a while to the winners, and they're really rolling this process on their soul. A big caution.
Khalil Sheikh: [11:54] Yeah, absolutely. James, thank you. The way I see it is like, you know, if you can enable, then there is a lot of value for self service BI across various different departments, right, and they don't have to depend on IT as much as they do today. Right? How do you see like, there's a lot of confusion about data mesh versus data fabric versus data lake house, how do you differentiate? What is the major differences between these three connotations?
James Serra: [12:10] Yeah, I try to come up with these common definitions of what all those are. And again, even those definitions have been around a while a little bit longer, there's still a lot of confusion on there. I know Gartner calls Data Mesh and Data Fabric, the same thing, but I would say they're different. I think we know what data warehouse is. And when we go to the next step in a modern data warehouse, what is modern? Well, it usually means it's in the cloud. And usually means you can handle data no matter what the size and type and speed of it. It can be real time it could be batch, it could be semi structure could be JSON, CSV files. And it could be a bunch of small files with really large files on there. That's what I think It'll be more modern in there. When we get to the Data Fabric, I think it's kind of a glorified modern data warehouse that adds a bunch of different components to it. Now we're talking about modern data mesh data management, we're talking about having maybe API calls to access the data. We're talking, maybe breaking up data mesh into building blocks that can use and maybe we build two extra security, like a back control and not just our back end. So I feel data mesh is just a handling of a data warehouse at a much grander scale, but still centralising and making that the big difference between that and the data fabric. And will I get blurbs while I can use a data mesh, like use virtualization, to access the data? But even in that case, you may say, Well, we're going to, for example, keep data in China, and we're not going to centralise it because we were not allowed to and will virtualize and pull it when it's needed. But you're not getting down the path of the data mesh where you're talking about ownership. And you still might say IT owns that data, even though it's you can't centralise it in them. And so the Data Fabric is not so much concerned on who owns this. This idea is as IT owns the data or data fabric is the domains on it. And they have to have a contract of winning display that data. So there's a big jump in going from the data mesh to the Data Fabric on there. And the big challenge and some people will say a more confusion is a data fabric is the technical logical solution of a data mesh because data mesh is just not technology. It's theory on there, but I okay, that's something that you can possibly say but I would say you're still not having that differentiation of a domain ownership and where the technology really is challenged is I have all this data sitting in a data mesh, and it's not centralised, and I do want to centralise it. How do I combine them all, how do I do that, and I can use virtualization software Or you may see some of the newer things talked about what the data mesh is not other domains that combine that data. And which is interesting, because now you're talking about copying the data over into other domains, which kind of went against what they first said about the data mesh, but now they're coming up with these other solutions on that, because they're saying that's a such a challenge that 100 domains out there, and somebody says, I want to combine data from 10 different domains, well, virtualization software could help but it could really affect the performance on that. So a better idea is let's just pull this data out, and combine it and aggregate and then make that another domain that's available to others on there. So that's one sort of maybe difference between the original data mission and where it's going down, because I think they're just coming up with all these new technology challenges, not overcome them. So you got to think of different approaches, and maybe creating other domains that are aggregation of the data, or are presenting the data in a consumable format is much different than the source format on there. And so this is where you're seeing multiple iterations of copies of the data in the Data Mesh.
Khalil Sheikh: [16:40] Thank you, James. And where do you see databricks fits into all that? No, I'm even hearing databricks and talent. And using Because ultimately, is about building a platform, which is self-serving, right? So where do you see data bricks fits into it? Or who are the major players beside databricks?
James Serra: [16:59] Yeah, one of the things I warn customers when they're looking at Data Mesh is, don't listen to all the vendors, because what I'm finding is a lot of data. Vendors now are saying, Hey, we support the data measures use your product, it's almost like a data mesh in the box, which is not what they should be saying different data mesh is a business concept, this idea, and the technology is a small part of that. And, so you can't solve, you can create a data mesh or some technology, you have to have that cultural change that domain ownership. And, now each of those vendors could have technology that could be used in a data mesh. And you could look at something like a databricks who has a data lake house component, the idea being is you don't have to use a data lake and a relational database.
Unknown Speaker: [17:57] With a data lake house, you put those relational database attributes inside of a data lake. And they would they directly do that, from a Data Lake to give you say, acid compliance and better performance on their own, and time travels. And you can see changes in the data and things that you normally would in relational database, you can say in some use cases. Now, we don't need a relational database, hey, put in a data lake house. Now how does that fit within a data mesh? Well, you could say each of the individual domains will have their own data lake house. And that's what they will use to store all their data, and they clean it, and then present it. And so solve that contract that said, make your data for example, they can go Okay, we'll have different layers in the data, Lake House, what raw and clean and presentable. And that presentable will be what others could use when they need to access the data. And we don't have to have a relational database in that solution.
But there are and I will talk about this previously, there are issues with just having a data lake house, you'll never get the performance you get out of a relational database, maybe it's acceptable, but it will not will get millisecond response time, you won't have certain security feature like low level security, you wouldn't have certain other features that are available in a relational database for many years on their additional complexity when you try to make a file folder structure, relational because the metadata is separate from the data. And that can be very challenging for people to go and access. So maybe you can solve that by hand saying we'll have API calls on top of all of that in our haven mesh. And hey, great, go for that. But there's not a lot of examples of great solution people have already built on that. So Time will tell as we see people build these out and maybe we'll start seeing some great solutions and then kind of document and then you can say okay, this is all way to go. This is the blueprint for another company to make the data mesh and then when you look at Microsoft, they have synapse and synapse can be their own data lake house because you can do a lot within synapse with data sitting in a data lake with their ability to use t SQL that data and query with a serverless component. So you only pay for query. And so you could also build a data lake house within synapse as and also have a relational database component as an option as part of that data lake house. So you could this is where it's a grey area is a data lake house owns mean no relational database with data rich? Yes, Microsoft that's like, Well, no, I think this coming having one source, and that's where synapse has this, their single pane of glass over all this data, and that, if you go to their synapse, dashboard on there, you can you can participate overnight sitting in a relational database, and you can very easily say, Well, I want to move with relational database, we'll use the same same t SQL, and you can do that so it becomes this federated option on there. So it's a bit of a grey area and Microsoft's coming up with their own data mesh, blueprints for using familiar data mesh using synapse.
Khalil Sheikh: [21:14] So is there like a lack of a better term, you mentioned data mesh in a box, right? For supply chain versus manufacturing? is Microsoft going that direction where you have synapse which gives you the raw structure, you have lake house, data lake house, you have a EDW, you have analytics on top of it? Are they going to go beyond the blueprint and build sort of a box? Like if you're marketing or supply chain or manufacturing or customer service? Are they going to bring that concept process compliance governance, and say, for marketing data? And how does it get wound by various different audience for it? Are they going to introduce it beyond a process? Or is it going to remain as a process?
James Serra: [22:00] I think it's going to involve evolve into some tools that will help automate it. And I don't know whether at Microsoft, so this is just speaking on what I've observed. And the idea being is okay, here's our idea of a data mesh, and they're going to call it something different. And this is the guidelines and the different way you want to think of a data mesh as far as organisational change. And then their idea may be that you have a centralised data lake, and divided by folders, kind of Microsoft's section, if you will, to it, and then what they're doing, and then they'll have on getting all this, okay, here is some code you can run that will create this data lake for you. And it'll ask you some questions. And then it will say, All right, now we fired up all the components that you can use to start building your data mesh. And I think that the more and more down that line of trying to automate it a bit, because of data mesh, maybe a data lake, it may have synapse, it may have data factory and all the components. So if they can make it easier to kind of fire this all up. And so you can get started quicker, great, but every damage solution is gonna have a lot of own customization. And this is where you can't say there's a product out of the box. And so we're gonna have to figure out a way to throw these products together. And make it specific to our use case, because the challenge will be to with customers is they have all these domains, they all maybe use different technologies. And this is what I've seen a lot of recently with customers is, especially when it comes to, hey, I'm a beer distributor, and I just bought all these individual, your plants, I don't put the right word, but they're all over the world. And they're all doing their own thing with different technologies. And if I want to create a data mesh, I got to work within all of those different technologies, maybe the goal will be down the road to all use one product like synapse, but it's gonna take a long time to get there. And in the meantime, we need to get results and pull the thing together. So can we create some commonality between them and tell them, hey, make the data presentable? I don't care what technologies you use, but this is the format maybe it's create some API's on there, but they're all doing it with different technologies. And then maybe we can say, well, we will centralise some of that data. And then we were talking about before with the different domains that they have. And you could have these, these domains and apparently what they call them, it was yeah, I forgot what the name of them was. But it was all here to do to source align domain data. And then the idea is I can create an aggregate domain data. That takes all the domains aggregates them together. And maybe that is one unified product that we use for that. And then we also have consumer aligned domain data. So we can take all this data and make it presentable in a way that others can consume it. And that's the detail level. And that is using the same technology, but you're gonna have trouble with all the different sorts of domains having their own technologies.
Khalil Sheikh: [25:37] Right, thank you, James. And how important is data vault architecture 2.0 architecture as an organising concept of data mash? Because there's so much going on in that space, right? Because like, ultimately, if it's not an out of the box solution, we got to, we got to come up with a logical/abstract architecture, what does that mean to you?
James Serra: [25:53] Yeah, and you can look at data vault as more of a way of tracking the changes in the data. And so each of those domains that could be part of the contract that it hasn't damaged, you will track all the changes to the data, and we would love for you to use data vault, to do all that. And can you get their mind for that? Well, yeah, hopefully. But the challenges with the data vault, which is a great concept that's been around for a while, is not many people are using it, to may not have the skill set to use it. And it may not be needed in every domain, maybe they don't need that sophistication in there. But that can be a tool in the toolbox that central it can go out and help other domains to, to get to that point where they can create a data vault on there. And, and but that, to me that just this is where the data mesh have these problems is all your domain, you're gonna be doing it differently. And how do you get to all the agree on the same set of standards on there, that's been the new challenge. Because in the end, if you tell them, hey, whatever you're doing, you need to improve it, we need you to follow this contract. And you need to make this data available a certain way to others use a lot of domains and develop Why should I do that? I'm just All I care about is my domain. And if you want to use the data for others, fine, pick a copy of it. But I'm not going to do that. Well. That's when we have to get the volume. That's to me, the biggest challenge data mesh is telling them do it this different way. It's extra work. Again, what's their incentive on there? Hopefully you convince them it's for the greater good, and that maybe you can do things with their data. And why don't others I'll give them more insights that they don't have. But they may be too busy for that. And they may not say we're not going to wait a year or two for damage come into play. We need results now. Do things now. So now they're gonna say so how do you handle that? That'll be a challenge.
Khalil Sheikh: [27:53] Right? So James says, seems like there's a lot of complexity to it. Right? You know, so it complexity from tools perspective, compliance, governance, ownership, right. So what size of the organisation should adopt it versus not? Because in our world, we have a lot of midsize customers as well, besides the fortune 100 customers. So now, how do we, how do we find the right audience to have this discussion? Because it seems like it's only suited for very large organisation who have the kind of governance process maturity as well as internal binding to say, Okay, I'm going to have this self serve architecture available to me, right? What have you seen in your experience? Who are the right audience for it? Because the midsize organisation may not have the tools, availability skills that is required to implement across domain, right, there's a lot of work. Where do you see this concept may be useful? Or is it still viable for mid sized organisation? If so why, and how.
James Serra: [29:00] I started off the bat, you have to be maybe 1% of the companies to have data mesh be helpful to this is not for small companies or midsize companies. It's really, really large companies who have current pain points. And I've seen this more often with the really large banks that are struggling with collecting all this data and making better business decisions with this data. And things may be a bit of a mess now. So it makes it easier to go and get buy in because each of the domains right and saying yeah, this is a nightmare. We didn't really use Intel. So if you don't have those pain points, if you're a new company,
I would say data mesh may not be the best route to take. To start off with, you can go down a modern data warehouse and that's probably not going to solve everything you need. But if you're an established company, you're having all these issues.
And you're seeing this, for example, not it and they have backlogs of months, and they're not able to ingest other datasets, because they can't scale the people and the technology and, and you're going look, no matter what we have to get to new technology here, because what we what we've chosen is not scaling on there, and we can't get the people and, and all this. And so you have to have a long list of problems already, I believe, to look at data mesh as a solution. Because the one problem I have with the way people are saying how the data meshes coming to the rescue is number one, it's gonna take a lot longer to build a data mesh in the modern data warehouse because of that culture change. So it's not a quick fix, it's a slower fix. And the other the number two, the problem I see with people pushing the data mesh is they say, look, all the solutions now are not scaling. And all these companies are having their projects fail, because the technology is not there. Well, I strongly disagree with that. And projects that I've seen failed are not the technology, it's the people in the process, and they'll take data mesh is going to help solve that. Okay, good. But you're solving in a way that the technology is not there yet. So how is this going to help reduce this failure if you don't have the technology at this point yet, and nobody can even really define a data mesh consistently on there. And the other thing is, when people say, well, all these big data solutions are failing or not scaling, I'm going with them. And I I've been at Microsoft seven years, we introduce these technologies, multiple parallel processing, And now, synapse is like Derek, I dealt with many companies as a Microsoft who were having petabytes of data, and they're building full solutions. And a great example is Microsoft, I've seen their internal solution with using a data lake and synapse for handling petabytes of data, you can imagine the data they get through Xbox. So it's being done, the technology is not stalled or are falling behind, it's there, it's using the right technology for your use cases, and having experienced people to build it out there. And so don't think that they imagined is going to be able to scale data where current solutions can't. They can use current solutions, but they couldn't be in those small use cases, a better solution. And that's if you have a lot of domains that are working on their own now, and having pain points with scaling people in technology and rely on it too much. If I start seeing those, that's when I may say maybe a database could be a solution. But would your company be able to change, do you have somebody who can drive all that change? Because I wouldn't be on 100 domain. But some companies have done that. And some companies have been successful with that. There's just not a lot of use cases so far.
Khalil Sheikh: [33:21] Thanks. So ultimately, the complexity of the data set and multiple data sources may drive the need for Data Mesh. Do you see an industry like healthcare, which is highly governed? You have hospitals, you have nursing kids, you have hospices, you have formularies, you know, pharmacies integration, you have care coordination IoT. Do you see what industry Do you see? Maybe the right, because there's a lot of various different data sets, including the compliance data set that may come from federal government like nanda and Macmillan guidelines and others, how do you see one industry or two industry that can be benefited from this concept, because the more if they have some kind of governance and contract as you talked about available, and then you take it because cross pollination also has to happen, right? data is not in silos that okay, pharmacy would use a set of data and be done with it, right? How do I use that data to enable who are my patients which are being you know, diagnosed with very expensive drugs and any company ours and all? So which industries do you see healthcare versus financial versus any other industry? Who may be the first adopter of this very loose but highly optimised system of choice, right, from a self serve data architecture perspective, who do you see are the first one or two adopters would be because it seems like it's still a lot of conceptual implementations are there, but from a platform as well as tools may be there, but ultimately assembling those tools to deliver the right value may not be existing or it may be their passion. So who do you see I would be early adopter?
James Serra: [35:12] So far as I've mentioned, when banks, the financial institution, and I can't think of having seen any health target. Now, I think your point was, healthcare could probably be the easiest. And, and the most need of a data mesh, which I would agree with, and I would have a lot of healthcare, I think you haven't seen healthcare because healthcare is usually not the leading edge like banks are, they usually are more conservative and a little behind in new technologies, because they have a lot of statutes and rules and laws, and HIPAA requirements for all of that. So I think this could be really helpful to the healthcare industry more than any other data mesh, but they're just going to be a little more hesitant and getting started on there. But if you, if you look at the requirements, it couldn't be easier to build any mesh, because all these particular domains are probably already following all these guidelines. And there may be a lot of similarities, because they can get a lot of trouble if they don't follow the super requirements and such. So maybe there's a lot more conformity or reading between those, maybe you can go to those domains and say, the incentive for you is that you're going to make sure you're following all the laws and governance and requirements that are required in the healthcare industry, we will help you do that. So this is where I can see that you may get a better buy in from the healthcare industry than others. And then maybe in security, you can say, look, we're going to give you requirements and blueprints, and maybe some tools to make your day more secure. And we're going to go that's great, that's we're willing to put in extra time and effort to get this or follow the next requirements on them. So I would say healthcare could be the most promising for data mesh. But I can't think of any that I've seen that have gone down that route. But I've talked to a couple of companies that are very early in investigating it. And to your point, there's more value out of data, and more benefits and return investments in health care, I can think of anyone I mean, imagine the cost savings, you can get machine learning model, that is reducing the amount of remittance somebody gets, or the helping the type of treatment you're going to get, and are helping claim fraud on net, which is all stuff I've seen healthcare save millions of dollars a month. And so today, I'm actually gonna make those solutions more valuable than Yeah, let's put down that data mesh. So it's gonna be interesting to see how this all turns out, because I think we're really, really rolling in this and having full implementations and customers creating technology that can be used for data mesh, and having some vendors come up with solutions for site unified data governance, that is not just taking what they have and saying it in what you're building something out, that could be using a data mesh.