Data Lakehouse - Debunking The Hype | The Second Episode
Podcast

The Data Story

Data Lakehouse - Debunking the hype

Season: 01 Episode: 02


Download Transcript( as docx )

Podcast Banner

About Data Story Series

Join The Data Story podcast, where two veterans of the Data & Analytics industry cut through the chase and bring you the most relevant technology trends transforming the industry. James Serra and Khalil Sheikh have helped transform several Fortune 100 enterprises into data-driven enterprises. This fortnightly podcast will equip you with the best practices, tools, and frameworks available that will help you spearhead your business insights journey. Stick around for each new topic discussion and subscribe to this channel.

Guests on this episode

James Serra

James Serra

James Serra is a Data Platform Architecture Lead at EY, and previously was a big data and data warehousing solution architect at Microsoft for seven years. He is a thought leader in the use and application of Big Data and advanced analytics, including solutions involving hybrid technologies of relational and non-relational data, Hadoop, MPP, IoT,

Khalil Sheikh

Khalil Sheikh

Khalil Sheikh is the Executive Vice President of Saxon Global. Under his leadership, Saxon is transforming from an IT Staffing and services organization into a new age digital transformation partner and a strong brand. Khalil has extensive experience in the IT services industry and in turning around businesses by promoting growth and profitability.

Previous Podcasts

Season: 01 Episode: 01

EPISODE TRANSCRIPT

Download Transcript( as docx )

Khalil Sheikh: [0:00] Okay, so let's get started with understanding what is data lake house? So, my first question, James is, what are the major differences between data mart, EDW, and data lake, and how an organization will make a choice.

James Serra: [0:28] Sure, and it's, there's no one clear cut definition of all these, I'm giving my definition that conforms with, I would say most people's understanding, but there could be opposing views to some of the things I'm gonna talk about today. When we look at, we think of a data lake. It's this schema on read, I can put all my data. I can dump all my data in there. It's a glorified file folder. And the idea being that it's just storage. And I can put compute on top of that, and once it's in a data lake, I can then do things to it to transform it, to query a report like that. When we talk about an enterprise data warehouse, you'll be talking about a relational database. So, it could be SQL Server, SQL, database, Oracle, anything that puts it into that third normal form and/or star schema. And it's all relational in their tables related to one another. And in most cases, like I talked about in the last episode that enterprise data warehouse would have both data analytics, and a relational database as part of that, where and if you look at a data mart, usually that means you're taking that relational database and making it into smaller components into a data mart maybe have particular subject like HR, finance in there, that and so the data warehouse is all the subjects and then the data mart is individual subjects. And maybe that's for performance reasons. Or for somebody who wants to modify the data without touching that main data warehouse. Data LakeHouse is a fairly new term, I think you've heard it a couple of years ago, first came out is taking a data lake data warehouse combining into one and the term data lakehouse. And the idea of that means I can have one storage location where I have all my data. And I don't need a separate bolted data lake in a data warehouse. Now, if you look at the way databricks defines it is the data lake adds a component on top of that called the Data Lake that gives you additional features that make it relational in nature, as far as features support. So, Data Lake will have things like acid transactions on that time travel. So, you can have data versioning for rollbacks and audit trails, has schema enforcement to upset up certs and inserts and performance improvement on that. So, the idea is one location to do everything when it comes to data warehousing. And it's not for OLTP. But rather, I want to ingest all this data and this decision by querying that data. The concern I had with just having a data lake reminds me, or data lakehouse reminds me of 10 years ago when the data lakes first came out. And the idea was, let's just put everything in it. And that's all we need. We don't need a relational database anymore. And that failed miserably back then. And I have a lot of horror stories of customers when I was working at Microsoft that tried to. And I always said, you need both a data lake and a data warehouse, relational data warehouse. Even with this concept of the data lakehouse, and you're adding additional features and talking to handling, I still think in most cases will need a relational database. There are some use cases now where I can see that you can get away with just a data lake house in there. But there are certain things you can sacrifice them to do that, which I can talk about to a little bit of a variation is when we look at Microsoft, they have said Azure synapse analytics. And you could think of that as a data lakehouse and that behind the scenes in that product, you can have a data lake. And you can use regular TSQL over that data. And in some of your cases, have just that data lake as your single source of data. But also in synapse, it's easy to copy that deal with relational database and use that same key sequel over that data. So, you can think in Microsoft's world they may find a data lakehouse is a little bit broader where it's a data lake In a relational database, but you have an interface over both the nodes that make it very easy to bounce between the two. So that's, that's kind of the way I look at it. And in most conversations, I have with customers they view within the similar type of way.

Khalil Sheikh: [5:18]James, the processing time, right, when it comes to data warehouse versus data lake is very slight. Can you talk about that, what is the benefit of data lake in terms of processing? Because the advantages of using a data lake with respect to processing time or positioning of a skill? How do you see that? Because that's a major decision point, right? Because the data has to be available. And please structured for data warehouse versus data lake, right. Can you elaborate on What is the benefit there?

James Serra: [6:02]Yeah, there's many reasons to have a day like I could probably spend a whole hour just talking about that. One of the big benefits is, you can think of the data lake as the staging area that was normally in a relational database is now separated out and put into a data lake. And you can have any type of compute on top of that, and you can, if you're in the cloud, you can scale up and have unlimited compute. So, you can quickly transform, Clean Master jewels, things that data in a data lake and not affect your relational database. So, users can continue to use that relational database and not have to have this maintenance whenever you kick them off in order to ingest data and do all transformations on there. And, because the schema on read allows you to very quickly put data in there and someone who's skilled at querying data in a file holding format can get value out of that data right away. And the challenge is that they have to have some advanced skill sets, because the metadata is not stored along with the data in the data, like, it isn't a relational database. But if they can, if they can get through that, and maybe use some technologies a little more advanced, then they could go and quickly get value on that data or data scientists use machine learning on top of that data. So, you have that other benefit of quick to get results in there. And you could also think of the data lake as a way to hoard data, I can put as much data as I want on there. And the cost is very cheap now. And I can keep it forever. And I can decide, maybe I'll need this data, maybe I won't, but let's put it in there and keep it and decide that later. And which is something you wouldn't want it in a relational database because it could be costly, and it can affect the performance of the database in there. So that's just a couple of the big benefits of a data lake. And then you can say, well, if there's no benefits, and I put something like build the lake, on top of that, do I need a relational database. And a couple things I'll point out that may be challenging if you just have a data lake is one is the speed. You'll never get the query performance having a relational database and daily in there, because the data lake doesn't have things like statistics and query plans. And there's also caching and materialized views, and on and on. There are exceptions and there's some in some cases where you can get some of those things within a data lake. But it's usually costly. And it's a lot of complexity to it relational database has been around forever. And they're really tuned to give you maximum performance, millisecond performance. So, if you need that more second, you're just not going to get it in a data lake, especially when you think of some of the technologies of multiple parallel processing there where I can take the data and I can split it out. And I can have things like replicated tables and distributed tables. So, I can get millisecond response time or maybe a few seconds, my family billions of rows of data, something that just could be in a data lake under the under security. The idea of a file folder structure is you have security at the file level of all the levels. Well, that can be problematic Pio file that's got a lot of rows in there and maybe isn't written by department, I want to get people only access to departments they can see, well, it's an all or nothing with a data lake in there. So, you'd have to go the extra step of splitting that file out into multiple files and there should be no pain. This is no idea of a row level security as there is in a relational database. So many times you'll see companies saying well, we use the data lake for certain things, but no Anytime we need to have users look at the data that end users that then we're going to put it into relational databases, we can do things like load security, and column level security and dynamic data masking and data discovery, classification, all these things that are built into a lot of relational database products are not in the database.

James Serra: [10:23] And then there's also the missing features that you may be used to in a relational database and not in Data lake. like things like automating referential integrity, data caching, workload management, again, some of those things are part of delta Lake, but they added an extra complexity on top of it. And then finally, I would say the complexity, if you have things in a file folder structure, you start adding a lot of data sources to it, you're adding additional folders and files, it can be a mess to try to navigate that too, because the metadata is stored someplace else. And so, you have to match it up and you have a consistency where somebody doesn't look at the metadata, or the change in metadata and have a different location, and some have different metadata pointing to the same file, which you don't have any relation database, metadata, along with the data. So, it's one consistent view. So, all those things you've taken consideration when you're saying, Hey, can we doing just the data lake, it may be that well, for those periods? I could explain, you also want to copy that data to therapists.

Khalil Sheikh: [11:26] Excellent, excellent. Thank you. So James, modern enterprises are moving mostly from traditional data warehouse to cloud. And then based on the cloud, they're also thinking in terms of diversity of data, right, you know, which is structured versus unstructured. So, data lake market has started to get a lot of momentum recently. Who are the major players in this space? And how do you see synapse, Microsoft analytics platform that sort of makes up the analytics and ADF is blended into it. And you can store that structured data versus unstructured data? Right? So how does is AWS like Amazon Web Services, or cloud era or data lakes, or some of the other players like snowflake expanding out against synapse? So two part question, Who do you see as a major player here, in this particular space, and how does synapse playing out in this very competitive data lake space?

James Serra: [12:35] Yeah, and in the day, like, because it sort of unstructured semi structured data on there, gives you that benefit of, of not having to upfront work in there. But you have to pay the price somewhere, and somewhere along the line, somebody have to put this structure on top of that. And that's it, you just got to be aware of that when you believe in the data lake. And when we look at the major players in there, I have looked at a lot of other solutions out there. So you've talked about snowflake and Amazon, Google, cloud era, they all and data bricks, they all have their own unique approaches to what they call it, data, lakehouse, or enterprise data warehouse, or data fabric data mesh these things, can have different definitions. And it's going to fit around their technology that they built. If we look at synapse, they, I think what differentiates that product is they have this single pane of glass that underneath the covers, has the ability to have a different storage, you can have a relational database storage, you can have a data lake, in their cases, data lake storage, and two, you can have spark tables, nameless Cosmos dB, all of them are possible storage to your data. And then on top of that, you can have multiple compute, you can have a provision pool, which is ability to use compute on top of relational data, then you have these on demand tools, which is the ability to query data in a data lake. And both of those using regular TSQL. So, the benefit of that is I can use TSQL against data into a data lake. And by doing that, I create a view on top of that, and it will appear to the user that they're using relational database, but really, it's sitting on a data lake, I can then move that data if I get similar performance I named into a relational database and use that same SQL on that, and then synapse also has these Apache Spark. So I can another option use the spark notebook on top of data sitting in a data lake or in a relational database. And then I can use Power BI on top of that, to query all that data in and making reporting format. Also within synapses is your data factory in there. So I have the ETL tool. Now to move that data from sources into a data lake and or relational database somewhere. So all under on that single pane of glass, it makes it much easier. And if we look at other products, like a snowflake, they have a little different setup there. They don't have, they have to have a data lake within a day kind of treat their relational databases, they like import data and keep it in that format into their relational database on there. So to kind of have a little bit of a twist on that. And they don't have an ETL tool that Oh, reporting, lets all other products outside of that. And it's not so much a single pane of glass as it is within synapse, they do have some benefits, like if there was multiple multi master clusters where you can have multiple compute on top of one database, which synapse is not there yet, that's going to be I think, hopefully soon. But that's one advantage of snowflake. And I have some other things like a marketplace and, such that make them stand out a bit. And then we look at others like a cloud era, they're kind of outside the relational database world for most parts of that. So I don't see them used too much.

James Serra: [16:34] Even in the cloud on there, they could be used as storage. But there's a lot of other components outside of the cloud or stack that you can use. But you're dealing in the open-source world, but it's not everybody wants in there. And then there could be but you could smell a lot of these things together, you can use any type of products. And when I was at Microsoft, a lot of customers, I like to say they use all Microsoft products. But maybe they're using some ETL tool like Informatica in your continuous in that. So a lot depends on the skillset of your current coworkers in there. And I would never recommend some product that goes against what they're all are experts at unless it's an outdated technology, they're usually just a combination of two, but we're seeing a lot of big bet a lot of customers are maxed out look at synapse because it's also can result in a lot of cost savings. When you look at the on demand pools. It's a paper query thing in there. And that's the ultimate differentiate between other competitors in there is I can use server lists, and just query anything a data lake and I can do it very cheaply. And that may be a way to go that gives me like gives me the performance I need. And if that works, right, if it becomes a point where it's not getting performance I need or maybe I need something more security, that I can just talk into relationship, because we're all doing it within the same environment.

Khalil Sheikh: [17:58] Thank you, James. So in terms of like, you know, can you talk about the AI component of synapse as well as like, the integrated AI like machine learning and as your cognitive services? Of course, Power BI is there, right, you know, so, but let's talk from an AI perspective, how do you see synapse differentiate, compared to other platform that is available? Because it's one platform that gives you the data pipelines and data warehousing data lake and Power BI, right. But can you elaborate on the AI component of it and how it differentiates based on your experience?

James Serra: [18:35] Yeah, there's a couple ways ago, Power BI has automated ML that you can use in there, you have to be careful with that. Because if you're not a data scientist, it could be junk in junk out. But I've seen customers use it for certain use cases. And it works very well in there no idea of machine learning and you have to train the model. So if you have the data sitting in synapse in a relational database, and or a data lake in there is a service that can use on top of that, and quickly use that data to train the model in there. So in addition to power guide, Microsoft has Azure ML and I can use that product and point it to those data sources within synapse, whether that relational database, and they're starting to integrate some of that into the dedicated tools on there. So you have the ability to quickly train and execute a model inside of regular t SQL on there. In most cases, the data scientists honestly go I don't want to use anything inside of synapse, I want my own special tool that I've used for years and just give me access to the data. Well, that's fine, because if you're in synapse and you store that data in a data lake, and or relational database, you can still access that outside of synapse. So just connection strings and You have the right permissions, you can do whatever you want to access that data. So, I see a lot of customers using our own thing with machine learning. And, they're into the Microsoft world of Azure ML. And they're using that to point to the data sitting in something that was created within synapse.

Khalil Sheikh: [20:21] Thank you, James. So, one of the areas, we have lots of technical people in this call choice of languages, right? When it comes to synapse, we can use any language T SQL, Python, Scala, Spark, right, or dotnet. How do you see from an adoption perspective for an organisation because you have so many choices of languages that you can use to implement it? Based on your experience? How do you compare that wealth of choices with respect to customization, configuration compared to other platforms?

James Serra: [21:00] Yeah, that's the big question. And it's interesting to see how things have changed over the past four or five years. And that TSQLs always been the language of choice for many, many years. For DBAs, DBA for many years became very familiar with T SQL and then end users want to use a simple language and T SQL could be pretty simple to learn. And then you have all the products that are using T SQL interfaces and data. And then Microsoft said, well, we're going to create a new type of SQL called New SQL, which was in a data lake analytics, that's since pretty much gone away. And they thought, well, it's just a little bit difference in there. But it's much more focused on big data. Well, in the end, it failed because it was too different than T SQL, there was a lot of pushbacks. And when you look at hive and spark, even though they're handsy compliant, there's a lot of differentiation. And if you're using t SQL, you want to stay in a T SQL world. And so, a lot of those products suffered even today, if you're using some product outside of Microsoft and T SQL, it's hard to get a lot of adoption to it. End users are used to T SQL. And so, I think what Microsoft really good right with synapse was they made t SQL the de facto choice, whether you're hitting data in a relational database or data in the data lake. And I found that one feature was really open up the doors for customers wanting to use synapse because they got, we don't retrain people, what we wrote, we can have all the stored procedures that we've created before, we can still continue using them. If we created views and things like that, and Power BI, we could still use it. And then you have tools that would only interface with SQL, well, if you come out and say whatever, those two more, I'm going to work against that. And so that was a really good choice that Microsoft, and people are resistant to change, they don't like to learn something new. And if you can keep them in that same comfort zone. And it makes it easy to migrate something that they have into this synapse because the TSQL doesn't have to change the most part, you're going to get bigger adoption. And that's what I think happened. So synapse does have the sparkles in there. And it was kind of a 50-50. with customers, if I would mention Spark, sometimes their eyes glazed up and looking at me all confused, they want nothing to do with Spark. And then there's other customers have been using for a while. So, they're okay with that. But there's a large majority of customers who don't want to use Spark. And they want it much easier. So let me use TSQL or let me use something like Data Factory, the data flows, this is a visual interface for transforming the data, then they're happy with that. And if you go to the spark, it's I tell them, it's there's a notebook with a blinking cursor, and it's gone and you have to be ready for that. And if you're not then try to use those other options to sequel and something like data flows and Data Factory.

Khalil Sheikh: [24:14] Thank you, James. Can you also talk about the operational data stores that comes with synapse like as your Cosmos DB for sentiments and other analytics? What is the value of this bundle being available in synapse and how do you see it people using it in enterprises?

James Serra: [24:36] Yeah, the idea with they have a thing called Synapse Link within synapse and it allows you to link right to Cosmos dB. It's an OLTP solution. But there's an option you can turn on that replicates that data instantaneously to a copy of that is more for analytical purposes and that's what synapse ties into. So you're not hitting and hammering that OLTP database in there. And this allows you to make it much easier to access that data within synapse without having to copy it over into synapse into a data lake relational database. And I can then join data that can be sitting in a relational database from sitting, the data lake could be in a spark table could be in Cosmos dB, in a single select TSQL statement, which I've done, I haven't blogged about this. And it's crazy, you can query all these four different data sources and get pretty good performance on it. And that can be a great solution, when you start getting the Federated query approach in there it maybe I want to query this data, I just want to see this by value, I want to do a one-time report on that, or I can use the ETL to actually use TSQL to pull it in, and then put it. It allows you to do that federated query approach. I think of it as a data virtualization solution, maybe a form of data virtualization solution. And I see this getting popular down the road, because down the line, Microsoft will increase those storage isn't there. So next, maybe a sequel database in there, or maybe even third party, it's like an Oracle or territory and I don't know for sure. But if they expand this Synapse Link in there, which now goes against some of the dynamics databases, data velocity river, they renamed that too, if they open it up, now I can go within synapse, and I can use an on-demand tool to query all the different data sources and just pay for the query. And I can even have that view on top of that, and then use Power BI. So, I can very quickly use that data without having ETL. To copy it into that. Now, there's tradeoffs on that, obviously, stage one, we're not going to do the clean, what hasn't been mastered, there was a lot of things to think about in there. But it could be a very quick and easy and cost-effective way to have federal works.

Khalil Sheikh: [27:08] Thank you, James. What do you think it takes to move like we come across a lot of legacy customers, right, who have gone to an EDW to take them to either as a cloud sequel versus synapse, and from my skills, cost timeline perspective, because if you think about it's like, you know, totally changing the mindset, right? Because you have different tools, you have ADF versus SSIS, versus some MS SQL, data warehouse, right? Especially. It's just a mind change, right? You know, and it'll change. So, what do you think it takes for somebody who was sitting on prem, to move to synapse, in your experience?

James Serra: [27:54] Yeah, and I tell customers, whether Microsoft, hey, if you haven't done this before, then look to consulting companies experience on there, because at the very least, they'll guide you in the right direction, and maybe your first project, you can use that and then get the experience that you need to do on your own. So, you have to upskill yourself on that. Because there are challenges with migration on there want to minimize the downtime, you want to make sure you're choosing the right products. And this is where the maximum, I spent a lot of time customers who are discovering what are you trying to migrate on there, what's your current skill set, and, and ask all these questions that would guide what architecture was, but also show them the art of the possible because it is a mind shift, there's a different way of doing things in the cloud, you don't want to just think of the way you did it on prem and do the same thing in the cloud. There's so much extra benefit you get, and it's not just cost, usually lower on the priority list. And there, it's time to the market is having high availability, disaster recovery, unlimited scalability, tons of different things, and you get in there. So, we tried to get customers to shy away from just a lift and shift and don't just go and create a VM and do an AI solution and run everything in there. Try to think beyond that. And it may be that you had this data warehouse for many years. And now it's time to modernize it. So instead of doing the lift and shift, it's a lift and modernize, and look at maybe rewriting parts of it. So, a lot of dependent on a timeline and a budget those things. But if you're going to move, there's going to be effort to do that. So why not make it a little bit more effort to modernize what you built out and take advantage of some of the cloud things on there. The benefit is if you are on premise, you're using something like SQL Server, you have used to a lot of the T SQL and the way things are done within if you moved into synapse in that now. You could say that the small data warehouse on there, you may not need synapse and maybe overkill if you're talking about gigabytes of data. Maybe you go something like SQL database. And if you know that it's not going to grow much, because at some point, if you get to say four terabytes, then that's where SQL database can't handle your data anymore. And also, you just may not get the performance you need. Because if you put something in synapse with MPP, technology is, of course, is going to be 20 to 100 times faster. So, you got to think of your future on there. And the difference between going from something like SQL Server and VPS are some differences in the way that data is laid out. And then the technology and not everything is supported, like cursors. So, there's some assessment tools, you can run that look at all your code that you've written, like the stored procedures, and tell you if there's anything that needs to be changed. So, I always say customers go through that, well aware of what you have to change, if anything, sometimes it can be pretty simple. And there, and then understand the cost benefits of having something like an on-demand pool that you don't have on prem. And then think differently of the way you want to collect the data that you need. And some of the challenges of that on prem is if the source data is on prem and your data warehouse is on prem, well, I can maybe two servers right next to each other. And I can quickly copy the data over. But if now it's going to move to the cloud, what's your bottleneck, and with customers it was, I would tell him that you're only going to be able to transfer data as fast as your pipeline from your data center to the cloud. So maybe you need to increase that and use something like express route that Microsoft has. Or maybe you just need to do something as simple as, instead of doing everything at night, maybe do it every hour and upload data in there. So, you can happen in the cloud. So that's the extra challenge in there. But now you open up the door is if you have data sources in the cloud, it makes it very easy to move them to other spots in the cloud on that. And what I found with customers is when they're moving from SQL Server and something like synapse, usually there's some training for a couple of days for the DBAs. And some of the developers to know the differences between SQL Server and synapse, but it wasn't any huge effort on there. It could be obviously, a completely different world, like an Oracle territory, that there's more challenges, but there's a lot of tools now that Microsoft hasn't migrate that over into synapse easily. So, you're not having to spend a lot of cycles of trying to get that data and code up into the cloud.

Khalil Sheikh: [32:30] Thank you, James. What do you think it takes to do a discovery for an environment like that because moving and EDW, which is a legacy EDW to synapse, lift and shift is one thing right? But if you really transforming it and modernizing the EDW and even introducing the data lake concept, right, it's all together very different because. But if you're taking in and moving it to synapse and wants to make use of data lake and say, Okay, what kind of reporting I want to have at this point in time, or dashboarding, or KPIs scorecards, I think lift and shift does not create a value proposition. Right? So what do you think a discovery may take from your experience, where you can understand the data structure, especially if you have multiple sub databases, or data warehouses available in a given environment? and move it? Is it a three-month engagement discovery? Six weeks engagement? What do you think in order to understand the data and come up with the strategy to migrate properly? Because you also have both cases, you have SSIS? Also, right? Which is on prem ETL. So, if you're going to move to ADF, and if you're going to adopt synapse, the real value of it, right? Is it complex? What have you seen in these migration or modernization efforts?

James Serra: [33:56] Yeah, it's a good point in that you have to do a proper assessment. And again, sometimes I'd say you haven't done as well to consultant company and made me have a set of tools when I've done this many times. Because I saw the biggest mistake is sometimes, they underestimate the movement of this data on there, not only in the time to physically move it, but also understanding the proper location to move that to modernize it to create a data lake if you're not currently using one. Do we move all the data from a relational database into a data lake and then back into relation to automation within skip the data lake in some cases, that could be a viable option to start with, we could even say you can use your SSIS code in Data Factory. And instead of switching over to Data Factory, we want SSIS as part of the future and Data Factory and so maybe you want to do that at first and maybe just rewrite some of the slower SSIS packages, but human 1000 SSIS packages, I would say move more those over run them as they had just changed the destination, into something in the synapse in there and then later go on and burn everything on there because I'll have to train everybody into this new skill set in there. And then there's the optimization that you can do. If you are on prem, you probably have a lot of servers, a lot of databases need to go and find what's being used anymore. I mean, you must get rid of these, maybe I consolidate the databases in there. So instead of having them on different servers, and then maybe small databases, there are ways to optimize that so if you're, if you're using stuff outside of synapse, you can say, here's elastic pools, and other avenues that you can take for those databases on there. Because some of those databases too many OLTP databases, and that's a whole separate thing, compared to synapses, okay, which product by moving to into SQL into Azure, because your SQL databases have managed instances, there's elastic pools, there's hyper scale. And so, you need somebody who understands all of those. So, you make sure your slot in the right one and save the most cost, and then also have the least amount of migration headaches on that. So, it needs an understanding of all the landscape for that. And the experience of having those cost efficiencies, because there's no easy button to say, Here's go scan my source systems, and then automatically upgrade and slot them into the right tool in there, there are assessment tools that will help you in your choice, but they may be wrong. And fortunately, in most cases, if you put it into a product and a certain tier, and it's not enough, you can just quickly and easily scale up within a few minutes. But it's not as easy to move it to a whole different product on there. So, you have to be real cognizant of what you have now, as well as what you're going to have in the future. Because now you have the cloud, usually you're going to have all these extra requests or data sources on there, maybe you want to know and just Twitter data to find companies sentiment and what people think about your company and your product, maybe you want IoT devices, stuff you can never do on prem. Well make sure you have that product, like a synapse that we can handle that data no matter what the size, and speed or the type. So, you're not migrating over into some solution. And then six months down the road, you're like, this is not handling what we need, we got to start from scratch. And this is where I spend time with customers. They don't know what they don't know. So let me explain all the products to you, or at least understand your use case, and then dive into the products that are right for your use case. But also keep in mind asking questions where you want to go to make sure that product can last you a number of years.

Khalil Sheikh: [37:54] Thank you, James, can you talk a little bit about some of your EY experience, some case study where you have moved a legacy platform like in respect to EDW or multiple legacy data warehouses into synapse, any case study that you can think of.

James Serra: [38:18] It's the state of fabric moving out. And this was pulled out previously using cloud era, and a lot of open-source tools. And we decided to rebuild it in Azure. And this is why I want to bring those higher in there. And so that the migration of that is a little bit challenging, because you have to move data from one source system to another on there with the data or else it could be not as challenging because you can just say what's good about the data from the source systems again, and not actually have to migrate the data in the cloud or data warehouse into that. And, then so what this allows when you when you standardize on the cloud platform like Azure is not synapse, but then you have this whole other landscape of tools that you can use. In our case, we're using databricks and delta lake. And in addition to synapse in there, we're using some third-party tools for additional features that are not available in the Microsoft Cloud or limited like APAC attribute-based access control or data virtualization, or Master Data Management things that are not available. I'm available. So, we plug in third party products into this data fabric on there. And this Data Fabric is trying to collect data at a large scale many petabytes of data. And again, it's different, the size, the speed and the type of that could be batch in real time to be JSON and CSV files on there. And so that's working very well and there's that and similar things within the EY is there's a lot of us a huge company, 100,000 employees, you can imagine there's a lot of projects that they have that they're moving from another company, another cloud, it could be on prem. And they could be doing a free, why are they going to do it for customers, because large technology consulting within life. And, and what I've seen is, is, fortunately, Microsoft has a lot of these migration tools that either automated or tell you step by step, so they've made a lot easier to move it over. The big challenge before is do you want to modernize and not just kind of do a lift and shift in there. But there's been a lot of success. I've seen it why at Microsoft with those migrations, and not as many on prem anymore at Microsoft, you know, six, seven years ago, a lot of that, but most customers are already in the cloud, Lessing, medium large sized question to smaller companies still have those challenges of being there. But a lot of customers are getting away from the data center and are closing them. And Microsoft has their own teams that can help if you are very large count or go to the consultant company that can help with the migration.

EPISODE 3

Data Mesh - Driving Insights At Scale

7th September 2021 10.00 AM CST
Scroll to top