Data Engineering Podcast

Tobias Macey

This show goes behind the scenes for the tools, techniques, and difficulties associated with the discipline of data engineering. Databases, workflows, automation, and data manipulation are just some of the topics that you will find here. read less

Unlocking The Potential Of Streaming Data Applications Without The Operational Headache At Grainite
2d ago
Unlocking The Potential Of Streaming Data Applications Without The Operational Headache At Grainite
Summary The promise of streaming data is that it allows you to react to new information as it happens, rather than introducing latency by batching records together. The peril is that building a robust and scalable streaming architecture is always more complicated and error-prone than you think it's going to be. After experiencing this unfortunate reality for themselves, Abhishek Chauhan and Ashish Kumar founded Grainite so that you don't have to suffer the same pain. In this episode they explain why streaming architectures are so challenging, how they have designed Grainite to be robust and scalable, and how you can start using it today to build your streaming data applications without all of the operational headache. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management Businesses that adapt well to change grow 3 times faster than the industry average. As your business adapts, so should your data. RudderStack Transformations lets you customize your event data in real-time with your own JavaScript or Python code. Join The RudderStack Transformation Challenge today for a chance to win a $1,000 cash prize just by submitting a Transformation to the open-source RudderStack Transformation library. Visit dataengineeringpodcast.com/rudderstack (https://www.dataengineeringpodcast.com/rudderstack) today to learn more Hey there podcast listener, are you tired of dealing with the headache that is the 'Modern Data Stack'? We feel your pain. It's supposed to make building smarter, faster, and more flexible data infrastructures a breeze. It ends up being anything but that. Setting it up, integrating it, maintaining it—it’s all kind of a nightmare. And let's not even get started on all the extra tools you have to buy to get it to do its thing. But don't worry, there is a better way. TimeXtender takes a holistic approach to data integration that focuses on agility rather than fragmentation. By bringing all the layers of the data stack together, TimeXtender helps you build data solutions up to 10 times faster and saves you 70-80% on costs. If you're fed up with the 'Modern Data Stack', give TimeXtender a try. Head over to dataengineeringpodcast.com/timextender (https://www.dataengineeringpodcast.com/timextender) where you can do two things: watch us build a data estate in 15 minutes and start for free today. Join in with the event for the global data community, Data Council Austin. From March 28-30th 2023, they'll play host to hundreds of attendees, 100 top speakers, and dozens of startups that are advancing data science, engineering and AI. Data Council attendees are amazing founders, data scientists, lead engineers, CTOs, heads of data, investors and community organizers who are all working together to build the future of data. As a listener to the Data Engineering Podcast you can get a special discount of 20% off your ticket by using the promo code dataengpod20. Don't miss out on their only event this year! Visit: dataengineeringpodcast.com/data-council (https://www.dataengineeringpodcast.com/data-council) today Your host is Tobias Macey and today I'm interviewing Ashish Kumar and Abhishek Chauhan about Grainite, a platform designed to give you a single place to build streaming data applications Interview Introduction How did you get involved in the area of data management? Can you describe what Grainite is and the story behind it? What are the personas that you are focused on addressing with Grainite? What are some of the most complex aspects of building streaming data applications in the absence of something like Grainite? How does Grainite work to reduce that complexity? What are some of the commonalities that you see in the teams/organizations that find their way to Grainite? What are some of the higher-order projects that teams are able to build when they are using Grainite as a starting point vs. where they would be spending effort on a fully managed streaming architecture? Can you describe how Grainite is architected? How have the design and goals of the platform changed/evolved since you first started working on it? What does your internal build vs. buy process look like for identifying where to spend your engineering resources? What is the process for getting Grainite set up and integrated into an organizations technical environment? What is your process for determining which elements of the platform to expose as end-user features and customization options vs. keeping internal to the operational aspects of the product? Once Grainite is running, can you describe the day 0 workflow of building an application or data flow? What are the day 2 - N capabilities that Grainite offers for ongoing maintenance/operation/evolution of those applications? What are the most interesting, innovative, or unexpected ways that you have seen Grainite used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Grainite? When is Grainite the wrong choice? What do you have planned for the future of Grainite? Contact Info Ashish LinkedIn (https://www.linkedin.com/in/ashishkumarprofile/) Abhishek LinkedIn (https://www.linkedin.com/in/abhishekchauhan/) Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ (https://www.pythonpodcast.com) covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast (https://www.themachinelearningpodcast.com) helps you go from idea to production with machine learning. Visit the site (https://www.dataengineeringpodcast.com) to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com (mailto:hosts@dataengineeringpodcast.com)) with your story. To help other people find the show please leave a review on Apple Podcasts (https://podcasts.apple.com/us/podcast/data-engineering-podcast/id1193040557) and tell your friends and co-workers Links Grainite (https://www.grainite.com/) Blog about the challenges of streaming architectures (https://www.grainite.com/blog/there-was-an-old-lady-who-swallowed-a-fly) Getting Started Docs (https://gitbook.grainite.com/developers/getting-started) BigTable (https://research.google/pubs/pub27898/) Spanner (https://research.google/pubs/pub39966/) Firestore (https://cloud.google.com/firestore) OpenCensus (https://opencensus.io/) Citrix (https://www.citrix.com/) NetScaler (https://www.citrix.com/blogs/2022/10/03/netscaler-is-back/) J2EE (https://www.oracle.com/java/technologies/appmodel.html) RocksDB (https://rocksdb.org/) Pulsar (https://pulsar.apache.org/) SQL Server (https://en.wikipedia.org/wiki/Microsoft_SQL_Server) MySQL (https://www.mysql.com/) RAFT Protocol (https://raft.github.io/) The intro and outro music is from The Hug (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug) by The Freak Fandango Orchestra (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/) / CC BY-SA (http://creativecommons.org/licenses/by-sa/3.0/)
Aligning Data Security With Business Productivity To Deploy Analytics Safely And At Speed
19-03-2023
Aligning Data Security With Business Productivity To Deploy Analytics Safely And At Speed
Summary As with all aspects of technology, security is a critical element of data applications, and the different controls can be at cross purposes with productivity. In this episode Yoav Cohen from Satori shares his experiences as a practitioner in the space of data security and how to align with the needs of engineers and business users. He also explains why data security is distinct from application security and some methods for reducing the challenge of working across different data systems. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management Join in with the event for the global data community, Data Council Austin. From March 28-30th 2023, they'll play host to hundreds of attendees, 100 top speakers, and dozens of startups that are advancing data science, engineering and AI. Data Council attendees are amazing founders, data scientists, lead engineers, CTOs, heads of data, investors and community organizers who are all working together to build the future of data. As a listener to the Data Engineering Podcast you can get a special discount of 20% off your ticket by using the promo code dataengpod20. Don't miss out on their only event this year! Visit: dataengineeringpodcast.com/data-council (https://www.dataengineeringpodcast.com/data-council) today RudderStack makes it easy for data teams to build a customer data platform on their own warehouse. Use their state of the art pipelines to collect all of your data, build a complete view of your customer and sync it to every downstream tool. Sign up for free at dataengineeringpodcast.com/rudder (https://www.dataengineeringpodcast.com/rudder) Hey there podcast listener, are you tired of dealing with the headache that is the 'Modern Data Stack'? We feel your pain. It's supposed to make building smarter, faster, and more flexible data infrastructures a breeze. It ends up being anything but that. Setting it up, integrating it, maintaining it—it’s all kind of a nightmare. And let's not even get started on all the extra tools you have to buy to get it to do its thing. But don't worry, there is a better way. TimeXtender takes a holistic approach to data integration that focuses on agility rather than fragmentation. By bringing all the layers of the data stack together, TimeXtender helps you build data solutions up to 10 times faster and saves you 70-80% on costs. If you're fed up with the 'Modern Data Stack', give TimeXtender a try. Head over to dataengineeringpodcast.com/timextender (https://www.dataengineeringpodcast.com/timextender) where you can do two things: watch us build a data estate in 15 minutes and start for free today. Your host is Tobias Macey and today I'm interviewing Yoav Cohen about the challenges that data teams face in securing their data platforms and how that impacts the productivity and adoption of data in the organization Interview Introduction How did you get involved in the area of data management? Data security is a very broad term. Can you start by enumerating some of the different concerns that are involved? How has the scope and complexity of implementing security controls on data systems changed in recent years? In your experience, what is a typical number of data locations that an organization is trying to manage access/permissions within? What are some of the main challenges that data/compliance teams face in establishing and maintaining security controls? How much of the problem is technical vs. procedural/organizational? As a vendor in the space, how do you think about the broad categories/boundary lines for the different elements of data security? (e.g. masking vs. RBAC, etc.) What are the different layers that are best suited to managing each of those categories? (e.g. masking and encryption in storage layer, RBAC in warehouse, etc.) What are some of the ways that data security and organizational productivity are at odds with each other? What are some of the shortcuts that you see teams and individuals taking to address the productivity hit from security controls? What are some of the methods that you have found to be most effective at mitigating or even improving productivity impacts through security controls? How does up-front design of the security layers improve the final outcome vs. trying to bolt on security after the platform is already in use? How can education about the motivations for different security practices improve compliance and user experience? What are the most interesting, innovative, or unexpected ways that you have seen data teams align data security and productivity? What are the most interesting, unexpected, or challenging lessons that you have learned while working on data security technology? What are the areas of data security that still need improvements? Contact Info Yoav Cohen (https://www.linkedin.com/in/yoav-cohen-7a4ba23/) Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ (https://www.pythonpodcast.com) covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast (https://www.themachinelearningpodcast.com) helps you go from idea to production with machine learning. Visit the site (https://www.dataengineeringpodcast.com) to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com (mailto:hosts@dataengineeringpodcast.com)) with your story. To help other people find the show please leave a review on Apple Podcasts (https://podcasts.apple.com/us/podcast/data-engineering-podcast/id1193040557) and tell your friends and co-workers Links Satori (https://satoricyber.com) Podcast Episode (https://www.dataengineeringpodcast.com/satori-cloud-data-governance-episode-165) Data Masking (https://en.wikipedia.org/wiki/Data_masking) RBAC == Role Based Access Control (https://en.wikipedia.org/wiki/Role-based_access_control) ABAC == Attribute Based Access Control (https://en.wikipedia.org/wiki/Attribute-based_access_control) Gartner Data Security Platform Report (https://www.gartner.com/en/documents/4006252) The intro and outro music is from The Hug (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug) by The Freak Fandango Orchestra (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/) / CC BY-SA (http://creativecommons.org/licenses/by-sa/3.0/)
Use Your Data Warehouse To Power Your Product Analytics With NetSpring
10-03-2023
Use Your Data Warehouse To Power Your Product Analytics With NetSpring
Summary With the rise of the web and digital business came the need to understand how customers are interacting with the products and services that are being sold. Product analytics has grown into its own category and brought with it several services with generational differences in how they approach the problem. NetSpring is a warehouse-native product analytics service that allows you to gain powerful insights into your customers and their needs by combining your event streams with the rest of your business data. In this episode Priyendra Deshwal explains how NetSpring is designed to empower your product and data teams to build and explore insights around your products in a streamlined and maintainable workflow. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management Join in with the event for the global data community, Data Council Austin. From March 28-30th 2023, they'll play host to hundreds of attendees, 100 top speakers, and dozens of startups that are advancing data science, engineering and AI. Data Council attendees are amazing founders, data scientists, lead engineers, CTOs, heads of data, investors and community organizers who are all working together to build the future of data. As a listener to the Data Engineering Podcast you can get a special discount of 20% off your ticket by using the promo code dataengpod20. Don't miss out on their only event this year! Visit: dataengineeringpodcast.com/data-council (https://www.dataengineeringpodcast.com/data-council) today! RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their extensive library of integrations enable you to automatically send data to hundreds of downstream tools. Sign up free at dataengineeringpodcast.com/rudder (https://www.dataengineeringpodcast.com/rudder) Your host is Tobias Macey and today I'm interviewing Priyendra Deshwal about how NetSpring is using the data warehouse to deliver a more flexible and detailed view of your product analytics Interview Introduction How did you get involved in the area of data management? Can you describe what NetSpring is and the story behind it? What are the activities that constitute "product analytics" and what are the roles/teams involved in those activities? When teams first come to you, what are the common challenges that they are facing and what are the solutions that they have attempted to employ? Can you describe some of the challenges involved in bringing product analytics into enterprise or highly regulated environments/industries? How does a warehouse-native approach simplify that effort? There are many different players (both commercial and open source) in the product analytics space. Can you share your view on the role that NetSpring plays in that ecosystem? How is the NetSpring platform implemented to be able to best take advantage of modern warehouse technologies and the associated data stacks? What are the pre-requisites for an organization's infrastructure/data maturity for being able to benefit from NetSpring? How have the goals and implementation of the NetSpring platform evolved from when you first started working on it? Can you describe the steps involved in integrating NetSpring with an organization's existing warehouse? What are the signals that NetSpring uses to understand the customer journeys of different organizations? How do you manage the variance of the data models in the warehouse while providing a consistent experience for your users? Given that you are a product organization, how are you using NetSpring to power NetSpring? What are the most interesting, innovative, or unexpected ways that you have seen NetSpring used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on NetSpring? When is NetSpring the wrong choice? What do you have planned for the future of NetSpring? Contact Info LinkedIn (https://www.linkedin.com/in/priyendra-deshwal/) Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ (https://www.pythonpodcast.com) covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast (https://www.themachinelearningpodcast.com) helps you go from idea to production with machine learning. Visit the site (https://www.dataengineeringpodcast.com) to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com (mailto:hosts@dataengineeringpodcast.com)) with your story. To help other people find the show please leave a review on Apple Podcasts (https://podcasts.apple.com/us/podcast/data-engineering-podcast/id1193040557) and tell your friends and co-workers Links NetSpring (https://www.netspring.io/) ThoughtSpot (https://www.thoughtspot.com/) Product Analytics (https://theproductmanager.com/topics/product-analytics-guide/) Amplitude (https://amplitude.com/) Mixpanel (https://mixpanel.com/) Customer Data Platform (https://blog.hubspot.com/service/customer-data-platform-guide) GDPR (https://en.wikipedia.org/wiki/General_Data_Protection_Regulation) CCPA (https://en.wikipedia.org/wiki/California_Consumer_Privacy_Act) Segment (https://segment.com/) Podcast Episode (https://www.dataengineeringpodcast.com/segment-customer-analytics-episode-72/) Rudderstack (https://www.rudderstack.com/) Podcast Episode (https://www.dataengineeringpodcast.com/rudderstack-open-source-customer-data-platform-episode-263/) The intro and outro music is from The Hug (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug) by The Freak Fandango Orchestra (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/) / CC BY-SA (http://creativecommons.org/licenses/by-sa/3.0/)
Exploring The Nuances Of Building An Intentional Data Culture
06-03-2023
Exploring The Nuances Of Building An Intentional Data Culture
Summary The ecosystem for data professionals has matured to the point that there are a large and growing number of distinct roles. With the scope and importance of data steadily increasing it is important for organizations to ensure that everyone is aligned and operating in a positive environment. To help facilitate the nascent conversation about what constitutes an effective and productive data culture, the team at Data Council have dedicated an entire conference track to the subject. In this episode Pete Soderling and Maggie Hays join the show to explore this topic and their experience preparing for the upcoming conference. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management Hey there podcast listener, are you tired of dealing with the headache that is the 'Modern Data Stack'? We feel your pain. It's supposed to make building smarter, faster, and more flexible data infrastructures a breeze. It ends up being anything but that. Setting it up, integrating it, maintaining it—it’s all kind of a nightmare. And let's not even get started on all the extra tools you have to buy to get it to do its thing. But don't worry, there is a better way. TimeXtender takes a holistic approach to data integration that focuses on agility rather than fragmentation. By bringing all the layers of the data stack together, TimeXtender helps you build data solutions up to 10 times faster and saves you 70-80% on costs. If you're fed up with the 'Modern Data Stack', give TimeXtender a try. Head over to dataengineeringpodcast.com/timextender (https://www.dataengineeringpodcast.com/timextender) where you can do two things: watch us build a data estate in 15 minutes and start for free today. Your host is Tobias Macey and today I'm interviewing Pete Soderling and Maggie Hays about the growing importance of establishing and investing in an organization's data culture and their experience forming an entire conference track around this topic Interview Introduction How did you get involved in the area of data management? Can you describe what your working definition of "Data Culture" is? In what ways is a data culture distinct from an organization's corporate culture? How are they interdependent? What are the elements that are most impactful in forming the data culture of an organization? What are some of the motivations that teams/companies might have in fighting against the creation and support of an explicit data culture? Are there any strategies that you have found helpful in counteracting those tendencies? In terms of the conference, what are the factors that you consider when deciding how to group the different presentations into tracks or themes? What are the experiences that you have had personally and in community interactions that led you to elevate data culture to be it's own track? What are the broad challenges that practitioners are facing as they develop their own understanding of what constitutes a healthy and productive data culture? What are some of the risks that you considered when forming this track and evaluating proposals? What are your criteria for determining whether this track is successful? What are the most interesting, innovative, or unexpected aspects of data culture that you have encountered through developing this track? What are the most interesting, unexpected, or challenging lessons that you have learned while working on selecting presentations for this year's event? What do you have planned for the future of this topic at Data Council events? Contact Info Pete @petesoder (https://twitter.com/petesoder) on Twitter LinkedIn (https://www.linkedin.com/in/petesoder) Maggie LinkedIn (https://www.linkedin.com/in/maggie-hays) Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ (https://www.pythonpodcast.com) covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast (https://www.themachinelearningpodcast.com) helps you go from idea to production with machine learning. Visit the site (https://www.dataengineeringpodcast.com) to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com (mailto:hosts@dataengineeringpodcast.com)) with your story. To help other people find the show please leave a review on Apple Podcasts (https://podcasts.apple.com/us/podcast/data-engineering-podcast/id1193040557) and tell your friends and co-workers Links Data Council (https://datacouncil.ai/austin) Podcast Episode (https://www.dataengineeringpodcast.com/data-council-data-professional-community-episode-96) Data Community Fund (https://www.datacommunity.fund) DataHub (https://datahubproject.io/) Podcast Episode (https://www.dataengineeringpodcast.com/acryl-data-datahub-metadata-graph-episode-230/) Database Design For Mere Mortals (https://amzn.to/3ZFV6dU) by Michael J. Hernandez (affiliate link) SOAP (https://en.wikipedia.org/wiki/SOAP) REST (https://en.wikipedia.org/wiki/Representational_state_transfer) Econometrics (https://en.wikipedia.org/wiki/Econometrics) DBA == Database Administrator (https://www.careerexplorer.com/careers/database-administrator/) Conway's Law (https://en.wikipedia.org/wiki/Conway%27s_law) dbt (https://www.getdbt.com/) Podcast Episode (https://www.dataengineeringpodcast.com/dbt-data-analytics-episode-81/) The intro and outro music is from The Hug (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug) by The Freak Fandango Orchestra (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/) / CC BY-SA (http://creativecommons.org/licenses/by-sa/3.0/)
Building A Data Mesh Platform At PayPal
27-02-2023
Building A Data Mesh Platform At PayPal
Summary There has been a lot of discussion about the practical application of data mesh and how to implement it in an organization. Jean-Georges Perrin was tasked with designing a new data platform implementation at PayPal and wound up building a data mesh. In this episode he shares that journey and the combination of technical and organizational challenges that he encountered in the process. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management Are you tired of dealing with the headache that is the 'Modern Data Stack'? We feel your pain. It's supposed to make building smarter, faster, and more flexible data infrastructures a breeze. It ends up being anything but that. Setting it up, integrating it, maintaining it—it’s all kind of a nightmare. And let's not even get started on all the extra tools you have to buy to get it to do its thing. But don't worry, there is a better way. TimeXtender takes a holistic approach to data integration that focuses on agility rather than fragmentation. By bringing all the layers of the data stack together, TimeXtender helps you build data solutions up to 10 times faster and saves you 70-80% on costs. If you're fed up with the 'Modern Data Stack', give TimeXtender a try. Head over to dataengineeringpodcast.com/timextender (https://www.dataengineeringpodcast.com/timextender) where you can do two things: watch us build a data estate in 15 minutes and start for free today. Your host is Tobias Macey and today I'm interviewing Jean-Georges Perrin about his work at PayPal to implement a data mesh and the role of data contracts in making it work Interview Introduction How did you get involved in the area of data management? Can you start by describing the goals and scope of your work at PayPal to implement a data mesh? What are the core problems that you were addressing with this project? Is a data mesh ever "done"? What was your experience engaging at the organizational level to identify the granularity and ownership of the data products that were needed in the initial iteration? What was the impact of leading multiple teams on the design of how to implement communication/contracts throughout the mesh? What are the technical systems that you are relying on to power the different data domains? What is your philosophy on enforcing uniformity in technical systems vs. relying on interface definitions as the unit of consistency? What are the biggest challenges (technical and procedural) that you have encountered during your implementation? How are you managing visibility/auditability across the different data domains? (e.g. observability, data quality, etc.) What are the most interesting, innovative, or unexpected ways that you have seen PayPal's data mesh used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on data mesh? When is a data mesh the wrong choice? What do you have planned for the future of your data mesh at PayPal? Contact Info LinkedIn (https://www.linkedin.com/in/jgperrin/) Blog (https://jgp.ai/) Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ (https://www.pythonpodcast.com) covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast (https://www.themachinelearningpodcast.com) helps you go from idea to production with machine learning. Visit the site (https://www.dataengineeringpodcast.com) to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com (mailto:hosts@dataengineeringpodcast.com)) with your story. To help other people find the show please leave a review on Apple Podcasts (https://podcasts.apple.com/us/podcast/data-engineering-podcast/id1193040557) and tell your friends and co-workers Links Data Mesh (https://www.thoughtworks.com/en-us/what-we-do/data-and-ai/data-mesh) O'Reilly Book (https://amzn.to/3Z5nC8T) (affiliate link) The next generation of Data Platforms is the Data Mesh (https://medium.com/paypal-tech/the-next-generation-of-data-platforms-is-the-data-mesh-b7df4b825522) PayPal (https://about.pypl.com/about-us/default.aspx) Conway's Law (https://en.wikipedia.org/wiki/Conway%27s_law) Data Mesh For All Ages - US (https://amzn.to/3YzVRop), Data Mesh For All Ages - UK (https://amzn.to/3YzVRop) Data Mesh Radio (https://daappod.com/data-mesh-radio/) Data Mesh Community (https://datameshlearning.com/) Data Mesh In Action (http://jgp.ai/dmia) Great Expectations (https://greatexpectations.io/) Podcast Episode (https://www.dataengineeringpodcast.com/great-expectations-technical-debt-data-pipeline-episode-117/) The intro and outro music is from The Hug (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug) by The Freak Fandango Orchestra (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/) / CC BY-SA (http://creativecommons.org/licenses/by-sa/3.0/)
The View Below The Waterline Of Apache Iceberg And How It Fits In Your Data Lakehouse
19-02-2023
The View Below The Waterline Of Apache Iceberg And How It Fits In Your Data Lakehouse
Summary Cloud data warehouses have unlocked a massive amount of innovation and investment in data applications, but they are still inherently limiting. Because of their complete ownership of your data they constrain the possibilities of what data you can store and how it can be used. Projects like Apache Iceberg provide a viable alternative in the form of data lakehouses that provide the scalability and flexibility of data lakes, combined with the ease of use and performance of data warehouses. Ryan Blue helped create the Iceberg project, and in this episode he rejoins the show to discuss how it has evolved and what he is doing in his new business Tabular to make it even easier to implement and maintain. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management Hey there podcast listener, are you tired of dealing with the headache that is the 'Modern Data Stack'? We feel your pain. It's supposed to make building smarter, faster, and more flexible data infrastructures a breeze. It ends up being anything but that. Setting it up, integrating it, maintaining it—it’s all kind of a nightmare. And let's not even get started on all the extra tools you have to buy to get it to do its thing. But don't worry, there is a better way. TimeXtender takes a holistic approach to data integration that focuses on agility rather than fragmentation. By bringing all the layers of the data stack together, TimeXtender helps you build data solutions up to 10 times faster and saves you 70-80% on costs. If you're fed up with the 'Modern Data Stack', give TimeXtender a try. Head over to timextender.com/dataengineering where you can do two things: watch us build a data estate in 15 minutes and start for free today. Your host is Tobias Macey and today I'm interviewing Ryan Blue about the evolution and applications of the Iceberg table format and how he is making it more accessible at Tabular Interview Introduction How did you get involved in the area of data management? Can you describe what Iceberg is and its position in the data lake/lakehouse ecosystem? Since it is a fundamentally a specification, how do you manage compatibility and consistency across implementations? What are the notable changes in the Iceberg project and its role in the ecosystem since our last conversation October of 2018? Around the time that Iceberg was first created at Netflix a number of alternative table formats were also being developed. What are the characteristics of Iceberg that lead teams to adopt it for their lakehouse projects? Given the constant evolution of the various table formats it can be difficult to determine an up-to-date comparison of their features, particularly earlier in their development. What are the aspects of this problem space that make it so challenging to establish unbiased and comprehensive comparisons? For someone who wants to manage their data in Iceberg tables, what does the implementation look like? How does that change based on the type of query/processing engine being used? Once a table has been created, what are the capabilities of Iceberg that help to support ongoing use and maintenance? What are the most interesting, innovative, or unexpected ways that you have seen Iceberg used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Iceberg/Tabular? When is Iceberg/Tabular the wrong choice? What do you have planned for the future of Iceberg/Tabular? Contact Info LinkedIn (https://www.linkedin.com/in/rdblue/) rdblue (https://github.com/rdblue) on GitHub Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ (https://www.pythonpodcast.com) covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast (https://www.themachinelearningpodcast.com) helps you go from idea to production with machine learning. Visit the site (https://www.dataengineeringpodcast.com) to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com (mailto:hosts@dataengineeringpodcast.com)) with your story. To help other people find the show please leave a review on Apple Podcasts (https://podcasts.apple.com/us/podcast/data-engineering-podcast/id1193040557) and tell your friends and co-workers Links Iceberg (https://iceberg.apache.org/) Podcast Episode (https://www.dataengineeringpodcast.com/iceberg-with-ryan-blue-episode-52/) Hadoop (https://hadoop.apache.org/) Data Lakehouse (https://www.forbes.com/sites/bernardmarr/2022/01/18/what-is-a-data-lakehouse-a-super-simple-explanation-for-anyone/) ACID == Atomic, Consistent, Isolated, Durable (https://en.wikipedia.org/wiki/ACID) Apache Hive (https://hive.apache.org/) Apache Impala (https://impala.apache.org/) Bodo (https://www.bodo.ai/) Podcast Episode (https://www.dataengineeringpodcast.com/bodo-parallel-data-processing-python-episode-223/) StarRocks (https://www.starrocks.io/) Dremio (https://www.dremio.com/) Podcast Episode (https://www.dataengineeringpodcast.com/dremio-open-data-lakehouse-episode-333/) DDL == Data Definition Language (https://en.wikipedia.org/wiki/Data_definition_language) Trino (https://trino.io/) PrestoDB (https://prestodb.io/) Apache Hudi (https://hudi.apache.org/) Podcast Episode (https://www.dataengineeringpodcast.com/hudi-streaming-data-lake-episode-209/) dbt (https://www.getdbt.com/) Apache Flink (https://flink.apache.org/) TileDB (https://tiledb.com/) Podcast Episode (https://www.dataengineeringpodcast.com/tiledb-universal-data-engine-episode-146/) CDC == Change Data Capture (https://en.wikipedia.org/wiki/Change_data_capture) Substrait (https://substrait.io/) The intro and outro music is from The Hug (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug) by The Freak Fandango Orchestra (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/) / CC BY-SA (http://creativecommons.org/licenses/by-sa/3.0/)
Let The Whole Team Participate In Data With The Quilt Versioned Data Hub
11-02-2023
Let The Whole Team Participate In Data With The Quilt Versioned Data Hub
Summary Data is a team sport, but it's often difficult for everyone on the team to participate. For a long time the mantra of data tools has been "by developers, for developers", which automatically excludes a large portion of the business members who play a crucial role in the success of any data project. Quilt Data was created as an answer to make it easier for everyone to contribute to the data being used by an organization and collaborate on its application. In this episode Aneesh Karve shares the journey that Quilt has taken to provide an approachable interface for working with versioned data in S3 that empowers everyone to collaborate. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management Truly leveraging and benefiting from streaming data is hard - the data stack is costly, difficult to use and still has limitations. Materialize breaks down those barriers with a true cloud-native streaming database - not simply a database that connects to streaming systems. With a PostgreSQL-compatible interface, you can now work with real-time data using ANSI SQL including the ability to perform multi-way complex joins, which support stream-to-stream, stream-to-table, table-to-table, and more, all in standard SQL. Go to dataengineeringpodcast.com/materialize (https://www.dataengineeringpodcast.com/materialize) today and sign up for early access to get started. If you like what you see and want to help make it better, they're hiring (https://materialize.com/careers/) across all functions! Your host is Tobias Macey and today I'm interviewing Aneesh Karve about how Quilt Data helps you bring order to your chaotic data in S3 with transactional versioning and data discovery built in Interview Introduction How did you get involved in the area of data management? Can you describe what Quilt is and the story behind it? How have the goals and features of the Quilt platform changed since I spoke with Kevin in June of 2018? What are the main problems that users are trying to solve when they find Quilt? What are some of the alternative approaches/products that they are coming from? How does Quilt compare with options such as LakeFS, Unstruk, Pachyderm, etc.? Can you describe how Quilt is implemented? What are the types of tools and systems that Quilt gets integrated with? How do you manage the tension between supporting the lowest common denominator, while providing options for more advanced capabilities? What is a typical workflow for a team that is using Quilt to manage their data? What are the most interesting, innovative, or unexpected ways that you have seen Quilt used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Quilt? When is Quilt the wrong choice? What do you have planned for the future of Quilt? Contact Info LinkedIn (https://www.linkedin.com/in/aneeshkarve/) @akarve (https://twitter.com/akarve) on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ (https://www.pythonpodcast.com) covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast (https://www.themachinelearningpodcast.com) helps you go from idea to production with machine learning. Visit the site (https://www.dataengineeringpodcast.com) to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com (mailto:hosts@dataengineeringpodcast.com)) with your story. To help other people find the show please leave a review on Apple Podcasts (https://podcasts.apple.com/us/podcast/data-engineering-podcast/id1193040557) and tell your friends and co-workers Links Quilt Data (https://quiltdata.com/) Podcast Episode (https://www.dataengineeringpodcast.com/quilt-data-with-kevin-moore-episode-37/) UW Madison (https://www.wisc.edu/) Docker Swarm (https://docs.docker.com/engine/swarm/) Kaggle (https://www.kaggle.com/) open.quiltdata.com (https://open.quiltdata.com/) FinOS Perspective (https://perspective.finos.org/) LakeFS (https://lakefs.io/) Podcast Episode (https://www.dataengineeringpodcast.com/lakefs-data-lake-versioning-episode-157/) Pachyderm (https://www.pachyderm.com/) Podcast Episode (https://www.dataengineeringpodcast.com/pachyderm-data-lineage-episode-82) Unstruk (https://www.unstruk.com/) Podcast Episode (https://www.dataengineeringpodcast.com/unstruk-unstructured-data-warehouse-episode-196/) Parquet (https://parquet.apache.org/) Avro (https://avro.apache.org/) ORC (https://orc.apache.org/) Cloudformation (https://aws.amazon.com/cloudformation/) Troposphere (https://github.com/cloudtools/troposphere) CDK == Cloud Development Kit (https://aws.amazon.com/cdk/) Shadow IT (https://en.wikipedia.org/wiki/Shadow_IT) Podcast Episode (https://www.dataengineeringpodcast.com/shadow-it-data-analytics-episode-121) Delta Lake (https://delta.io/) Podcast Episode (https://www.dataengineeringpodcast.com/delta-lake-data-lake-episode-85/) Apache Iceberg (https://iceberg.apache.org/) Podcast Episode (https://www.dataengineeringpodcast.com/iceberg-with-ryan-blue-episode-52/) Datasette (https://datasette.io/) Frictionless (https://frictionlessdata.io/) DVC (https://dvc.org/) Podcast.__init__ Episode (https://www.pythonpodcast.com/data-version-control-episode-206/) The intro and outro music is from The Hug (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug) by The Freak Fandango Orchestra (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/) / CC BY-SA (http://creativecommons.org/licenses/by-sa/3.0/)
Reflecting On The Past 6 Years Of Data Engineering
06-02-2023
Reflecting On The Past 6 Years Of Data Engineering
Summary This podcast started almost exactly six years ago, and the technology landscape was much different than it is now. In that time there have been a number of generational shifts in how data engineering is done. In this episode I reflect on some of the major themes and take a brief look forward at some of the upcoming changes. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management Your host is Tobias Macey and today I'm reflecting on the major trends in data engineering over the past 6 years Interview Introduction 6 years of running the Data Engineering Podcast Around the first time that data engineering was discussed as a role Followed on from hype about "data science" Hadoop era Streaming Lambda and Kappa architectures Not really referenced anymore "Big Data" era of capture everything has shifted to focusing on data that presents value Regulatory environment increases risk, better tools introduce more capability to understand what data is useful Data catalogs Amundsen and Alation Orchestration engine Oozie, etc. -> Airflow and Luigi -> Dagster, Prefect, Lyft, etc. Orchestration is now a part of most vertical tools Cloud data warehouses Data lakes DataOps and MLOps Data quality to data observability Metadata for everything Data catalog -> data discovery -> active metadata Business intelligence Read only reports to metric/semantic layers Embedded analytics and data APIs Rise of ELT dbt Corresponding introduction of reverse ETL What are the most interesting, unexpected, or challenging lessons that you have learned while working on running the podcast? What do you have planned for the future of the podcast? Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ (https://www.pythonpodcast.com) covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast (https://www.themachinelearningpodcast.com) helps you go from idea to production with machine learning. Visit the site (https://www.dataengineeringpodcast.com) to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com (mailto:hosts@dataengineeringpodcast.com)) with your story. To help other people find the show please leave a review on Apple Podcasts (https://podcasts.apple.com/us/podcast/data-engineering-podcast/id1193040557) and tell your friends and co-workers The intro and outro music is from The Hug (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug) by The Freak Fandango Orchestra (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/) / CC BY-SA (http://creativecommons.org/licenses/by-sa/3.0/)
Let Your Business Intelligence Platform Build The Models Automatically With Omni Analytics
30-01-2023
Let Your Business Intelligence Platform Build The Models Automatically With Omni Analytics
Summary Business intelligence has gone through many generational shifts, but each generation has largely maintained the same workflow. Data analysts create reports that are used by the business to understand and direct the business, but the process is very labor and time intensive. The team at Omni have taken a new approach by automatically building models based on the queries that are executed. In this episode Chris Merrick shares how they manage integration and automation around the modeling layer and how it improves the organizational experience of business intelligence. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management Truly leveraging and benefiting from streaming data is hard - the data stack is costly, difficult to use and still has limitations. Materialize breaks down those barriers with a true cloud-native streaming database - not simply a database that connects to streaming systems. With a PostgreSQL-compatible interface, you can now work with real-time data using ANSI SQL including the ability to perform multi-way complex joins, which support stream-to-stream, stream-to-table, table-to-table, and more, all in standard SQL. Go to dataengineeringpodcast.com/materialize (https://www.dataengineeringpodcast.com/materialize) today and sign up for early access to get started. If you like what you see and want to help make it better, they're hiring (https://materialize.com/careers/) across all functions! Your host is Tobias Macey and today I'm interviewing Chris Merrick about the Omni Analytics platform and how they are adding automatic data modeling to your business intelligence Interview Introduction How did you get involved in the area of data management? Can you describe what Omni Analytics is and the story behind it? What are the core goals that you are trying to achieve with building Omni? Business intelligence has gone through many evolutions. What are the unique capabilities that Omni Analytics offers over other players in the market? What are the technical and organizational anti-patterns that typically grow up around BI systems? What are the elements that contribute to BI being such a difficult product to use effectively in an organization? Can you describe how you have implemented the Omni platform? How have the design/scope/goals of the product changed since you first started working on it? What does the workflow for a team using Omni look like? What are some of the developments in the broader ecosystem that have made your work possible? What are some of the positive and negative inspirations that you have drawn from the experience that you and your team-mates have gained in previous businesses? What are the most interesting, innovative, or unexpected ways that you have seen Omni used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Omni? When is Omni the wrong choice? What do you have planned for the future of Omni? Contact Info LinkedIn (https://www.linkedin.com/in/merrickchristopher/) @cmerrick (https://twitter.com/cmerrick) on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ (https://www.pythonpodcast.com) covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast (https://www.themachinelearningpodcast.com) helps you go from idea to production with machine learning. Visit the site (https://www.dataengineeringpodcast.com) to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com (mailto:hosts@dataengineeringpodcast.com)) with your story. To help other people find the show please leave a review on Apple Podcasts (https://podcasts.apple.com/us/podcast/data-engineering-podcast/id1193040557) and tell your friends and co-workers Links Omni Analytics (https://www.exploreomni.com/) Stitch (https://www.stitchdata.com/) RJ Metrics (https://en.wikipedia.org/wiki/RJMetrics) Looker (https://www.looker.com/) Podcast Episode (https://www.dataengineeringpodcast.com/looker-with-daniel-mintz-episode-55/) Singer (https://www.singer.io/) dbt (https://www.getdbt.com/) Podcast Episode (https://www.dataengineeringpodcast.com/dbt-data-analytics-episode-81/) Teradata (https://www.teradata.com/) Fivetran (https://www.fivetran.com/) Apache Arrow (https://arrow.apache.org/) Podcast Episode (https://www.dataengineeringpodcast.com/voltron-data-apache-arrow-episode-346/) DuckDB (https://duckdb.org/) Podcast Episode (https://www.dataengineeringpodcast.com/duckdb-in-process-olap-database-episode-270/) BigQuery (https://cloud.google.com/bigquery) Snowflake (https://www.snowflake.com/en/) Podcast Episode (https://www.dataengineeringpodcast.com/snowflakedb-cloud-data-warehouse-episode-110/) The intro and outro music is from The Hug (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug) by The Freak Fandango Orchestra (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/) / CC BY-SA (http://creativecommons.org/licenses/by-sa/3.0/)
Safely Test Your Applications And Analytics With Production Quality Data Using Tonic AI
22-01-2023
Safely Test Your Applications And Analytics With Production Quality Data Using Tonic AI
Summary The most interesting and challenging bugs always happen in production, but recreating them is a constant challenge due to differences in the data that you are working with. Building your own scripts to replicate data from production is time consuming and error-prone. Tonic is a platform designed to solve the problem of having reliable, production-like data available for developing and testing your software, analytics, and machine learning projects. In this episode Adam Kamor explores the factors that make this such a complex problem to solve, the approach that he and his team have taken to turn it into a reliable product, and how you can start using it to replace your own collection of scripts. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management Truly leveraging and benefiting from streaming data is hard - the data stack is costly, difficult to use and still has limitations. Materialize breaks down those barriers with a true cloud-native streaming database - not simply a database that connects to streaming systems. With a PostgreSQL-compatible interface, you can now work with real-time data using ANSI SQL including the ability to perform multi-way complex joins, which support stream-to-stream, stream-to-table, table-to-table, and more, all in standard SQL. Go to dataengineeringpodcast.com/materialize (https://www.dataengineeringpodcast.com/materialize) today and sign up for early access to get started. If you like what you see and want to help make it better, they're hiring (https://materialize.com/careers/) across all functions! Data and analytics leaders, 2023 is your year to sharpen your leadership skills, refine your strategies and lead with purpose. Join your peers at Gartner Data & Analytics Summit, March 20 – 22 in Orlando, FL for 3 days of expert guidance, peer networking and collaboration. Listeners can save $375 off standard rates with code GARTNERDA. Go to dataengineeringpodcast.com/gartnerda (https://www.dataengineeringpodcast.com/gartnerda) today to find out more. Your host is Tobias Macey and today I'm interviewing Adam Kamor about Tonic, a service for generating data sets that are safe for development, analytics, and machine learning Interview Introduction How did you get involved in the area of data management? Can you describe what Tonic is and the story behind it? What are the core problems that you are trying to solve? What are some of the ways that fake or obfuscated data is used in development and analytics workflows? challenges of reliably subsetting data impact of ORMs and bad habits developers get into with database modeling Can you describe how Tonic is implemented? What are the units of composition that you are building to allow for evolution and expansion of your product? How have the design and goals of the platform evolved since you started working on it? Can you describe some of the different workflows that customers build on top of your various tools What are the most interesting, innovative, or unexpected ways that you have seen Tonic used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Tonic? When is Tonic the wrong choice? What do you have planned for the future of Tonic? Contact Info LinkedIn (https://www.linkedin.com/in/adam-kamor-85720b48/) @AdamKamor (https://twitter.com/adamkamor) on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ (https://www.pythonpodcast.com) covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast (https://www.themachinelearningpodcast.com) helps you go from idea to production with machine learning. Visit the site (https://www.dataengineeringpodcast.com) to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com (mailto:hosts@dataengineeringpodcast.com)) with your story. To help other people find the show please leave a review on Apple Podcasts (https://podcasts.apple.com/us/podcast/data-engineering-podcast/id1193040557) and tell your friends and co-workers Links Tonic (https://hubs.la/Q01yX4qN0) Djinn (https://hubs.la/Q01yX4FL0) Django (https://www.djangoproject.com/) Ruby on Rails (https://rubyonrails.org/) C# (https://learn.microsoft.com/en-us/dotnet/csharp/tour-of-csharp/) Entity Framework (https://learn.microsoft.com/en-us/dotnet/csharp/tour-of-csharp/) PostgreSQL (https://www.postgresql.org/) MySQL (https://www.mysql.com/) Oracle DB (https://www.oracle.com/database/) MongoDB (https://www.mongodb.com/) Parquet (https://parquet.apache.org/) Databricks (https://www.databricks.com/) Mockaroo (https://www.mockaroo.com/) The intro and outro music is from The Hug (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug) by The Freak Fandango Orchestra (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/) / CC BY-SA (http://creativecommons.org/licenses/by-sa/3.0/)
Building Applications With Data As Code On The DataOS
16-01-2023
Building Applications With Data As Code On The DataOS
Summary The modern data stack has made it more economical to use enterprise grade technologies to power analytics at organizations of every scale. Unfortunately it has also introduced new overhead to manage the full experience as a single workflow. At the Modern Data Company they created the DataOS platform as a means of driving your full analytics lifecycle through code, while providing automatic knowledge graphs and data discovery. In this episode Srujan Akula explains how the system is implemented and how you can start using it today with your existing data systems. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management Truly leveraging and benefiting from streaming data is hard - the data stack is costly, difficult to use and still has limitations. Materialize breaks down those barriers with a true cloud-native streaming database - not simply a database that connects to streaming systems. With a PostgreSQL-compatible interface, you can now work with real-time data using ANSI SQL including the ability to perform multi-way complex joins, which support stream-to-stream, stream-to-table, table-to-table, and more, all in standard SQL. Go to dataengineeringpodcast.com/materialize (https://www.dataengineeringpodcast.com/materialize) today and sign up for early access to get started. If you like what you see and want to help make it better, they're hiring (https://materialize.com/careers/) across all functions! Struggling with broken pipelines? Stale dashboards? Missing data? If this resonates with you, you’re not alone. Data engineers struggling with unreliable data need look no further than Monte Carlo, the leading end-to-end Data Observability Platform! Trusted by the data teams at Fox, JetBlue, and PagerDuty, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo monitors and alerts for data issues across your data warehouses, data lakes, dbt models, Airflow jobs, and business intelligence tools, reducing time to detection and resolution from weeks to just minutes. Monte Carlo also gives you a holistic picture of data health with automatic, end-to-end lineage from ingestion to the BI layer directly out of the box. Start trusting your data with Monte Carlo today! Visit dataengineeringpodcast.com/montecarlo (http://www.dataengineeringpodcast.com/montecarlo) to learn more. Data and analytics leaders, 2023 is your year to sharpen your leadership skills, refine your strategies and lead with purpose. Join your peers at Gartner Data & Analytics Summit, March 20 – 22 in Orlando, FL for 3 days of expert guidance, peer networking and collaboration. Listeners can save $375 off standard rates with code GARTNERDA. Go to dataengineeringpodcast.com/gartnerda (https://www.dataengineeringpodcast.com/gartnerda) today to find out more. Your host is Tobias Macey and today I'm interviewing Srujan Akula about DataOS, a pre-integrated and managed data platform built by The Modern Data Company Interview Introduction How did you get involved in the area of data management? Can you describe what your mission at The Modern Data Company is and the story behind it? Your flagship (only?) product is a platform that you're calling DataOS. What is the scope and goal of that platform? Who is the target audience? On your site you refer to the idea of "data as software". What are the principles and ways of thinking that are encompassed by that concept? What are the platform capabilities that are required to make it possible? There are 11 "Key Features" listed on your site for the DataOS. What was your process for identifying the "must have" vs "nice to have" features for launching the platform? Can you describe the technical architecture that powers your DataOS product? What are the core principles that you are optimizing for in the design of your platform? How have the design and goals of the system changed or evolved since you started working on DataOS? Can you describe the workflow for the different practitioners and stakeholders working on an installation of DataOS? What are the interfaces and escape hatches that are available for integrating with and extending the operation of the DataOS? What are the features or capabilities that you are expressly choosing not to implement? (e.g. ML pipelines, data sharing, etc.) What are the design elements that you are focused on to make DataOS approachable and understandable by different members of an organization? What are the most interesting, innovative, or unexpected ways that you have seen DataOS used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on DataOS? When is DataOS the wrong choice? What do you have planned for the future of DataOS? Contact Info LinkedIn (https://www.linkedin.com/in/srujanakula/) @srujanakula (https://twitter.com/srujanakula) on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ (https://www.pythonpodcast.com) covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast (https://www.themachinelearningpodcast.com) helps you go from idea to production with machine learning. Visit the site (https://www.dataengineeringpodcast.com) to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com (mailto:hosts@dataengineeringpodcast.com)) with your story. To help other people find the show please leave a review on Apple Podcasts (https://podcasts.apple.com/us/podcast/data-engineering-podcast/id1193040557) and tell your friends and co-workers Links Modern Data Company (https://themoderndatacompany.com/) Alation (https://www.alation.com/) Airbyte (https://airbyte.com/) Podcast Episode (https://www.dataengineeringpodcast.com/airbyte-open-source-data-integration-episode-173/) Fivetran (https://www.fivetran.com/) Podcast Episode (https://www.dataengineeringpodcast.com/fivetran-data-replication-episode-93/) Airflow (https://airflow.apache.org/) Dremio (https://www.dremio.com/) Podcast Episode (https://www.dataengineeringpodcast.com/dremio-with-tomer-shiran-episode-58/) PrestoDB (https://prestodb.io/) GraphQL (https://graphql.org/) Cypher (https://neo4j.com/developer/cypher/) graph query language Gremlin (https://en.wikipedia.org/wiki/Gremlin_(query_language)) graph query language The intro and outro music is from The Hug (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug) by The Freak Fandango Orchestra (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/) / CC BY-SA (http://creativecommons.org/licenses/by-sa/3.0/)
Automate Your Pipeline Creation For Streaming Data Transformations With SQLake
08-01-2023
Automate Your Pipeline Creation For Streaming Data Transformations With SQLake
Summary Managing end-to-end data flows becomes complex and unwieldy as the scale of data and its variety of applications in an organization grows. Part of this complexity is due to the transformation and orchestration of data living in disparate systems. The team at Upsolver is taking aim at this problem with the latest iteration of their platform in the form of SQLake. In this episode Ori Rafael explains how they are automating the creation and scheduling of orchestration flows and their related transforations in a unified SQL interface. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management Data and analytics leaders, 2023 is your year to sharpen your leadership skills, refine your strategies and lead with purpose. Join your peers at Gartner Data & Analytics Summit, March 20 – 22 in Orlando, FL for 3 days of expert guidance, peer networking and collaboration. Listeners can save $375 off standard rates with code GARTNERDA. Go to dataengineeringpodcast.com/gartnerda (https://www.dataengineeringpodcast.com/gartnerda) today to find out more. Truly leveraging and benefiting from streaming data is hard - the data stack is costly, difficult to use and still has limitations. Materialize breaks down those barriers with a true cloud-native streaming database - not simply a database that connects to streaming systems. With a PostgreSQL-compatible interface, you can now work with real-time data using ANSI SQL including the ability to perform multi-way complex joins, which support stream-to-stream, stream-to-table, table-to-table, and more, all in standard SQL. Go to dataengineeringpodcast.com/materialize (https://www.dataengineeringpodcast.com/materialize) today and sign up for early access to get started. If you like what you see and want to help make it better, they're hiring (https://materialize.com/careers/) across all functions! Struggling with broken pipelines? Stale dashboards? Missing data? If this resonates with you, you’re not alone. Data engineers struggling with unreliable data need look no further than Monte Carlo, the leading end-to-end Data Observability Platform! Trusted by the data teams at Fox, JetBlue, and PagerDuty, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo monitors and alerts for data issues across your data warehouses, data lakes, dbt models, Airflow jobs, and business intelligence tools, reducing time to detection and resolution from weeks to just minutes. Monte Carlo also gives you a holistic picture of data health with automatic, end-to-end lineage from ingestion to the BI layer directly out of the box. Start trusting your data with Monte Carlo today! Visit dataengineeringpodcast.com/montecarlo (http://www.dataengineeringpodcast.com/montecarlo) to learn more. Your host is Tobias Macey and today I'm interviewing Ori Rafael about the SQLake feature for the Upsolver platform that automatically generates pipelines from your queries Interview Introduction How did you get involved in the area of data management? Can you describe what the SQLake product is and the story behind it? What is the core problem that you are trying to solve? What are some of the anti-patterns that you have seen teams adopt when designing and implementing DAGs in a tool such as Airlow? What are the benefits of merging the logic for transformation and orchestration into the same interface and dialect (SQL)? Can you describe the technical implementation of the SQLake feature? What does the workflow look like for designing and deploying pipelines in SQLake? What are the opportunities for using utilities such as dbt for managing logical complexity as the number of pipelines scales? SQL has traditionally been challenging to compose. How did that factor into your design process for how to structure the dialect extensions for job scheduling? What are some of the complexities that you have had to address in your orchestration system to be able to manage timeliness of operations as volume and complexity of the data scales? What are some of the edge cases that you have had to provide escape hatches for? What are the most interesting, innovative, or unexpected ways that you have seen SQLake used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on SQLake? When is SQLake the wrong choice? What do you have planned for the future of SQLake? Contact Info LinkedIn (https://www.linkedin.com/in/ori-rafael-91723344/) Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ (https://www.pythonpodcast.com) covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast (https://www.themachinelearningpodcast.com) helps you go from idea to production with machine learning. Visit the site (https://www.dataengineeringpodcast.com) to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com (mailto:hosts@dataengineeringpodcast.com)) with your story. To help other people find the show please leave a review on Apple Podcasts (https://podcasts.apple.com/us/podcast/data-engineering-podcast/id1193040557) and tell your friends and co-workers Links Upsolver (https://www.upsolver.com/) Podcast Episode (https://www.dataengineeringpodcast.com/upsolver-streaming-data-integration-episode-240/) SQLake (https://docs.upsolver.com/sqlake/) Airflow (https://airflow.apache.org/) Dagster (https://dagster.io/) Podcast Episode (https://www.dataengineeringpodcast.com/dagster-software-defined-assets-data-orchestration-episode-309/) Prefect (https://www.prefect.io/) Podcast Episode (https://www.dataengineeringpodcast.com/prefect-workflow-engine-episode-86/) Flyte (https://flyte.org/) Podcast Episode (https://www.dataengineeringpodcast.com/flyte-data-orchestration-machine-learning-episode-291/) GitHub Actions (https://github.com/features/actions) dbt (https://www.getdbt.com/) Podcast Episode (https://www.dataengineeringpodcast.com/dbt-data-analytics-episode-81/) PartiQL (https://partiql.org/) The intro and outro music is from The Hug (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug) by The Freak Fandango Orchestra (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/) / CC BY-SA (http://creativecommons.org/licenses/by-sa/3.0/)
Increase Your Odds Of Success For Analytics And AI Through More Effective Knowledge Management With AlignAI
29-12-2022
Increase Your Odds Of Success For Analytics And AI Through More Effective Knowledge Management With AlignAI
Summary Making effective use of data requires proper context around the information that is being used. As the size and complexity of your organization increases the difficulty of ensuring that everyone has the necessary knowledge about how to get their work done scales exponentially. Wikis and intranets are a common way to attempt to solve this problem, but they are frequently ineffective. Rehgan Avon co-founded AlignAI to help address this challenge through a more purposeful platform designed to collect and distribute the knowledge of how and why data is used in a business. In this episode she shares the strategic and tactical elements of how to make more effective use of the technical and organizational resources that are available to you for getting work done with data. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you're ready to build your next pipeline, or want to test out the projects you hear about on the show, you'll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode (https://www.dataengineeringpodcast.com/linode) today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don't forget to thank them for their continued support of this show! Atlan is the metadata hub for your data ecosystem. Instead of locking your metadata into a new silo, unleash its transformative potential with Atlan's active metadata capabilities. Push information about data freshness and quality to your business intelligence, automatically scale up and down your warehouse based on usage patterns, and let the bots answer those questions in Slack so that the humans can focus on delivering real value. Go to dataengineeringpodcast.com/atlan (https://www.dataengineeringpodcast.com/atlan) today to learn more about how Atlan’s active metadata platform is helping pioneering data teams like Postman, Plaid, WeWork & Unilever achieve extraordinary things with metadata and escape the chaos. Struggling with broken pipelines? Stale dashboards? Missing data? If this resonates with you, you’re not alone. Data engineers struggling with unreliable data need look no further than Monte Carlo, the leading end-to-end Data Observability Platform! Trusted by the data teams at Fox, JetBlue, and PagerDuty, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo monitors and alerts for data issues across your data warehouses, data lakes, dbt models, Airflow jobs, and business intelligence tools, reducing time to detection and resolution from weeks to just minutes. Monte Carlo also gives you a holistic picture of data health with automatic, end-to-end lineage from ingestion to the BI layer directly out of the box. Start trusting your data with Monte Carlo today! Visit dataengineeringpodcast.com/montecarlo (http://www.dataengineeringpodcast.com/montecarlo) to learn more. Your host is Tobias Macey and today I'm interviewing Rehgan Avon about her work at AlignAI to help organizations standardize their technical and procedural approaches to working with data Interview Introduction How did you get involved in the area of data management? Can you describe what AlignAI is and the story behind it? What are the core problems that you are focused on addressing? What are the tactical ways that you are working to solve those problems? What are some of the common and avoidable ways that analytics/AI projects go wrong? What are some of the ways that organizational scale and complexity impacts their ability to execute on data and AI projects? What are the ways that incomplete/unevenly distributed knowledge manifests in project design and execution? Can you describe the design and implementation of the AlignAI platform? How have the goals and implementation of the product changed since you first started working on it? What is the workflow at the individual and organizational level for businesses that are using AlignAI? One of the perennial challenges with knowledge sharing in an organization is managing incentives to engage with the available material. What are some of the ways that you are working to integrate the creation and distribution of institutional knowledge into employees' day-to-day work? What are the most interesting, innovative, or unexpected ways that you have seen AlignAI used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on AlignAI? When is AlignAI the wrong choice? What do you have planned for the future of AlignAI? Contact Info LinkedIn (https://www.linkedin.com/in/rehganavon/) @RehganAvon (https://twitter.com/RehganAvon) on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ (https://www.pythonpodcast.com) covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast (https://www.themachinelearningpodcast.com) helps you go from idea to production with machine learning. Visit the site (https://www.dataengineeringpodcast.com) to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com (mailto:hosts@dataengineeringpodcast.com)) with your story. To help other people find the show please leave a review on Apple Podcasts (https://podcasts.apple.com/us/podcast/data-engineering-podcast/id1193040557) and tell your friends and co-workers Links AlignAI (https://www.getalignai.com/) Sharepoint (https://en.wikipedia.org/wiki/SharePoint) Confluence (https://en.wikipedia.org/wiki/Confluence_(software)) GitHub (https://github.com/) Canva (https://www.canva.com/) Instructional Design (https://en.wikipedia.org/wiki/Instructional_design) Notion (https://www.notion.so/) Coda (https://coda.io/) Waterfall Design (https://en.wikipedia.org/wiki/Waterfall_model) dbt (https://www.getdbt.com/) Podcast Episode (https://www.dataengineeringpodcast.com/dbt-data-analytics-episode-81/) Alteryx (https://www.alteryx.com/) The intro and outro music is from The Hug (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug) by The Freak Fandango Orchestra (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/) / CC BY-SA (http://creativecommons.org/licenses/by-sa/3.0/)
Using Product Driven Development To Improve The Productivity And Effectiveness Of Your Data Teams
29-12-2022
Using Product Driven Development To Improve The Productivity And Effectiveness Of Your Data Teams
Summary With all of the messaging about treating data as a product it is becoming difficult to know what that even means. Vishal Singh is the head of products at Starburst which means that he has to spend all of his time thinking and talking about the details of product thinking and its application to data. In this episode he shares his thoughts on the strategic and tactical elements of moving your work as a data professional from being task-oriented to being product-oriented and the long term improvements in your productivity that it provides. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you're ready to build your next pipeline, or want to test out the projects you hear about on the show, you'll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode (https://www.dataengineeringpodcast.com/linode) today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don't forget to thank them for their continued support of this show! Modern data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days or even weeks. By the time errors have made their way into production, it’s often too late and damage is done. Datafold built automated regression testing to help data and analytics engineers deal with data quality in their pull requests. Datafold shows how a change in SQL code affects your data, both on a statistical level and down to individual rows and values before it gets merged to production. No more shipping and praying, you can now know exactly what will change in your database! Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Visit dataengineeringpodcast.com/datafold (https://www.dataengineeringpodcast.com/datafold) today to book a demo with Datafold. RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their extensive library of integrations enable you to automatically send data to hundreds of downstream tools. Sign up free at dataengineeringpodcast.com/rudder (https://www.dataengineeringpodcast.com/rudder) Build Data Pipelines. Not DAGs. That’s the spirit behind Upsolver SQLake, a new self-service data pipeline platform that lets you build batch and streaming pipelines without falling into the black hole of DAG-based orchestration. All you do is write a query in SQL to declare your transformation, and SQLake will turn it into a continuous pipeline that scales to petabytes and delivers up to the minute fresh data. SQLake supports a broad set of transformations, including high-cardinality joins, aggregations, upserts and window operations. Output data can be streamed into a data lake for query engines like Presto, Trino or Spark SQL, a data warehouse like Snowflake or Redshift., or any other destination you choose. Pricing for SQLake is simple. You pay $99 per terabyte ingested into your data lake using SQLake, and run unlimited transformation pipelines for free. That way data engineers and data users can process to their heart’s content without worrying about their cloud bill. For data engineering podcast listeners, we’re offering a 30 day trial with unlimited data, so go to dataengineeringpodcast.com/upsolver (https://www.dataengineeringpodcast.com/upsolver) today and see for yourself how to avoid DAG hell. Your host is Tobias Macey and today I'm interviewing Vishal Singh about his experience building data products at Starburst Interview Introduction How did you get involved in the area of data management? Can you describe what your definition of a "data product" is? What are some of the different contexts in which the idea of a data product is applicable? How do the parameters of a data product change across those different contexts/consumers? What are some of the ways that you see the conversation around the purpose and practice of building data products getting overloaded by conflicting objectives? What do you see as common challenges in data teams around how to approach product thinking in their day-to-day work? What are some of the tactical ways that product-oriented work on data problems differs from what has become common practice in data teams? What are some of the features that you are building at Starburst that contribute to the efforts of data teams to build full-featured product experiences for their data? What are the most interesting, innovative, or unexpected ways that you have seen Starburst used in the context of data products? What are the most interesting, unexpected, or challenging lessons that you have learned while working at Starburst? When is a data product the wrong choice? What do you have planned for the future of support for data product development at Starburst? Contact Info LinkedIn (https://www.linkedin.com/in/singhsvishal/) @vishal_singh (https://twitter.com/vishal_singh) on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ (https://www.pythonpodcast.com) covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast (https://www.themachinelearningpodcast.com) helps you go from idea to production with machine learning. Visit the site (https://www.dataengineeringpodcast.com) to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com (mailto:hosts@dataengineeringpodcast.com)) with your story. To help other people find the show please leave a review on Apple Podcasts (https://podcasts.apple.com/us/podcast/data-engineering-podcast/id1193040557) and tell your friends and co-workers Links Starburst (https://www.starburst.io/) Podcast Episode (https://www.dataengineeringpodcast.com/starburst-lakehouse-modern-data-architecture-episode-304/) Geophysics (https://en.wikipedia.org/wiki/Geophysics) Product-Led Growth (https://www.productled.org/foundations/what-is-product-led-growth) Trino (https://trino.io/) DataNova (https://www.starburst.io/datanova/) Starburst Galaxy (https://www.starburst.io/platform/starburst-galaxy/) Tableau (https://www.tableau.com/) PowerBI (https://powerbi.microsoft.com/en-us/) Podcast Episode (https://www.dataengineeringpodcast.com/power-bi-business-intelligence-episode-154/) Metabase (https://www.metabase.com/) Podcast Episode (https://www.dataengineeringpodcast.com/metabase-with-sameer-al-sakran-episode-29/) Great Expectations (https://greatexpectations.io/) Podcast Episode (https://www.dataengineeringpodcast.com/great-expectations-technical-debt-data-pipeline-episode-117/) The intro and outro music is from The Hug (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug) by The Freak Fandango Orchestra (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/) / CC BY-SA (http://creativecommons.org/licenses/by-sa/3.0/)
Simple And Scalable Encryption Of Data In Use For Analytics And Machine Learning With Opaque Systems
26-12-2022
Simple And Scalable Encryption Of Data In Use For Analytics And Machine Learning With Opaque Systems
Summary Encryption and security are critical elements in data analytics and machine learning applications. We have well developed protocols and practices around data that is at rest and in motion, but security around data in use is still severely lacking. Recognizing this shortcoming and the capabilities that could be unlocked by a robust solution Rishabh Poddar helped to create Opaque Systems as an outgrowth of his PhD studies. In this episode he shares the work that he and his team have done to simplify integration of secure enclaves and trusted computing environments into analytical workflows and how you can start using it without re-engineering your existing systems. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you're ready to build your next pipeline, or want to test out the projects you hear about on the show, you'll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode (https://www.dataengineeringpodcast.com/linode) today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don't forget to thank them for their continued support of this show! Modern data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days or even weeks. By the time errors have made their way into production, it’s often too late and damage is done. Datafold built automated regression testing to help data and analytics engineers deal with data quality in their pull requests. Datafold shows how a change in SQL code affects your data, both on a statistical level and down to individual rows and values before it gets merged to production. No more shipping and praying, you can now know exactly what will change in your database! Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Visit dataengineeringpodcast.com/datafold (https://www.dataengineeringpodcast.com/datafold) today to book a demo with Datafold. RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their extensive library of integrations enable you to automatically send data to hundreds of downstream tools. Sign up free at dataengineeringpodcast.com/rudder (https://www.dataengineeringpodcast.com/rudder) Build Data Pipelines. Not DAGs. That’s the spirit behind Upsolver SQLake, a new self-service data pipeline platform that lets you build batch and streaming pipelines without falling into the black hole of DAG-based orchestration. All you do is write a query in SQL to declare your transformation, and SQLake will turn it into a continuous pipeline that scales to petabytes and delivers up to the minute fresh data. SQLake supports a broad set of transformations, including high-cardinality joins, aggregations, upserts and window operations. Output data can be streamed into a data lake for query engines like Presto, Trino or Spark SQL, a data warehouse like Snowflake or Redshift., or any other destination you choose. Pricing for SQLake is simple. You pay $99 per terabyte ingested into your data lake using SQLake, and run unlimited transformation pipelines for free. That way data engineers and data users can process to their heart’s content without worrying about their cloud bill. For data engineering podcast listeners, we’re offering a 30 day trial with unlimited data, so go to dataengineeringpodcast.com/upsolver (https://www.dataengineeringpodcast.com/upsolver) today and see for yourself how to avoid DAG hell. Your host is Tobias Macey and today I'm interviewing Rishabh Poddar about his work at Opaque Systems to enable secure analysis and machine learning on encrypted data Interview Introduction How did you get involved in the area of data management? Can you describe what you are building at Opaque Systems and the story behind it? What are the core problems related to security/privacy in data analytics and ML that organizations are struggling with? What do you see as the balance of internal vs. cross-organization applications for the solutions you are creating? comparison with homomorphic encryption validation and ongoing testing of security/privacy guarantees performance impact of encryption overhead and how to mitigate it UX aspects of not being able to view the underlying data risks of information leakage from schema/meta information Can you describe how the Opaque Systems platform is implemented? How have the design and scope of the product changed since you started working on it? Can you describe a typical workflow for a team or teams building an analytical process or ML project with your platform? What are some of the constraints in terms of data format/volume/variety that are introduced by working with it in the Opaque platform? How are you approaching the balance of maintaining the MC2 project against the product needs of the Opaque platform? What are the most interesting, innovative, or unexpected ways that you have seen the Opaque platform used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Opaque Systems/MC2? When is Opaque the wrong choice? What do you have planned for the future of the Opaque platform? Contact Info LinkedIn (https://www.linkedin.com/in/rishabh-poddar/) Website (https://rishabhpoddar.com/) @Podcastinator (https://twitter.com/podcastinator) on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ () covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast (https://www.themachinelearningpodcast.com) helps you go from idea to production with machine learning. Visit the site (https://www.dataengineeringpodcast.com) to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com (mailto:hosts@dataengineeringpodcast.com)) with your story. To help other people find the show please leave a review on Apple Podcasts (https://podcasts.apple.com/us/podcast/data-engineering-podcast/id1193040557) and tell your friends and co-workers Links Opaque Systems (https://opaque.co/) UC Berkeley RISE Lab (https://rise.cs.berkeley.edu/) TLS (https://en.wikipedia.org/wiki/Transport_Layer_Security) MC² (https://mc2-project.github.io/) Homomorphic Encryption (https://en.wikipedia.org/wiki/Homomorphic_encryption) Secure Multi-Party Computation (https://en.wikipedia.org/wiki/Secure_multi-party_computation) Secure Enclaves (https://opaque.co/blog/what-are-secure-enclaves/) Differential Privacy (https://en.wikipedia.org/wiki/Differential_privacy) Data Obfuscation (https://en.wikipedia.org/wiki/Data_masking) AES == Advanced Encryption Standard (https://en.wikipedia.org/wiki/Advanced_Encryption_Standard) Intel SGX (Software Guard Extensions) (https://www.intel.com/content/www/us/en/developer/tools/software-guard-extensions/overview.html) Intel TDX (Trust Domain Extensions) (https://www.intel.com/content/www/us/en/developer/articles/technical/intel-trust-domain-extensions.html) TPC-H Benchmark (https://www.tpc.org/tpch/) Spark (https://spark.apache.org/) Trino (https://trino.io/) PyTorch (https://pytorch.org/) Tensorflow (https://www.tensorflow.org/) The intro and outro music is from The Hug (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug) by The Freak Fandango Orchestra (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/) / CC BY-SA (http://creativecommons.org/licenses/by-sa/3.0/)
An Exploration Of Tobias' Experience In Building A Data Lakehouse From Scratch
26-12-2022
An Exploration Of Tobias' Experience In Building A Data Lakehouse From Scratch
Summary Five years of hosting the Data Engineering Podcast has provided Tobias Macey with a wealth of insight into the work of building and operating data systems at a variety of scales and for myriad purposes. In order to condense that acquired knowledge into a format that is useful to everyone Scott Hirleman turns the tables in this episode and asks Tobias about the tactical and strategic aspects of his experiences applying those lessons to the work of building a data platform from scratch. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you're ready to build your next pipeline, or want to test out the projects you hear about on the show, you'll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode (https://www.dataengineeringpodcast.com/linode) today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don't forget to thank them for their continued support of this show! Atlan is the metadata hub for your data ecosystem. Instead of locking your metadata into a new silo, unleash its transformative potential with Atlan's active metadata capabilities. Push information about data freshness and quality to your business intelligence, automatically scale up and down your warehouse based on usage patterns, and let the bots answer those questions in Slack so that the humans can focus on delivering real value. Go to dataengineeringpodcast.com/atlan (https://www.dataengineeringpodcast.com/atlan) today to learn more about how Atlan’s active metadata platform is helping pioneering data teams like Postman, Plaid, WeWork & Unilever achieve extraordinary things with metadata and escape the chaos. Struggling with broken pipelines? Stale dashboards? Missing data? If this resonates with you, you’re not alone. Data engineers struggling with unreliable data need look no further than Monte Carlo, the leading end-to-end Data Observability Platform! Trusted by the data teams at Fox, JetBlue, and PagerDuty, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo monitors and alerts for data issues across your data warehouses, data lakes, dbt models, Airflow jobs, and business intelligence tools, reducing time to detection and resolution from weeks to just minutes. Monte Carlo also gives you a holistic picture of data health with automatic, end-to-end lineage from ingestion to the BI layer directly out of the box. Start trusting your data with Monte Carlo today! Visit dataengineeringpodcast.com/montecarlo (http://www.dataengineeringpodcast.com/montecarlo) to learn more. Your host is Tobias Macey and today I'm being interviewed by Scott Hirleman about my work on the podcasts and my experience building a data platform Interview Introduction How did you get involved in the area of data management? Data platform building journey Why are you building, who are the users/use cases How to focus on doing what matters over cool tools How to build a good UX Anything surprising or did you discover anything you didn't expect at the start How to build so it's modular and can be improved in the future General build vs buy and vendor selection process Obviously have a good BS detector - how can others build theirs So many tools, where do you start - capability need, vendor suite offering, etc. Anything surprising in doing much of this at once How do you think about TCO in build versus buy Any advice Guest call out Be brave, believe you are good enough to be on the show Look at past episodes and don't pitch the same as what's been on recently And vendors, be smart, work with your customers to come up with a good pitch for them as guests... Tobias' advice and learnings from building out a data platform: Advice: when considering a tool, start from what are you actually trying to do. Yes, everyone has tools they want to use because they are cool (or some resume-driven development). Once you have a potential tool, is the capabilty you want to use a unloved feature or a main part of the product. If it's a feature, will they give it the care and attention it needs? Advice: lean heavily on open source. You can fix things yourself and better direct the community's work than just filing a ticket and hoping with a vendor. Learning: there is likely going to be some painful pieces missing, especially around metadata, as you build out your platform. Advice: build in a modular way and think of what is my escape hatch? Yes, you have to lock yourself in a bit but build with the possibility of a vendor or a tool going away - whether that is your choice (e.g. too expensive) or it literally disappears (anyone remember FoundationDB?). Learning: be prepared for tools to connect with each other but the connection to not be as robust as you want. Again, be prepared to have metadata challenges especially. Advice: build your foundation to be strong. This will limit pain as things evolve and change. You can't build a large building on a bad foundation - or at least it's a BAD idea... Advice: spend the time to work with your data consumers to figure out what questions they want to answer. Then abstract that to build to general challenges instead of point solutions. Learning: it's easy to put data in S3 but it can be painfully difficult to query it. There's a missing piece as to how to store it for easy querying, not just the metadata issues. Advice: it's okay to pay a vendor to lessen pain. But becoming wholly reliant on them can put you in a bad spot. Advice: look to create paved path / easy path approaches. If someone wants to follow the preset path, it's easy for them. If they want to go their own way, more power to them, but not the data platform team's problem if it isn't working well. Learning: there will be places you didn't expect to bend - again, that metadata layer for Tobias - to get things done sooner. It's okay to not have the end platform built at launch, move forward and get something going. Advice: "one of the perennial problems in technlogy is the bias towards speed and action without necessarily understanding the destination." Really consider the path and if you are creating a scalable and maintainable solution instead of pushing for speed to deliver something. Advice: consider building a buffer layer between upstream sources so if there are changes, it doesn't automatically break things downstream. Tobias' data platform components: data lakehouse paradigm, Airbyte for data integration (chosen over Meltano), Trino/Starburst Galaxy for distributed querying, AWS S3 for the storage layer, AWS Glue for very basic metadata cataloguing, Dagster as the crucial orchestration layer, dbt Contact Info LinkedIn (https://www.linkedin.com/in/scotthirleman/) Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ () covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast (https://www.themachinelearningpodcast.com) helps you go from idea to production with machine learning. Visit the site (https://www.dataengineeringpodcast.com) to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com (mailto:hosts@dataengineeringpodcast.com)) with your story. To help other people find the show please leave a review on Apple Podcasts (https://podcasts.apple.com/us/podcast/data-engineering-podcast/id1193040557) and tell your friends and co-workers Links Data Mesh Community (https://datameshlearning.com/community/) Podcast (https://www.linkedin.com/company/80887002/admin/) OSI Model (https://en.wikipedia.org/wiki/OSI_model) Schemata (https://schemata.app/) Podcast Episode (https://www.dataengineeringpodcast.com/schemata-schema-compatibility-utility-episode-324/) Atlan (https://atlan.com/) Podcast Episode (https://www.dataengineeringpodcast.com/atlan-data-team-collaboration-episode-179/) OpenMetadata (https://open-metadata.org/) Podcast Episode (https://www.dataengineeringpodcast.com/openmetadata-universal-metadata-layer-episode-237/) Chris Riccomini (https://daappod.com/data-mesh-radio/devops-for-data-mesh-chris-riccomini/) The intro and outro music is from The Hug (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug) by The Freak Fandango Orchestra (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/) / CC BY-SA (http://creativecommons.org/licenses/by-sa/3.0/)
Making Sense Of The Technical And Organizational Considerations Of Data Contracts
19-12-2022
Making Sense Of The Technical And Organizational Considerations Of Data Contracts
Summary One of the reasons that data work is so challenging is because no single person or team owns the entire process. This introduces friction in the process of collecting, processing, and using data. In order to reduce the potential for broken pipelines some teams have started to adopt the idea of data contracts. In this episode Abe Gong brings his experiences with the Great Expectations project and community to discuss the technical and organizational considerations involved in implementing these constraints to your data workflows. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you're ready to build your next pipeline, or want to test out the projects you hear about on the show, you'll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode (https://www.dataengineeringpodcast.com/linode) today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don't forget to thank them for their continued support of this show! Atlan is the metadata hub for your data ecosystem. Instead of locking your metadata into a new silo, unleash its transformative potential with Atlan's active metadata capabilities. Push information about data freshness and quality to your business intelligence, automatically scale up and down your warehouse based on usage patterns, and let the bots answer those questions in Slack so that the humans can focus on delivering real value. Go to dataengineeringpodcast.com/atlan (https://www.dataengineeringpodcast.com/atlan) today to learn more about how Atlan’s active metadata platform is helping pioneering data teams like Postman, Plaid, WeWork & Unilever achieve extraordinary things with metadata and escape the chaos. Struggling with broken pipelines? Stale dashboards? Missing data? If this resonates with you, you’re not alone. Data engineers struggling with unreliable data need look no further than Monte Carlo, the leading end-to-end Data Observability Platform! Trusted by the data teams at Fox, JetBlue, and PagerDuty, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo monitors and alerts for data issues across your data warehouses, data lakes, dbt models, Airflow jobs, and business intelligence tools, reducing time to detection and resolution from weeks to just minutes. Monte Carlo also gives you a holistic picture of data health with automatic, end-to-end lineage from ingestion to the BI layer directly out of the box. Start trusting your data with Monte Carlo today! Visit dataengineeringpodcast.com/montecarlo (http://www.dataengineeringpodcast.com/montecarlo) to learn more. Your host is Tobias Macey and today I'm interviewing Abe Gong about the technical and organizational implementation of data contracts Interview Introduction How did you get involved in the area of data management? Can you describe what your conception of a data contract is? What are some of the ways that you have seen them implemented? How has your work on Great Expectations influenced your thinking on the strategic and tactical aspects of adopting/implementing data contracts in a given team/organization? What does the negotiation process look like for identifying what needs to be included in a contract? What are the interfaces/integration points where data contracts are most useful/necessary? What are the discussions that need to happen when deciding when/whether a contract "violation" is a blocking action vs. issuing a notification? At what level of detail/granularity are contracts most helpful? At the technical level, what does the implementation/integration/deployment of a contract look like? What are the most interesting, innovative, or unexpected ways that you have seen data contracts used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on data contracts/great expectations? When are data contracts the wrong choice? What do you have planned for the future of data contracts in great expectations? Contact Info LinkedIn (https://www.linkedin.com/in/abe-gong-8a77034/) @AbeGong (https://twitter.com/AbeGong) on Twitter Website (https://www.abegong.com/) Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ () covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast (https://www.themachinelearningpodcast.com) helps you go from idea to production with machine learning. Visit the site (https://www.dataengineeringpodcast.com) to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com (mailto:hosts@dataengineeringpodcast.com)) with your story. To help other people find the show please leave a review on Apple Podcasts (https://podcasts.apple.com/us/podcast/data-engineering-podcast/id1193040557) and tell your friends and co-workers Links Great Expectations (https://www.abegong.com/) Podcast Episode (https://www.dataengineeringpodcast.com/great-expectations-technical-debt-data-pipeline-episode-117/) Progressive Typing (https://en.wikipedia.org/wiki/Gradual_typing) Pioneers, Settlers, Town Planners (https://blog.gardeviance.org/2015/03/on-pioneers-settlers-town-planners-and.html) Pydantic (https://pydantic-docs.helpmanual.io/) Podcast.__init__ Episode (https://www.pythonpodcast.com/pydantic-data-validation-episode-263/) Typescript (https://www.typescriptlang.org/) Duck Typing (https://en.wikipedia.org/wiki/Duck_typing) Flyte (https://flyte.org/) Podcast Episode (https://www.dataengineeringpodcast.com/flyte-data-orchestration-machine-learning-episode-291/) Dagster (https://dagster.io/) Podcast Episode (https://www.dataengineeringpodcast.com/dagster-software-defined-assets-data-orchestration-episode-309) Trino (https://trino.io/) The intro and outro music is from The Hug (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug) by The Freak Fandango Orchestra (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/) / CC BY-SA (http://creativecommons.org/licenses/by-sa/3.0/)
Revisit The Fundamental Principles Of Working With Data To Avoid Getting Caught In The  Hype Cycle
19-12-2022
Revisit The Fundamental Principles Of Working With Data To Avoid Getting Caught In The Hype Cycle
Summary The data ecosystem has seen a constant flurry of activity for the past several years, and it shows no signs of slowing down. With all of the products, techniques, and buzzwords being discussed it can be easy to be overcome by the hype. In this episode Juan Sequeda and Tim Gasper from data.world share their views on the core principles that you can use to ground your work and avoid getting caught in the hype cycles. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you're ready to build your next pipeline, or want to test out the projects you hear about on the show, you'll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode (https://www.dataengineeringpodcast.com/linode) today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don't forget to thank them for their continued support of this show! Modern data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days or even weeks. By the time errors have made their way into production, it’s often too late and damage is done. Datafold built automated regression testing to help data and analytics engineers deal with data quality in their pull requests. Datafold shows how a change in SQL code affects your data, both on a statistical level and down to individual rows and values before it gets merged to production. No more shipping and praying, you can now know exactly what will change in your database! Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Visit dataengineeringpodcast.com/datafold (https://www.dataengineeringpodcast.com/datafold) today to book a demo with Datafold. RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their extensive library of integrations enable you to automatically send data to hundreds of downstream tools. Sign up free at dataengineeringpodcast.com/rudder (https://www.dataengineeringpodcast.com/rudder) Build Data Pipelines. Not DAGs. That’s the spirit behind Upsolver SQLake, a new self-service data pipeline platform that lets you build batch and streaming pipelines without falling into the black hole of DAG-based orchestration. All you do is write a query in SQL to declare your transformation, and SQLake will turn it into a continuous pipeline that scales to petabytes and delivers up to the minute fresh data. SQLake supports a broad set of transformations, including high-cardinality joins, aggregations, upserts and window operations. Output data can be streamed into a data lake for query engines like Presto, Trino or Spark SQL, a data warehouse like Snowflake or Redshift., or any other destination you choose. Pricing for SQLake is simple. You pay $99 per terabyte ingested into your data lake using SQLake, and run unlimited transformation pipelines for free. That way data engineers and data users can process to their heart’s content without worrying about their cloud bill. For data engineering podcast listeners, we’re offering a 30 day trial with unlimited data, so go to dataengineeringpodcast.com/upsolver (https://www.dataengineeringpodcast.com/upsolver) today and see for yourself how to avoid DAG hell. Your host is Tobias Macey and today I'm interviewing Juan Sequeda and Tim Gasper about their views on the role of the data mesh paradigm for driving re-assessment of the foundational principles of data systems Interview Introduction How did you get involved in the area of data management? What are the areas of the data ecosystem that you see the most turmoil and confusion? The past couple of years have brought a lot of attention to the idea of the "modern data stack". How has that influenced the ways that your and your customers' teams think about what skills they need to be effective? The other topic that is introducing a lot of confusion and uncertainty is the "data mesh". How has that changed the ways that teams think about who is involved in the technical and design conversations around data in an organization? Now that we, as an industry, have reached a new generational inflection about how data is generated, processed, and used, what are some of the foundational principles that have proven their worth? What are some of the new lessons that are showing the greatest promise? data modeling data platform/infrastructure data collaboration data governance/security/privacy How does your work at data.world work support these foundational practices? What are some of the ways that you work with your teams and customers to help them stay informed on industry practices? What is your process for understanding the balance between hype and reality as you encounter new ideas/technologies? What are some of the notable changes that have happened in the data.world product and market since I last had Bryon on the show in 2017? What are the most interesting, innovative, or unexpected ways that you have seen data.world used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on data.world? When is data.world the wrong choice? What do you have planned for the future of data.world? Contact Info Juan LinkedIn (https://www.linkedin.com/in/juansequeda/) @juansequeda (https://twitter.com/juansequeda) on Twitter Website (https://www.juansequeda.com/) Tim LinkedIn (https://www.linkedin.com/in/timgasper/) @TimGasper (https://twitter.com/TimGasper) on Twitter Website (https://www.timgasper.com/) Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ () covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast (https://www.themachinelearningpodcast.com) helps you go from idea to production with machine learning. Visit the site (https://www.dataengineeringpodcast.com) to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com (mailto:hosts@dataengineeringpodcast.com)) with your story. To help other people find the show please leave a review on Apple Podcasts (https://podcasts.apple.com/us/podcast/data-engineering-podcast/id1193040557) and tell your friends and co-workers Links data.world (https://data.world/) Podcast Episode (https://www.dataengineeringpodcast.com/data-dot-world-with-bryon-jacob-episode-9/) Gartner Hype Cycle (https://www.gartner.com/en/information-technology/glossary/hype-cycle) Data Mesh (https://www.thoughtworks.com/en-us/what-we-do/data-and-ai/data-mesh) Modern Data Stack (https://tanay.substack.com/p/understanding-the-modern-data-stack) DataOps (https://en.wikipedia.org/wiki/DataOps) Data Observability (https://www.montecarlodata.com/blog-what-is-data-observability/) Data & AI Landscape (https://mattturck.com/data2021/) DataDog (https://www.datadoghq.com/) RDF == Resource Description Framework (https://en.wikipedia.org/wiki/Resource_Description_Framework) SPARQL (https://en.wikipedia.org/wiki/SPARQL) Moshe Vardi (https://en.wikipedia.org/wiki/Moshe_Vardi) Star Schema (https://en.wikipedia.org/wiki/Star_schema) Data Vault (https://en.wikipedia.org/wiki/Data_vault_modeling) Podcast Episode (https://www.dataengineeringpodcast.com/data-vault-data-modeling-episode-119/) BPMN == Business Process Modeling Notation (https://en.wikipedia.org/wiki/Business_Process_Model_and_Notation) The intro and outro music is from The Hug (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug) by The Freak Fandango Orchestra (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/) / CC BY-SA (http://creativecommons.org/licenses/by-sa/3.0/)
Convert Your Unstructured Data To Embedding Vectors For More Efficient Machine Learning With Towhee
12-12-2022
Convert Your Unstructured Data To Embedding Vectors For More Efficient Machine Learning With Towhee
Preamble This is a cross-over episode from our new show The Machine Learning Podcast, the show about going from idea to production with machine learning. Summary Data is one of the core ingredients for machine learning, but the format in which it is understandable to humans is not a useful representation for models. Embedding vectors are a way to structure data in a way that is native to how models interpret and manipulate information. In this episode Frank Liu shares how the Towhee library simplifies the work of translating your unstructured data assets (e.g. images, audio, video, etc.) into embeddings that you can use efficiently for machine learning, and how it fits into your workflow for model development. Announcements Hello and welcome to the Machine Learning Podcast, the podcast about machine learning and how to bring it from idea to delivery.Building good ML models is hard, but testing them properly is even harder. At Deepchecks, they built an open-source testing framework that follows best practices, ensuring that your models behave as expected. Get started quickly using their built-in library of checks for testing and validating your model’s behavior and performance, and extend it to meet your specific needs as your model evolves. Accelerate your machine learning projects by building trust in your models and automating the testing that you used to do manually. Go to themachinelearningpodcast.com/deepchecks today to get started!Your host is Tobias Macey and today I’m interviewing Frank Liu about how to use vector embeddings in your ML projects and how Towhee can reduce the effort involved Interview IntroductionHow did you get involved in machine learning?Can you describe what Towhee is and the story behind it?What is the problem that Towhee is aimed at solving?What are the elements of generating vector embeddings that pose the greatest challenge or require the most effort?Once you have an embedding, what are some of the ways that it might be used in a machine learning project? Are there any design considerations that need to be addressed in the form that an embedding takes and how it impacts the resultant model that relies on it? (whether for training or inference) Can you describe how the Towhee framework is implemented? What are some of the interesting engineering challenges that needed to be addressed?How have the design/goals/scope of the project shifted since it began? What is the workflow for someone using Towhee in the context of an ML project?What are some of the types optimizations that you have incorporated into Towhee? What are some of the scaling considerations that users need to be aware of as they increase the volume or complexity of data that they are processing? What are some of the ways that using Towhee impacts the way a data scientist or ML engineer approach the design development of their model code?What are the interfaces available for integrating with and extending Towhee?What are the most interesting, innovative, or unexpected ways that you have seen Towhee used?What are the most interesting, unexpected, or challenging lessons that you have learned while working on Towhee?When is Towhee the wrong choice?What do you have planned for the future of Towhee? Contact Info LinkedInfzliu on GitHubWebsite@frankzliu on Twitter Parting Question From your perspective, what is the biggest barrier to adoption of machine learning today? Closing Announcements Thank you for listening! Don’t forget to check out our other shows. The Data Engineering Podcast covers the latest on modern data management. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used.Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@themachinelearningpodcast.com) with your story.To help other people find the show please leave a review on iTunes and tell your friends and co-workers Links TowheeZillizMilvus Data Engineering Podcast Episode Computer VisionTensorAutoencoderLatent SpaceDiffusion ModelHSL == Hue, Saturation, LightnessWeights and Biases The intro and outro music is from Hitman’s Lovesong feat. Paola Graziano by The Freak Fandango Orchestra/CC BY-SA 3.0
Run Your Applications Worldwide Without Worrying About The Database With Planetscale
12-12-2022
Run Your Applications Worldwide Without Worrying About The Database With Planetscale
Summary One of the most critical aspects of software projects is managing its data. Managing the operational concerns for your database can be complex and expensive, especially if you need to scale to large volumes of data, high traffic, or geographically distributed usage. Planetscale is a serverless option for your MySQL workloads that lets you focus on your applications without having to worry about managing the database or fight with differences between development and production. In this episode Nick van Wiggeren explains how the Planetscale platform is implemented, their strategies for balancing maintenance and improvements of the underlying Vitess project with their business goals, and how you can start using it today to free up the time you spend on database administration. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementWhen you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show!Modern data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days or even weeks. By the time errors have made their way into production, it’s often too late and damage is done. Datafold built automated regression testing to help data and analytics engineers deal with data quality in their pull requests. Datafold shows how a change in SQL code affects your data, both on a statistical level and down to individual rows and values before it gets merged to production. No more shipping and praying, you can now know exactly what will change in your database! Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Visit dataengineeringpodcast.com/datafold today to book a demo with Datafold.RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their extensive library of integrations enable you to automatically send data to hundreds of downstream tools. Sign up free at dataengineeringpodcast.com/rudderBuild Data Pipelines. Not DAGs. That’s the spirit behind Upsolver SQLake, a new self-service data pipeline platform that lets you build batch and streaming pipelines without falling into the black hole of DAG-based orchestration. All you do is write a query in SQL to declare your transformation, and SQLake will turn it into a continuous pipeline that scales to petabytes and delivers up to the minute fresh data. SQLake supports a broad set of transformations, including high-cardinality joins, aggregations, upserts and window operations. Output data can be streamed into a data lake for query engines like Presto, Trino or Spark SQL, a data warehouse like Snowflake or Redshift., or any other destination you choose. Pricing for SQLake is simple. You pay $99 per terabyte ingested into your data lake using SQLake, and run unlimited transformation pipelines for free. That way data engineers and data users can process to their heart’s content without worrying about their cloud bill. For data engineering podcast listeners, we’re offering a 30 day trial with unlimited data, so go to dataengineeringpodcast.com/upsolver today and see for yourself how to avoid DAG hell.Your host is Tobias Macey and today I’m interviewing Nick van Wiggeren about Planetscale, a serverless and globally distributed MySQL database as a service Interview IntroductionHow did you get involved in the area of data management?Can you describe what Planetscale is and the story behind it?What are the core problems that you are solving with the Planetscale platform? How might an engineering team address those challenges in the absence of Planetscale/Vitess? Can you describe how Planetscale is implemented? What are some of the addons that you have had to build on top of Vitess to make Planetscale What are the impacts that a serverless database has on the way teams approach their application/platform design and development?metrics exposed to help users optimize their usageWhat is your policy/philosophy for determining what capabilities to include in Vitess and what belongs in the Planetscale platform?What are the most interesting, innovative, or unexpected ways that you have seen Planetscale/Vitess used?What are the most interesting, unexpected, or challenging lessons that you have learned while working on Planetscale?When is Planetscale the wrong choice?What do you have planned for the future of Planetscale? Contact Info @nickvanwig on TwitterLinkedInnickvanw on GitHub Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers Links PlanetscaleVitessCNCF == Cloud Native Computing FoundationHadoopOLTP == Online Transactional ProcessingGaleraYugabyte DB Podcast Episode CitusDBMariaDB SkySQL Podcast Episode CockroachDB Podcast Episode NewSQLAWS PrivateLinkPlanetscale ConnectSegment Podcast Episode BigQuery The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA