1 gigabyte of data in a grocery bag. This is what you get when doing robotic deliveries. This is a lot of data. Especially when iterating over a million times like we do.
But the rabbit hole goes deeper. The data is also incredibly diverse. Robot sensor and image data, user interactions with apps, transaction data from orders, and more. And his use cases are equally diverse, from training deep neural networks to creating sophisticated visualizations for merchants his partners, and everything in between.
So far, our centralized data team has handled all this complexity. Our exponential growth to date has prompted us to explore new ways of working to keep pace.
I found the data mesh paradigm to be the best method. Below I’ll describe Starship’s take on data meshes, but first I’ll provide a quick overview of the approach and why I chose to adopt it.
What is a data mesh?
The Data Mesh Framework was originally written by Zhamak Dehghani. This paradigm is based on the core concepts of data products, data domains, data platforms, and data governance.
The main purpose of the Data Mesh Framework is to help large organizations remove data engineering bottlenecks and deal with complexity. As such, it addresses many of the details associated with enterprise environments, from data quality, architecture and security to governance and organizational structure. As it stands, only a few companies have publicly announced their adherence to the data mesh paradigm, and all are large, multi-billion dollar companies. Nonetheless, we believe it can be successfully applied to smaller businesses as well.
Starship data mesh
Does the data work in close proximity to the people who produce or consume the information?
Operating a hyperlocal robotic delivery marketplace around the world requires turning a variety of data into valuable products. Data comes in from robots (e.g. telemetry, routing decisions, ETA), merchants and customers (apps, orders, offerings, etc.), and all operational aspects of the business (from simple remote operator tasks to global logistics for spares). I will come. parts and robots).
The diversity of use cases is the main reason we were attracted to the data mesh approach. We want to perform data work in close proximity to the people who produce or consume information. By following the data mesh principle, we want to meet the diverse data needs of our teams while still moderately reducing central oversight.
Starship is not yet enterprise-scale, so implementing all aspects of the data mesh is impractical. Instead, we settled on a simplistic approach that makes sense for us today and puts us on the right path for the future.
Define what a data product is — each includes an owner, an interface, and a user
Applying product thinking to data is the foundation of the whole approach. Anything that exposes data to other users or processes is considered a data product. Data can be published in any format, including BI dashboards, Kafka topics, data warehouse views, and responses from predictive microservices.
A simple example of Starship’s data product is a BI dashboard for site leads to track site transaction volume. A more complex example is a self-service pipeline for robot software engineers to send all kinds of driving information from robots to data lakes.
In any case, we don’t treat the data warehouse (actually Databricks Lakehouse) as a single product, but as a platform supporting many interconnected products. Such granular products are typically owned by the data scientists/engineers who build and maintain them rather than dedicated product managers.
The Product Owner is expected to know who the users are and what needs they are trying to solve with the product, and based on that, define and meet the quality expectations of the product. Perhaps as a result, we pay more forethought to interfaces, components that are critical to usability but cumbersome to change.
Most importantly, understanding your users and the value each product brings to them makes prioritizing ideas much easier. This is important in a startup context where you need to move quickly and don’t have time to make everything perfect.
Group data products into domains to reflect your company’s organizational structure
Before noticing the data mesh model, I was successfully using the following format: Lightly embedded data scientist For a while on Starship. In effect, some key teams had data team his members working part-time, whatever that meant for that particular team.
I proceeded to define the data domains according to the organizational structure, but this time I was careful to cover all parts of the company. After mapping data products to domains, we assigned data team members to manage each domain. This person is responsible for managing the entire set of data products within the domain. Some of them are owned by the same person, some by other engineers on the domain team, some by other data team her members (for resource reasons, etc.) . .
There are a few things I like about our domain setup. First and foremost, every area of the company now has someone responsible for data architecture. Given the nuances inherent in all domains, this is only possible because of the division of labor.
Creating structure in our data products and interfaces has also helped us better understand the world of data. For example, in situations where there are more domains than data team members (currently 19 versus 7), they are doing a better job of ensuring that each one is working on a set of interrelated topics. . And I realized that to alleviate the growing pain, I needed to minimize the number of interfaces used across domain boundaries.
Finally, a more subtle bonus of using data domains: I feel there’s a recipe for tackling all sorts of new situations. Every time a new initiative comes along, it becomes clearer to everyone where it belongs and who needs to carry it out.
There are also some open questions. While some domains naturally lean towards exposing most of their source data, others naturally lean towards consuming and transforming it, others have a fair amount of both. Should I divide it if it gets too big? Or should we put subdomains within a larger domain? We’ll have to make those decisions in the future.
Empower those who build data products by standardizing instead of centralizing
The goal of Starship’s data platform is simple. Allowing a single data person (usually a data scientist) to handle a domain end-to-end, meaning not tying up a central data platform team all day long. work today. To do that, we need to provide our domain engineers and data scientists with great tools and standard building blocks for their data products.
Does the data mesh approach require a full data platform team? Not much. Our data platform team consists of one data platform engineer who spends half of his time building it into the domain in parallel. The main reason we are able to make our data platform engineering so lean is by choosing Spark+Databricks as the core of our data platform. Traditional data warehouse architectures impose significant data engineering overhead due to the diversity of data domains.
We have found it useful to have a clear distinction between the data stacks between components that are part of the platform and all other components. Examples of what we provide domain teams as part of our data platform:
- Databricks+Spark as a working environment and general-purpose computing platform.
- One-liner functions for ingesting data from Mongo collections, Kafka topics, etc.
- An Airflow instance for scheduling data pipelines.
- A template for building and deploying predictive models as microservices.
- data product cost tracking;
- BI & visualization tools.
As a general approach, our goal is to standardize as much as it makes sense in the current context. As long as we’re more productive at this point and no part of the process is centralized, we’re happy. For example, tools for data quality assurance, data discovery, and data lineage are things we leave for the future.
Strong personal ownership supported by feedback loops
Having fewer people and teams is actually an asset in some aspects of governance. For example, it makes decision making much easier. On the other hand, our major governance issues are also a direct result of our size. For each domain he cannot be expected to be an expert on all potential technical aspects when there is one data person. However, they are the only ones who understand domains in detail. How can you maximize their chances of making good choices within their domain?
Our answer is a culture of ownership, discussion and feedback within the team. We have generously embraced Netflix’s business philosophy and cultivated:
- Individual responsibility for results (of own product and domain).
- Seek a variety of opinions before making decisions, especially those that affect other domains.
- Seek feedback and code reviews as both a quality mechanism and an opportunity for personal growth.
We also made some specific agreements on how to approach quality, best practices (including naming conventions), and more. However, we believe that the right feedback loop is a key factor in making the guidelines a reality.
These principles apply beyond the “building” work of the data team. That’s the focus of this blog post. Clearly, when it comes to how data scientists create value within an enterprise, it goes beyond just delivering data products.
Final Thoughts on Governance – We repeat the way we work. We know that there is no one “best” way to do things and that we have to adapt over time.
The last word
This one! These are the four core data mesh concepts applied to Starship. As you can see, we’ve found an approach to data meshing that works well for our agile growth stage. If that sounds appealing in your context, we hope you’ve found reading about our experience helpful.
If you would like to join us in our work, please see our careers page for a list of open positions. Or check out our YouTube channel to learn more about the world’s leading robotic delivery service.
If you have any questions or comments, please contact me. Let’s learn from each other!