The Data Opportunity: the Emergence of a New Layer in Modern Enterprise
Data is changing the way businesses are being built and run. Most CEOs and business leaders will agree that future success is increasingly reliant on their ability to turn data into a differentiating advantage. Yet +70% of enterprises still lag in their ability to create value from data.
At the same time, extracting value from data is more important and more challenging than ever. Over the past decades, we have seen enterprise data evolve from limited data, monolithic systems, expensive storage and constrained use cases, to big data, increasingly decentralised systems, cheaper, larger and faster storage, and we have seen the expansion of “data intelligence” and BI across organisations. We are now transitioning towards a wave of ML/AI-driven intelligence and automation, underpinned by higher level, more complex data needs such as real-time actionability and stricter privacy oversight.
In this new wave, it is the data — rather than code — and its workflows that drive the system’s output and performance. For enterprises to maximise insights and value from their data — whether it’s traditional software or AI/ML software — the underlying data infrastructure and tooling will need to evolve. We believe this evolution will fuel the rise of a new generation of data infrastructure and tooling companies that enable enterprises to create value with their data. We refer to this as “the new data layer.”
We see this new data layer as a large shift in modern Enterprise with the potential to outgrow “code” by orders of magnitude and create several multi-billion dollar categories over the next decade.
As a Europe-focused investor, we believe this region’s focus on data offers fertile ground for innovation in the new data layer and an opportunity to build on the emerging successes of companies like Collibra, Dataiku and Privitar.
What is driving the rise of the data layer?
The modern data stack is not new, but the past years have accelerated its development, underpinned by tailwinds that will continue to play out over years to come:
- Growth in data volume, velocity and complexity driven by increasing data collection and generation across multiple data types, multiple channels, multiple devices (embedded, IOT); increased digitization across more industries and adoption of AI/ML; increased heterogeneity of users including data “citizens” users expanding data inputs. It is estimated that 175 zettabyes of data will be generated annually by 2025.
- A growing role of enterprises as ‘data stewards’ as businesses centralize customer and business data, it’s estimated that in 2019 more data was stored in the enterprise core than in all the world’s existing endpoints.
- Continued mass migration to the cloud, and knowledge centralisation with the cloud as the new data “core”
- Increased penetration of AI/ML, including outside the realm of ‘AI-native’ companies
- Increasing business imperatives for low latency data insights as enterprises want systems to be always “on,” tracking, monitoring, listening, watching, learning, enabling real-time actionability
- Increase in DataOps with 70%+ of companies planning to increase their data ops teams
- Scarcity of data talent including data engineers, data scientists, and ML/AI engineers and evolution of data team organisational structure; Demand for data engineers increased by 50% and for data scientists by 32% in 2019.
- Data leaders entering the C-suite; 70% of Fortune 500 had a Chief Data Officer in 2018, up from only 12% in 2012.
- Democratisation of data and its use cases to a broader set of users and “data citizens” across the business
- Data privacy risks as breaches continue to rise (4B of records exposed in H1 2019 alone) and GDPR-like regulation is adopted (e.g. CCPA in California, LGPD in Latin America); 50% of the world’s population is expected to adopt this type of policy by 2022.
The evolving data stack
These rapidly evolving needs have fuelled a proliferation and fragmentation of data tooling and infrastructure across all layers (collection, storage, transformation, analysis etc). It is a dynamic, yet relatively nascent (and thus messy) space. Category boundaries are blurred or only being defined, most tools haven’t reached critical mass (often not even in a niche), and the use cases and overlaps between solutions (what the tools do and what they don’t do) is not yet obvious.
This fragmentation creates challenges for everyone involved, whether it’s users or enterprises looking to choose their tooling for the data stack, founders deciding on positioning, go-to-market and product evolution, or investors trying to assess which tool will become a breakout company.
For more detailed data architectures, a16z has a great piece here.
Where does it hurt: a user view
We invited CTOs, data engineers, ML engineers and data scientists from Atomico portfolio companies to a series of data roundtables to get insights on their challenges. Participant companies ranged from scaleups to early-stage companies, and from BI use cases to AI/ML software.
A few high-level themes emerged from these discussions:
- Data is increasingly seen as a strategic asset. Cloud data storage allows for more standardization and collaboration and could become a system of record.
- The infrastructure necessary for cloud migration creates an opportunity (and often an imperative) to rethink the architecture and leverage it as a strategic capability.
- A proliferation of tools creates confusion about what tools can and cannot do, how they work together and manual tooling to link them together.
- There is a proliferation of data stacks even within a single company as different users, business units, etc build their own stacks.
- There is often data segregation between the “customer facing” teams and the “data platform” teams.
- “Build” versus “buy” has a number of trade-offs, but the former leads to a number of DIY solutions that don’t scale.
Below is more detail on the challenges raised during these conversations.
These issues hinder the ability for companies to rapidly extract value out of data. There’s no point in data just being stored, time-to-insight is key, and these issues extend the time-to-insight significantly.
Opportunity is everywhere
There are exciting areas of opportunity throughout the new data layer. Here are a few (non-exhaustive) ones we have been thinking of; we know that there will be many others that we haven’t even contemplated today.
- Data “worker” productivity: 70–80% of a data scientist’s work is spent on tasks like collecting, cleaning and labelling data, or other engineering type work. Most teams need to build a lot of tools internally and then maintain them. Models are often built locally and offline and the process of translating them to scale and online is painful. The tooling to standardise, govern and collaborate around ML data is nascent. It’s a multifaceted problem and thus an active area for building new solutions: from labelling and annotation platforms (e.g. Superannotate, V7 labs) to data scientist productivity and collaboration platforms (e.g. Deepnote), to experiment automation and optimisation, to low code app development (e.g. Streamlit)
- Data collaboration and democratisation (engineering <-> analytics): Given talent shortages and the need to integrate domain knowledge that sits with business users, we see an increasing need for solutions that provide interface-enabling collaboration between various data ‘actors’ in different areas of the stack and create a bridge to data consumers (e.g. Dataiku, Abacus.ai, Graphy).
- Low latency analytics: A key promise of data — translating insights into relevant business decisions at the right time — relies on the ability to access and analyse data with low latency. Emerging approaches to enable this range from higher performances databases to analytics outside the warehouse via a single point of access to data regardless of where it sits (e.g. Starburst)
- Privacy enabling compute: Trade-offs between data usability and privacy, or restrictions that limit data sharing and amount of data available to use have led to several approaches to address these tensions including differential privacy, encryption, secure multi-party compute, federated learning , synthetic data, or processing in secure environments (e.g. Inpher, MostlyAI, Evervault). This space is still early and users tend to sit at the more sophisticated end of the data needs spectrum.
- Cloud data marketplaces and exchanges: Every company is a data company, and more players will soon aim to unlock this untapped value and find ways to monetise it. With collection and cleaning a significant pain point, there is also an opportunity for third parties to address the space. Companies like Snowflake are already positioning themselves to lay out the piping and become an obvious intermediary.
Pipeline and orchestration related
- Data quality & observability: While tools exist to monitor code and infra/apps (e.g. Datadog), data workflows are dealt with manually or with DIY solutions. Quality issues stem from across the stack: data sources, inconsistencies in unification and integration (e.g. database mergers, cloud integrations) and lack of coordination between users (e.g. different definitions). A new set of startups are trying to solve this problem in what is still a relatively white space (e.g. Soda, Monte Carlo, Datafold).
- Data lineage: Linked to quality, being able to trace back data through its ingestion, transformation and processing journey along the various use cases is key to ensuring reliability, trust and therefore business value of data, as well as GDPR compliance or effective changes, upgrades or systems migration. We expect to see more advanced automation tools in the space as the complexity and volume of data increases.
- Automation and operational applications: The data tools and stack today are used with an analytical objective — to generate insights — that informs human decision-making and execution. Increased standardization of data and use cases are also enabling automation of repetitive analytical tasks. As an extension of this, we believe a next stage opportunity lies in using the data and workflows with an operational objective: to generate an action by pushing it directly into operational systems and automating decisions and execution.
- End-to-end AI/ML platforms: As we enter the deployment phase of AI/ML in enterprise, we expect to see more AI platforms that help productize and operate ML systems in the data layer (e.g. Tecton).
- Customer data privacy tools: Frameworks like GDPR and CCPA give consumers the right to access or delete data collected on them, but companies do not have fast tools to meet these requests given the multitude of vendors, databases, and 3rd-party platforms. While early, some solutions are emerging here to manage consent and automate such requests (e.g. Transcend, Mine).
As natural for an evolving space, there are many active debates around the future developments within the Data layer: How will Data warehouses and Data lakes evolve or more broadly, compute, storage and data? Will BI and ML workflows converge? What possibilities do complex data open up? How do we deal with the tension between democratization of data and managing access, security and governance? How do we evolve towards operationalization of data into automated decisions and execution?
We look forward to participating in these debates and to seeing the opportunities they unlock as they play out over the next years.
Considerations for company builders in the data layer
To wrap up, we thought it might be helpful to share some of the specific themes we hear from founders building and scaling products in data.
- Buyer personas and customer data “modernness”
Companies are still building out their organisational structures around data; the buying decision process and personas vary between companies and evolve constantly. Buyers of data tooling include CDOs, Heads of Data, Heads of Data Engineering, CTO / VP Engineering or even C level roles outside the data or engineering realms (compliance/governance, business heads etc).
Customer data “modernness” has a direct impact on buyer persona, product and go-to-market choices, and ultimately customer economics. Most data tools tend to initially target the “tech-driven” enterprises with larger data teams and modern data stack, with engineering-led decision making. These companies are more likely to be early adopters, have shorter sales cycles and the data capabilities to understand and implement new tools. On the other side, there are the more “traditional” enterprises with a lot of data and data use cases, but with nascent data teams and legacy stacks. They are often large companies and therefore could have high value per customer but will require a top-down multi-stakeholder go-to-market with longer sales cycles, backed by more service, customisation and integration with legacy systems with potential impact on margin and speed of scaling.
2. User personas
At a very high level, users in the space fall in two categories: engineering users (e.g. data engineers, VP eng) and analytics users / data consumers (e.g. data scientists, data and BI analysts, even business users). Which of these two founders and teams chose to focus on will have implications for product development, GTM strategy and ACV potential. For instance, being able to tap into a bigger group of data ‘consumers,’ especially business users, can expand the value per customer, and/or be critical to the building the product in cases where domain knowledge is necessary to deliver on the solution, such as data quality. It has, however, implications on UI/UX, targeting and positioning.
3. Community and bottom-up discovery via OS
With the proliferation of data tools, a key challenge for new tools is discoverability and awareness. This implies a bottom-up, community-led approach, often via an open source strategy and is particularly relevant when the product is geared towards an engineering buyer persona. While some successful players in the space like Snowflake and Collibra relied on top-down (partly because of a buyer persona), many new data tools plays have an OS strategy and maintain a Slack community. If community is an important lever, it requires setting up a deliberate strategy and dedicated capabilities early.
4. Community to commercialisation
A bottom-up go-to-market approach adds the known challenge of balancing it with monetization: building enough differentiation and value-add in the commercial, enterprise-grade product vs. the OS product, while ensuring transparency and right expectation setting for the community. Also, while growing the community is a natural initial objective, adoption needs to translate over time into lead conversion.
5. Market readiness and customer sweet spot
Most enterprises are still early in their data journey and focused towards the bottom of the “pyramid” of data needs. Fewer companies are ready to adopt (and pay) for the more sophisticated end of data tooling that requires a highly skilled data team to understand and implement them. It’s worth understanding early the type and size of the customer segment that is ready to adopt, not only for developing the right value proposition and GTM, but also to plan the scaling and capital requirements accordingly.
6. Product expansion and differentiation
The increasingly competitive and rapidly evolving stack makes it non-trivial for founders to decide on where to expand the initial product from its “entry point.” A continued exploration and in-depth understanding of customer’s pain points and willingness to pay (vs willingness to experiment) is helpful in this expansion journey, combined with an understanding of where one’s team can drive an unfair advantage, be it industry connectivity and insight, community, or differentiated technical know-how.
We are incredibly excited to see the next generation of data founders take on the opportunities ahead in the data layer. If you are building a company in this space, we would love to hear from you at [email protected] or [email protected].
Finally, thank you to Karim Fanous from Kheiron Medical for his input and leadership of our roundtables (he writes about data and AI here) and to all our other data roundtable contributors from Beekeeper, CloudNC, Healx, Kheiron Medical, Klarna, LabGenius, Peakon, PsiQuantum, Scandit, Teralytics and Varjo.