Looking ahead to the future of computing and data infrastructure
Friday, January 1st 2021
As an early stage investor, I spend a lot of time speaking with founders, their target customers, employees at larger companies, academics, and my investor peers. While the purpose of these discussions varies, they seem to inevitably drift into some version of “what’s happening now, and what is likely to happen over the course of the next few years?” My ongoing fascination with how advancements in computing and data infrastructure impact the way we work tends to be the focus of these conversations. This post intends to provide a synthesis of the most impactful ideas from these conversations over the past year, and their influence on my go-forward thinking as an investor. I look forward to your thoughts, and hope you enjoy reading it.
I also want to thank Arjun Narayan, Ev Kontsevoy, Dave Cole, Mark Curphey, Manu Sharma, Justin Gage, and Matt Biilmann, among the many others whose ideas, feedback, and ongoing work contributed to this post. Now, without further ado, here is what I’m thinking about heading into ‘21.
The cloud data warehouse becomes the cloud data “platform"
BigQuery, Snowflake, Fivetran, and DBT have become the reference architecture for analytics. There is no simpler way to join disparate silos of data together for analysis today. However, I see things taking a different shape over the next few years. Now that all critical business data is centralized and readily accessible via SQL, the logical next step is to use this foundation to build full-featured applications that both read and write to the warehouse. This pattern is already playing out within the most forward-leaning companies, and I suspect it will accelerate into a more common architectural pattern over the next 12 months. We will also see more start-ups attempting to reinvent existing categories like marketing automation, user event analytics, and customer support by building directly on top of the warehouse.
There are two interesting consequences of this trend. The first is the need for cloud data warehouses to ingest and process streaming data. Most analytics pipelines today are batch-based once-a-day pipelines, rather than real-time. As this data starts being used to power user-facing experiences, the value of reducing latency increases dramatically. While stale internal dashboards can be acceptable, stale customer-facing data is not. This will drive investment across all stages of the data pipeline to make them real-time. The second consequence is a democratization of data - moving from specialized tools and languages to a simple SQL query layer means that any employee that can write a SQL query will be able to also build full-fledged applications and customer-facing experiences. I don’t see this trend reversing, and after some questions about its applicability to the “big data” era, SQL is firmly in place as the lingua franca for data for the next decade.
The cloud goes serverless
Cloud providers are correctly viewed as the single biggest shift in the history of computing infrastructure. AWS spent the last decade pitching the benefits of their infrastructure relative to on-premise infrastructure. The next decade of infrastructure will be built from the ground up to take full advantage of the architectural benefits of the cloud, and will start to pose a strategic threat to the cloud providers themselves. These services will “rent” scale from the providers, and completely abstract away the complexity of managing infrastructure from their customers. In doing so, they will also benefit from the ability to allocate more engineering resources to higher-level technical differentiation and UX, while being incredibly cheap for customers to get started. The best of these services will be serverless (defined as costing zero when not being used), making it incredibly cheap for users to get started without paying high upfront-costs.
Snowflake serves as a mature example. They built the first cloud data warehouse to scale storage and compute independently, taking advantage of the possibilities of AWS EC2 and S3 (and their equivalents in Azure and GCP). Netlify* is a serverless hosting platform built on top of AWS Cloudfront, S3, and their equivalents. Instead of building and operating content delivery networks and globally-distributed object stores, they differentiate on developer ergonomics and multi-cloud availability.
These companies pose a real threat to the margins of the cloud providers themselves, as they compete directly against high-margin cloud provider products (just as Snowflake competes with AWS Redshift). At the same time, these companies fundamentally increase cloud adoption! Carefully navigating this frenemy relationship will be key to building successful businesses, and as Snowflake has shown, can be category-defining at scale.
I am excited about the potential of this space, and its ability to yield large, category-defining companies. **Serverless infrastructure players will continue to compete with the cloud providers on the dimension where the cloud providers are weakest (UX and ergonomics), are uniquely positioned to enable “multi-cloud,” and can allocate the majority of their R&D resources toward differentiating features and technology. **However, this model is not without challenges. While fending off stiff competition from the cloud providers, they must also ensure their unit economics remain healthy, net of the “rent” they must pay in exchange for building on their APIs. Threading this needle is not easy. It requires each to maintain a differentiated technical position against three larger and better-funded adversaries, build vibrant user communities that translate to mindshare, and provide a superior customer experience relative to what the cloud providers can offer. While this may seem like an uphill climb for a startup, companies like Snowflake are proof that it is possible. Now that employees from these household names are leaving to join and found new companies that will undoubtedly benefit from this experience, I believe we are still “rounding first base” in terms of our understanding of what kinds of businesses can be built on top of the cloud platforms.
Data becomes the new endpoint in security
More companies are software companies than ever before, and this has pushed the value of data to new heights. Unfortunately, the rate of data breaches has trended similarly, with a record 2,935 such incidents being reported through Q3’20. At the same time, governments around the world are passing and enforcing regulations meant to incentivize companies to enact more rigorous data security controls. If our data is so valuable, and the stakes this high, why are we struggling so mightily to protect it? Historical attempts at securing data have focused on protecting the infrastructure that persists and provides access to it — endpoints. This made sense when companies physically owned and managed each server and laptop in use. In this construct, endpoints are a much easier “asset” to reason about than data. However, this approach becomes problematic as infrastructure shifts to cloud services. Endpoints are highly ephemeral in the cloud, making it difficult to instrument good security controls around them. This is exacerbated by the need to give developers and data scientists near-unfettered access to data to ensure their productivity. When an attacker gets inside a corporate network or gains control of an employee’s laptop, they are often looking for data. The breaches showing up in the news on a near-weekly basis all seem to be rooted in the same problem - a lack of awareness of what data an organization has, where it is, and who has access to it.
Protecting endpoints must evolve to protecting data. In the past decade, incumbent vendors have mostly attempted to extend their endpoint and network based approaches to the cloud. This decade will be defined by security products that are architected from the ground up around identity and data-awareness. In practice, this is the ability to authenticate a user or service’s identity, and authorize access to the right data assets based on that identity. While companies like Okta, Duo and Auth0 offer a glimpse into what identity-aware security looks like, the canonical approach to data security in the cloud-era is still being defined. Regardless of where we land, at the core of the “winning” approach will be the ability to discover and classify sensitive data. Recent improvements in NLP model performance and scalability have made it feasible to do this from both a scale and cost standpoint. Historically, this kind of visibility could only be achieved by deploying agents on each endpoint. Today, this data is available via API calls to SaaS and cloud infrastructure providers, among other equally “lightweight” techniques. This has paved the way for the holy grail of “agentless security,” and should accelerate our move to a world in which we have the tools at our disposal to focus entirely on securing our most precious resource — data.
More business processes will be written as code, and treated as such
On-demand access to virtual infrastructure has opened up all kinds of new possibilities to developers. Instead of a monolithic application running on a single physical server, we now build applications that consist of microservices distributed across many virtual servers. Instead of writing our own plumbing and middleware code, we turn to 3rd party APIs and SaaS. While I could go on with examples, the point is the software we are developing is more distributed and heterogeneous than ever. As a result, stitching it together so each piece can be orchestrated in concert has become critically important, and far more difficult. Our historic reliance on static scripts, wikis, and configuration files as the source of truth for how this orchestration happens has run its course. Given the nature of modern applications and cloud infrastructure, it’s simply too difficult to keep everything in sync this way.
In response, we have seen the most repetitive, manual, and error-prone engineering workflows such infrastructure provisioning (Terraform), data pipeline orchestration (Airflow, Dagster), database migrations (Prisma*), and security configuration management (Vault, Open Policy Agent) go the way of “code.” In this context, “code” means replacing scripts and configuration files with declarative business logic for how a task is carried out. Instead of requiring the owner of the script to have an understanding of the underlying infrastructure upon which the task is being executed, these frameworks and technologies abstract that complexity from the end-user, and maintain the necessary integrations. Instead of forcing the user to update scripts manually when a workflow changes, the framework updates them dynamically and keeps them in-sync. Instead of relying on a failure to signal a workflow is broken, these frameworks run continuous unit tests. The logic behind these workflows can now be versioned, tested, shared, secured, and executed "as code.” This unlocks massive gains in efficiency, scalability, and reliability for each orchestration task.
Looking ahead, what I find most exciting about this development is the clear potential to apply “as code” to workflows in non-technical business functions. An example that comes to mind in marketing is how leads are captured on a company’s website. Once a visitor provides information, a series of tasks is triggered in response. Add to Salesforce, email white paper, update lead nurture list, etc. Updates to such a website could be happening daily, if not more often. If its main purpose is to serve as a vehicle for capturing leads, what happens if a change to the website “breaks” the lead capture flow, and no one notices until the next day, when the pipeline report shows that new leads have plummeted? Now, suppose this hypothetical lead capture workflow moved from cron, or something more sophisticated like Zapier, to the “as code” construct. Upon deploying a new version of the company’s site, a unit test would have flagged the broken workflow, and prevented the deploy from happening until the test passed. This is one of many similar examples that come to mind in areas like compliance, sales, customer support, and finance. Modern companies are software companies, and it seems both logical and inevitable that the best practices of software development will permeate the rest of the business in this sense.
The ML infrastructure value-chain will undergo a phase of consolidation
A quick glance at one of the many machine-learning infrastructure tooling landscapes published in the past year such as this one leaves one with a feeling that is best described as 🤯. Training data management, model training, feature stores, feature serving pipelines, model deployment, model performance management, and model explainability are being positioned as distinct new categories by startups whose commercial ambitions require it. Borrowing from Moore’s “Crossing the Chasm” framework, I believe we are in the midst of the “early majority” adoption phase in machine learning infrastructure. In contrast to the “innovator” and “early adopter” groups, Moore describes the early majority as pragmatists in search of a comprehensive and economical solution to their problem, and most interested in buying that solution from a market leader. Unlike innovators and early adopters, the early majority is not interested in adopting technologies because they are “new,” nor do they care to stomach the risk of being first.
The biggest issue I see in the machine learning infrastructure value chain today is the growing supply demand imbalance that exists between the “early majority” group of ML users, and the lack of end-to-end solutions that conform to their needs. Said another way, it appears unlikely to me that the growing ecosystem of point solutions will all be meaningfully adopted by the customer segment that tends to anoint the market leader. While this may not bode well for the explosion of venture-backed startups operating in this space, it also presents a clear opportunity for consolidation. I believe the “winners” in machine learning infrastructure have already been founded. This group of companies chose to focus on a part of the value chain that is strategic enough to be a source advantage as they broaden their focus to give the “early majority” segment of the market the form-factor of solution they are accustomed to buying. These products will be able to support the entirety of the machine learning lifecycle, from training data collection and annotation, to deploying and operating models in production, to integrating them with the applications through which we benefit from their predictions.
So who is best positioned to do this? Is it a large company capable of M&A? Or is it a startup whose initial focus is considered foundational enough to the early majority that they will be receptive to additional functionality being layered in over time? My bet is on the latter. More specifically, my bet is on the companies who chose to focus on data-intensive, rather than compute-intensive problems. Data-intensive areas are tasks like training data management and feature-serving. Solving such problems requires a customer to persist significant amounts of data with the product they are using, which I have learned to be an extremely “sticky” position to be in as a startup. This comes back to the notion of data gravity. Would I prefer to send my training data to a third party to train my model, or train my model within the same platform upon which I store the data? Would I prefer to deploy my model from the same platform upon which I will be serving it features from, or a separate one? Data has gravity, making it ideal to carry out computations on data as close to it lives as possible. Thus, I am excited about ML infrastructure companies whose products benefit from data gravity, and have an advantaged path to delivering the end-to-end solution that the market is ultimately ripe for.
If you are thinking about or building products that touch on any of these ideas, I would love to hear from you.
*indicates KP portfolio company