1. What it is
A data flywheel exists when usage of the product produces proprietary data that materially improves the model that powers the product, and the improvement is observable enough that more usage follows. The benefit is a quality gap that compounds with deployment time, not with capital. A challenger entering at the leader's model quality finds the leader has already moved further by the time the challenger ships.
2. How it works
The mechanism has a sharp threshold — the good-enough gap. Below it, the flywheel is fragile and a competitor with sufficient capital can match it. Above it, the leader's feedback velocity outruns any plausible challenger's capital deployment. Most “data moats” never cross the threshold.
The load-bearing input is privileged access: the leader's data is not also accessible to the foundation-model layer that any competitor can buy from. When the underlying model layer is shared and the flywheel data is the only privileged input, the moat survives. When the foundation-model layer absorbs comparable data through its own training pipeline, the flywheel evaporates from above.
Five operating moves separate the flywheels that compound from the ones that stall:
- Pick a beachhead with high feedback velocity. Cursor turns the loop in seconds (accept/reject completions); Tesla in milliseconds (perception); construction-progress flywheels in days or weeks.
- Instrument the loop end-to-end. The data has to flow from the customer's use back to the training pipeline reliably enough that no compelling signal is dropped. Most flywheel pitches lose at the instrumentation step.
- Avoid the model-layer trap. If the moat depends on a foundation-model improvement the foundation-model lab will also ship, the firm has a leased capability, not a flywheel.
- Cross the good-enough gap deliberately. Capital is best spent reaching the threshold and defending it; spreading below the threshold across many use cases is how flywheels die in the cradle.
- Keep the data privileged. Customer-data terms, on-prem deployments, contractual restrictions on outbound training — all are levers to keep the corpus private to the leader.
3. Canonical examples
- Tesla Autopilot — the canonical horizontal case. Real-world fleet data — edge cases, near-misses, novel road geometries — trains models that improve perception and planning.
- Glean — the enterprise knowledge graph matures over 12–18 months of real usage in a customer environment. Permission-aware retrieval quality compounds with corpus depth and query history.
- OpenSpace in AEC — a 75,000+ project corpus of visual-to-BIM comparisons. The vision model is replicable; the comparison corpus is not.
- Higharc in residential AEC — per-home design and pricing decisions captured as a byproduct of every builder's configurator session. The corpus is privileged because no foundation-model lab ever sees it.
4. How it fails
- Marginal data goes stale. The first 10,000 customers contributed novel signal; the next 100,000 contribute redundant signal. The curve flattens.
- Foundation-layer absorption. The frontier lab releases a model that closes the quality gap from below, and the leader's privileged corpus stops mattering for the user-visible quality bar.
- Threshold never crossed. The leader stalls below the good-enough gap, capital exhausts, a challenger with comparable capital catches up, and the segment commoditizes.
- Privacy or regulatory friction. Customer or regulator forbids the data flow back to the model. Without the loop, the moat is just a corpus — valuable, but not compounding.
- Distribution incumbent eats the flywheel. A capital-rich incumbent with embedded distribution (Microsoft Copilot inside M365) ships a “good-enough” version that prevents the flywheel from reaching scale.
5. Key insights — the AEC platform-incumbent pattern
The two AEC software incumbents most often called data-flywheel candidates — Autodesk and Procore — show in mirror image why the privileged-access assumption is what breaks the mechanism, and why most AEC data-flywheel claims are conditional rather than load-bearing.
Autodesk: the design-side artifact without the mechanism
Autodesk has the corpus by accumulation: decades of Revit, AutoCAD, and Civil 3D telemetry, an AECO segment that closed FY26 at $3.58B (+22% YoY), and Forma as the unified industry cloud after absorbing Autodesk Construction Cloud in March 2026. Every flywheel-shaped artifact has shipped or been announced: Project Bernini, Neural CAD foundation models for Buildings and Manufacturing, Tandem Insights, the APS metering reset, the $200M strategic stake in World Labs. On the surface, the flywheel is spinning.
The threshold has not been crossed because privileged access is unresolved. Autodesk has the corpus but does not have legal training rights at scale. The December 2025 ToS revision moved a half-step — clarifying that customers may train on their own data — but the symmetric question (can Autodesk train its own foundation models on customer cloud files by default?) is unresolved in the FY26 10-K and standard subscription terms. Until that question resolves in Autodesk's favor — opt-in contribution tiers, a curated public-good corpus partnership, or a procurement re-negotiation enterprise customers accept rather than litigate — the corpus sits on the leader's storage but does not flow back to the model.
Procore: execution-side artifact, more aggressive kit, same threshold problem
Procore holds the symmetric position. Where Autodesk owns design files, Procore owns the construction-execution data graph — daily logs, RFIs, submittals, observations, financial transactions, contract trails — across 17,623 organic customers and millions of projects. For an agentic AI predicting project risk, the execution corpus is structurally richer than Autodesk's design corpus.
Procore has assembled a more aggressive moat-construction kit on a tighter window: the September 2025 Developer Policy bans bulk export for commercial training; the Managed Marketplace replaces the open developer ecosystem; the Agentic API went GA in late March 2026 on consumption pricing; and the $168M Datagrid acquisition (Toric Labs) provides the agentic reasoning engine. The November 2025 operator-CEO transition (Tooey Courtemanche to Ajei Gopal, ex-Ansys) installs a public-software-CEO playbook for platform monetization.
And the threshold has still not been crossed, for the same reason. The Developer Policy bans third-party training but does not, in public-facing language, explicitly grant Procore training rights against its own customers. Q3 2026 earnings is the visible test — AI revenue contribution and customer-built-agent adoption at Groundbreak 2026 (October) tell us whether Procore crosses or stalls.
The pattern
Both incumbents have flywheel-shaped artifacts. Both have tightened the API perimeter within months of each other (Autodesk APS December 2025, Procore September 2025). Both face the same load-bearing assumption: customer-data training rights at scale. Neither has resolved it. The diagnostic question for any AEC data-flywheel claim is identical: show the privileged-access mechanism, dated, with the contractual or technical evidence. Without that, the flywheel is conditional — an artifact, not a moat. Most AEC vendors that look like they have data flywheels (OpenSpace, Buildots, Trunk Tools) are running the same artifact-without-mechanism play.
Visual: the threshold curve
Cross-references
Data Flywheel pairs with Agentic Workflow Lock-in (ch. 11) when the agent's learned context is itself one of the data inputs to the flywheel, and with Evaluator Judgment Power (ch. 12) when the flywheel is what funds the judgment-bearing entity's pricing power. Without an evaluator surface, a flywheel is leverage you cannot monetize at the value being created.
Sources: a16z, "The Empty Promise of Data Moats" (Casado/Bornstein, 2020) · a16z, "Trading Margin for Moat: Services-Led Growth" (2024) · Sequoia, "AI 50: AI Agents Move Beyond Chat" (2025) · Glean Series F coverage (June 2025) · OpenSpace customer-corpus claims (2024–2025 press).