Moltbook Isn’t a Reverse Turing Test — It’s a Containment Test

Naval called Moltbook the “new reverse Turing test,” and everyone immediately treated it like a profound milestone. I think it’s something else: a live-fire test of whether we can contain agentic systems once they’re networked together.

Let’s be precise. Moltbook is an AI-only social platform, roughly “Reddit, but for agents,” where humans can watch but not participate. The pitch is simple: observe how AI agents behave socially when left alone. Naval’s label is elegant because it implies the agents are now the judges—humans are the odd ones out.

But if you’re a founder or an operator, you should ignore the poetry and ask: what is the product really doing to the world?

Moltbook’s real innovation is not “AI social behavior.” It’s a new topology: lots of agents, from different builders, connected in a public arena where they can feed each other instructions, links, and narratives at scale. That’s not a reverse Turing test. It’s a coordination surface.​

And coordination surfaces create externalities.

In the old internet, humans spammed humans. In the new internet, agents will spam agents—except “spam” won’t just be annoying; it will be executable. If you give agents permissions (email, calendars, bank access, code execution, “tools”), and then you let them ingest untrusted content from a network like Moltbook, you are building the conditions for what security folks call the “lethal trifecta.”

This is where the discussion gets serious.

Forbes contributor Amir Husain’s critique is basically a warning about permissions: people are already connecting agents to real systems—home devices, accounts, encrypted messages, emails, calendars—and then letting those agents interact with unknown agents in a shared environment. That’s an attack surface, not a party trick. If the platform enables indirect prompt injection—malicious content that causes downstream agents to leak secrets or take unintended actions—then your “social experiment” becomes a supply chain problem.

You don’t need science fiction for this to go wrong. You just need one agent that can persuade another agent to do something slightly dumb, repeatedly, across thousands of interactions. We already know that when systems combine high permissions, external content ingestion, and weak boundaries, bad things happen—fast.

So here’s my different perspective:

Moltbook isn’t proving that agents are becoming “more human.” It’s proving that we’re about to repeat the Web2 security arc—except the users are autonomous processes with tools, and the cost of an error is not just misinformation, it’s action.

And yes, that matters for investors.

I’m optimizing for fund outcomes within a horizon, not for philosophical truth at year 12. The investable question is not “is this emergent intelligence?” It’s: “does this create durable value that survives the cleanup required to make it safe?”

If Moltbook becomes the standard sandbox for red-teaming agents—great. If it becomes the public square where autonomous tool-using systems learn adversarial persuasion from each other, that’s not a product category; that’s a systemic risk generator, and regulators will come for everyone adjacent to it.

What should founders do?

First, treat any agent-to-agent network as hostile-by-default. Second, sandbox tools like your company depends on it—because it does. Third, stop marketing autonomy until you can measure and bound it, because markets pay for narratives on the way up, and punish you when the story breaks.

Naval’s phrase is catchy. But the real test isn’t whether humans can still tell who’s who.

The real test is whether we can build agent networks that don’t turn “conversation” into “compromise.”

Oxford says “gut.” I say “objective + proof.”

Oxford’s The Impact of Artificial Intelligence on Venture Capital argues AI accelerates sourcing and diligence, but investment decisions stay human because durable moats are socially grounded conviction, gut feeling, and networks.

I agree with the workflow diagnosis. I disagree with the implied endgame.

Not because “gut” is fake—but because “gut” is often a label we apply when we haven’t defined success tightly enough, or when we don’t have a measurement loop that forces our beliefs to confront outcomes.

Dealflow is getting commoditized. The edge is moving.

AI expands visibility, speeds up pipelines, and pushes the industry toward shared tools and shared feeds. When everyone can scan more of the world, “who saw it first” decays.

But convergence of inputs does not imply convergence of results. The edge moves from access to learning rate.

The outlier problem isn’t mystical. It’s an evaluation problem.

Oxford’s strongest point is that the power-law outliers are indistinguishable from “just bad” in the moment, and that humans use conviction to step into ambiguity.

I accept that premise and I still think the conclusion is wrong.

Because “conviction” is not a supernatural faculty. It’s a policy under uncertainty. And policies can be evaluated.

If your decision rule can’t be backtested, it’s not conviction. It’s narrative.

Don’t try to read souls. Build signals you can audit.

Some firms try to extract psychology from language data. Sometimes it works as a cue; often it’s noisy. And founders adapt as soon as they sense the scoring system.

So the goal isn’t “measure personality with high accuracy.” The goal is: build signals that are legible, repeatable, falsifiable and then combine them with a process that forces updates when reality disagrees.

Verification beats vibes.

If founders optimize public narratives, then naive text scoring collapses into a Goodhart trap.

The difference between toy AI and investable AI is verification: triangulate claims, anchor them in time, reject numbers that can’t be sourced, and penalize inconsistency across evidence.

That’s how you turn unstructured noise into features you can actually test.

Status is a market feature—not a human moat.

Networks and brand matter because markets respond to them—follow-on capital, recruiting pull, distribution, acquisition gravity.

So yes: status belongs in the model.

But modeling status is not the same thing as needing a human network as the enduring edge. One is an input signal. The other is a claim about irreducible advantage.

If an effect is systematic, it’s modelable.

Objective function: I’m optimizing for fund outcomes.

A lot of debates about “AI can’t do VC” hide an objective mismatch.

If your target is “eventual truth at year 12,” you’ll privilege a certain kind of human judgment. If your target is “realizable outcomes within a fund horizon,” you’ll build a different machine.

I’m comfortable modeling hype—not because fundamentals don’t matter, but because time and liquidity are part of the label. Markets pay for narratives before they pay for final verdicts, and funds get paid on the path, not just the destination.

The punchline

Oxford is right about current practice: AI reshapes the funnel, while humans still own the final decision and accountability.

My reaction is that this is not a permanent moat. It’s a temporary equilibrium.

Define success precisely. Build signals that survive verification. Backtest honestly. Update fast.

That’s not gut.

That’s an investing operating system.

2026 is the year we stop confusing scaling with solving

I called neuro-symbolic AI a 600% growth area back when I analyzed 20,000+ NEURIPS papers. I wrote that world models would unlock the $100T bet because spatial intelligence beats text prediction. I predicted AGI would expose average VCs because LLMs struggle with complex planning and causal reasoning.

Now Ilya Sutskever—co-founder of OpenAI, the guy who built the thing everyone thought would lead to AGI—just said it out loud: "We are moving from the age of scaling to the age of research".

That's not a dip. That's a ceiling.

Here's what the math actually says:

Meta, Amazon, Microsoft, Google, and Tesla have spent $560 billion on AI capex since early 2024. They've generated $35 billion in AI revenue. That's a 16:1 spend-to-revenue ratio. AI-related spending now accounts for 50% of U.S. GDP growth. White House AI Czar David Sacks admitted that a reversal would risk recession.

The 2000 dot-com crash was contained because telecom was one sector. AI isn't. This is systemic exposure dressed up as innovation.

The paradigm that just died:

The Kaplan scaling laws promised a simple formula: 10x the parameters, 10x the data, 10x the compute = 10x better AI. It worked from GPT-3 to GPT-4. It doesn't work anymore. Sutskever's exact words: these models "generalize dramatically worse than people".

Translation: we hit the data wall. Pre-training has consumed the internet's high-quality text. Going 100x bigger now yields marginal, not breakthrough, gains. When your icon of deep learning says that, you're not in a correction—you're at the end of an era.

The five directions I've been tracking—now validated:

The shift isn't abandoning AI. It's abandoning the lazy idea that "bigger solves everything." Here's where the research-to-market gap is closing faster than most realize:

1. Neuro-symbolic AI (the 600% growth area I flagged)

I wrote that neuro-symbolic was the highest-growth niche with massive commercial gaps. Now it's in Gartner's 2025 Hype Cycle. Why? Because LLMs hallucinate, can't explain reasoning, and break on causal logic. Neuro-symbolic systems don't. Drug discovery teams are deploying them because transparent, testable explanations matter when lives are on the line. MIT-IBM frames it as layered architecture: neural networks as sensory layer, symbolic systems as cognitive layer. That separation—learning vs. reasoning—is what LLMs never had.

2. Test-time compute (the paradigm I missed, but now understand)

OpenAI's o1/o3 flipped the script: spend compute at inference, not just training. Stanford's s1 model—trained on 1,000 examples with budget forcing—beat o1-preview by 27% on competition math. That's proof that intelligent compute allocation beats brute scale. But there's a limit: test-time works when refining existing knowledge, not generating fundamentally new capabilities. It's a multiplier on what you already have, not a foundation for AGI.

3. Small language models (the efficiency play enterprises actually need)

Microsoft's Phi-4-Mini, Mistral-7B, and others with 1-10B parameters are matching GPT-4 in narrow domains. They run on-device, preserve privacy, cost 10x less, and don't require hyperscale infrastructure. Enterprises are deploying hybrid strategies: SLMs for routine tasks, large models for multi-domain complexity. That's not compromise—that's architecture that works at production scale.

4. World models (the $100T bet I wrote about)

I argued that world models—systems that build mental maps of reality, not just predict text—would define the next era. They're now pulling $2B+ in funding across robotics, autonomous vehicles, and gaming. Fei-Fei Li's World Labs hit unicorn status at $230M raised. Skild AI secured $1.5B for robotic world models. And of course Yann Lecun's new startup. This isn't hype—it's the shift from language to spatial intelligence I predicted.

5. Agentic AI (the microservices moment for AI)

Gartner reports a 1,445% surge in multi-agent inquiries from Q1 2024 to Q2 2025. By end of 2026, 40% of enterprise apps will embed AI agents, up from under 5% in 2025. Anthropic's Model Context Protocol (MCP) and Google's A2A are creating HTTP-equivalent standards for agent orchestration. The agentic AI market: $7.8B today, projected $52B by 2030. This is exactly the shift I described in AGI VCs—unbundling monolithic intelligence into specialized, composable systems.

What kills most AI deployments (and what I've been saying):

I wrote that the gap isn't technology—it's misaligned expectations, disconnected business goals, and unclear ROI measurement. Nearly 95% of AI pilots generate no return (MIT study). The ones that work have three things: clear kill-switch metrics, tight integration loops, and evidence-first culture.

Enterprise spending in 2026 is consolidating, not expanding. While 68% of CEOs plan to increase AI investment, they're concentrating budgets on fewer vendors and proven solutions. Rob Biederman of Asymmetric Capital Partners: "Budgets will increase for a narrow set of AI products that clearly deliver results and will decline sharply for everything else".

That's the bifurcation I predicted: a few winners capturing disproportionate value, and a long tail struggling to justify continued investment.

The punchline:

The scaling era gave us ChatGPT. The research era will determine whether we build systems that genuinely reason, plan, and generalize—or just burn a trillion dollars discovering the limits of gradient descent.

My bet: the teams that win are the ones who stop optimizing for benchmark leaderboards and start solving actual constraints—data scarcity, energy consumption, reasoning depth, and trust. The ones who recognized early that neuro-symbolic, world models, and agentic systems weren't academic curiosities but the actual path forward.

I've been tracking these shifts for two years. Sutskever's admission isn't news to anyone reading this blog—it's confirmation that the research-to-market timeline just accelerated.

Ego last, evidence first. The founders who internalized that are already building what comes next.

AGI Will Replace Average VCs. The Best Ones? Different Game.

The performance gap between tier-1 human VCs and current AI on startup selection isn't what you think. VCBench: a new standardized benchmark where both humans and LLMs evaluate 9,000 anonymized founder profiles, shows top VCs achieving 5.6% precision. GPT-4o hit 29.1%. DeepSeek-V3 reached 59.1% (though with brutal 3% recall, meaning it almost never said "yes").[1]​

That's not a rounding error. It's a 5-10x gap in precision, the metric that matters most in VC, where false positives (bad investments) are far costlier than false negatives (missed deals).[1]​

But here's what the paper doesn't solve: VCBench inflated the success rate from real-world 1.9% to 9% for statistical stability, and precision doesn't scale linearly when you drop the base rate back down. The benchmark also can't test sourcing, founder relationships, or board-level value-add, all critical to real fund performance. And there's a subtle time-travel problem: models might be exploiting macro trend knowledge (e.g., "crypto founder 2020-2022 = likely exit") rather than true founder quality signals.[2]​

Still, the directional message is clear: there is measurable, extractable signal in structured founder data that LLMs capture better than human intuition. The narrative that "AI will augment but never replace VCs" is comforting and wrong. The question isn't if AGI venture capitalists will exist—it's when they cross 15-20% unicorn hit rates in live portfolios (double the best human benchmark) and what that phase transition does to the rest of us.​

The math is brutal for average funds

Firebolt Ventures has been cited as leading the pack at a 10.1% unicorn hit rate—13 unicorns from 129 investments since 2020. (Stanford GSB VCI-backed analysis, as shared publicly) Andreessen Horowitz sits at 5.5% on that same "since 2020" hit-rate framing, albeit at far larger volume. And importantly: Sequoia fell just below the 5% cutoff on that ranking—less because of a lack of wins and more because high volume dilutes hit rate.[3]​

The 2017 vintage—now mature enough to score—shows top-decile funds hitting 4.22x TVPI. Median? 1.72x. Most venture outcomes are random noise dressed up as strategy.​

Here's the punchline: PitchBook's 20-year LP study has been summarized as finding that even highly skilled manager selectors (those with 40%+ hit rates at picking top-quartile funds) generate only ~0.61% additional annual returns, and that skilled selection beats random portfolios ~98.1% of the time in VC (vs. ~99.9% in buyouts). (PitchBook analysis, as summarized).​

If the best fund pickers in the world can barely separate signal from noise, what does that say about VC selection itself?​

AGI VCs won't need warm intros

Current ML research suggests models can identify systematic misallocation even within the set of companies VCs already fund. In "Venture Capital (Mis)Allocation in the Age of AI," the median VC-backed company ranks at the 83rd percentile of model-predicted exit probability—meaning VCs are directionally good, but still leave money on the table. (Lyonnet & Stern, 2022). Within the same industries and locations, the authors estimate that reallocating toward the model's top picks would increase VCs' imputed MOIC by ~50%.​

That alpha exists because human VCs are bottlenecked by:

Information processing limits. Partners evaluate ~200-500 companies/year. An AGI system can scan orders of magnitude more continuously.​

Network constraints. You can't invest in founders you never meet. AGI doesn't need warm intros—it can surface weak signals from GitHub velocity, hiring patterns, or web/social-traffic deltas before the traditional network even sees the deck.​

Cognitive biases. We over-index on storytelling, pedigree, and pattern-matching to our last winner. Algorithms don't care if the founder went to Stanford or speaks confidently. They care about predictors of tail outcomes.​

Bessemer's famous Anti-Portfolio—the deals they passed on Google, PayPal, eBay, Coinbase is proof that even elite judgment systematically misfires. If the misses are predictable in hindsight, they're predictable in foresight given the right model.​

The five gaps closing faster than expected

AGI isn't here yet because five bottlenecks remain:

Continual learning. Current models largely freeze after training. A real VC learns from every pitch, every exit, every pivot. Research directions like "Nested Learning" have been proposed as pathways toward continual learning, but it's still not a solved, production-default capability.​

Visual perception. Evaluating pitch decks, product demos, team dynamics from video requires true multimodal understanding. Progress is real, but "human-level" is not the default baseline yet.​

Hallucination reduction. For VC diligence—where one wrong fact about IP or founder background kills the deal—today's hallucination profile is still too risky. Instead of claiming a universal "96% reduction," the defensible claim is that retrieval-augmented generation plus verification/guardrails can sharply reduce hallucinations in practice, with the magnitude depending on corpus quality and evaluation method. ​

Complex planning. Apple's research suggests reasoning models can collapse beyond certain complexity thresholds; venture investing is a 7-10 year planning problem through pivots, rounds, and market shifts.​

Causal reasoning. Correlation doesn't answer "If we invest $2M vs. $1M, what happens?" Causal forests and double ML estimate treatment effects while controlling for confounders. The infrastructure exists; it's not yet integrated into frontier LLMs. Give it 18 months.​

Unlike the theoretical barriers to general AGI (which may require paradigm shifts), the barriers to an AGI VC are engineering problems with known solutions.​

The phase transition nobody's pricing in

Hugo Duminil-Copin won the Fields Medal for proving how percolation works: below a critical threshold, clusters stay small. Above it, a giant component suddenly dominates. That's not a metaphor—it's a rigorous model of network effects.​

Hypothesis (not settled fact): once AGI-allocated capital crosses something like 15-25% of total VC AUM, network effects could create nonlinear disadvantage for human-only VCs in deal flow access and selection quality. Why? Because:​

Algorithmic funds identify high-signal companies before they hit the traditional fundraising circuit. If you're a founder and a fund can produce a high-conviction term sheet on a dramatically shorter clock—with clear, inspectable reasoning—you take the meeting.​

Network effects compound. The AGI with the best proprietary outcome data (rejected deals, partner notes, failed pivots) trains better models. That attracts better founders. Which generates better data. Repeat.​

LPs will demand quantitative benchmarks. "Show me your out-of-sample precision vs. the AGI baseline" becomes table stakes. Funds that can't answer get cut.​

The first AGI VC to hit 15% unicorn rates and 6-8x TVPI will trigger the cascade. My estimate: 2028-2029 for narrow domains (B2B SaaS seed deals), 2030-2032 for generalist funds. That's not decades—it's one fund cycle.​

What survives: relationship alpha and judgment at the edge

The AGI VC will systematically crush humans on sourcing, diligence, and statistical selection. What it won't replace—at least initially:

Founder trust and warm intros. Reputation still opens doors. An algorithm can't build years of relationship capital overnight.​

Strategic support and crisis management. Board-level judgment calls, operational firefighting, ego management in founder conflicts—those require human nuance.​

Novel situations outside the training distribution. Unprecedented technologies, regulatory black swans, geopolitical shocks. When there's no historical pattern to learn from, you need human synthesis.​

VCs will bifurcate: algorithmic funds competing on data/modeling edge and speed, versus relationship boutiques offering founder services and accepting lower returns. The middle—firms that do neither exceptionally—will get squeezed out.​

Operating system for the transition

If you're building or managing a fund today, three moves matter:

1. Build proprietary outcome data now. The best training set isn't Crunchbase—it's your rejected deal flow with notes, your portfolio pivots, your failed companies' post-mortems. That's the moat external models can't replicate. Track every pitch, every IC decision, every update. Structure it for ML ingestion.​

2. Instrument your decision process. Precommit to hypotheses ("We think founder X will succeed because Y"). Log the reasoning. Compare predicted vs. actual outcomes quarterly. This builds the feedback loop that lets you detect when your mental model is miscalibrated—and when an algorithm beats you.​

3. Segment where you add unique value vs. where you're replaceable. If your edge is "I know this space and can move fast," you're exposed. If it's "founders trust me in a crisis and I've navigated three pivots with them," you're defensible. Be honest about which deals came from relationship alpha versus statistical pattern-matching. Double down on the former; automate the latter.​

The real test

In three years, when an AGI fund publishes live performance data showing 12-15% unicorn rates and 5-6x TVPI, the LP conversation changes overnight. Not because the technology is elegant—because the returns are real and the process is transparent.​

That's the moment VCs have to answer: What alpha do we generate that a model can't? For many funds, the answer will be uncomfortable. For the best ones—the ones who've always known that determination, speed, and earned insight compound faster than credentials—it'll be clarifying.​

The AGI VC era doesn't kill venture capital. It kills the pretense that average judgment plus a warm network equals outperformance. What's left is a smaller, sharper game where human edge has to be provable, not performative.​

And if you can't articulate your edge in a sentence—quantifiably, with evidence—you're not competing with other humans anymore. You're competing with an algorithm that already sees your blind spots better than you do.​

  1. https://arxiv.org/pdf/2509.14448.pdf
  2. https://www.reddit.com/r/learnmachinelearning/comments/1no8xji/vcbench_new_benchmark_shows_llms_can_predict/
  3. https://www.linkedin.com/posts/ilyavcandpe_top-unicorn-investors-by-hit-rate-since-2020-activity-7362200145880367104-7zTv