The Software Factory Has Arrived: What AI Engineer World's Fair 2026 Tells Us About Where AI Is Going

The AI Engineer World’s Fair opens in San Francisco on June 28. It has more than 6,000 attendees, 300 speakers and 29 tracks. That scale matters because this is not a conference about what AI might become. It is a conference about what AI has already become, and what now breaks when companies try to run it at scale.

The people attending are not tourists. They are engineers, founders, product leaders and AI operators who have to ship working systems on Monday morning. Read across the program and one thing becomes clear. The industry has crossed a threshold. The question is no longer whether AI can generate code, answer questions or perform tasks. The question is whether we can verify it, govern it, secure it and trust it enough to let it act.

The biggest bottleneck has moved from generation to verification. AI can now produce more output than humans can review. Uber is running 1,800 AI generated code changes per week. Greptile has analyzed 1 million AI generated pull requests. GitHub Copilot, Factory, Qodo and others are all dealing with the same second order problem. Once machines write faster, humans become the constraint. The next major layer in AI may not be another model company. It may be the verification layer that tells us whether the output is correct, safe and usable.

This is especially clear in software. GitHub CEO Thomas Dohmke is asking what the future of the software development lifecycle looks like. That question alone tells you where we are. Software is being reorganized around AI agents that write code, open pull requests, run tests, respond to failures and iterate. Humans move from authors to supervisors. The companies that figure this out will not just have better tools. They will have a different cost structure.

A new discipline is also emerging around the model. The conference calls it harness engineering. It is the scaffolding, state management, orchestration, memory, tool calls, retries and error recovery that make agents work in production. This matters because many agent failures are not model failures. The model did what it was asked to do. The system around it gave the wrong context, the wrong tool or the wrong recovery path. The model is becoming one part of a larger system. The harness is where a lot of the moat will live.

Prompt engineering is also giving way to context engineering. Prompt engineering asks what instruction to give the model. Context engineering asks what the model should know right now. For long running agents, that is the operating system. Context compaction, memory offloading and cost accounting are becoming core production primitives. A weaker model with excellent context management may beat a stronger model with poor context discipline.

MCP is another signal of where the industry is going. The Model Context Protocol is moving from experiment to infrastructure. Figma, Docker and Microsoft are all building around it. But the same protocol that lets agents connect to tools and data also expands the blast radius. An agent that can read a codebase, call APIs and access internal systems is not a chatbot. It is an actor inside the company. That creates a new security and governance problem.

The same issue appears in regulated industries. Two Sigma is presenting on AI agents that assume employee identity and access internal tools. PayPal is presenting on agent initiated payments across ChatGPT and Google AI Mode. These are serious trust boundaries. What is an agent allowed to do? Whose identity does it act under? Who is liable when it makes a mistake? AI capability has moved faster than enterprise governance. The next 12 to 24 months will be about closing that gap.

Voice is another major surface. Text agents fail quietly. Voice agents fail in public. They interrupt, lag, misunderstand and talk over people. Most voice systems are still pipelines that turn speech into text, send it to an LLM and then turn the answer back into speech. Native speech systems should eventually collapse that stack, but we are still in the difficult gap between possible and reliable.

The open weights ecosystem has also become credible. Hugging Face is hosting millions of models and serving large enterprises. Open models are no longer just research tools. They are becoming part of the production stack. That changes cost, control, privacy and deployment strategy.

The bigger point is this. Coding agents were the proof of concept. Knowledge work agents are the business. Legal analysis, financial research, medical reasoning and investment work will need domain specific tools, workflows and verification systems. You cannot simply point a coding agent at a new domain and expect it to work.

The AI Engineer World’s Fair 2026 is not about AI potential. It is about AI operations. How do you run agents at scale? How do you verify them? How do you govern them? How do you secure them? The generation problem has been substantially solved for a large class of tasks. The operations problem is now the center of gravity. The companies that understand this are building for the next five years. The companies still focused only on generation are solving yesterday’s problem.

 The conference runs June 28–July 2, 2026 at Moscone West Convention Center, San Francisco. Please visit https://www.ai.engineer/worldsfair/2026

Synthetic Dissent: Your Agentic Investment Committee Needs a Correlation Audit

The multi-agent investment committee is becoming the new toy in venture.

Five AI agents read the same memo. One is the market-sizing agent. One is the technical diligence agent. One is the financial model agent. One is the founder-pattern agent. One is the skeptic. They debate, vote, and produce a dashboard with confidence scores.

It looks rigorous.

It may also be fake rigor.

The problem is not that the agents are useless. The problem is that many of these systems treat five agents as five independent opinions when, in reality, they are often one opinion with five different costumes.

If the agents run on the same foundation model, read the same memo, retrieve from the same data room, and inherit the same model priors, their votes are not independent. They are correlated. And once the votes are correlated, the committee math breaks.

This is the first failure mode of the agentic investment committee: it confuses headcount with epistemic diversity.

Five agents does not mean five independent views.

Sometimes it means one view, sampled five times.

The Clones Are Voting

Here is what happens in many first-generation agentic IC systems.

A fund gives the same deal memo to multiple agents. Each agent is assigned a persona. The market agent asks whether the TAM is large enough. The product agent asks whether the product is defensible. The founder agent asks whether the founder has the right background. The finance agent checks burn, margin, and revenue quality. The skeptic writes the bear case.

On the surface, the outputs look different. One agent worries about go-to-market. Another worries about technical depth. Another worries about customer concentration.

But the underlying model may still be the same. The data may still be the same. The retrieved context may still be the same. The priors may still be the same.

So what looks like disagreement may only be formatting variance.

This matters because a committee is valuable only if its members bring genuinely different models of the world. A former biotech operator and a consumer marketplace investor may disagree for real reasons. They have different scars, different pattern libraries, different examples in their heads, and different mistakes they are trying not to repeat.

AI personas do not automatically have that. A prompt that says “you are a contrarian growth investor” does not create a contrarian growth investor. It creates a language model role-playing one.

That distinction is not philosophical. It is mathematical.

The Missing Math: Effective Committee Size

If five IC agents are independent, five votes can add real signal.

But if their errors are correlated, the effective number of independent opinions collapses.

A simple approximation is:

Effective independent agents = n / [1 + (n - 1)ρ]

Where:

n = number of agents
ρ = average correlation between their errors

Now apply this to a five-agent IC.

If correlation is zero:

5 / [1 + 4(0)] = 5.0

You really have five independent agents.

If correlation is 0.5:

5 / [1 + 4(0.5)] = 1.67

Your five-agent IC is closer to 1.7 independent agents.

If correlation is 0.8:

5 / [1 + 4(0.8)] = 1.19

Your five-agent IC is basically one agent with a panel discussion.

This is the uncomfortable truth: a unanimous “strong invest” from five highly correlated agents may not be five signals. It may be one signal echoing through five prompts.

The dashboard says 5–0.

The math says 1.2–0.

Why Consensus Is the Wrong Target in Venture

The instinct to build toward consensus is understandable.

IC meetings are expensive. Partner disagreement is uncomfortable. LPs like process. A system that produces clean, confident outputs feels mature. It feels institutional. It feels like judgment has been made legible.

But venture returns are not produced by maximizing consensus.

They are produced by being right on a small number of non-obvious companies before the market agrees.

That is why venture is such a strange asset class. In many portfolios, most companies do not matter to the final return. A small number of outliers matter enormously. The fund is often made by the investment that looked weird, early, overvalued, too small, too messy, or too soon.

This creates a problem for agentic IC design.

If the system is optimized to reduce disagreement, it will become very good at killing strange companies. It will push decisions back toward the historical average. It will reward companies that look like prior winners and penalize companies whose best feature is that they do not fit an existing pattern yet.

That is dangerous because the average of historical venture data is not the source of venture returns. The outlier is.

A venture IC should not ask only, “Do we agree?”

It should ask, “Where do we disagree, and is the disagreement informative?”

The best IC output is not a confidence score. It is the unresolved argument that forces the partner to think.

The Real Job of a Devil’s-Advocate Bot

The devil’s-advocate bot should not be a decorative skeptic.

It should not produce a polite paragraph saying, “Risks include competition, execution, and fundraising environment.”

That is not dissent. That is memo garnish.

A real devil’s-advocate bot has one job: construct the strongest possible case that this investment should not happen.

It should try to kill the deal.

Not because the fund should always listen to it. Because if the strongest kill memo is weak, that is information. If the strongest kill memo is strong but answerable, that is information. If the strongest kill memo is strong and no one can rebut it, that is also information.

The purpose of synthetic dissent is not negativity. It is compression. It compresses the hardest objections into a form the human IC cannot avoid.

The bear case should not be hidden below a final recommendation. It should be the main artifact.

The system should produce three things:

  1. The strongest invest case.

  2. The strongest kill case.

  3. The unresolved disagreement between them.

The third output is the most important.

Build for Disagreement, Not Theater

If funds want agentic ICs to matter, they need to stop measuring how often agents agree and start measuring whether agent disagreement is useful.

That requires different architecture.

First, use decorrelated retrieval.

Do not let every agent read the same memo and the same data-room summary. Give agents different information sets. One agent reads only founder background and references. One reads only customer calls. One reads only technical diligence. One reads only competitive data. One reads only the financial model.

This creates real information asymmetry. When agents disagree, the disagreement now means something. It may reflect a tension between customer love and weak margins, or between founder strength and market timing, or between technical elegance and weak distribution.

Second, use heterogeneous models where possible.

Different prompts on the same model are not enough. Use different model families, different fine-tunes, different retrieval strategies, and different evaluation rubrics. You may not eliminate correlation, but you can reduce it.

Third, make the skeptic structurally independent.

The kill agent should not be asked to “be balanced.” Balance is the synthesizer’s job. The kill agent should be adversarial. It should have permission to be harsh, specific, and one-sided.

Fourth, track calibration by agent, not just by committee.

Most systems will track whether the final recommendation was right. That is useful but incomplete. You should also track which agent dissented, when it dissented, and whether that dissent would have improved the decision.

The most valuable agent may not be the one that agrees with the final vote most often.

It may be the one that was uncomfortable for the right reasons.

A Better Agentic IC Dashboard

The current dashboard often looks like this:

Market agent: Invest
Product agent: Invest
Founder agent: Invest
Finance agent: Invest
Skeptic agent: Invest with risks
Final recommendation: Strong invest
Confidence: 87%

That looks decisive.

But it may be dangerously overconfident.

A better dashboard would look like this:

Nominal agent vote: 5–0
Estimated error correlation: 0.72
Effective independent vote count: 1.32
Strongest invest argument: Founder-market fit is unusually strong, customer pull is early but real, and the market may inflect faster than incumbents expect.
Strongest kill argument: The current traction may be services-led, gross margin is not yet proven, and the buyer may lack budget ownership.
Irresolvable disagreement: High
Human decision required: Yes

That dashboard is less comforting.

It is also more honest.

The goal is not to make the IC quieter. The goal is to make the unresolved judgment visible.

The Human Is Still the Committee

The strongest version of agentic IC does not replace the partner.

It makes the partner less lazy.

A bad system gives the human a recommendation to endorse. A good system gives the human an argument to overcome.

That distinction matters. Venture judgment is not only about pattern recognition. It is about knowing when to violate the pattern. It is about asking whether the thing that looks like a flaw is actually the source of the opportunity.

A consensus machine will struggle with that.

A dissent machine might help.

The right question is not, “Can AI agents vote like an investment committee?”

The better question is, “Can AI agents expose the disagreement that a real investment committee would otherwise avoid?”

If your agentic IC is producing clean, confident, unanimous outputs on most deals, something is probably wrong. Either your deals are unusually obvious, or your system has been engineered to launder uncertainty into consensus.

Consensus among correlated agents is not conviction.

It is noise wearing a suit.

Build the devil’s advocate. But more importantly, measure the correlation.

Because in venture, the danger is not that your agents disagree.

The danger is that they all agree for the same wrong reason.

The Most Dangerous Slide in a Southeast Asian Startup Deck

The most dangerous slide in a Southeast Asian startup deck is not the TAM slide.

It is the U.S. pilot slide.

Every founder expanding into America wants that one recognizable logo. A U.S. enterprise agrees to test the product. Someone senior sounds excited. The company runs a proof of concept. The logo goes into the fundraising deck. Suddenly everyone starts calling it “U.S. traction.”

I would be careful.

A pilot is not traction. A pilot is a question.

The question is not only whether the product works. That is usually the easy part. The harder question is whether the customer has a real problem, a real buyer, a real budget and a real rollout path.

I call this the Four Reals.

Real pain. Real buyer. Real budget. Real rollout.

Without those four things, a pilot is not a bridge into the U.S. market. It is a waiting room with a famous logo on the door.

I have seen many Southeast Asian founders overvalue U.S. interest because the signal feels powerful. A meeting with an American enterprise feels different. A trial with a global company feels like validation. A logo from Silicon Valley, New York or a Fortune 500 company changes how investors listen. It gives the team confidence. It makes the company feel global.

But the U.S. market does not reward interest. It rewards conversion.

American enterprises are very good at experimentation. They will take calls. They will run pilots. They will ask startups to prove value. They will introduce innovation teams, digital transformation teams, strategy teams and product teams. Sometimes that is the beginning of a real commercial relationship. Other times, it is just the enterprise outsourcing its curiosity to startups.

That is the trap.

The founder thinks the company is entering the U.S. market. The enterprise thinks it is learning.

For AI startups, the trap is even worse. Every company wants to experiment with AI. Very few know exactly how to deploy it, govern it, integrate it and pay for it at scale. A demo can be impressive and still fail to become workflow. The model can be strong and still get stuck because the data is messy. The user can love it and still lose to compliance. The champion can be excited and still fail to get finance approval.

This is why I would rather see one boring paid U.S. customer than five glamorous pilots.

A paid customer tells me someone has crossed the internal line from curiosity to commitment. A pilot only tells me someone was interested enough to try.

For founders, the discipline should begin before the pilot starts. Define success in writing. Tie the pilot to a business metric, not a vague feeling. Identify the economic buyer, not just the friendly user. Understand who signs the purchase order. Ask what budget this comes from. Map the security, legal and procurement path early. Set a clear end date. Most importantly, agree upfront on what happens if the pilot works.

A good pilot should create a buying decision.

A bad pilot creates activity.

This matters for Southeast Asian founders because capital efficiency can hide go-to-market weakness. Many founders from this region are excellent at building with fewer resources. They are used to fragmented markets, different cultures and difficult operating environments. That is a strength. But the U.S. is difficult in a different way. It is not fragmented by geography as much as it is fragmented inside the customer.

The user is not always the buyer. The buyer is not always the budget owner. The budget owner is not always the decision maker. The decision maker may still need legal, compliance, procurement, finance and IT to say yes.

So when a founder tells me they have a U.S. pilot, I do not ask whether the logo is impressive.

I ask what happens next.

Is the pain real? Is the buyer real? Is the budget real? Is the rollout real?

If yes, that pilot may be the start of U.S. traction.

If no, it may only be a very expensive conversation.

AI Labor Is Not SaaS

For twenty years, software investors underwrote a simple assumption: software scales better than labor. SaaS sold access to reusable code. Add another customer, seat, or department, and the marginal cost was close to zero. That was the economic magic behind 80–90% gross margins, high net retention, and the valuation framework that built modern enterprise software.

AI agents complicate that assumption. They are software, but they behave economically like labor. Every completed task carries a cost: inference, tool calls, orchestration, retrieval, verification, retries, exception handling, and sometimes human escalation. In traditional SaaS, usage was mostly evidence of retention. In agentic AI, usage is also a cost event.

The obvious story is that AI agents will replace seats. Why pay for fifty users when one person with agents can do the same work? That story is directionally right, but it misses the harder question. Replacing seats does not automatically create a better business model. A seat is predictable revenue. Work is variable cost.

That is the agent margin trap. Outcomes-based pricing sounds elegant: pay per resolved ticket, qualified lead, invoice processed, or claim reviewed. The customer only pays when value is delivered. But alignment is not the same as margin. An agentic AI company that sells more outcomes also takes on more work, more model calls, more edge cases, more failed attempts, and more dispute risk over whether the outcome was actually achieved.

The bull case is that inference costs will fall. They will. But inference is only the visible cost of autonomy. The hidden costs are verification, exception handling, trust, and liability. Those do not automatically collapse with GPU prices. In many workflows, they become the real cost center.

This is why the key diligence question for agentic AI companies is changing. It is not how much ARR they have, how many workflows they automate, or how accurate the model is. The real question is: what is the contribution margin per outcome after inference, tools, orchestration, verification, exceptions, and human escalation?

If that number does not improve with scale, the company is not AI-native SaaS. It is tech-enabled services with a better interface.

The best agentic AI companies will operate in domains where the work is frequent, narrow, measurable, repeatable, and low-liability. Customer support resolution, invoice processing, compliance documentation, sales research, data enrichment, claims triage, and recruiting screening are good examples. In these markets, success can be defined, failure can be measured, exceptions can be reduced, and margins can improve.

The weakest agentic AI companies will operate where the work is ambiguous, judgment-heavy, high-liability, and hard to verify. Strategy, complex legal advice, medical decisioning, financial planning, enterprise sales, and executive recruiting may demo well, but scale poorly. The edge cases become the product. The exception queue becomes the company.

Investors are trying to value AI labor with SaaS multiples. But labor and software are different economic objects. Software scales by replication. Labor scales by execution. SaaS monetized access. Agents monetize work.

The next great AI companies will not be the ones that automate the most work. They will be the ones that automate the most profitable work.

AI does not kill SaaS. It forces us to ask a better question: is this software, or is this labor with an API?