May 19, 2025

Designing AI Pilots That Actually Scale

The pattern is frustratingly common: AI pilot succeeds, stakeholders are excited, funding is approved for full rollout, and then… the scaled deployment underperforms dramatically. Harvard Business Review has documented this pattern extensively.

I’ve seen this enough times to recognise the failure isn’t in the scaling – it’s in how the pilot was designed. Pilots optimised to demonstrate possibility don’t prove production viability.

Here’s how to design pilots that actually predict scaled success.

Why Pilots Deceive

Several factors cause pilots to outperform subsequent scaled deployments:

Selected users. Pilots typically involve enthusiastic early adopters. Scaled deployment includes sceptics and reluctant users.

Attention and support. Pilot projects get disproportionate attention. Issues are quickly addressed. Users get extensive support. Scaled deployment doesn’t get the same resources.

Simplified scope. Pilots often address the easy version of the problem. Edge cases and exceptions are deferred. Scaled deployment must handle the messy reality.

Clean data. Pilot datasets are often curated. Production data includes all the quality issues that were excluded from the pilot.

Motivated measurement. People want pilots to succeed. Measurement may unconsciously favour positive interpretation. Nobody’s career depends on reporting pilot failure accurately.

These factors combine to create pilots that look great but don’t predict production reality.

Designing Predictive Pilots

The goal is designing pilots that accurately predict scaled performance. This requires different choices than pilots designed to look good.

Include Representative Users

Don’t just pilot with enthusiasts. Include:

Sceptical users who will stress-test the AI
Average performers, not just top performers
People who will use it reluctantly
Users with varying technical comfort

If your pilot group is self-selected enthusiasts, your results will be optimistic. Recruit deliberately to match the actual user population you’ll eventually deploy to.

Reduce Special Support

Pilot support should match what you can sustain at scale.

If the pilot has a dedicated support team responding to issues within minutes, but scaled deployment will have standard IT support with 24-hour SLAs, test with scaled support levels.

If users can call the project team directly during the pilot, but that won’t be possible at scale, remove that access.

The pilot should feel like production, not like a coddled experiment.

Include Edge Cases

Real-world operations include exceptions, unusual situations, and difficult cases. Include them in the pilot:

The 10% of transactions that don’t fit standard patterns
The customers with complex histories
The documents with unusual formatting
The situations that currently require manual escalation

If the AI handles the easy 90% well but can’t handle the difficult 10%, scaled deployment will have problems. Test with the difficult cases.

Use Production-Quality Data

Resist the temptation to clean up data for the pilot. Use data that matches production reality:

Include records with quality issues
Don’t exclude difficult cases from the dataset
Test with current data volumes and refresh rates
Include historical data even if it’s messier

AI that performs well on curated data but poorly on production data isn’t ready for deployment. Learn this during the pilot, not during rollout.

Measure Honestly

Establish measurement before the pilot starts, and protect measurement integrity:

Define success criteria in advance
Use automated measurement where possible
Include negative indicators, not just positive
Have measurement owned by someone not invested in pilot success
Compare against meaningful baselines

If the pilot team controls measurement, unconscious bias will affect results. Create structural independence.

The Pilot Design Checklist

Before launching a pilot, verify:

User selection:

Includes sceptics and reluctant users
Matches demographic of scaled population
Not exclusively enthusiasts and early adopters

Support model:

Matches what’s sustainable at scale
No special access to project team
Training matches what’s scalable

Scope:

Includes edge cases and exceptions
Represents full complexity of the workflow
Doesn’t defer difficult scenarios

Data:

Uses production-quality data
Includes known data quality issues
Tests at realistic volumes

Measurement:

Criteria defined before pilot starts
Measurement independent of pilot team
Includes negative indicators
Baseline comparisons are meaningful

If you can’t check all these boxes, your pilot will likely produce misleading results.

What a Realistic Pilot Looks Like

The pilot will be harder to execute:

Some users won’t engage enthusiastically
Issues will take longer to resolve
Edge cases will reveal limitations
Data quality problems will affect results
Some metrics will be disappointing

That’s not pilot failure – that’s learning what production will look like. A pilot that exposes problems is more valuable than one that hides them.

After the Pilot: Go/No-Go Decisions

When the pilot completes, assess honestly:

Strong go: Metrics met success criteria with representative users, realistic support, and production data. Scaling should preserve results.

Conditional go: Metrics were acceptable but revealed specific issues that need addressing before scale. Scale only after fixes are validated.

No-go: Metrics fell short, or problems are fundamental rather than fixable. Scaling would waste resources. Return to development or abandon.

Many organisations turn “conditional go” into “go” without addressing the conditions. This is how scaling failures happen.

The Counter-Argument

I sometimes hear: “If we make the pilot that hard, it won’t look good enough to get funding for scale.”

That’s a valid concern. But consider the alternative: a pilot that looks good, gets funding, and then fails at scale. That outcome is worse for everyone involved – it wastes more resources and damages credibility for future AI initiatives.

Better to have a realistic pilot that leads to appropriate decisions than an optimistic pilot that leads to expensive failures.

Final Thought

The purpose of a pilot isn’t to demonstrate that AI can work in ideal conditions. It’s to predict whether AI will work in production conditions.

Design pilots for prediction, not demonstration. You’ll get fewer impressive pilot results and more successful scaled deployments.

That’s a trade-off worth making.