AI Concept Testing vs Surveys: 2025 Benchmark

Research Overview

We evaluated several hundred new product concepts using both an AI synthetic model and an online consumer panel survey. The AI model was trained to function like virtual consumers. Both the synthetic consumers and the real consumers viewed visual and brief written descriptions of each new product idea in random order and answered the same set of questions.

From these questions five key performance metrics were measured on a 5-point Likert scale: Overall Score, Purchase Interest, New & Unique, Solves a Need, and Virality. We averaged these scores for all concepts to see how closely the AI synthetic model predicted how real consumers would score.

We used scores from the Purchase Interest questions to rank the concepts from top to bottom, then statistically compared the ranking by the synthetic panel versus the real consumers.

We also tracked turnaround time and cost per study.

Outcome Metrics

Overall Score. The synthetic panel gave the average concept a 72% top-two box (“TTB”) score, closely matching the consumer panel which aggregated to 70% TTB.
Purchase Interest. The synthetic panel gave the average concept a 68% TTB score, vs. 65% for the consumer panel. Note that these questions don’t name specific prices; they ask about buying the item “at a price comparable to similar products you have bought.”
New & Unique. The synthetic panel gave the average concept a score of 74% TTB, compared to 70% for the consumer panel. This indicates it performs quite well at detecting novelty, which makes sense considering how much organic consumer data it was trained on.
Solves a Need. The synthetic panel averaged 71% TTB vs. 68% for the consumer panel. The three point spread matches the relationship seen with purchase interest.
Virality. The synthetic panel scored the Likelihood to Share for the average product at 69% TTB, vs. 61% for the consumer panel. This metric had the widest gap between AI and real people. However, self‑reported “I’ll tell others” tends to be conservative, so the higher synthetic score may actually be a closer proxy for real‑world word‑of‑mouth.

Rank Order of Concepts

The synthetic model’s predictions closely fit the ranking by real consumers. It had an R² value of 0.70, meaning essentially that the AI model could explain about 70% of why consumers picked the order in which they did. 30% still come from real‑world nuances the synthetic model didn’t capture.

Cycle time comparison

Traditional surveys averaged fourteen calendar days from brief to final report. The synthetic panels computed and returned a full dashboard of scores in under five minutes, enabling multiple rounds of testing within one workday.

Cost efficiency

Median cost per test: $23,500 for online survey-based quantitative research versus $0-49 per month for AI for CPG. Redirecting these cost savings can fund additional concept iterations, qual and quant testing including in-home use tests, price sensitivity testing, or pilot marketing.

Study context

This benchmark covers diverse concepts in food, beverage, and personal care. While scores may vary by category nuance, the improvements in speed and cost hold across segments.

Recommendations

Use the synthetic panel to rapidly screen early ideas, identifying which concepts deserve further investment. This will shave several months off the traditional process for the front end of innovation. Then, conduct traditional qualitative and quantitative testing to explore finalist ideas in more depth and build confidence in the plans.

Try it free

Ready to run your own synthetic panel?

Start free