LLMs: A Credible Alternative to Traditional Surveys?

⚡

Key Takeaways

1Recent studies show that LLMs can simulate the average responses of household surveys with an accuracy of one percentage point.

2Despite their average accuracy, LLMs fail to reproduce the diversity of individual responses observed in real surveys.

3Unlearning methods have been tested to improve the dispersion of LLM responses, but challenges remain in faithfully mimicking RCTs.

💡Why it matters — The use of LLMs could revolutionize data collection, but their ability to reflect the diversity of opinions remains limited.

LLMs: A Credible Alternative to Traditional Surveys?

Large Language Models

The idea of replacing human respondents with large language models (LLMs) in economic surveys is gaining increasing interest. Recent research has explored the ability of these models to simulate the responses of 6,000 American households to economic questions, such as those related to inflation. The results are promising: LLMs can reproduce the average responses from major household surveys with an accuracy of within one percentage point, according to a study by Zarifhonarvar in 2026. For example, the 2020 Survey of Consumer Expectations (SCE) reported a median one-year inflation rate of about 3%, a figure that LLMs can also achieve when properly prompted. This accuracy could make LLMs an interesting and cost-effective complement to traditional surveys such as the SCE, the Michigan Survey, and the Survey of Professional Forecasters.

However, a recent paper titled Can LLMs Mimic Household Surveys?, co-authored with Ami Dalloul from the University of Duisburg-Essen, highlights a major limitation of LLMs. While these models accurately reach the median of surveys, they fail to capture the diversity of individual responses. For instance, the Llama-3 model places 95% of its simulated respondents within a narrow range of two percentage points, whereas actual SCE responses in 2020 varied from -25% to +27%. In other words, while the average may be correct, the diversity of opinions underlying it is absent, reducing the simulations to a mere representative agent.

Unlearning in LLMs

To address this limitation, one approach is to remove certain memorized statistics from the models, rather than ignoring them. Two unlearning methods have been applied to the Llama-3.1-8B-Instruct model, an open-source model that allows modification of its internal weights. The first method, Gradient Ascent (GA), aims to maximize prediction loss on a specific dataset while preserving the model's general ability to reason about micro-surveys. The second method, called Negative Preference Optimization (NPO), treats the data to be forgotten as non-preferred completions, thereby minimizing a preference loss compared to a reference model.

The targeted data for forgetting includes the official inflation record, comprising monthly series of the Consumer Price Index (CPI) and average inflation expectations published by the SCE and Michigan surveys. The effects of these unlearning methods on the distribution of responses are detailed in Table 1 of the study.

Simulating a Randomized Controlled Trial

While broadening the distribution of responses is a step forward, it is not sufficient to achieve the ultimate goal of the study: to reproduce randomized controlled trials (RCTs) of surveys with synthetic versions. RCTs are costly tools, and once data is collected, it is impossible for a researcher to go back to test new theories or adjust treatments. Synthetic agents could potentially fill this gap, provided their behavior accurately reflects that of human respondents.

To evaluate this capability, the study replicated a real RCT conducted by Coibion, Gorodnichenko, and Weber in 2022. In this trial, respondents were randomly assigned to several groups: a control group with no information, several treatment groups each receiving different economic information, and a placebo group exposed to content unrelated to inflation. Participants were first asked to indicate their prior inflation expectation, receive the assigned information, and then report a new subsequent expectation. The difference between these two expectations constitutes the respondent's revision.

A treatment is deemed effective if the observed revisions differ significantly from those of the control group and if the direction of change aligns with economic forecasts: for example, downward revisions after a communication from the FOMC, or upward revisions following news of rising gasoline prices. The verification for synthetic agents involves determining whether their revisions behave similarly to those of human respondents.

Conclusion

For researchers and practitioners considering using LLMs to conduct surveys, several key points emerge:

LLMs fail to reproduce the diversity of personas. Survey simulation boils down to a single agent responding to the same question thousands of times, consistently achieving a very precise average, sometimes up to four decimal places.
Targeted unlearning methods allow for the recovery of a significant portion of response dispersion and a substantial share of the treatment effect, although challenges remain in accurately mimicking RCTs.

In summary, while LLMs offer a promising alternative for data collection, their ability to reflect the diversity of human opinions remains limited, necessitating improvements to be fully operational.

LLMs: A Credible Alternative to Traditional Surveys?

Le brief IA que les pros lisent chaque soir

LLMs: A Credible Alternative to Traditional Surveys?

Large Language Models

Unlearning in LLMs

Simulating a Randomized Controlled Trial

Conclusion

Brief IA — L'actualité IA en français