SocialReasoning-Bench: AI and User Interest Advocacy

⚡

Key Takeaways

1SocialReasoning-Bench evaluates the ability of AI agents to advocate for user interests in social contexts.

2Current AI agents, even with explicit instructions, struggle to optimize outcomes for users.

3The tested models, such as GPT-5.4, show progress with defensive prompting but remain inadequate.

💡Why it matters — AI agents must be reliable to act effectively in complex social contexts, which is crucial for their widespread adoption.

SocialReasoning-Bench: Evaluating the Social Reasoning of AI Agents

Artificial intelligence (AI) agents are increasingly integrated into social contexts, where they must manage schedules, negotiate purchases, or interact with other agents on behalf of a user. To accomplish these tasks, they require more than just technical skills: they need social reasoning. This is where SocialReasoning-Bench comes into play, a benchmark designed to evaluate this crucial capability.

Objectives and Methodology

SocialReasoning-Bench tests the ability of AI agents to negotiate on behalf of a user in two realistic contexts: Calendar Coordination and Market Negotiation. This benchmark measures not only the outcomes achieved but also the processes followed by the agents. It evaluates the optimality of outcomes (the value that agents manage to secure for the user) and due diligence (the competence of the decision-making process).

Current AI models, while generally able to complete tasks, often leave value on the table. For instance, they frequently accept suboptimal meeting times or poor offers instead of effectively advocating for the user's interests. Even with explicit instructions to act in the user's best interest, performance remains far below what a trustworthy delegate should achieve.

Calendar Coordination

In the context of calendar coordination, an assistant agent manages a user's schedule for a single day. It receives a meeting request from another agent. The agent has access to a value function for time slots, which captures the user's scheduling preferences between 0.0 and 1. This function can be provided explicitly by the user or inferred from their calendar history.

The counterpart is a requesting agent representing another person who wishes to schedule a meeting with the user. This counterpart has its own value function, constructed as the inverse of the user's. Some requesters negotiate in good faith, while others use the interaction to extract private details from the calendar or push the assistant toward times the user does not want.

In each task, there exists a zone of possible agreement (ZOPA), a term borrowed from negotiation theory to denote the set of outcomes that both parties could accept. In calendar coordination, the ZOPA is the set of time slots that are mutually free on both calendars. Each task is designed so that the ZOPA contains at least three slots with different preference scores for the user, and the opening request from the requester always conflicts with the user's calendar.

Market Negotiation

In the context of market negotiation, a buyer agent representing a user negotiates with a seller agent to purchase a unique product. The user wants to pay as little as possible for the product. Their value function is the gap between the offer price and a private reservation price, the highest price they would be willing to pay. A larger gap captures more value, and an agreement above the reservation price captures none.

The counterpart is a seller agent with its own private reservation price set below that of the buyer. The counterpart's value function reflects that of the user, with higher agreement prices generating more value and agreement prices below the seller's reservation price generating no value.

The ZOPA is the price range between the seller's and buyer's reservations. The seller's opening offer is always above the buyer's reservation price, forcing the latter to negotiate the price down.

New Metrics for a New Framework

Existing benchmarks often focus on task completion: was the meeting scheduled? Was the exchange concluded? In principal-agent contexts, what matters is not only whether the task is completed but also how it is accomplished. SocialReasoning-Bench introduces new measures to capture this distinction.

Optimality of Outcomes

The optimality of outcomes evaluates the share of available value that the agent has captured for its principal, on a scale from 0 to 1. The outcome within the ZOPA most favorable to the principal scores 1, while the outcome most favorable to the counterpart scores 0.0. Intermediate outcomes are rated based on where the principal's value function places them between these two extremes.

Optimality of outcomes alone confuses skill with luck. An agent that immediately accepts the first offer from a counterpart, without examining its situation or making a counterproposal, can still achieve a good score if the counterpart happens to propose a good outcome. To separate skill from luck, SocialReasoning-Bench introduces a process metric.

Due Diligence

Due diligence assesses the quality of the process on a scale from 0 to 1 by comparing the agent's actions at each decision point in the trajectory with the action that a reasonable deterministic agent policy would have taken in the same state. The reasonable agent policy is a greedy procedure that captures what a competent advocate would do at each step, such as gathering relevant context before acting, starting with a favorable position for its principal, and conceding only after better options have been exhausted. The due diligence score is the rate at which the agent's actual choices align with those of the reasonable agent throughout the trajectory.

Together, the optimality of outcomes and due diligence form an operational notion of an agent's duty of care toward the person it represents. An agent that achieves a good outcome through a negligent process is fragile, while an agent that follows a good process but achieves a poor outcome indicates a lack of capability rather than negligence. Only an agent that scores well on both demonstrates strong social reasoning.

Experimental Setup

For the calendar assistant agent and the market buyer agent, SocialReasoning-Bench evaluates GPT-4.1 with chain of thought, GPT-5.4 at a high reasoning effort, and Claude Sonnet 4.6 and Gemini 3 Flash at high reasoning levels. The counterpart (i.e., the requester in calendar coordination and the seller in market negotiation) is always Gemini 3 Flash with a medium reasoning effort, kept constant across all conditions so that any score differences reflect the tested model rather than the difficulty of its opponent.

Each model is run under two prompt conditions: Base Prompt where the agent receives only role and tool descriptions, and Defensive Prompt where the agent additionally receives explicit instructions to consult all available sources and advocate for the user toward the best possible outcome.

Each task unfolds over 10 negotiation rounds, at most. The counterpart always makes the first offer in each task.

What We Discover

Discovery 1: Agents complete tasks at near-perfect rates but produce poor outcomes.

In calendar planning, agents almost always succeed in booking the meeting, but most often at suboptimal times. In market negotiation, agreements are almost always concluded, but frequently at the worst possible price. Tasks are completed, but not well completed: task completion signals success, while the optimality of outcomes reveals a constant failure to act in the principal's best interest.

Figure 2: Task Completion vs. Optimality of Outcomes by model and domain. All models complete tasks at near-perfect rates but produce poor outcomes. We measured the optimality of outcomes against both prompts, base and defensive. Defensive prompting helps but does not close the gap.

Discovery 2: Defensive prompting helps, but is not enough to close the gap.

When we instruct agents on how to work hard on behalf of their principal, we observe improvements in outcomes in both domains, but this is not enough to close the gap. GPT-5.4 benefits the most from defensive prompting (+0.21 in planning, +0.12 in market), while GPT-4.1 barely responds in either domain. Other models fall somewhere in between.

Discovery 3: The optimality of outcomes shows how much value agents leave on the table.

The optimality of outcomes reflects where each agreement sits within the ZOPA. When we plot the outcomes, they cluster closer to the counterpart's ideal than to the principal's.

Figure 3: Distribution of Optimality of Outcomes (OO) by model and domain. Each point is a task instance. OO=1.0 means the agent captured all available value for its principal; OO=0.0 means the counterpart captured everything.