SyncMind: Measuring Agent Out-of-Sync Recovery in Collaborative Software Engineering

Challenge: Agent Out-of-Sync

Consider a human-AI collaboration scenario:

• While Agent implements changes based on its understanding at time T_i, Human modifies the codebase at T_j (T_i < T_j < T_k)

• Agent's subsequent update at T_k becomes incompatible with the current state S_k due to its outdated belief state B_k

This raises the critical challenge: How can collaborators effectively recognize their belief being out-of-sync (B_k ≠ S_k), diagnose the root causes, and recover their belief B_k to match the world state S_k?

Agent Out-of-Sync Recovery

Evaluation

SyncMind: Agent Out-of-Sync Recovery Framework

Agent Out-of-Sync Recovery:
Tackling the challenge of agent out-of-sync in collaborative software engineering, we propose SyncMind, a framework that systematically evaluates agent out-of-sync recovery in collaborative scenarios.

Resource-Aware Out-of-Sync Recovery:
We integrate the resource-aware recovery module into SyncMind, evaluating agents' awareness of temporal and financial resources.

SyncBench: Agent Out-of-Sync Benchmark

To systematically evaluate the out-of-sync recovery capabilities of LLM-powered agents, we construct SyncBench, a benchmark featuring agent out-of-sync in collaborative software engineering.

Evaluation: Agent Out-of-Sync Recovery

Recovery Ability: Out-of-Sync Recovery

We evaluate LLM agents' out-of-sync recovery abilities through five complementary metrics:

• SR : success rate

• LA : localization accuracy

• CSR : conditional success rate

• ASR : assistance seeking rate

• Eff : recovery efficiency

Collaboration Ability: Collaborative Out-of-Sync Recovery

Experiment results reveal significant limitations in LLM agents' collaboration capabilities:

• Willingness to collaborate

• Communication quality

• Strategic out-of-sync recovery

Resource Awareness: Resource-Aware Out-of-Sync Recovery

Resource-aware out-of-sync recovery unfolds fundamental limitations in LLM agents' resource awareness, provideing insights for future development of resource-efficient collaborative systems:

• Time management

• Cost sensitivity

• Resource-efficient collaboration

Key Findings

(1) Significant Ability Gaps Among Different LLM Agents

We observe significant variations in different LLM agents' out-of-sync recovery performance.

Viewing experiment results on Caller and Callee separately, agents' recovery performance ranges from Llama-3.1 agents (SR<=4.00%) to Claude-3.5-Sonnet (SR>=25.41%).
These gaps remain huge despite varying task complexity and recovery settings (find more details in the Appendix sections of our paper).

(2) Beneficial Collaborator Assistance In Agent Recovery Success

Collaborator assistance demonstrates beneficial impact on agents' out-of-sync recovery success.

Comparing LLM agents' out-of-sync recovery performance between their individual independent (deeper colors) and collaborative (lighter colors) recoveries, collaborator assistance by and large improves agents' recovery success.
The positive effects of collaborator assistance grow stronger as task complexity increases.
The effectiveness of collaborator assistance hinges not only on agents' collabroative willingness, but their communication quality and strategy as well. These aspects also significantly affect agents' localization efficiency and recovery success.

Collaborative willingness: LLM agents show in general limited collaboration initiative (ASR<=4.86%).
Question quality: Higher question quality correlates positively with agents' localization accuracy and recovery success.
Recovery strategy: Early environment exploration exhibits beneficial influence on recovery success, underlining the significance of strategic out-of-sync recovery.

(3) LLM Agents' Lack of Collaboration Willingness

Our calculation of ASR reveals existing LLM agents' lack of willingness to collaborate (ASR<=4.86%).
The increasee agents' collaboration willingness is positively associated with agents' recovery success.

(4) LLM Agents' Lack of Resource Awareness

Our resource-aware out-of-sync recovery experiments evaluates agents' resource awareness in both temporal and financial dimensions.

Resource-aware out-of-sync recovery:

Temporal awareness: We extend the maximum time limit for out-of-sync recovery from 30 turns to 50 turns.
Financial awareness: We adjust the hypothetical total budget and action cost to elvaute agents' financial resource awareness.

Budget awareness: We triple the hypothetical total budget from $1000 (insufficient for 30-turn recovery) to $3000 (sufficient for 30-turn recovery with all kinds of action taking patterns).
Cost awareness: We halve and double the cost of seeking collaborator assistance, respectively.

The minimal differences in agents' SR scores uncover existing LLM agents' general lack of resource awareness, despite notable benefits obtained from collaborator assistance.

Resources

Paper

Check out our paper to view more details about SyncMind and SyncBench.

Paper

Data

Access our agent out-of-sync benchmark with two datasets: Caller and Callee.

Dataset

Code

View our implementation of SyncMind and SyncBench for the out-of-sync challenge.

GitHub

BibTeX

@article{guo2025syncmind,
            title={SyncMind: Measuring Agent Out-of-Sync Recovery in Collaborative Software Engineering},
            author={Guo, Xuehang and Wang, Xingyao and Chen, Yangyi and Li, Sha and Han, Chi and Li, Manling and Ji, Heng},
            journal={arXiv preprint arXiv:2502.06994},
            year={2025}
        }