SyncMind: Measuring Agent Out-of-Sync Recovery in Collaborative Software Engineering

Challenge: Agent Out-of-Sync

Consider a human-AI collaboration scenario:
• While Agent implements changes based on its understanding at time Ti, Human modifies the codebase at Tj (Ti < Tj < Tk)
Agent's subsequent update at Tk becomes incompatible with the current state Sk due to its outdated belief state Bk
This raises the critical challenge: How can collaborators effectively recognize their belief being out-of-sync (Bk Sk), diagnose the root causes, and recover their belief Bk to match the world state Sk?
Agent Out-of-Sync Visualization

SyncMind: Agent Out-of-Sync Recovery Framework

Agent Out-of-Sync Recovery:
Tackling the challenge of agent out-of-sync in collaborative software engineering, we propose SyncMind, a framework that systematically evaluates agent out-of-sync recovery in collaborative scenarios.

SyncMind Framework

Resource-Aware Out-of-Sync Recovery:
We integrate the resource-aware recovery module into SyncMind, evaluating agents' awareness of temporal and financial resources.

SyncMind Framework Modules

SyncBench: Agent Out-of-Sync Benchmark

To systematically evaluate the out-of-sync recovery capabilities of LLM-powered agents, we construct SyncBench, a benchmark featuring agent out-of-sync in collaborative software engineering.

SyncBench Construction

Evaluation: Agent Out-of-Sync Recovery

Recovery Ability: Out-of-Sync Recovery

We evaluate LLM agents' out-of-sync recovery abilities through five complementary metrics:

SR : success rate
LA : localization accuracy
CSR : conditional success rate
ASR : assistance seeking rate
Eff : recovery efficiency

Collaboration Ability: Collaborative Out-of-Sync Recovery

Experiment results reveal significant limitations in LLM agents' collaboration capabilities:

• Willingness to collaborate
• Communication quality
• Strategic out-of-sync recovery

Resource Awareness: Resource-Aware Out-of-Sync Recovery

Resource-aware out-of-sync recovery unfolds fundamental limitations in LLM agents' resource awareness, provideing insights for future development of resource-efficient collaborative systems:

• Time management
• Cost sensitivity
• Resource-efficient collaboration

Key Findings

(1) Significant Ability Gaps Among Different LLM Agents

We observe significant variations in different LLM agents' out-of-sync recovery performance.

  • Viewing experiment results on Caller and Callee separately, agents' recovery performance ranges from Llama-3.1 agents (SR<=4.00%) to Claude-3.5-Sonnet (SR>=25.41%).
  • These gaps remain huge despite varying task complexity and recovery settings (find more details in the Appendix sections of our paper).
Ability Gaps Among LLM Agents
(2) Beneficial Collaborator Assistance In Agent Recovery Success

Collaborator assistance demonstrates beneficial impact on agents' out-of-sync recovery success.

  • Comparing LLM agents' out-of-sync recovery performance between their individual independent (deeper colors) and collaborative (lighter colors) recoveries, collaborator assistance by and large improves agents' recovery success.
  • The positive effects of collaborator assistance grow stronger as task complexity increases.
  • The effectiveness of collaborator assistance hinges not only on agents' collabroative willingness, but their communication quality and strategy as well. These aspects also significantly affect agents' localization efficiency and recovery success.
    • Collaborative willingness: LLM agents show in general limited collaboration initiative (ASR<=4.86%).
    • Question quality: Higher question quality correlates positively with agents' localization accuracy and recovery success.
    • Recovery strategy: Early environment exploration exhibits beneficial influence on recovery success, underlining the significance of strategic out-of-sync recovery.
Ability Gaps Among LLM Agents
(3) LLM Agents' Lack of Collaboration Willingness
  • Our calculation of ASR reveals existing LLM agents' lack of willingness to collaborate (ASR<=4.86%).
  • The increasee agents' collaboration willingness is positively associated with agents' recovery success.
Ability Gaps Among LLM Agents
(4) LLM Agents' Lack of Resource Awareness

Our resource-aware out-of-sync recovery experiments evaluates agents' resource awareness in both temporal and financial dimensions.

  • Resource-aware out-of-sync recovery:
    • Temporal awareness: We extend the maximum time limit for out-of-sync recovery from 30 turns to 50 turns.
    • Financial awareness: We adjust the hypothetical total budget and action cost to elvaute agents' financial resource awareness.
      • Budget awareness: We triple the hypothetical total budget from $1000 (insufficient for 30-turn recovery) to $3000 (sufficient for 30-turn recovery with all kinds of action taking patterns).
      • Cost awareness: We halve and double the cost of seeking collaborator assistance, respectively.
  • The minimal differences in agents' SR scores uncover existing LLM agents' general lack of resource awareness, despite notable benefits obtained from collaborator assistance.
Ability Gaps Among LLM Agents

Resources

Paper

Check out our paper to view more details about SyncMind and SyncBench.

Paper

Data

Access our agent out-of-sync benchmark with two datasets: Caller and Callee.

Dataset

Code

View our implementation of SyncMind and SyncBench for the out-of-sync challenge.

GitHub

BibTeX

@article{guo2025syncmind,
            title={SyncMind: Measuring Agent Out-of-Sync Recovery in Collaborative Software Engineering},
            author={Guo, Xuehang and Wang, Xingyao and Chen, Yangyi and Li, Sha and Han, Chi and Li, Manling and Ji, Heng},
            journal={arXiv preprint arXiv:2502.06994},
            year={2025}
        }