In scientific research, analysis requires accurately interpreting complex multimodal knowledge, integrating evidence from different sources, and drawing inferences grounded in domain-specific knowledge. However, current AI systems struggle to consistently demonstrate such capabilities. The complexity and variability of scientific tables and figures, combined with heterogeneous structures and long-context requirements, pose fundamental obstacles to scientific table & figure analysis. To quantify these challenges, we introduce ANABENCH, a large-scale benchmark featuring 63,178 instances from nine scientific domains, systematically categorized along seven complexity dimensions. To tackle these challenges, we propose ANAGENT, a multi-agent framework for enhanced scientific table & figure analysis through four specialized agents: Planner decomposes tasks into actionable subtasks, Expert retrieves task-specific information through targeted tool execution, Solver synthesizes information to generate coherent analysis, and Critic performs iterative refinement through five-dimensional quality assessment. Comprehensive evaluation across 9 broad domains with 170 subdomains demonstrates that ANAGENT achieves substantial improvements, up to β 13.43% in training-free settings and β 42.12% with finetuning, while revealing that task-oriented reasoning and context-aware problem-solving are essential for high-quality scientific table & figure analysis.
Figure 2: Challenges In Scientific Table & Figure Analysis
Figure 8: Preliminary Study
Scientific tables and figures present huge heterogeneity in formats (e.g., LaTeX, XML), modality (e.g., text, image), and structural complexity (e.g., nested hierarchies, mathematical notation, domain-specific conventions). Effective analysis requires: (1) Multimodal understanding to interpret charts, plots, diagrams, and cross-reference elements; (2) Structural parsing to extract data with varying layouts; (3) Adaptive reasoning strategies that can adjust based on analysis objectives, as methodology-oriented analysis tasks require accurate understanding of domain-specific knowledge and novel method design while experiment-oriented analysis tasks demand professional interpretation and reasoning grounded on empirical evidence; (4) Task-specific depth calibration between shallow descriptive analysis versus in-depth inferential synthesis.
High-quality analysis cannot rely on tables and figures in isolation, as it demands systematic evidence integration across extensive document contexts: (1) Captions provide essential contextual framing and variable definitions; (2) Related sections (e.g., methodologies, experimental results, discussions and findings) contain critical information and domain Knowledge (3) Cross-references to other sections and documents establish extended contexts and heightened understanding difficulties; (4) Citations to prior work ground findings in existing literature. Models need to maintain coherence across thousands of tokens while selectively retrieving relevant information from long, heterogeneous scientific documents.
Scientific analysis requires more than language understanding, as it demands deep domain and task-specific expertise: (1) Specialized terminology with precise technical meanings that vary across disciplines; (2) Domain conventions for data presentation, statistical reporting, and experimental design; (3) Background knowledge of established theories, methodologies, and research paradigms within each field; (4) Interpretive frameworks grounded in disciplinary norms across 9 broad domains and 170 fine-grained subdomains. Without domain-specific knowledge, models will produce generic descriptions or even inaccurate contents, rather than scientifically meaningful analysis.
Scientific table and figure analysis presents multifaceted challenges that demand comprehensive evaluation:
63,178 instances across 9 domains, 170 subdomains, categorized along 7 complexity dimensions
Figure 3: Four-Stage Benchmark Construction for ANABENCH
Multimodal & Multi-Layout Scientific Content: Encompassing ables, figures, or both combined, different types of data require distinct parsing strategies. For example, tables tend to demand structural understanding capacities while figures reply heavily on accurate visual interpretation capabilities.
Scientific Discipline Coverage: Spans 9 broad domains (Computer Science, Electrical Engineering, Mathematics, Physics, Economics, Quantitative Biology, Quantitative Finance, Statistics, Biomedicine) with 170 fine-grained subdomains. Each domain has unique terminology, conventions, knowledge, etc., that demand adaptive reasoning and analysis.
Source Representation: Scientific content can appear in different formats, such as LaTeX and XML. These different formats greatly affect parsing complexity and information extraction strategies essential for accurate analysis.
Publication Type: Distinguishing between general research papers and review & survey papers, source type greatly influences the structures and layouts of paper writing, the interpretation of paper contents, and the density of cross-references and contextual dependencies.
Information Scope and Context Breadth: Self-contained analysis uses only the data provided; Internal references require related sections within the paper; External references involve cross-document retrieval and comprehension; and Mixed involves multiple contextual sources. As wider scope requires sophisticated long-context understanding and multi-hop reasoning, analysis width is one of the essential dimensions for accurate scientific analysis.
Analytical Rigor and Reasoning Complexity: Shallow analysis involves surface-level description and factual summarization from provided contents, while in-depth analysis requires inferential reasoning, cross-evidence synthesis, identifying implicit patterns, drawing domain-grounded conclusions, and generating insights beyond literal interpretation.
Analysis Purpose and Focus: Methodology-oriented analysis examines approaches, techniques, experimental designs, and procedural details. On the other hand, experiment-oriented analysis focuses on results interpretation, performance comparisons, statistical significance, and empirical findings. Different objectives demand distinct reasoning strategies and domain knowledge, posing significant challenges to both high-level planning and analysis implementation.
| Broad Domain | Fine-Grained Subdomain |
|---|---|
| Computer Science (40 Subdomains) |
Artificial Intelligence; Hardware Architecture; Computational Complexity; Computational Engineering, Finance, and Science; Computational Geometry; Computation and Language; Cryptography and Security; Computer Vision and Pattern Recognition; Computers and Society; Databases; Distributed, Parallel, and Cluster Computing; Digital Libraries; Discrete Mathematics; Data Structures and Algorithms; Emerging Technologies; Formal Languages and Automata Theory; General Literature; Graphics; Computer Science and Game Theory; Human-Computer Interaction; Information Retrieval; Information Theory; Machine Learning; Logic in Computer Science; Multiagent Systems; Multimedia; Mathematical Software; Numerical Analysis; Neural and Evolutionary Computing; Networking and Internet Architecture; Other Computer Science; Operating Systems; Performance; Programming Languages; Robotics; Symbolic Computation; Sound; Software Engineering; Social and Information Networks; Systems and Control |
| Economics (3 Subdomains) |
Econometrics; General Economics; Theoretical Economics |
| Electrical Engineering (4 Subdomains) |
Audio and Speech Processing; Image and Video Processing; Signal Processing; Systems and Control |
| Mathematics (32 Subdomains) |
Commutative Algebra; Algebraic Geometry; Analysis of PDEs; Algebraic Topology; Classical Analysis and ODEs; Combinatorics; Category Theory; Complex Variables; Differential Geometry; Dynamical Systems; Functional Analysis; General Mathematics; General Topology; Group Theory; Geometric Topology; History and Overview; Information Theory; K-Theory and Homology; Logic; Metric Geometry; Mathematical Physics; Numerical Analysis; Number Theory; Operator Algebras; Optimization and Control; Probability; Quantum Algebra; Rings and Algebras; Representation Theory; Symplectic Geometry; Spectral Theory; Statistics Theory |
| Physics (51 Subdomains) |
Astrophysics (Cosmology and Nongalactic Astrophysics; Earth and Planetary Astrophysics; Astrophysics of Galaxies; High Energy Astrophysical Phenomena; Instrumentation and Methods for Astrophysics; Solar and Stellar Astrophysics); Condensed Matter (Disordered Systems and Neural Networks; Mesoscale and Nanoscale Physics; Materials Science; Other Condensed Matter; Quantum Gases; Soft Condensed Matter; Statistical Mechanics; Strongly Correlated Electrons; Superconductivity); General Relativity and Quantum Cosmology (General Relativity and Quantum Cosmology); High Energy Physics - Experiment; High Energy Physics - Lattice; High Energy Physics - Phenomenology; High Energy Physics - Theory; Mathematical Physics; Nonlinear Sciences; Nuclear Experiment; Nuclear Theory; Physics (Accelerator Physics; Atmospheric and Oceanic Physics; Applied Physics; Biological Physics; Chemical Physics; Classical Physics; Computational Physics; Data Analysis, Statistics and Probability; Physics Education; Fluid Dynamics; General Physics; Geophysics; History and Philosophy of Physics; Instrumentation and Detectors; Medical Physics; Optics; Plasma Physics; Popular Physics; Physics and Society; Space Physics); Quantum Physics |
| Quantitative Biology (10 Subdomains) |
Biomolecules; Cell Behavior; Genomics; Molecular Networks; Neurons and Cognition; Other Quantitative Biology; Populations and Evolution; Quantitative Methods; Subcellular Processes; Tissues and Organs |
| Quantitative Finance (9 Subdomains) |
Computational Finance; Economics; General Finance; Mathematical Finance; Portfolio Management; Pricing of Securities; Risk Management; Statistical Finance; Trading and Market Microstructure |
| Statistics (6 Subdomains) |
Applications; Computation; Methodology; Machine Learning; Other Statistics; Statistics Theory |
| Biomedicine (15 Subdomains) |
General Pathology; Infectious Disease; Neurological Disease; Endocrine & Metabolic Disease; Psychiatry; Oncology; Cardiovascular System; Cell Biology; Genetics; Endocrinology; Immunology; Biochemistry; Metabolism; Histology; Virology |
Four specialized agents working collaboratively through planning, knowledge retrieval, solution generation, and iterative refinement
1. Strategic Planning: Understand the analysis objectives and decompose them into manageable subtasks, determining what information is needed and how different evidence should be integrated.
2. Knowledge Acquisition: Search for relevant context both within the work and across related works, examining different methodologies, related concepts, empirical insights, etc., to build comprehensive understanding.
3. Synthesis & Reasoning: Integrate multimodal information and domain-specific knowledge to generate coherent analysis grounded in evidence.
4. Critical Evaluation: Reflect on their analysis, verifying consistency with evidence, checking logical coherence, and refining interpretations with domain expertise.
Figure 1: Simulating Human Scientific Analysis Workflow
Figure 4: ANAGENT for Multi-Agent Coordinative Scientific Table & Figure Analysis
Decomposition & Planning
Analyzes inputs and task objectives, examine what information and domain knowledge are needed, and decomposes complex tasks into actionable subtasks
Knowledge Searching & Retrieval
Performs iterative knowledge acquisition through specialized tool execution and targeted information retrieval
Reasoning & Generation
Synthesizes accumulated knowledge and information with provided input to generate coherent, context-aware analysis solutions
Reflection & Refinement
Assesses generated analysis through five-dimensional evaluation protocal and provides targeted feedback for iterative improvement
Five specialized toolkits with 16 tools enabling augmented scientific analysis:
Test-time adaptation with k-shot exemplars for enhanced agent adaptability
Five-dimensional feedback guidance to reduce errors and improve analysis quality
SFT + specialized RL for optimizing individual capabilities while maintaining collaboration effectiveness
| Model | Size | Setting | SSEM (%) | SLEX (%) | SAVG (%) | Improvement |
|---|---|---|---|---|---|---|
| GPT-4.1-mini | - | Baseline | 45.18 | 10.54 | 27.86 | - |
| GPT-4.1-mini | - | Zero-Shot | 48.11 | 11.75 | 29.93 | +7.43% |
| GPT-4.1-mini | - | One-Shot | 49.47 | 12.98 | 31.22 | +12.06% |
| Gemini-2.5-Flash | - | Baseline | 42.47 | 9.20 | 25.84 | - |
| Gemini-2.5-Flash | - | Zero-Shot | 44.79 | 10.09 | 27.44 | +6.19% |
| Gemini-2.5-Flash | - | One-Shot | 47.64 | 10.98 | 29.31 | +13.43% |
| InternVL-3.5 | 4B | Baseline | 43.78 | 9.37 | 26.58 | - |
| InternVL-3.5 | 4B | Zero-Shot | 46.44 | 10.17 | 28.31 | +6.51% |
| InternVL-3.5 | 4B | One-Shot | 47.41 | 11.12 | 29.27 | +10.12% |
| InternVL-3.5 | 8B | Baseline | 44.71 | 9.98 | 27.34 | - |
| InternVL-3.5 | 8B | Zero-Shot | 47.77 | 10.85 | 29.31 | +7.21% |
| InternVL-3.5 | 8B | One-Shot | 48.52 | 12.22 | 30.37 | +11.08% |
| Qwen2.5-VL | 3B | Baseline | 43.68 | 9.49 | 26.59 | - |
| Qwen2.5-VL | 3B | Zero-Shot | 46.18 | 10.91 | 28.55 | +7.37% |
| Qwen2.5-VL | 3B | One-Shot | 47.26 | 11.31 | 29.29 | +10.15% |
| Qwen2.5-VL | 7B | Baseline | 44.74 | 9.98 | 27.31 | - |
| Qwen2.5-VL | 7B | Zero-Shot | 46.97 | 11.14 | 29.06 | +6.41% |
| Qwen2.5-VL | 7B | One-Shot | 48.22 | 12.29 | 30.25 | +10.77% |
| Qwen3-VL | 4B | Baseline | 43.99 | 9.53 | 26.76 | - |
| Qwen3-VL | 4B | Zero-Shot | 46.95 | 10.50 | 28.73 | +7.36% |
| Qwen3-VL | 4B | One-Shot | 47.55 | 11.20 | 29.38 | +9.79% |
| Qwen3-VL | 8B | Baseline | 44.73 | 10.16 | 27.44 | - |
| Qwen3-VL | 8B | Zero-Shot | 48.12 | 11.64 | 29.88 | +8.89% |
| Qwen3-VL | 8B | One-Shot | 49.15 | 12.98 | 31.07 | +13.23% |
| Model | Size | Training | Setting | SSEM (%) | SLEX (%) | SAVG (%) | Improvement |
|---|---|---|---|---|---|---|---|
| InternVL-3.5 | 4B | SFT | Zero-Shot | 51.85 | 18.92 | 35.39 | +33.15% |
| InternVL-3.5 | 4B | SFT | One-Shot | 52.13 | 18.99 | 35.56 | +33.78% |
| InternVL-3.5 | 8B | SFT | Zero-Shot | 53.20 | 20.03 | 36.61 | +33.91% |
| InternVL-3.5 | 8B | SFT | One-Shot | 53.66 | 20.14 | 36.90 | +34.97% |
| Qwen2.5-VL | 3B | SFT | Zero-Shot | 50.57 | 17.68 | 34.12 | +28.32% |
| Qwen2.5-VL | 3B | SFT | One-Shot | 51.02 | 17.69 | 34.35 | +29.18% |
| Qwen2.5-VL | 3B | RL | Zero-Shot | 46.13 | 12.42 | 29.28 | +10.12% |
| Qwen2.5-VL | 3B | RL | One-Shot | 47.10 | 13.07 | 30.09 | +13.16% |
| Qwen2.5-VL | 3B | SFT+RL | Zero-Shot | 51.91 | 18.86 | 35.39 | +33.10% |
| Qwen2.5-VL | 3B | SFT+RL | One-Shot | 52.40 | 20.29 | 36.34 | +36.67% |
| Qwen2.5-VL | 7B | SFT | Zero-Shot | 52.27 | 19.69 | 35.98 | +31.75% |
| Qwen2.5-VL | 7B | SFT | One-Shot | 53.34 | 20.43 | 36.89 | +35.08% |
| Qwen3-VL | 4B | SFT | Zero-Shot | 52.19 | 19.85 | 36.02 | +34.60% |
| Qwen3-VL | 4B | SFT | One-Shot | 52.34 | 20.05 | 36.20 | +35.28% |
| Qwen3-VL | 4B | RL | Zero-Shot | 47.43 | 12.58 | 30.03 | +12.22% |
| Qwen3-VL | 4B | RL | One-Shot | 48.45 | 12.89 | 30.67 | +14.61% |
| Qwen3-VL | 4B | SFT+RL | Zero-Shot | 53.08 | 20.81 | 36.95 | +38.08% |
| Qwen3-VL | 4B | SFT+RL | One-Shot | 54.14 | 21.92 | 38.03 | +42.12% |
| Qwen3-VL | 8B | SFT | Zero-Shot | 54.34 | 22.16 | 38.25 | +39.40% |
| Qwen3-VL | 8B | SFT | One-Shot | 54.55 | 22.22 | 38.39 | +39.91% |
If you find this work useful, please kindly cite:
@article{guo2026anagent,
title={ANAGENT For Enhancing Scientific Table & Figure Analysis},
author={Guo, Xuehang and Lu, Zhiyong and Hope, Tom and Wang, Qingyun},
journal={arXiv preprint arXiv:2602.10081},
url={https://arxiv.org/abs/2602.10081},
year={2026}
}