Skip to content

The NeurIPS 2024 Preshow: Creating SPIQA: Addressing the Limitations of Existing Datasets for Scientific VQA


Scientific papers are more than words; they are a symphony of text, figures, and tables, all working together to communicate complex research findings.

However, current AI models struggle to grasp the full depth of these papers, as traditional datasets for training AI systems in scientific paper comprehension have primarily focused on textual data, often neglecting essential visual components like figures and tables. This limitation hinders the development of AI systems that can truly understand and interact with scientific literature in a meaningful way. 

Enter SPIQA: a solution to this challenge, offering a new approach to building datasets for scientific paper comprehension. SPIQA’s approach incorporates visual data from approximately 26,000 computer science research papers, providing a robust platform for training more comprehensive AI systems.

NeurIPS 2024 Paper: Creating SPIQA: Addressing the Limitations of Existing Datasets for Scientific VQA

Author: Shraman Pramanick is a Ph.D. student in the Department of Electrical and Computer Engineering at Johns Hopkins University.

The Shortcomings of Current Datasets

Existing datasets for training and evaluating AI systems on scientific papers suffer from several key limitations:

  • Text-centricity: Most datasets prioritize text, neglecting the critical role that figures and tables play in conveying research findings. These visual elements often present essential data, illustrate intricate relationships, and offer concise visual summaries that are crucial for a complete understanding of the research.
  • Limited Scale: Creating large-scale datasets for scientific papers is a complex and expensive. The need for domain-specific expertise, the intricacies of scientific language, and the time required to extract and annotate relevant information contribute to the small scale of existing datasets. This limitation restricts the ability to train robust AI models capable of handling diverse topics and methodologies.
  • Lack of Granularity: Many datasets treat scientific papers as monolithic entities, failing to capture the hierarchical structure and the interconnectedness of different sections, figures, and tables. This lack of granularity limits the ability to train AI models that can perform fine-grained analysis and reasoning over specific aspects of the paper.

SPIQA: A New Approach

SPIQA extracts figures, tables, and associated captions from an extensive collection of papers covering domains such as NLP, machine learning, and databases. This broad scope underscores its utility across various scientific endeavors.  Here’s how:

  • Multimodality: SPIQA embraces the multimodal nature of scientific papers by incorporating a vast collection of figures, tables, and their associated captions, along with the paper’s full text. This approach ensures that AI systems trained on SPIQA can learn to interpret textual and visual information, developing a more holistic understanding of the research.
  • Large Scale: SPIQA is the first large-scale dataset specifically designed to interpret complex figures and tables within the context of scientific papers. It covers a wide range of computer science domains, encompassing over 25,000 papers, 150,000 figures, and 117,000 tables. This expansive scale provides a rich training ground for AI models, enabling them to generalize better to unseen papers and topics.
  • Strategic Integration: SPIQA leverages existing human-annotated datasets, such as QASA and QASPER, to supplement its automatically generated content. This strategy allows for a more diverse and comprehensive collection of questions, ensuring a higher level of difficulty and requiring a deeper understanding of the papers.

More importantly, the dataset facilitates multimodal AI systems capable of linking visual data to accompanying text, thus achieving a more rounded understanding of scientific material.

Balancing Automation and Human Expertise

The creation of SPIQA involved a carefully balanced approach, combining automatic question generation using large language models (LLMs) with manual curation by domain experts.

Automatic question generation enabled the creation of a large-scale dataset, addressing the issue of limited scale. However, a rigorous human filtering process was implemented to ensure quality and mitigate potential biases introduced by LLMs. This process involved multiple criteria, such as verifying question relevance, answer correctness, and the necessity of figures and tables to answer the questions. 

Additionally, two existing datasets, QASA and QASPER, were meticulously filtered to identify questions requiring an understanding of figures and tables, further enriching SPIQA with high-quality, human-written questions.

Automated and Manual Curation: A Fine Balance

A standout feature of SPIQA is its dual approach to question generation: blending automated methods with manual curation. 

While manual generation ensures quality, it’s not scalable across datasets of this magnitude. Hence, leveraging large language models (LLMs) for automated question generation becomes essential. Despite potential biases, this method has proven effective, with 89% of LLM-generated questions being indistinguishable from manually created ones to human evaluators.

This combination of strategies demonstrates a thoughtful balance between scalability and quality control.

Evaluating and Enhancing Performance

Pramanik’s team employed several baseline models to assess SPIQA’s efficacy, including closed and open-source models, revealing a broad performance spectrum. Notably, introducing the L3 score—a novel metric leveraging LLM confidence scores—offers a more nuanced evaluation of AI systems’ answers, addressing shortcomings of traditional token-based metrics.

Broader Implications and Future Directions

SPIQA sets a new standard for datasets in scientific paper comprehension, providing the resources necessary to develop more advanced AI systems. 

These systems can potentially revolutionize how we interact with scientific literature, offering benefits in research, education, and beyond. Students and researchers could leverage AI-powered tools to quickly identify relevant papers, extract key insights, and answer complex questions requiring an understanding of text and visual elements.

The creation of SPIQA involved overcoming several challenges in data curation and generation, including:

  • Balancing Automation and Human Expertise: Finding the right balance between automatic question generation and manual curation was crucial to achieving both scale and quality.
  • Domain-Specific Considerations: Scientific papers often contain specialized terminology and complex concepts that require domain expertise for accurate annotation.
  • Evaluation Metrics: Developing robust evaluation metrics for free-form question answering, where answers can be descriptive and vary in style, is crucial for accurately assessing AI systems’ performance.

Conclusion

SPIQA represents more than an evolution in scientific AI comprehension; it’s a paradigm shift. By embracing the full spectrum of data available in scientific papers it paves the way for future advancements in document understanding and AI application across diverse scientific disciplines. As the community eagerly awaits further expansions, like potential applications in biology and chemistry, SPIQA is a cornerstone in pursuing a more intelligent interaction with scientific information.

As AI continues to evolve, datasets like SPIQA will undoubtedly play an integral role bridging the gap between human insights and machine intelligence. Whether you’re in research or industry, SPIQA’s innovations promise exciting possibilities for a future where AI systems might fully grasp the intricate language of science.

Keep an eye out for more insights from Voxel 51 in upcoming blog posts, and if you’re attending NeurIPS in Vancouver, don’t forget to visit our booth and engage with our team for in-person discussions and to grab some exclusive swag!