--- title: surveyor-0 app_file: scripts/run_db_interface.py sdk: gradio sdk_version: 4.40.0 --- # Mapping the Data Landscape For Generalizable Scientific Models We introduce a method to build a knowledge base to store structured information extracted from scientific publications, datasets and articles by leveraging large language models! We want to cover all of "science", and perform semantic search over scientific literature for highly specific knowledge discovery. This tool helps us to find aggregate information and statistics pertaining to current state of scientific research, identify the gaps where current foundation models lack coverage and where they can generalize well, and helps discover overlaps of methods used across different fields, which can help facilitate in building more unified foundation models for science. ### Example Preview: Concept Co-occurrence Connectivity Graph for Astrophysics! https://github.com/user-attachments/assets/d0c2c4ac-924d-4ba8-80d5-5c665d910652 We use the Llama-3-70B-Instruct model with 2 A100 80GB GPUs for structured information extraction. ## Workflow
Fig 1: Prompt optimization pipeline to maximize precision of the model annotated predictions by running on manually annotated subset of scientific corpora. The tagged outputs can be generated as JSON or in a readable format, and be generated using temperature and nucleus sampling (sweep hyperparams). |
Fig 2: Illustration of the structured prediction pipeline on the full corpus of scientific papers, which runs optimized prompts and stores the model's outputs in a SQL db. |