File size: 6,126 Bytes
62f966f
9997b7b
 
62f966f
 
 
c41dfa3
b9fbe45
62f966f
 
99bc866
62f966f
 
9997b7b
e8ebf39
fde76b0
 
3ae0627
d387a99
0cce39a
 
d251baf
 
0cce39a
041e0aa
a693e28
4e6f989
ebe573d
13f46f5
 
3ae0627
 
56b264d
 
 
7bf070f
 
4e6f989
3865d62
4e6f989
56b264d
182ca2f
919822b
47ed2dc
cd17f01
 
041e0aa
 
 
cd17f01
 
 
 
 
f5ab635
cd17f01
 
 
4e6f989
 
 
cd17f01
 
 
041e0aa
 
 
cd17f01
 
 
4e6f989
cd17f01
7bf070f
ab9a153
 
 
 
6f2a39c
b9f9e8d
 
 
 
 
 
 
 
0b28b48
 
 
 
 
fcde626
0b28b48
 
 
4e6f989
 
 
0b28b48
 
 
 
 
 
18733dd
6f2a39c
041e0aa
 
 
4e6f989
 
6170d15
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
---
title: Scientific Document Insights Q/A
emoji: 📝
colorFrom: yellow
colorTo: pink
sdk: streamlit
sdk_version: 1.36.0
app_file: streamlit_app.py
pinned: false
license: apache-2.0
app_port: 8501
---

# DocumentIQA: Scientific Document Insights Q/A

**Work in progress** :construction_worker: 

<img src="https://github.com/lfoppiano/document-qa/assets/15426/f0a04a86-96b3-406e-8303-904b93f00015" width=300 align="right" />

https://lfoppiano-document-qa.hf.space/

## Introduction

Question/Answering on scientific documents using LLMs: ChatGPT-3.5-turbo, GPT4, GPT4-Turbo, Mistral-7b-instruct and Zephyr-7b-beta.
The streamlit application demonstrates the implementation of a RAG (Retrieval Augmented Generation) on scientific documents.
**Different to most of the projects**, we focus on scientific articles and we extract text from a structured document. 
We target only the full-text using [Grobid](https://github.com/kermitt2/grobid) which provides cleaner results than the raw PDF2Text converter (which is comparable with most of other solutions).

Additionally, this frontend provides the visualisation of named entities on LLM responses to extract <span stype="color:yellow">physical quantities, measurements</span> (with [grobid-quantities](https://github.com/kermitt2/grobid-quantities)) and <span stype="color:blue">materials</span> mentions (with [grobid-superconductors](https://github.com/lfoppiano/grobid-superconductors)).

(The image on the right was generated with https://huggingface.co/spaces/stabilityai/stable-diffusion)

[<img src="https://img.youtube.com/vi/M4UaYs5WKGs/hqdefault.jpg" height="300" align="right" 
/>](https://www.youtube.com/embed/M4UaYs5WKGs)

## Getting started

- Select the model+embedding combination you want to use 
- If using gpt3.5-turbo, gpt4 or gpt4-turbo, enter your API Key ([Open AI](https://platform.openai.com/account/api-keys)). 
- Upload a scientific article as a PDF document. You will see a spinner or loading indicator while the processing is in progress. 
- Once the spinner disappears, you can proceed to ask your questions

 ![screenshot2.png](docs%2Fimages%2Fscreenshot2.png)

## Documentation

### Embedding selection
In the latest version there is the possibility to select both embedding functions and LLMs. There are some limitation, OpenAI embeddings cannot be used with open source models, and viceversa. 

### Context size
Allow to change the number of blocks from the original document that are considered for responding. 
The default size of each block is 250 tokens (which can be changed before uploading the first document). 
With default settings, each question uses around 1000 tokens.

**NOTE**: if the chat answers something like "the information is not provided in the given context", **changing the context size will likely help**. 

### Chunks size
When uploaded, each document is split into blocks of a determined size (250 tokens by default). 
This setting allows users to modify the size of such blocks. 
Smaller blocks will result in a smaller context, yielding more precise sections of the document. 
Larger blocks will result in a larger context less constrained around the question.

### Query mode
Indicates whether sending a question to the LLM (Language Model) or to the vector storage. 
 - **LLM** (default) enables question/answering related to the document content.
 - **Embeddings**: the response will consist of the raw text from the document related to the question (based on the embeddings). This mode helps to test why sometimes the answers are not satisfying or incomplete.
 - **Question coefficient** (experimental): provide a coefficient that indicate how the question has been far or closed to the retrieved context

### NER (Named Entities Recognition)
This feature is specifically crafted for people working with scientific documents in materials science. 
It enables to run NER on the response from the LLM, to identify materials mentions and properties (quantities, measurements).
This feature leverages both [grobid-quantities](https://github.com/kermitt2/grobid-quanities) and [grobid-superconductors](https://github.com/lfoppiano/grobid-superconductors) external services. 

### Troubleshooting
Error: `streamlit: Your system has an unsupported version of sqlite3. Chroma requires sqlite3 >= 3.35.0`.
Here the [solution on Linux](https://stackoverflow.com/questions/76958817/streamlit-your-system-has-an-unsupported-version-of-sqlite3-chroma-requires-sq).
For more information, see the [details](https://docs.trychroma.com/troubleshooting#sqlite) on Chroma website.

## Disclaimer on Data, Security, and Privacy ⚠️

Please read carefully:

- Avoid uploading sensitive data. We temporarily store text from the uploaded PDF documents only for processing your request, and we disclaim any responsibility for subsequent use or handling of the submitted data by third-party LLMs.
- Mistral and Zephyr are FREE to use and do not require any API, but as we leverage the free API entrypoint, there is no guarantee that all requests will go through. Use at your own risk.
- We do not assume responsibility for how the data is utilized by the LLM end-points API.

## Development notes

To release a new version: 

- `bump-my-version bump patch` 
- `git push --tags`

To use docker: 

- docker run `lfoppiano/document-insights-qa:{latest_version)`

- docker run `lfoppiano/document-insights-qa:latest-develop` for the latest development version 

To install the library with Pypi: 

- `pip install document-qa-engine` 


## Acknowledgement 

The project was initiated at the [National Institute for Materials Science](https://www.nims.go.jp) (NIMS) in Japan. 
Currently, the development is possible thanks to [ScienciLAB](https://www.sciencialab.com).
This project was contributed by [Guillaume Lambard](https://github.com/GLambard) and the [Lambard-ML-Team](https://github.com/Lambard-ML-Team), [Pedro Ortiz Suarez](https://github.com/pjox), and [Tomoya Mato](https://github.com/t29mato).
Thanks also to [Patrice Lopez](https://www.science-miner.com), the author of [Grobid](https://github.com/kermitt2/grobid).