Spaces:
Sleeping
Sleeping
File size: 14,492 Bytes
b8c24aa 8f1a330 b8c24aa 8f1a330 b8c24aa 8f1a330 b8c24aa 8f1a330 32a6937 8f1a330 32a6937 8f1a330 32a6937 0e8d94e 32a6937 8f1a330 32a6937 8f1a330 32a6937 0e8d94e 32a6937 8f1a330 32a6937 8f1a330 32a6937 8f1a330 32a6937 8f1a330 32a6937 8f1a330 32a6937 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 |
---
title: LLLM QA Eval
emoji: 💬
colorFrom: yellow
colorTo: purple
sdk: gradio
sdk_version: 4.36.1
app_file: app.py
pinned: false
license: apache-2.0
---
# Evaluate and Optimize Open-Source LLMs' Performance for Question Answering with RAG and Non-RAG
This project contains the source code, datasets and results for the titled paper.
## Results for [WebQSP Dataset](./data/datasets/WebQSP.test.wikidata.json)
| Model Name | RAG | RAG with Chat Template | Non-RAG | Note |
| -------------------------------- | ----------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------- | ----------------- |
| Phi-3-mini-128k-instruct (batch) | [Phi-3-mini-128k-instruct_wd_rag_batch_4](./data/results/Phi-3-mini-128k-instruct_wd_rag_batch_4.csv) | [Phi-3-mini-128k-instruct_wd_true](./data/results/Phi-3-mini-128k-instruct_wd_true.csv) | [Phi-3-mini-128k-instruct_wd_non_rag_batch_16](./data/results/Phi-3-mini-128k-instruct_wd_non_rag_batch_16.csv) | Evaluated 3 types |
| gemma-1.1-2b-it | [gemma-1.1-2b-it_wd](./data/results/gemma-1.1-2b-it_wd.csv) | [gemma-1.1-2b-it_wd_true](./data/results/gemma-1.1-2b-it_wd_true.csv) | [gemma-1.1-2b-it_wd_non_rag](./data/results/gemma-1.1-2b-it_wd_non_rag.csv) | Evaluated 3 types |
| gemma-1.1-7b-it | [gemma-1.1-7b-it_wd](./data/results/gemma-1.1-7b-it_wd.csv) | [gemma-1.1-7b-it_wd_true](./data/results/gemma-1.1-7b-it_wd_true.csv) | [gemma-1.1-7b-it_wd_non_rag](./data/results/gemma-1.1-27b-it_wd_non_rag.csv) | Evaluated 3 types |
| Mistral-7B-Instruct-v0.2 | [Tune_2024-03-29_11-28-20](./data/results/Tune_2024-03-29_11-28-20.csv) | [Mistral-7B-Instruct-v0.2_wd_true](./data/results/Mistral-7B-Instruct-v0.2_wd_true.csv) | [Tune_2024-04-16_12-24-27](./data/results/Tune_2024-04-16_12-24-27.csv.csv) | Evaluated 3 types |
| Llama-2-7b-chat-hf | [Tune_2024-03-20_15-35-37](./data/results/Tune_2024-03-20_15-35-37.csv) | [Llama-2-7b-chat-hf_wd_true](./data/results/Llama-2-7b-chat-hf_wd_true.csv) | [Tune_2024-04-09_09-19-22](./data/results/Tune_2024-04-09_09-19-22.csv) | Evaluated 3 types |
| Meta-Llama-3-8B-Instruct | [Meta-Llama-3-8B-Instruct_wd](./data/results/Meta-Llama-3-8B-Instruct_wd.csv) | [Meta-Llama-3-8B-Instruct_wd_true](./data/results/Meta-Llama-3-8B-Instruct_wd_true.csv) | [Meta-Llama-3-8B-Instruct_wd_non_rag](./data/results/Meta-Llama-3-8B-Instruct_wd_non_rag.csv) (generic prompt) | Evaluated 3 types |
| | | | [Meta-Llama-3-8B-Instruct_wd_1_non_rag](./data/results/Meta-Llama-3-8B-Instruct_wd_1_non_rag.csv) | Evaluated Non-RAG |
| Llama-2-13b-chat-hf | [Tune_2024-03-25_23-32-57](./data/results/Tune_2024-03-25_23-32-57.csv) | [Llama-2-13b-chat-hf_wd_true](./data/results/Llama-2-13b-chat-hf_wd_true.csv) | [Tune_2024-04-10_16-53-38](./data/results/Tune_2024-04-10_16-53-38.csv) | Evaluated 3 types |
| Llama-2-70b-chat-hf | [Llama-2-70b-chat-hf_wd](./data/results/Llama-2-70b-chat-hf_wd.csv) | [Llama-2-70b-chat-hf_wd_true](./data/results/Llama-2-70b-chat-hf_wd_true.csv) | [Llama-2-70b-chat-hf_wd_non_rag](./data/results/Llama-2-70b-chat-hf_wd_non_rag.csv) | Evaluated 3 types |
| Meta-Llama-3-70B-Instruct | [Meta-Llama-3-70B-Instruct_wd](./data/results/Meta-Llama-3-70B-Instruct_wd.csv) | [Meta-Llama-3-70B-Instruct_wd_true](./data/results/Meta-Llama-3-70B-Instruct_wd_true.csv) | [Meta-Llama-3-70B-Instruct_wd_non_rag](./data/results/Meta-Llama-3-70B-Instruct_wd_non_rag.csv) | Evaluated 3 types |
| gpt-3.5-turbo | [gpt-3.5-turbo_rag](./data/results/gpt-3.5-turbo_rag.csv) | | [gpt-3.5-turbo_non_rag](./data/results/gpt-3.5-turbo_non_rag.csv) | Evaluated both |
## Results for [MS MACRO Dataset](./data/datasets/ms_macro.json)
| Model Name | RAG | RAG with Chat Template | Non-RAG | Note |
| ------------------------- | ---------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------ | ---- |
| gemma-1.1-2b-it | [gemma-1.1-2b-it_mm_false](data/results/gemma-1.1-2b-it_mm_true_false.csv) | [gemma-1.1-2b-it_mm_true](data/results/gemma-1.1-2b-it_mm_true.csv) | [gemma-1.1-2b-it_mm_non_rag.csv](data/results/gemma-1.1-2b-it_mm_true_false_non_rag.csv) | |
| Phi-3-mini-128k-instruct | [Phi-3-mini-128k-instruct_mm_false](data/results/Phi-3-mini-128k-instruct_mm_false.csv) | [Phi-3-mini-128k-instruct_mm_true](data/results/Phi-3-mini-128k-instruct_mm_true.csv) | [Phi-3-mini-128k-instruct_mm_non_rag.csv](data/results/Phi-3-mini-128k-instruct_mm_non_rag.csv) | |
| gemma-1.1-7b-it | [gemma-1.1-7b-it_mm_false](data/results/gemma-1.1-7b-it_mm_false.csv) | [gemma-1.1-7b-it_mm_true](data/results/gemma-1.1-7b-it_mm_true.csv) | [gemma-1.1-7b-it_mm_non_rag.csv](data/results/gemma-1.1-7b-it_mm_non_rag.csv) | |
| Mistral-7B-Instruct-v0.2 | [Mistral-7B-Instruct-v0.2_mm_false](data/results/Mistral-7B-Instruct-v0.2_mm_false.csv) | [Mistral-7B-Instruct-v0.2_mm_true](data/results/Mistral-7B-Instruct-v0.2_mm_true.csv) | [Mistral-7B-Instruct-v0.2_mm_false](data/results/Mistral-7B-Instruct-v0.2_mm_non_rag.csv) | |
| Llama-2-7b-chat-hf | [Llama-2-7b-chat-hf_mm_false](data/results/Llama-2-7b-chat-hf_mm_true_false.csv) | [Llama-2-7b-chat-hf_mm_true](data/results/Llama-2-7b-chat-hf_mm_true.csv) | [Llama-2-7b-chat-hf_mm_non_rag.csv](data/results/Llama-2-7b-chat-hf_mm_true_false_non_rag.csv) | |
| Meta-Llama-3-8B-Instruct | [Meta-Llama-3-8B-Instruct_mm_false](data/results/Meta-Llama-3-8B-Instruct_mm_true_false.csv) | [Meta-Llama-3-8B-Instruct_mm_true](data/results/Meta-Llama-3-8B-Instruct_mm_true.csv) | [Meta-Llama-3-8B-Instruct_mm_non_rag.csv](data/results/Meta-Llama-3-8B-Instruct_mm_true_false_non_rag.csv) | |
| Llama-2-13b-chat-hf | [Llama-2-13b-chat-hf_mm_false](data/results/Llama-2-13b-chat-hf_mm_false.csv) | [Llama-2-13b-chat-hf_mm_true](data/results/Llama-2-13b-chat-hf_mm_true.csv) | [Llama-2-13b-chat-hf_mm_non_rag.csv](data/results/Llama-2-13b-chat-hf_mm_non_rag.csv) | |
| Llama-2-70b-chat-hf | [Llama-2-70b-chat-hf_mm_false](data/results/Llama-2-70b-chat-hf_mm_false.csv) | [Llama-2-70b-chat-hf_mm_true](data/results/Llama-2-70b-chat-hf_mm_true.csv) | [Llama-2-70b-chat-hf_mm_non_rag.csv](data/results/Llama-2-70b-chat-hf_mm_non_rag.csv) | |
| Meta-Llama-3-70B-Instruct | [Meta-Llama-3-70B-Instruct_mm_false](data/results/Meta-Llama-3-70B-Instruct_mm_true_false.csv) | [Meta-Llama-3-70B-Instruct_mm_true](data/results/Meta-Llama-3-70B-Instruct_mm_true.csv) | [Meta-Llama-3-70B-Instruct_mm_non_rag.csv](data/results/Meta-Llama-3-70B-Instruct_mm_true_false_non_rag.csv) | |
| gpt-3.5-turbo | [gpt-3.5-turbo_rag](./data/results/gpt-3.5-turbo_mm_RP_1.300.csv) | | [gpt-3.5-turbo_non_rag](./data/results/gpt-3.5-turbo_mm_non_rag_RP_1.300.csv) | |
## How it works
We're using an AI methodology, namely Conversational Retrieval Augmentation (CRAG), which uses LLMs off the shelf (i.e., without any fine-tuning), then controls their behavior through clever prompting and conditioning on private “contextual” data, e.g., texts extracted from your PDF files.
At a very high level, the workflow can be divided into three stages:
1. Data preprocessing / embedding: This stage involves storing private data (your PDF files) to be retrieved later. Typically, the documents are broken into chunks, passed through an embedding model, then stored the created embeddings in a vectorstore.
2. Prompt construction / retrieval: When a user submits a query, the application constructs a series of prompts to submit to the language model. A compiled prompt typically combines a prompt template and a set of relevant documents retrieved from the vectorstore.
3. Prompt execution / inference: Once the prompts have been compiled, they are submitted to a pre-trained LLM for inference—including both proprietary model APIs and open-source or self-trained models.
Tech stack used includes LangChain, Gradio, Chroma and FAISS.
- LangChain is an open-source framework that makes it easier to build scalable AI/LLM apps and chatbots.
- Gradio is an open-source Python library that is used to build machine learning and data science demos and web applications.
- Chroma and FAISS are open-source vectorstores for storing embeddings for your files.
## Running Locally
1. Check pre-conditions:
- [Git Large File Storage (LFS)](https://git-lfs.com/) must have been installed.
- Run `python --version` to make sure you're running Python version 3.10 or above.
- [CMake](https://cmake.org/) must have been installed. Here is a sample command to install `CMake` on `ubuntu`:
```
sudo apt install cmake
```
2. Clone the repo
```
git lfs install
git clone --recursive https://github.com/smu-ai/Evaluation-of-Orca-2-Models-for-Conversational-RAG.git
```
3. Ensure the latest PyTorch must have been installed.
```
# using CUDA with Nvidia GPU
make install-torch-cuda
# using Apple Silicon or other CPU
make install-torch
```
4. Install packages
```
pip install -r requirements.txt
```
5. Set up your environment variables
- By default, environment variables are loaded from `.env.example` file
- If you don't want to use the default settings, copy `.env.example` into `.env`. Your can then update it for your local runs.
6. Run automated test:
```
make test
```
7. Start the local server at `http://localhost:7860`:
```
make start
```
8. Tune repetition penalty parameters:
```
make tune
```
## Talk to Your Own PDF Files
- The sample PDF files are downloaded from [PCI DSS official website](https://www.pcisecuritystandards.org/document_library/?category=pcidss) and the corresponding embeddings are stored in folders `data/chromadb_1024_512` and `data/faiss_1024_512` with Chroma & FAISS formats respectively, which allows you to run locally without any additional effort.
- You can also put your own PDF files into any folder specified in `SOURCE_PDFS_PATH` and run the command below to generate embeddings which will be stored in folder `FAISS_INDEX_PATH` or `CHROMADB_INDEX_PATH`. If both `*_INDEX_PATH` env vars are set, `FAISS_INDEX_PATH` takes precedence. Make sure the folder specified by `*_INDEX_PATH` doesn't exist; other wise the command will simply try to load index from the folder and do a simple similarity search, as a way to verify if embeddings are generated and stored properly. Please note the HuggingFace Embedding model specified by `HF_EMBEDDINGS_MODEL_NAME` will be used to generate the embeddings.
```
python ingest.py
```
- Once embeddings are generated, you can test them out locally, or check them into your duplicated space. Please note HF Spaces git server does not allow PDF files to be checked in.
## Play with Different Large Language Models
The source code supports different LLM types - as shown at the top of `.env.example`
```
# LLM_MODEL_TYPE=openai
# LLM_MODEL_TYPE=gpt4all-j
# LLM_MODEL_TYPE=gpt4all
# LLM_MODEL_TYPE=llamacpp
# LLM_MODEL_TYPE=huggingface
# LLM_MODEL_TYPE=mosaicml
# LLM_MODEL_TYPE=stablelm
# LLM_MODEL_TYPE=openllm
LLM_MODEL_TYPE=hftgi
```
- By default, the app runs `microsoft/orca-2-13b` model with HF Text Generation Interface, which runs on a research server and might be down from time to time.
- Uncomment/comment the above to play with different LLM types. You may also want to update other related env vars. E.g., here's the list of HF models which have been tested with the code:
```
# HUGGINGFACE_MODEL_NAME_OR_PATH="microsoft/orca-2-7b"
HUGGINGFACE_MODEL_NAME_OR_PATH="microsoft/orca-2-13b"
# HUGGINGFACE_MODEL_NAME_OR_PATH="TheBloke/wizardLM-7B-HF"
# HUGGINGFACE_MODEL_NAME_OR_PATH="TheBloke/vicuna-7B-1.1-HF"
# HUGGINGFACE_MODEL_NAME_OR_PATH="nomic-ai/gpt4all-j"
# HUGGINGFACE_MODEL_NAME_OR_PATH="nomic-ai/gpt4all-falcon"
# HUGGINGFACE_MODEL_NAME_OR_PATH="lmsys/fastchat-t5-3b-v1.0"
# HUGGINGFACE_MODEL_NAME_OR_PATH="meta-llama/Llama-2-7b-chat-hf"
# HUGGINGFACE_MODEL_NAME_OR_PATH="meta-llama/Llama-2-13b-chat-hf"
# HUGGINGFACE_MODEL_NAME_OR_PATH="meta-llama/Llama-2-70b-chat-hf"
```
|