Upload 7 files

Browse files

Files changed (8) hide show

.gitattributes +3 -0
README.md +86 -3
images/Screen_recording-2024-07-03_16-39-54.mp4 +3 -0
images/data.png +0 -0
images/generateTraj.png +0 -0
images/hook.png +0 -0
images/leaderboard.png +3 -0
images/main.png +3 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,6 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+images/leaderboard.png filter=lfs diff=lfs merge=lfs -text
+images/main.png filter=lfs diff=lfs merge=lfs -text
+images/Screen_recording-2024-07-03_16-39-54.mp4 filter=lfs diff=lfs merge=lfs -text

README.md CHANGED Viewed

@@ -1,3 +1,86 @@
----
-license: apache-2.0
----

+<h1 align="center"> PyBench: Evaluate LLM Agent on Real World Tasks </h1>
+<p align="center">
+<a href="comming soon">📃 Paper</a>
+•
+<a href="https://huggingface.co/datasets/Mercury7353/PyInstruct" >🤗 Data (PyInstruct)</a>
+•
+<a href="https://huggingface.co/Mercury7353/PyLlama3" >🤗 Model (PyLlama3)</a>
+•
+</p>
+PyBench is a comprehensive benchmark evaluating LLM on real-world coding tasks including **chart analysis**, **text analysis**, **image/ audio editing**, **complex math** and **software/website development**.
+ We collect files from Kaggle, arXiv, and other sources and automatically generate queries according to the type and content of each file.
+![Overview](images/main.png)
+## Why PyBench?
+The LLM Agent, equipped with a code interpreter, is capable of automatically solving real-world coding tasks, such as data analysis and image processing.
+%
+However, existing benchmarks primarily focus on either simplistic tasks, such as completing a few lines of code, or on extremely complex and specific tasks at the repository level, neither of which are representative of various daily coding tasks.
+%
+To address this gap, we introduce **PyBench**, a benchmark that encompasses 6 main categories of real-world tasks, covering more than 10 types of files.
+![How PyBench Works](images/generateTraj.png)
+## 📁 PyInstruct
+To figure out a way to enhance the model's ability on PyBench, we generate a homologous dataset: **PyInstruct**. The PyInstruct contains multi-turn interaction between the model and files, stimulating the model's capability on coding, debugging and multi-turn complex task solving.  Compare to other Datasets focus on multi-turn coding ability, PyInstruct has longer turns and tokens per trajectory.
+![Data Statistics](images/data.png)
+*Dataset Statistics. Token statistics are computed using Llama-2 tokenizer.*
+## 🪄 PyLlama
+We trained Llama3-8B-base on PyInstruct, CodeActInstruct, CodeFeedback, and Jupyter Notebook Corpus to get PyLlama3, achieving an outstanding performance on PyBench
+## 🚀 Model Evaluation with PyBench!
+<video src="https://github.com/Mercury7353/PyBench/assets/103104011/fef85310-55a3-4ee8-a441-612e7dbbaaab"> </video>
+*Demonstration of the chat interface.*
+### Environment Setup:
+Begin by establishing the required environment:
+```bash
+conda env create -f environment.yml
+```
+### Model Configuration
+Initialize a local server using the vllm framework, which defaults to port "8001":
+```bash
+bash SetUpModel.sh
+```
+A Jinja template is necessary to launch a vllm server. Commonly used templates can be located in the `./jinja/` directory.
+Prior to starting the vllm server, specify the model path and Jinja template path in `SetUpModel.sh`.
+### Configuration Adjustments
+Specify your model's path and the server port in `./config/model.yaml`. This configuration file also allows for customization of the system prompts.
+### Execution on PyBench
+Ensure to update the output trajectory file path in the script before execution:
+```bash
+python /data/zyl7353/codeinterpreterbenchmark/inference.py --config_path ./config/<your config>.yaml --task_path ./data/meta/task.json --output_path <your trajectory.jsonl path>
+```
+### Unit Testing Procedure
+- **Step 1:** Store the output files in `./output`.
+- **Step 2:** Define the trajectory file path in
+  `./data/unit_test/enter_point.py`.
+- **Step 3:** Execute the unit test script:
+  ```bash
+  python data/unit_test/enter_point.py
+  ```
+## 📊 LeaderBoard
+![LLM Leaderboard](images/leaderboard.png)
+# 📚 Citation
+```bibtex
+TBD
+```

images/Screen_recording-2024-07-03_16-39-54.mp4 ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:43692e7d0a6925a082f33d4ae2c5326fdd3af118e37849c8d858c0a8a7fde029
+size 5153739

images/data.png ADDED Viewed

images/generateTraj.png ADDED Viewed

images/hook.png ADDED Viewed

images/leaderboard.png ADDED Viewed

Git LFS Details

SHA256: 084d7fc0ae7f65e632a1998392ef156cc051d0e251b2ebb40077c2e91c187d55
Pointer size: 132 Bytes
Size of remote file: 1 MB

images/main.png ADDED Viewed

Git LFS Details

SHA256: f9f9b942c1ef76ccab6c04f5dd9961e905959c5bff72c0ebca25a01efd48af5a
Pointer size: 132 Bytes
Size of remote file: 1.33 MB