Update README.md
Browse files
README.md
CHANGED
@@ -19,10 +19,9 @@ tags:
|
|
19 |
|
20 |
# Model description
|
21 |
|
22 |
-
|
23 |
-
This is a repository for fine-tuned CodeLlama-7b model in the Hugging Face Transformers format.
|
24 |
|
25 |
-
#
|
26 |
|
27 |
```python
|
28 |
from transformers import AutoModelForCausalLM, AutoTokenizer
|
@@ -69,24 +68,23 @@ The model was trained on one A100 GPU with following hyperparameters:
|
|
69 |
| `total_batch_size` | 32 (~30K tokens per step) |
|
70 |
| `num_epochs` | 2 |
|
71 |
|
72 |
-
More details about
|
73 |
|
74 |
# Fine-tuning data
|
75 |
|
76 |
-
For this model we used 25K exmaples
|
77 |
-
For more information about the dataset follow the link.
|
78 |
|
79 |
# Evaluation
|
80 |
|
81 |
-
|
82 |
|
83 |
-
|
84 |
|
85 |
| **Model name** | **Kotlin HumanEval Pass Rate** |
|
86 |
|:---------------------------:|:----------------------------------------:|
|
87 |
-
| `
|
88 |
-
| `
|
89 |
|
90 |
# Ethical Considerations and Limitations
|
91 |
|
92 |
-
CodeLlama-7B-KStack-clean
|
|
|
19 |
|
20 |
# Model description
|
21 |
|
22 |
+
This is a repository for the **CodeLlama-7b** model fine-tuned on the [KStack-clean](https://huggingface.co/datasets/JetBrains/KStack-clean) dataset with rule-based filtering, in the *Hugging Face Transformers* format. KStack-clean is a small subset of [KStack](https://huggingface.co/datasets/JetBrains/KStack), the largest collection of permissively licensed Kotlin code, automatically filtered to include files that have the highest "educational value for learning algorithms in Kotlin".
|
|
|
23 |
|
24 |
+
# How to use
|
25 |
|
26 |
```python
|
27 |
from transformers import AutoModelForCausalLM, AutoTokenizer
|
|
|
68 |
| `total_batch_size` | 32 (~30K tokens per step) |
|
69 |
| `num_epochs` | 2 |
|
70 |
|
71 |
+
More details about fine-tuning can be found in the technical report.
|
72 |
|
73 |
# Fine-tuning data
|
74 |
|
75 |
+
For this model, we used 25K exmaples from the [KStack-clean](https://huggingface.co/datasets/JetBrains/KStack-clean) dataset, selected from the larger [KStack](https://huggingface.co/datasets/JetBrains/KStack) dataset according to educational value for learning algorithms. In total, the dataset contains about 23M tokens.
|
|
|
76 |
|
77 |
# Evaluation
|
78 |
|
79 |
+
For evaluation, we used the [Kotlin HumanEval](https://huggingface.co/datasets/JetBrains/Kotlin_HumanEval) dataset, which contains all 161 tasks from HumanEval translated into Kotlin by human experts. You can find more details about the pre-processing necessary to obtain our results, including the code for running, on the [datasets's page](https://huggingface.co/datasets/JetBrains/Kotlin_HumanEval).
|
80 |
|
81 |
+
Here are the results of our evaluation:
|
82 |
|
83 |
| **Model name** | **Kotlin HumanEval Pass Rate** |
|
84 |
|:---------------------------:|:----------------------------------------:|
|
85 |
+
| `CodeLlama-7B` | 26.89 |
|
86 |
+
| `CodeLlama-7B-KStack-clean` | **37.89** |
|
87 |
|
88 |
# Ethical Considerations and Limitations
|
89 |
|
90 |
+
CodeLlama-7B-KStack-clean is a new technology that carries risks with use. The testing conducted to date has not covered, nor could it cover all scenarios. For these reasons, as with all LLMs, CodeLlama-7B-KStack-clean's potential outputs cannot be predicted in advance, and the model may in some instances produce inaccurate or objectionable responses to user prompts. The model was fine-tuned on a specific data format (Kotlin tasks), and deviation from this format can also lead to inaccurate or undesirable responses to user queries. Therefore, before deploying any applications of CodeLlama-7B-KStack-clean, developers should perform safety testing and tuning tailored to their specific applications of the model.
|