instructlab
/

merlinite-7b-pt

Text Generation

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

gx-ai-architect commited on Jun 24

Commit

80a24c4

•

1 Parent(s): f4ff397

Update README.md

Files changed (1) hide show

README.md +1 -1

README.md CHANGED Viewed

@@ -18,7 +18,7 @@ base_model: mistralai/Mistral-7B-v0.1
 # Model Card for Merlinite-7B-pt 🔥
 ### Overview
-We introduce **Merlinite-7B-pt**, a strong open-source chat model, aligned using AI feedback **without proprietary models or using any human annotation**.
 - **Merlinite-7B-pt** is first supervised-finetuned (SFT) via [LAB](https://arxiv.org/abs/2403.01081) using Mistral-7B-v0.1 as base model, and then preference-tuned via AI feedback.
 - Our preference tuning recipe uses the DPO reward from Mixtral-8x7B-Instruct-v0.1 as the proxy for human preferences, and applies iterative rejection sampling to finetune the SFT policy.
 - We show that DPO log-ratios can serve as a reliable reward signal, showing clear correlation between reward improvements and MT-Bench improvements.

 # Model Card for Merlinite-7B-pt 🔥
 ### Overview
+We introduce **Merlinite-7B-pt**, a strong open-source chat model, preference aligned using AI feedback **without proprietary models or using any human annotation**.
 - **Merlinite-7B-pt** is first supervised-finetuned (SFT) via [LAB](https://arxiv.org/abs/2403.01081) using Mistral-7B-v0.1 as base model, and then preference-tuned via AI feedback.
 - Our preference tuning recipe uses the DPO reward from Mixtral-8x7B-Instruct-v0.1 as the proxy for human preferences, and applies iterative rejection sampling to finetune the SFT policy.
 - We show that DPO log-ratios can serve as a reliable reward signal, showing clear correlation between reward improvements and MT-Bench improvements.