Update README.md

0becd46 verified 3 months ago

7.67 kB

	---
	datasets:
	- louisbrulenaudet/Romulus-cpt-fr
	license: llama3
	language:
	- fr
	base_model: meta-llama/Meta-Llama-3.1-8B-Instruct
	pipeline_tag: text-generation
	library_name: transformers
	tags:
	- law
	- droit
	- unsloth
	- trl
	- transformers
	- sft
	- llama
	---
	<img src="assets/thumbnail.webp">

	# Romulus, continually pre-trained models for French law.

	Romulus is a series of continually pre-trained models enriched in French law and intended to serve as the basis for a fine-tuning process on labeled data. Please note that these models have not been aligned for the production of usable text as they stand, and will certainly need to be fine-tuned for the desired tasks in order to produce satisfactory results.

	The training corpus is made up of around 34,864,949 tokens (calculated with the meta-llama/Meta-Llama-3.1-8B-Instruct tokenizer).

	## Hyperparameters

	The following table outlines the key hyperparameters used for training Romulus.

	\| Parameter \| Description \| Value \|
	\|----------------------------------\|-----------------------------------------------------------------\|-----------------------------\|
	\| `max_seq_length` \| Maximum sequence length for the model \| 4096 \|
	\| `load_in_4bit` \| Whether to load the model in 4-bit precision \| False \|
	\| `model_name` \| Pre-trained model name from Hugging Face \| meta-llama/Meta-Llama-3.1-8B-Instruct\|
	\| `r` \| Rank of the LoRA adapter \| 128 \|
	\| `lora_alpha` \| Alpha value for the LoRA module \| 32 \|
	\| `lora_dropout` \| Dropout rate for LoRA layers \| 0 \|
	\| `bias` \| Bias type for LoRA adapters \| none \|
	\| `use_gradient_checkpointing` \| Whether to use gradient checkpointing \| unsloth \|
	\| `train_batch_size` \| Per device training batch size \| 8 \|
	\| `gradient_accumulation_steps` \| Number of gradient accumulation steps \| 8 \|
	\| `warmup_ratio` \| Warmup steps as a fraction of total steps \| 0.1 \|
	\| `num_train_epochs` \| Number of training epochs \| 1 \|
	\| `learning_rate` \| Learning rate for the model \| 5e-5 \|
	\| `embedding_learning_rate` \| Learning rate for embeddings \| 1e-5 \|
	\| `optim` \| Optimizer used for training \| adamw_8bit \|
	\| `weight_decay` \| Weight decay to prevent overfitting \| 0.01 \|
	\| `lr_scheduler_type` \| Type of learning rate scheduler \| linear \|

	# Training script

	Romulus was trained using Unsloth on a Nvidia H100 Azure EST US instance provided by the Microsoft for Startups program from this script:

	```python
	# -- coding: utf-8 --
	import os

	from typing import (
	Dict,
	)

	from datasets import load_dataset
	from unsloth import (
	FastLanguageModel,
	is_bfloat16_supported,
	UnslothTrainer,
	UnslothTrainingArguments,
	)

	max_seq_length = 4096
	dtype = None
	load_in_4bit = False

	model, tokenizer = FastLanguageModel.from_pretrained(
	model_name="meta-llama/Meta-Llama-3.1-8B-Instruct",
	max_seq_length=max_seq_length,
	dtype=dtype,
	load_in_4bit=load_in_4bit,
	token="hf_token",
	)

	model = FastLanguageModel.get_peft_model(
	model,
	r=128,
	target_modules=[
	"q_proj",
	"k_proj",
	"v_proj",
	"o_proj",
	"gate_proj",
	"up_proj",
	"down_proj",
	"embed_tokens",
	"lm_head",
	],
	lora_alpha=32,
	lora_dropout=0,
	bias="none",
	use_gradient_checkpointing="unsloth",
	random_state=3407,
	use_rslora=True,
	loftq_config=None,
	)

	prompt = """### Référence :
	{}
	### Contenu :
	{}"""

	EOS_TOKEN = tokenizer.eos_token

	def formatting_prompts_func(examples):
	"""
	Format input examples into prompts for a language model.

	This function takes a dictionary of examples containing titles and texts,
	combines them into formatted prompts, and appends an end-of-sequence token.

	Parameters
	----------
	examples : dict
	A dictionary containing two keys:
	- 'title': A list of titles.
	- 'text': A list of corresponding text content.

	Returns
	-------
	dict
	A dictionary with a single key 'text', containing a list of formatted prompts.

	Notes
	-----
	- The function assumes the existence of a global `prompt` variable, which is a
	formatting string used to combine the title and text.
	- The function also assumes the existence of a global `EOS_TOKEN` variable,
	which is appended to the end of each formatted prompt.
	- The input lists 'title' and 'text' are expected to have the same length.

	Examples
	--------
	>>> examples = {
	... 'title': ['Title 1', 'Title 2'],
	... 'text': ['Content 1', 'Content 2']
	... }
	>>> formatting_cpt_prompts_func(examples)
	{'text': ['<formatted_prompt_1><EOS>', '<formatted_prompt_2><EOS>']}
	"""
	refs = examples["ref"]
	texts = examples["texte"]
	outputs = []

	for ref, text in zip(refs, texts):
	text = prompt.format(ref, text) + EOS_TOKEN
	outputs.append(text)

	return {
	"text": outputs,
	}


	cpt_dataset = load_dataset(
	"louisbrulenaudet/Romulus-cpt-fr",
	split="train",
	token="hf_token",
	)

	cpt_dataset = cpt_dataset.map(
	formatting_prompts_func,
	batched=True,
	)

	trainer = UnslothTrainer(
	model=model,
	tokenizer=tokenizer,
	train_dataset=cpt_dataset,
	dataset_text_field="text",
	max_seq_length=max_seq_length,
	dataset_num_proc=2,
	args=UnslothTrainingArguments(
	per_device_train_batch_size=8,
	gradient_accumulation_steps=8,
	warmup_ratio=0.1,
	num_train_epochs=1,
	learning_rate=5e-5,
	embedding_learning_rate=1e-5,
	fp16=not is_bfloat16_supported(),
	bf16=is_bfloat16_supported(),
	logging_steps=1,
	report_to="wandb",
	save_steps=350,
	run_name="romulus-cpt",
	optim="adamw_8bit",
	weight_decay=0.01,
	lr_scheduler_type="linear",
	seed=3407,
	output_dir="outputs",
	),
	)

	trainer_stats = trainer.train()
	```

	<img src="assets/loss.png">

	## Citing & Authors

	If you use this code in your research, please use the following BibTeX entry.

	```BibTeX
	@misc{louisbrulenaudet2024,
	author = {Louis Brulé Naudet},
	title = {Romulus, continually pre-trained models for French law},
	year = {2024}
	howpublished = {\url{https://huggingface.co/datasets/louisbrulenaudet/Romulus-cpt-fr}},
	}
	```

	## Feedback

	If you have any feedback, please reach out at [louisbrulenaudet@icloud.com](mailto:louisbrulenaudet@icloud.com).