BGE-Small Fine-Tuned on USCode-QueryPairs

This is a fine-tuned version of the BGE Small embedding model, trained on the USCode-QueryPairs dataset, a subset of the USLawQA corpus. The model is optimized for generating embeddings for legal text, achieving 75% accuracy on the test set.

Overview

  • Base Model: BGE Small
  • Dataset: USCode-QueryPairs
  • Training Details:
    • Hardware: Google Colab (T4 GPU)
    • Training Time: 2 hours
  • Accuracy: 75% on the test set from USLawQA

Applications

This model is ideal for:

  • Legal Text Retrieval: Efficient semantic search across legal documents.
  • Question Answering: Answering legal queries based on context from the US Code.
  • Embeddings Generation: Generating high-quality embeddings for downstream legal NLP tasks.

Usage

The model can be used with model.encode for generating embeddings. Below is an example usage snippet:

# Load model directly
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("ArchitRastogi/BGE-Small-LegalEmbeddings-USCode")
model = AutoModel.from_pretrained("ArchitRastogi/BGE-Small-LegalEmbeddings-USCode")
text = "Duties of the president"
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)
#Printing the Embeddings
print(outputs)

Evaluation

The model was evaluated on the test set of USLawQA and achieved the following metrics:

  • Accuracy: 75%
  • Task: Semantic similarity and legal question answering.

Related Resources

πŸ“§ Contact

For any inquiries, suggestions, or feedback, feel free to reach out:

Archit Rastogi
πŸ“§ architrastogi20@gmail.com


πŸ“œ License

This dataset is distributed under the Apache 2.0 License. Please ensure compliance with applicable copyright laws when using this dataset.

Downloads last month
34
Safetensors
Model size
33.4M params
Tensor type
F32
Β·
Inference Examples
Inference API (serverless) does not yet support transformers models for this pipeline type.

Model tree for ArchitRastogi/BGE-Small-LegalEmbeddings-USCode

Finetuned
(137)
this model

Dataset used to train ArchitRastogi/BGE-Small-LegalEmbeddings-USCode

Evaluation results