en-af-sql-training-1727527893

This model is a fine-tuned version of t5-small on three datasets: b-mc2/sql-create-context, Clinton/Text-to-sql-v1, knowrohit07/know-sql. It achieves the following results on the evaluation set:

Loss: 0.0210

Model description

This is a fine-tuned Afrikaans-to-SQL model. The pretrained t5-small was used to train our SQL model.

Training and Evaluation Datasets

As mentioned, to train the model we used a combination of three dataset which we split into training, testing, and validation sets. THe dataset can be found by following these links:

We did a 80-10-10 split on each dataset and then combined them into a single DatasetDict object with train, test, and validation sets.

DatasetDict({
    train: Dataset({
        features: ['answer', 'question', 'context', 'afr question'],
        num_rows: 118692
    })
    test: Dataset({
        features: ['answer', 'question', 'context', 'afr question'],
        num_rows: 14838
    })
    validation: Dataset({
        features: ['answer', 'question', 'context', 'afr question'],
        num_rows: 14838
    })
})

The pretrained model was then fine-tuned on the dataset splits. Rather than using only the question, the model also takes in the schema context such that it can generate more accurate queries for a given database.

Input prompt

Table context: CREATE TABLE table_55794 (
    "Home team" text,
    "Home team score" text,
    "Away team" text,
    "Away team score" text,
    "Venue" text,
    "Crowd" real,
    "Date" text
)
Question: Watter tuisspan het'n span mebbourne?
Answer:

Expected Output

SELECT "Home team score" FROM table_55794 WHERE "Away team" = 'melbourne'

Intended uses & limitations

This model takes in a single prompt (similar to the one above) that is tokenized and it then uses the input_ids to generate an output SQL query. However the prompt must be structured in a specific way.

The prompt must start with the table/schema description followed by the question followed by an empty answer. Below we illustrate an example on how to use it. Furthermore, our combined dataset looks as follows:

Tokenized Dataset

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'labels'],
        num_rows: 118692
    })
    test: Dataset({
        features: ['input_ids', 'labels'],
        num_rows: 14838
    })
    validation: Dataset({
        features: ['input_ids', 'labels'],
        num_rows: 14838
    })
})

Usage

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, Trainer, TrainingArguments
# Load the model and tokenizer from Hugging Face Hub
repo_name = "JsteReubsSoftware/en-af-sql-training-1727527893"
en_af_sql_model = AutoModelForSeq2SeqLM.from_pretrained(repo_name, torch_dtype=torch.bfloat16)
en_af_sql_model = en_af_sql_model.to('cuda')
tokenizer = AutoTokenizer.from_pretrained(repo_name)

question = "Watter tuisspan het'n span mebbourne?"
context = "CREATE TABLE table_55794 (
    "Home team" text,
    "Home team score" text,
    "Away team" text,
    "Away team score" text,
    "Venue" text,
    "Crowd" real,
    "Date" text
)"

prompt = f"""Tables:
{context}

Question:
{question}

Answer:
"""
inputs = tokenizer(prompt, return_tensors='pt')
inputs = inputs.to('cuda')

output = tokenizer.decode(
    en_af_sql_model.generate(
        inputs["input_ids"], 
        max_new_tokens=200,
    )[0], 
    skip_special_tokens=True
)

print("Predicted SQL Query:")
print(output)

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 0.005
train_batch_size: 32
eval_batch_size: 32
seed: 42
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: linear
num_epochs: 2

We used the following in our program:

output_dir = f'./en-af-sql-training-{str(int(time.time()))}'

training_args = TrainingArguments(
        output_dir=output_dir,
        learning_rate=5e-3,
        num_train_epochs=2,
        per_device_train_batch_size=16,     # batch size per device during training
        per_device_eval_batch_size=16,      # batch size for evaluation
        weight_decay=0.01,
        logging_steps=50,
        evaluation_strategy='steps',        # evaluation strategy to adopt during training
        eval_steps=500,                     # number of steps between evaluation
)

trainer = Trainer(
        model=finetuned_model,
        args=training_args,
        train_dataset=tokenized_datasets['train'],
        eval_dataset=tokenized_datasets['validation'],
)

Training results

Training Loss	Epoch	Step	Validation Loss
0.0573	0.1348	500	0.0452
0.0424	0.2695	1000	0.0364
0.037	0.4043	1500	0.0323
0.0356	0.5391	2000	0.0287
0.0328	0.6739	2500	0.0269
0.0281	0.8086	3000	0.0255
0.0286	0.9434	3500	0.0238
0.0269	1.0782	4000	0.0233
0.0247	1.2129	4500	0.0225
0.0245	1.3477	5000	0.0217
0.0226	1.4825	5500	0.0214
0.0245	1.6173	6000	0.0211
0.024	1.7520	6500	0.0210
0.0249	1.8868	7000	0.0210

Testing results

After our model was trained and validated, we evaluated the model using four evaluation metrics.

Exact Match Accuracy: This measured the accuracy of our model predicting the exact same SQL query as the target query.
TSED score: This metric ranges from 0 to 1 and was proposed by this paper. It allows us to estimate the execution performance of the output query, allowing us to estimate the model's execution accuracy.
SQAM accuracy: Similar to TSED, we can used this to estimate the output query's execution accuracy (also see this paper).
BLEU score: This helps us measure the similarity between the output query and the target query.

The following were the obtained results over the testing set (14838 records):

Exact Match = 35.98 %
TSED score: 0.897
SQAM score: 74.31 %
BLEU score: 0.762

Citing this work:

@misc{jstereubssoftware_2024_Afr2SQL,
  title   = {en-af-sql fine-tuned model},
  author  = {JsteReubsSoftware}, 
  year    = {2024},
  url     = {https://huggingface.co/JsteReubsSoftware/en-af-sql-training-1727527893}
}

Framework versions

Transformers 4.44.2
Pytorch 2.4.0
Datasets 3.0.0
Tokenizers 0.19.1

JsteReubsSoftware
/

en-af-sql-training-1727527893