Model card

This is a generative model from the paper "Byte Pair Encoding for Symbolic Music" (EMNLP 2023). The model has been trained with Byte Pair Encoding (BPE) on the Maestro dataset to generate classical piano music with the TSD tokenizer.

Model Details

Model Description

It has a vocabulary of 20k tokens learned with Byte Pair Encoding (BPE) using MidiTok.

Developed and shared by: Nathan Fradet
Affiliations: Sorbonne University (LIP6 lab) and Aubay
Model type: causal autoregressive Transformer
Backbone model: GPT2
Music genres: Classical piano 🎹
License: Apache 2.0

Model Sources

Repository: https://github.com/Natooz/BPE-Symbolic-Music
Paper: ACL https://aclanthology.org/2023.emnlp-main.123/ - Arxiv https://arxiv.org/abs/2301.11975

Uses

The model is designed for autoregressive music generation. It generates the continuation of a music prompt.

How to Get Started with the Model

Use the code below to get started with the model. You will need the miditok (>=v2.1.7), transformers and torch packages to make it run, that can be installed with pip.

import torch
from transformers import AutoModelForCausalLM
from miditok import TSD
from symusic import Score

torch.set_default_device("cuda")
model = AutoModelForCausalLM.from_pretrained("Natooz/Maestro-TSD-bpe20k", trust_remote_code=True, torch_dtype="auto")
tokenizer = TSD.from_pretrained("Natooz/Maestro-TSD-bpe20k")
input_midi = Score("path/to/file.mid")
input_tokens = tokenizer(input_midi)

generated_token_ids = model.generate(input_tokens.ids, max_length=500)
generated_midi = tokenizer(generated_token_ids)
generated_midi.dump_midi("path/to/continued.mid")

Training Details

Training Data

The model has been trained on the Maestro dataset. The dataset contains about 200 hours of classical piano music. The tokenizer is trained with Byte Pair Encoding (BPE) to build a vocabulary of 20k tokens.

Training Procedure

Training regime: fp16 mixed precision on V100 PCIE 32GB GPUs
Compute Region: France

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 0.0001
train_batch_size: 64
eval_batch_size: 96
seed: 444
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: cosine_with_restarts
lr_scheduler_warmup_ratio: 0.3
training_steps: 100000

Environmental impact

We cannot estimate reliably the amount of CO2eq emitted, as we lack data on the exact power source used during training. However, we can highlight that the cluster used is mostly powered by nuclear energy, which is a low carbon energy source ensuring a reduced direct environmental impact.

Citation

BibTeX:

@inproceedings{bpe-symbolic-music,
    title = "Byte Pair Encoding for Symbolic Music",
    author = "Fradet, Nathan  and
      Gutowski, Nicolas  and
      Chhel, Fabien  and
      Briot, Jean-Pierre",
    editor = "Bouamor, Houda  and
      Pino, Juan  and
      Bali, Kalika",
    booktitle = "Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing",
    month = dec,
    year = "2023",
    address = "Singapore",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.emnlp-main.123",
    doi = "10.18653/v1/2023.emnlp-main.123",
    pages = "2001--2020",
}