Example of `training.txt` and `validation.txt` for fine tuning ProtGPT2

#24

by littleworth - opened May 4, 2023

Discussion

littleworth

May 4, 2023

•

edited May 4, 2023

@nferruz Hi Noelia,

Thank you for your great work. I sincerely believe it will make a great contribution to protein biology.

I would like to try fine-tuning your ProtGPT2.
In particular, this line of code follows your model card.

python run_clm.py --model_name_or_path nferruz/ProtGPT2 --train_file training.txt --validation_file validation.txt --tokenizer_name nferruz/ProtGPT2
 --do_train --do_eval --output_dir output --learning_rate 1e-06

Can you give an example of what the content of training.txt and validation.txt looks like?

Let's say I have this fasta file that I want to turn into training.txt.

>myseq1
MTAQIVRTLGWRLIRRTRRQQAGEQPHHPPAPSAPAVPSTPAKQAPTPESGMPSKRALRE
>myseq2
ARERAAAAAGPASSAGPTASGTRPEETASRATNSARDAAGESAARSATGPRDRASPGPTG

Should I format it this way:

<|endoftext|>
MTAQIVRTLGWRLIRRTRRQQAGEQPHHPPAPSAPAVPSTPAKQAPTPESGMPSKRALRE
ARERAAAAAGPASSAGPTASGTRPEETASRATNSARDAAGESAARSATGPRDRASPGPTG

Or this way?

<|endoftext|>
MTAQIVRTLGWRLIRRTRRQQAGEQPHHPPAPSAPAVPSTPAKQAPTPESGMPSKRALRE
<|endoftext|>
ARERAAAAAGPASSAGPTASGTRPEETASRATNSARDAAGESAARSATGPRDRASPGPTG

Thanks and hope to hear from you again.

Sincerely,
Littleworth.

littleworth changed discussion title from Example of training.txt and validation.txt for fine tuning to Example of `training.txt` and validation.txt for fine tuning ProtGPT2 May 4, 2023

littleworth changed discussion title from Example of `training.txt` and validation.txt for fine tuning ProtGPT2 to Example of `training.txt` and `validation.txt` for fine tuning ProtGPT2 May 4, 2023

nferruz

Owner May 4, 2023

Hi Littleworth,

It should be like in your second example. Please bear in mind that it must be completely like a fast format, with a newline character every 60 aminoacids:

Like this:

<|endoftext|>
MKDIDTLISNNALWSKMLVEEDPGFFEKLAQAQKPRFLWIGCSDSRVPAERLTGLEPGEL
FVHRNVANLVIHTDLNCLSVVQYAVDVLEVEHIIICGHYGCGGVQAAVENPELGLINNWL
LHIRDIWFKHSSLLGEMPQERRLDTLCELNVMEQVYNLGHSTIMQSAWKRGQKVTIHGWA
YGIHDGLLRDLDVTATNRETLEQRYRHGISNLKLKHANHK
<|endoftext|>
#ANOTHER SEQUENCE

But not like this:

<|endoftext|>
MKDIDTLISNNALWSKMLVEEDPGFFEKLAQAQKPRFLWIGCSDSRVPAERLTGLEPGELFVHRNVANLVIHTDLNCLSVVQYAVDVLEVEHIIICGHYGCGGVQAAVENPELGLINNWLLHIRDIWFKHSSLLGEMPQERRLDTLCELNVMEQVYNLGHSTIMQSAWKRGQKVTIHGWAYGIHDGLLRDLDVTATNRETLEQRYRHGISNLKLKHANHK
<|endoftext|>
#ANOTHER SEQUENCE

Thank you for using ProtGPT2 and posting!
Noelia

littleworth

May 4, 2023

@nferruz Thank you so much!

emrecicekyurt

Sep 12, 2023

Hello Dear @nferruz ,

Even if my train_file is in the format of following, I got train_samples = 1 in the output. It should be 2 instead of 1 for the following example.

<|endoftext|>
ETLVKVKDAEDQLGARVGYIELDLNSGKILESFRPEERFPMMSTFKVLLCGAVLSRIDAG
QEQLGRRIHYSQNDLVEYSPVTEKHLTDG
<|endoftext|>
ETLVKVKDAEDQLGARVGYIELDLNSGKILESFRPEERFPMMSTFKVLLCGAVLSRIDAG
QEQLGRRIHYSQNDLVEYSPVTEKHLTAG

Why does the model not get the input file correctly? Do you have any idea how to resolve it?

nferruz

Owner Sep 12, 2023

Hello!

I believe you mean during training? In that case, the number of samples is the number of groups of 512 tokens that are passed in batches to the model. With those two sequences you’re below 512 tokens, hence you don’t arrive to more than one sample.

Hope this helps
Noelia

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment