Example of `training.txt` and `validation.txt` for fine tuning ProtGPT2
@nferruz Hi Noelia,
Thank you for your great work. I sincerely believe it will make a great contribution to protein biology.
I would like to try fine-tuning your ProtGPT2.
In particular, this line of code follows your model card.
python run_clm.py --model_name_or_path nferruz/ProtGPT2 --train_file training.txt --validation_file validation.txt --tokenizer_name nferruz/ProtGPT2
--do_train --do_eval --output_dir output --learning_rate 1e-06
Can you give an example of what the content of training.txt
and validation.txt
looks like?
Let's say I have this fasta file that I want to turn into training.txt
.
>myseq1
MTAQIVRTLGWRLIRRTRRQQAGEQPHHPPAPSAPAVPSTPAKQAPTPESGMPSKRALRE
>myseq2
ARERAAAAAGPASSAGPTASGTRPEETASRATNSARDAAGESAARSATGPRDRASPGPTG
Should I format it this way:
<|endoftext|>
MTAQIVRTLGWRLIRRTRRQQAGEQPHHPPAPSAPAVPSTPAKQAPTPESGMPSKRALRE
ARERAAAAAGPASSAGPTASGTRPEETASRATNSARDAAGESAARSATGPRDRASPGPTG
Or this way?
<|endoftext|>
MTAQIVRTLGWRLIRRTRRQQAGEQPHHPPAPSAPAVPSTPAKQAPTPESGMPSKRALRE
<|endoftext|>
ARERAAAAAGPASSAGPTASGTRPEETASRATNSARDAAGESAARSATGPRDRASPGPTG
Thanks and hope to hear from you again.
Sincerely,
Littleworth.
Hi Littleworth,
It should be like in your second example. Please bear in mind that it must be completely like a fast format, with a newline character every 60 aminoacids:
Like this:
<|endoftext|>
MKDIDTLISNNALWSKMLVEEDPGFFEKLAQAQKPRFLWIGCSDSRVPAERLTGLEPGEL
FVHRNVANLVIHTDLNCLSVVQYAVDVLEVEHIIICGHYGCGGVQAAVENPELGLINNWL
LHIRDIWFKHSSLLGEMPQERRLDTLCELNVMEQVYNLGHSTIMQSAWKRGQKVTIHGWA
YGIHDGLLRDLDVTATNRETLEQRYRHGISNLKLKHANHK
<|endoftext|>
#ANOTHER SEQUENCE
But not like this:
<|endoftext|>
MKDIDTLISNNALWSKMLVEEDPGFFEKLAQAQKPRFLWIGCSDSRVPAERLTGLEPGELFVHRNVANLVIHTDLNCLSVVQYAVDVLEVEHIIICGHYGCGGVQAAVENPELGLINNWLLHIRDIWFKHSSLLGEMPQERRLDTLCELNVMEQVYNLGHSTIMQSAWKRGQKVTIHGWAYGIHDGLLRDLDVTATNRETLEQRYRHGISNLKLKHANHK
<|endoftext|>
#ANOTHER SEQUENCE
Thank you for using ProtGPT2 and posting!
Noelia
@nferruz Thank you so much!
Hello Dear @nferruz ,
Even if my train_file is in the format of following, I got train_samples = 1 in the output. It should be 2 instead of 1 for the following example.
<|endoftext|>
ETLVKVKDAEDQLGARVGYIELDLNSGKILESFRPEERFPMMSTFKVLLCGAVLSRIDAG
QEQLGRRIHYSQNDLVEYSPVTEKHLTDG
<|endoftext|>
ETLVKVKDAEDQLGARVGYIELDLNSGKILESFRPEERFPMMSTFKVLLCGAVLSRIDAG
QEQLGRRIHYSQNDLVEYSPVTEKHLTAG
Why does the model not get the input file correctly? Do you have any idea how to resolve it?
Hello!
I believe you mean during training? In that case, the number of samples is the number of groups of 512 tokens that are passed in batches to the model. With those two sequences you’re below 512 tokens, hence you don’t arrive to more than one sample.
Hope this helps
Noelia