jbetker
/

tortoise-tts-v2

Model card Files Files and versions Community

jbetker commited on Apr 11, 2022

Commit

8215af8

•

1 Parent(s): b07fb37

Add read script

Browse files

Files changed (3) hide show

data/riding_hood.txt +54 -0
do_tts.py +4 -6
read.py +76 -0

data/riding_hood.txt ADDED Viewed

	@@ -0,0 +1,54 @@

+Once upon a time there lived in a certain village a little country girl, the prettiest creature who was ever seen. Her mother was excessively fond of her; and her grandmother doted on her still more. This good woman had a little red riding hood made for her. It suited the girl so extremely well that everybody called her Little Red Riding Hood.
+One day her mother, having made some cakes, said to her, "Go, my dear, and see how your grandmother is doing, for I hear she has been very ill. Take her a cake, and this little pot of butter."
+Little Red Riding Hood set out immediately to go to her grandmother, who lived in another village.
+As she was going through the wood, she met with a wolf, who had a very great mind to eat her up, but he dared not, because of some woodcutters working nearby in the forest. He asked her where she was going. The poor child, who did not know that it was dangerous to stay and talk to a wolf, said to him, "I am going to see my grandmother and carry her a cake and a little pot of butter from my mother."
+"Does she live far off?" said the wolf
+"Oh I say," answered Little Red Riding Hood; "it is beyond that mill you see there, at the first house in the village."
+"Well," said the wolf, "and I'll go and see her too. I'll go this way and go you that, and we shall see who will be there first."
+The wolf ran as fast as he could, taking the shortest path, and the little girl took a roundabout way, entertaining herself by gathering nuts, running after butterflies, and gathering bouquets of little flowers. It was not long before the wolf arrived at the old woman's house. He knocked at the door: tap, tap.
+"Who's there?"
+"Your grandchild, Little Red Riding Hood," replied the wolf, counterfeiting her voice; "who has brought you a cake and a little pot of butter sent you by mother."
+The good grandmother, who was in bed, because she was somewhat ill, cried out, "Pull the bobbin, and the latch will go up."
+The wolf pulled the bobbin, and the door opened, and then he immediately fell upon the good woman and ate her up in a moment, for it been more than three days since he had eaten. He then shut the door and got into the grandmother's bed, expecting Little Red Riding Hood, who came some time afterwards and knocked at the door: tap, tap.
+"Who's there?"
+Little Red Riding Hood, hearing the big voice of the wolf, was at first afraid; but believing her grandmother had a cold and was hoarse, answered, "It is your grandchild Little Red Riding Hood, who has brought you a cake and a little pot of butter mother sends you."
+The wolf cried out to her, softening his voice as much as he could, "Pull the bobbin, and the latch will go up."
+Little Red Riding Hood pulled the bobbin, and the door opened.
+The wolf, seeing her come in, said to her, hiding himself under the bedclothes, "Put the cake and the little pot of butter upon the stool, and come get into bed with me."
+Little Red Riding Hood took off her clothes and got into bed. She was greatly amazed to see how her grandmother looked in her nightclothes, and said to her, "Grandmother, what big arms you have!"
+"All the better to hug you with, my dear."
+"Grandmother, what big legs you have!"
+"All the better to run with, my child."
+"Grandmother, what big ears you have!"
+"All the better to hear with, my child."
+"Grandmother, what big eyes you have!"
+"All the better to see with, my child."
+"Grandmother, what big teeth you have got!"
+"All the better to eat you up with."
+And, saying these words, this wicked wolf fell upon Little Red Riding Hood, and ate her all up.

do_tts.py CHANGED Viewed

@@ -5,7 +5,7 @@ import torch
 import torch.nn.functional as F
 import torchaudio
-from api_new_autoregressive import TextToSpeech, load_conditioning
 from utils.audio import load_audio
 from utils.tokenizer import VoiceBpeTokenizer
@@ -18,6 +18,7 @@ if __name__ == '__main__':
         'harris': ['voices/harris/1.wav', 'voices/harris/2.wav'],
         'lescault': ['voices/lescault/1.wav', 'voices/lescault/2.wav'],
         'otto': ['voices/otto/1.wav', 'voices/otto/2.wav'],
         # Female voices
         'atkins': ['voices/atkins/1.wav', 'voices/atkins/2.wav'],
         'grace': ['voices/grace/1.wav', 'voices/grace/2.wav'],
@@ -27,8 +28,8 @@ if __name__ == '__main__':
     parser = argparse.ArgumentParser()
     parser.add_argument('-text', type=str, help='Text to speak.', default="I am a language model that has learned to speak.")
-    parser.add_argument('-voice', type=str, help='Use a preset conditioning voice (defined above). Overrides cond_path.', default='dotrice,harris,lescault,otto,atkins,grace,kennard,mol')
-    parser.add_argument('-num_samples', type=int, help='How many total outputs the autoregressive transformer should produce.', default=32)
     parser.add_argument('-batch_size', type=int, help='How many samples to process at once in the autoregressive model.', default=16)
     parser.add_argument('-num_diffusion_samples', type=int, help='Number of outputs that progress to the diffusion stage.', default=16)
     parser.add_argument('-output_path', type=str, help='Where to store outputs.', default='results/')
@@ -38,9 +39,6 @@ if __name__ == '__main__':
     tts = TextToSpeech(autoregressive_batch_size=args.batch_size)
     for voice in args.voice.split(','):
-        tokenizer = VoiceBpeTokenizer()
-        text = torch.IntTensor(tokenizer.encode(args.text)).unsqueeze(0).cuda()
-        text = F.pad(text, (0,1))  # This may not be necessary.
         cond_paths = preselected_cond_voices[voice]
         conds = []
         for cond_path in cond_paths:

 import torch.nn.functional as F
 import torchaudio
+from api import TextToSpeech, load_conditioning
 from utils.audio import load_audio
 from utils.tokenizer import VoiceBpeTokenizer
         'harris': ['voices/harris/1.wav', 'voices/harris/2.wav'],
         'lescault': ['voices/lescault/1.wav', 'voices/lescault/2.wav'],
         'otto': ['voices/otto/1.wav', 'voices/otto/2.wav'],
+        'obama': ['voices/obama/1.wav', 'voices/obama/2.wav'],
         # Female voices
         'atkins': ['voices/atkins/1.wav', 'voices/atkins/2.wav'],
         'grace': ['voices/grace/1.wav', 'voices/grace/2.wav'],
     parser = argparse.ArgumentParser()
     parser.add_argument('-text', type=str, help='Text to speak.', default="I am a language model that has learned to speak.")
+    parser.add_argument('-voice', type=str, help='Use a preset conditioning voice (defined above). Overrides cond_path.', default='obama,dotrice,harris,lescault,otto,atkins,grace,kennard,mol')
+    parser.add_argument('-num_samples', type=int, help='How many total outputs the autoregressive transformer should produce.', default=128)
     parser.add_argument('-batch_size', type=int, help='How many samples to process at once in the autoregressive model.', default=16)
     parser.add_argument('-num_diffusion_samples', type=int, help='Number of outputs that progress to the diffusion stage.', default=16)
     parser.add_argument('-output_path', type=str, help='Where to store outputs.', default='results/')
     tts = TextToSpeech(autoregressive_batch_size=args.batch_size)
     for voice in args.voice.split(','):
         cond_paths = preselected_cond_voices[voice]
         conds = []
         for cond_path in cond_paths:

read.py ADDED Viewed

	@@ -0,0 +1,76 @@

+import argparse
+import os
+import torch
+import torch.nn.functional as F
+import torchaudio
+from api import TextToSpeech, load_conditioning
+from utils.audio import load_audio
+from utils.tokenizer import VoiceBpeTokenizer
+def split_and_recombine_text(texts, desired_length=200, max_len=300):
+    # TODO: also split across '!' and '?'. Attempt to keep quotations together.
+    texts = [s.strip() + "." for s in texts.split('.')]
+    i = 0
+    while i < len(texts):
+        ltxt = texts[i]
+        if len(ltxt) >= desired_length or i == len(texts)-1:
+            i += 1
+            continue
+        if len(ltxt) + len(texts[i+1]) > max_len:
+            i += 1
+            continue
+        texts[i] = f'{ltxt} {texts[i+1]}'
+        texts.pop(i+1)
+    return texts
+if __name__ == '__main__':
+    # These are voices drawn randomly from the training set. You are free to substitute your own voices in, but testing
+    # has shown that the model does not generalize to new voices very well.
+    preselected_cond_voices = {
+        # Male voices
+        'dotrice': ['voices/dotrice/1.wav', 'voices/dotrice/2.wav'],
+        'harris': ['voices/harris/1.wav', 'voices/harris/2.wav'],
+        'lescault': ['voices/lescault/1.wav', 'voices/lescault/2.wav'],
+        'otto': ['voices/otto/1.wav', 'voices/otto/2.wav'],
+        'obama': ['voices/obama/1.wav', 'voices/obama/2.wav'],
+        'carlin': ['voices/carlin/1.wav', 'voices/carlin/2.wav'],
+        # Female voices
+        'atkins': ['voices/atkins/1.wav', 'voices/atkins/2.wav'],
+        'grace': ['voices/grace/1.wav', 'voices/grace/2.wav'],
+        'kennard': ['voices/kennard/1.wav', 'voices/kennard/2.wav'],
+        'mol': ['voices/mol/1.wav', 'voices/mol/2.wav'],
+        'lj': ['voices/lj/1.wav', 'voices/lj/2.wav'],
+    }
+    parser = argparse.ArgumentParser()
+    parser.add_argument('-textfile', type=str, help='A file containing the text to read.', default="data/riding_hood.txt")
+    parser.add_argument('-voice', type=str, help='Use a preset conditioning voice (defined above). Overrides cond_path.', default='dotrice')
+    parser.add_argument('-num_samples', type=int, help='How many total outputs the autoregressive transformer should produce.', default=256)
+    parser.add_argument('-batch_size', type=int, help='How many samples to process at once in the autoregressive model.', default=16)
+    parser.add_argument('-output_path', type=str, help='Where to store outputs.', default='results/longform/')
+    args = parser.parse_args()
+    os.makedirs(args.output_path, exist_ok=True)
+    with open(args.textfile, 'r', encoding='utf-8') as f:
+        text = ''.join([l for l in f.readlines()])
+    texts = split_and_recombine_text(text)
+    tts = TextToSpeech(autoregressive_batch_size=args.batch_size)
+    priors = []
+    for j, text in enumerate(texts):
+        cond_paths = preselected_cond_voices[args.voice]
+        conds = priors.copy()
+        for cond_path in cond_paths:
+            c = load_audio(cond_path, 22050)
+            conds.append(c)
+        gen = tts.tts(text, conds, num_autoregressive_samples=args.num_samples, temperature=.7, top_p=.7)
+        torchaudio.save(os.path.join(args.output_path, f'{j}.wav'), gen.squeeze(0).cpu(), 24000)
+        priors.append(torchaudio.functional.resample(gen, 24000, 22050).squeeze(0))
+        while len(priors) > 2:
+            priors.pop(0)