homebrewltd
/

Ichigo-llama3.1-s-instruct-v0.3-phase-2

sound language model

Model card Files Files and versions Community

jan-hq commited on Oct 4

Commit

482c8d0

•

1 Parent(s): 40a072d

Update README.md

Files changed (1) hide show

README.md +2 -15

README.md CHANGED Viewed

@@ -47,8 +47,8 @@ if not os.path.exists("whisper-vq-stoks-medium-en+pl-fixed.model"):
 vq_model = RQBottleneckTransformer.load_model(
         "whisper-vq-stoks-medium-en+pl-fixed.model"
     ).to(device)
 def audio_to_sound_tokens(audio_path, target_bandwidth=1.5, device=device):
-    vq_model.ensure_whisper(device)
     wav, sr = torchaudio.load(audio_path)
     if sr != 16000:
@@ -59,19 +59,6 @@ def audio_to_sound_tokens(audio_path, target_bandwidth=1.5, device=device):
     result = ''.join(f'<|sound_{num:04d}|>' for num in codes)
     return f'<|sound_start|>{result}<|sound_end|>'
-def audio_to_sound_tokens_transcript(audio_path, target_bandwidth=1.5, device=device):
-    vq_model.ensure_whisper(device)
-    wav, sr = torchaudio.load(audio_path)
-    if sr != 16000:
-        wav = torchaudio.functional.resample(wav, sr, 16000)
-    with torch.no_grad():
-        codes = vq_model.encode_audio(wav.to(device))
-        codes = codes[0].cpu().tolist()
-    result = ''.join(f'<|sound_{num:04d}|>' for num in codes)
-    return f'<|reserved_special_token_69|><|sound_start|>{result}<|sound_end|>'
 ```
 Then, we can inference the model the same as any other LLM.
@@ -136,7 +123,7 @@ We utilize [torchtune](https://github.com/pytorch/torchtune) library for the lat
 | Parameter                  | Instruction Fine-tuning      |
 |----------------------------|-------------------------|
 | **Epoch**                  | 1                       |
-| **Global batch size**      | 128                     |
 | **Learning Rate**          | 7e-5                  |
 | **Learning Scheduler**     | Cosine with warmup      |
 | **Optimizer**              | Adam torch fused        |

 vq_model = RQBottleneckTransformer.load_model(
         "whisper-vq-stoks-medium-en+pl-fixed.model"
     ).to(device)
+vq_model.ensure_whisper(device)
 def audio_to_sound_tokens(audio_path, target_bandwidth=1.5, device=device):
     wav, sr = torchaudio.load(audio_path)
     if sr != 16000:
     result = ''.join(f'<|sound_{num:04d}|>' for num in codes)
     return f'<|sound_start|>{result}<|sound_end|>'
 ```
 Then, we can inference the model the same as any other LLM.
 | Parameter                  | Instruction Fine-tuning      |
 |----------------------------|-------------------------|
 | **Epoch**                  | 1                       |
+| **Global batch size**      | 256                     |
 | **Learning Rate**          | 7e-5                  |
 | **Learning Scheduler**     | Cosine with warmup      |
 | **Optimizer**              | Adam torch fused        |