Spaces:
Running
on
T4
Running in Colab
I am trying to get a minimalistic replication of the streaming behavior. I eventually arrived at this code:
import time
import soundfile as sf
from src.simuleval_transcoder import SimulevalTranscoder
from src.simuleval_agent_directory import SimulevalAgentDirectory, AgentWithInfo
from src.transcoder_helpers import get_transcoder_output_events
agent_directory = SimulevalAgentDirectory()
pre_agent = agent_directory.build_agent_if_available("SeamlessStreaming", "vad_s2st_sc_main.yaml" )
agent = AgentWithInfo(pre_agent, 'SeamlessStream', 's2t', 'en')
transcoder = SimulevalTranscoder(
agent,
sample_rate=16000,
debug=True,
buffer_limit=5
)
transcoder.start()
# Function to simulate streaming audio
def stream_audio(file_path, chunk_size=320):
with sf.SoundFile(audio_file_path, 'r') as f:
while True:
data = f.read(chunk_size)
if len(data) == 0:
break
yield data
# Stream the audio and transcribe
audio_file_path = "converted_file.wav"
for chunk in stream_audio(audio_file_path):
# Process incoming bytes
transcoder.process_incoming_bytes(chunk, dynamic_config={"targetLanguage": "en"})
# Check for transcription output
events = get_transcoder_output_events(transcoder)
for event in events:
print(event)
if event['event'] == 'translation_text':
print(event['payload']) # Print the transcribed text
time.sleep(0.02)
# Finalize
transcoder.close = True
But something appears to be wrong, because although I can see the GPU is being utilized,
get_transcoder_output_events(transcoder)
never returns any event. Am I doing something wrong?
I can see from the debug folder that the audio files are gibberish. Apparently it is not processing the audio chunks correctly. It may be related to this issue?
https://github.com/facebookresearch/seamless_communication/issues/237
Hi @rodrigoheck - wanted to share this Colab notebook we prepared: https://fb.me/mt-neurips, which shows an example of simplified standalone streaming inference (scroll to the bottom). It simulates what is happening in the HF demo - i.e. getting an unsegmented audio stream and passing it in 320ms chunks to the streaming system.
It also provides a visualization/sample audio of the output translation which we overlay on top of the input audio, adding in the appropriate delay/silence to reflect how it would sound if you were actually streaming it in real time.
Hello
@rodrigoheck
Have you fixed this minimal demo ? I tried debugging it but cannot detect the problem . No events are gotten from the transcoder and the output_queue's size is always zero
Also for the demo in https://fb.me/mt-neurips it does work but I cant really understand what is going on so I am trying to make a more minimal approach just like the above code block