sarulab-speech
/

hubert-base-jtube

Feature Extraction

Inference Endpoints

Model card Files Files and versions Community

Wataru commited on Feb 5

Commit

f3c78a4

•

1 Parent(s): 0570a7e

Update README.md

Files changed (1) hide show

README.md +16 -1

README.md CHANGED Viewed

@@ -7,6 +7,20 @@ library_name: transformers
 # hubert-base-jtube
 This repo provides model weights for the [hubert-base model](https://arxiv.org/abs/2106.07447) trained on the [JTubeSpeech](https://github.com/sarulab-speech/jtubespeech) corpus.
 ## Dataset
 We extracted approximately 2720 hours of Japanese speech from the single-speaker subset of the JTubeSpeech corpus.
@@ -45,4 +59,5 @@ hidden_states = model(input_values).last_hidden_state
 # 謝辞/acknowledgements
 本研究は、国立研究開発法人産業技術総合研究所事業の令和5年度覚醒プロジェクトの助成を受けたものです。
-/This work was supported by AIST KAKUSEI project (FY2023).

 # hubert-base-jtube
 This repo provides model weights for the [hubert-base model](https://arxiv.org/abs/2106.07447) trained on the [JTubeSpeech](https://github.com/sarulab-speech/jtubespeech) corpus.
+Scroll for the model usage
+# FAQ
+Q. 何をするモデル？<br>
+A. 音声を潜在変数に埋め込むモデル．音声認識（書き起こし）みたいな認識系のタスクに使えます．
+Q. 音声言語モデルって，ChatGPT の音声版ってこと？<br>
+A. Transformer にも種類があって，Encoder型とDecoder型の2つがあります．簡単に言うとEncoderが認識用（元データから潜在変数を得るモデル）で，Decoderが生成用（元データを復元するモデル）です．今回公開したHuBERTはEncoder型（認識用）で，ChatGPTのようなDecoder型（生成用）とは異なります．
+Q. じゃあ声は作れないってこと？<br>
+A. 声を生成するモデルではなくて，認識する側のモデルです．生成には使えません．
+Q. Decoder型（生成側）は今後公開する予定はあるの？<br>
+A. 生成モデルの公開は個人の権利を侵害する可能性があるため予定していないです．むしろ，声に関する個人の権利を保護する技術を開発することが音声技術者の課題だと考えています．（今回の音声言語モデルはそのための第一歩です）
 ## Dataset
 We extracted approximately 2720 hours of Japanese speech from the single-speaker subset of the JTubeSpeech corpus.
 # 謝辞/acknowledgements
 本研究は、国立研究開発法人産業技術総合研究所事業の令和5年度覚醒プロジェクトの助成を受けたものです。
+/This work was supported by AIST KAKUSEI project (FY2023).