为什么vocab.json 文件，里面应该有中文啊，怎么打开是乱码找不到中文啊

#10

by TigerZ - opened Jun 21

Discussion

TigerZ

Jun 21

为什么vocab.json 文件，里面应该有中文啊，怎么打开是乱码找不到中文啊。我们的BPE在中文上是怎么做的啊，会分成完整单个汉字？还是完整词？还是不完整词？

nomore

Jul 19

..........大哥里面确实有中文，你看看bpe编码

gefeifan

Aug 14

这个bpe编码怎么转成汉字？

mocorr

Sep 11

•

edited Sep 11

text = "xxx模型在信息检索中的应用"
inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True )
input_ids = inputs["input_ids"].squeeze().tolist()
tokens = tokenizer.convert_ids_to_tokens(input_ids)
for token in tokens:
decode_text = tokenizer.convert_tokens_to_string([token]) # 将tokens转回原始文本
print(f"Decoded Text: {decode_text}")

@TigerZ @gefeifan

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment