Request the embedding model in Traditional Chinese
Hi,
Thank you very much for your contribution on Traditional Chinese LLM.
However, I'd like to ask if you can consider contributing an embedding model like "text2vec-large-chinese" or "bge-large-zh-v1.5" for Traditional Chinese?
I would appreciate much about that!
Thanks for your kind word.
I did think about embedding model given most of our pretraining corpus which have structure information (e.g. title, context pairs) is well suited for training retrieval/embedding models. Also I used bge-zh in twllm.com for reranking search results which works quite well.
The current blocker is my bandwidth and low expectation for the impact.
If the open-source community could contribute some cases where current embedding model including OpenAI API, "text2vec-large-chinese" or "bge-large-zh-v1.5" failed in our language, I would have a better estimate of how I can contribute.
Hi,
Thank you for your explanations. I will do testing with your model.
However, I just want to update one thing that I got from the authors of" bge-large-zh-v1.5". In fact, their embedding model was only trained on Simplified Chinese, instead of Traditional Chinese (https://huggingface.co/BAAI/bge-large-zh-v1.5/discussions/3#654b7ac2c5fd9382862542d4).
Therefore, I think "text2vec-large-chinese" might not supported too.
Moreover, related to Open AI embedding models, that is totally different because my purpose is to run all "inside" my PC.
it seems like a good motivation for training traditional mandarin embedding models
Hi,
Absolutely yes, just like it is kind of new idea for now.
However, I believe that lots of engineering labs in Taiwan will pay attentions about this model. Because it is really necessary.
One more thing,
Is there any required prompt template for Taiwan-LLM-13B-v2.0-chat?
In my application, can I use the template as follows:
template = """You are a chatbot having a conversation with a human.
Given the following extracted parts of a long document and a question, create a final answer.
{context}
{chat_history}
Human: {human_input}
Chatbot:"""
from transformers import AutoTokenizer
chat = [
# {"role": "system", "content": "你講中文"},
{"role": "user", "content": "Hello, how are you?"},
{"role": "assistant", "content": "I'm doing great. How can I help you today?"},
{"role": "user", "content": "I'd like to show off how chat templating works!"},
]
tokenizer = AutoTokenizer.from_pretrained("yentinglin/Taiwan-LLM-7B-v2.0.1-chat")
prompt_for_generation = tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=True)
Try this, it should work for all my models
re. embedding model, it would be great if industry labs could sponsor me :)
Hi,
However, in my case, I also need "context" and "chat_history" that are local data and history of chat, respectively.
Thus, could you please tell me how should I put "context" and "chat_history" into your prompt template?
Thank you for your help!
Yours template looks fine
Thank you for your answer!