dpr-ctx_encoder-bert-base-multilingual
Description
Multilingual DPR Model base on bert-base-multilingual-cased. DPR model DPR repo
Data
question pairs for train
: 644,217question pairs for dev
: 73,710
*DRCD and MLQA are converted using script from haystack squad_to_dpr.py
Training Script
I use the script from haystack
Usage
from transformers import DPRContextEncoder, DPRContextEncoderTokenizer
tokenizer = DPRContextEncoderTokenizer.from_pretrained('voidful/dpr-ctx_encoder-bert-base-multilingual')
model = DPRContextEncoder.from_pretrained('voidful/dpr-ctx_encoder-bert-base-multilingual')
input_ids = tokenizer("Hello, is my dog cute ?", return_tensors='pt')["input_ids"]
embeddings = model(input_ids).pooler_output
Follow the tutorial from haystack
:
Better Retrievers via "Dense Passage Retrieval"
from haystack.retriever.dense import DensePassageRetriever
retriever = DensePassageRetriever(document_store=document_store,
query_embedding_model="voidful/dpr-question_encoder-bert-base-multilingual",
passage_embedding_model="voidful/dpr-ctx_encoder-bert-base-multilingual",
max_seq_len_query=64,
max_seq_len_passage=256,
batch_size=16,
use_gpu=True,
embed_title=True,
use_fast_tokenizers=True)
- Downloads last month
- 8