TREC DL 19 Metric Mismatch

by krypticmouse - opened Oct 22, 2023

Oct 22, 2023

Hi I was trying to benchmark this model on TREC DL19 track, I was mainly rerank a ranking I got from another model. Though the metrics seem low and not even matching the metric of original ranking:-

NDGC@5: 0.6620280234158401
NDGC@10: 0.6530871918417204

I'm using pytrec_eval to evaluate, what could be the issue?

MrLight

Castorini org Oct 22, 2023

Hi @krypticmouse
just fyi I have seen your comments. I am looking into it now, will get back to you asap.

krypticmouse

Oct 22, 2023

•

edited Oct 22, 2023

Thanks a lot! Attaching the inference code I used. Maxlen is 180 and bsize is 1 and model is castorini/rankllama-v1-7b-lora-passage:-

        tokenizer = AutoTokenizer.from_pretrained('meta-llama/Llama-2-7b-hf', token="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")
        tokenizer.add_special_tokens({'pad_token': '[PAD]'})
        model = self.get_model(self.model)
        
        assert len(qids) == len(pids), (len(qids), len(pids))

        scores = []

        model.eval()
        with torch.inference_mode():
            with torch.cuda.amp.autocast():
                for offset in tqdm.tqdm(range(0, len(qids), self.bsize), disable=(not show_progress)):
                    endpos = offset + self.bsize

                    queries_ = [f'query: {self.queries[qid]}</s>' for qid in qids[offset:endpos]]
                    passages_ = [f'document: {self.collection[pid]}</s>' for pid in pids[offset:endpos]]

                    features = tokenizer(queries_, passages_, padding='longest', truncation=True,
                                return_tensors='pt', max_length=self.maxlen).to(self.device)

                    batch_scores = model(**features).logits.view(-1, ).float()

                    scores.append(batch_scores)

        scores = torch.tensor(scores)
        scores = scores.tolist()

MrLight

Castorini org Oct 23, 2023

•

edited Oct 23, 2023

Hi, @krypticmouse
could you try to change these two lines to:

queries_ = [f'query: {self.queries[qid]}' for qid in qids[offset:endpos]
passages_ = [f'document: {self.collection[pid]}' for pid in pids[offset:endpos]]

I just noticed in our implementation we eventually didn't use the </s>'s representation to compute relevant scores for the cross-encoder reranker.
as it gives error like https://github.com/microsoft/DeepSpeed/issues/4017, when fine-tuning on V100 machine with fp16. (this issue doesn't happened to bi-encoder)

We instead use the last input sequence token, i.e. the last token of the document to compute the score. I will make updates accordingly.

krypticmouse

Oct 23, 2023

Oh alright! Will try it out, thanks a lot!

Do you use this exact prompt to benchmark in the paper or was it different?

MrLight

Castorini org Oct 23, 2023

same prompt.

another potential difference, the msmarco passage corpus we use is the 'with title' version.
https://huggingface.co/datasets/Tevatron/msmarco-passage-corpus.
https://arxiv.org/pdf/2304.12904.pdf

krypticmouse

Oct 23, 2023

Oh right, Thanks a lot! I went ahead with the thing and it seemed to have made it better. Thanks for the help :)

Anything about how to do batch inference?

MrLight

Castorini org Oct 23, 2023

•

edited Oct 23, 2023

Anything about how to do batch inference?

I am working on that, will get back in a day.

krypticmouse

Oct 23, 2023

Thanks a ton for the help! Really appreciate it :)

MrLight

Castorini org Oct 25, 2023

please check https://github.com/texttron/tevatron/tree/main/examples/rankllama for batch inference

cramraj8

Apr 19

Hi @MrLight , @krypticmouse ,
I was able to reproduce the performance, but the running time seems quite low. Running time for dl19 was fair, but when it scales to large collections like NQ or FEVER in BEIR benchmark, the running time estimates to 48+ hrs. Any thoughts on this?

MrLight

Castorini org Apr 22

•

edited Apr 22

Hi @cramraj8 , yes, a limitation of the LLM embedding is the inference time. Would have to speed up the corpus encoding by using multi gpu to run in parallel.
and bf16, flash attention2 should have some help on speeding up.

ah sorry, are you running reranking or retrieval? how many gpus are you running with?

cramraj8

Apr 22

Got it. I was doing reranking only with 1 GPU. Earlier I was using the given example code in the model card page, which was doing each query wise re-ranking. That was very slow. But the batch inference codebase was pretty fast.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment