upskyy
/

ko-reranker-8k

Text Classification

sentence-transformers

text-embeddings-inference

Model card Files Files and versions Community

ko-reranker-8k / README.md

upskyy's picture

Upload folder using huggingface_hub

fa926ef verified 4 months ago

|

history blame contribute delete

3.51 kB

	---
	license: apache-2.0
	pipeline_tag: text-classification
	tags:
	- transformers
	- sentence-transformers
	- text-embeddings-inference
	language:
	- ko
	- multilingual
	---


	# upskyy/ko-reranker-8k

	ko-reranker-8k는 [BAAI/bge-reranker-v2-m3](https://huggingface.co/BAAI/bge-reranker-v2-m3) 모델에 [한국어 데이터](https://huggingface.co/datasets/upskyy/ko-wiki-reranking)를 finetuning 한 model 입니다.

	## Usage
	## Using FlagEmbedding
	```
	pip install -U FlagEmbedding
	```

	Get relevance scores (higher scores indicate more relevance):

	```python
	from FlagEmbedding import FlagReranker


	reranker = FlagReranker('upskyy/ko-reranker-8k', use_fp16=True) # Setting use_fp16 to True speeds up computation with a slight performance degradation

	score = reranker.compute_score(['query', 'passage'])
	print(score) # -8.3828125

	# You can map the scores into 0-1 by set "normalize=True", which will apply sigmoid function to the score
	score = reranker.compute_score(['query', 'passage'], normalize=True)
	print(score) # 0.000228713314721116

	scores = reranker.compute_score([['what is panda?', 'hi'], ['what is panda?', 'The giant panda (Ailuropoda melanoleuca), sometimes called a panda bear or simply panda, is a bear species endemic to China.']])
	print(scores) # [-11.2265625, 8.6875]

	# You can map the scores into 0-1 by set "normalize=True", which will apply sigmoid function to the score
	scores = reranker.compute_score([['what is panda?', 'hi'], ['what is panda?', 'The giant panda (Ailuropoda melanoleuca), sometimes called a panda bear or simply panda, is a bear species endemic to China.']], normalize=True)
	print(scores) # [1.3315579521758342e-05, 0.9998313472460109]
	```


	## Using Huggingface transformers

	Get relevance scores (higher scores indicate more relevance):


	```python
	import torch
	from transformers import AutoModelForSequenceClassification, AutoTokenizer


	tokenizer = AutoTokenizer.from_pretrained('upskyy/ko-reranker-8k')
	model = AutoModelForSequenceClassification.from_pretrained('upskyy/ko-reranker-8k')
	model.eval()

	pairs = [['what is panda?', 'hi'], ['what is panda?', 'The giant panda (Ailuropoda melanoleuca), sometimes called a panda bear or simply panda, is a bear species endemic to China.']]
	with torch.no_grad():
	inputs = tokenizer(pairs, padding=True, truncation=True, return_tensors='pt', max_length=512)
	scores = model(**inputs, return_dict=True).logits.view(-1, ).float()
	print(scores)
	```



	## Citation

	```bibtex
	@misc{li2023making,
	title={Making Large Language Models A Better Foundation For Dense Retrieval},
	author={Chaofan Li and Zheng Liu and Shitao Xiao and Yingxia Shao},
	year={2023},
	eprint={2312.15503},
	archivePrefix={arXiv},
	primaryClass={cs.CL}
	}
	@misc{chen2024bge,
	title={BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation},
	author={Jianlv Chen and Shitao Xiao and Peitian Zhang and Kun Luo and Defu Lian and Zheng Liu},
	year={2024},
	eprint={2402.03216},
	archivePrefix={arXiv},
	primaryClass={cs.CL}
	}
	```


	## Reference

	- [Dongjin-kr/ko-reranker](https://huggingface.co/Dongjin-kr/ko-reranker)
	- [reranker-kr](https://github.com/aws-samples/aws-ai-ml-workshop-kr/tree/master/genai/aws-gen-ai-kr/30_fine_tune/reranker-kr)
	- [FlagEmbedding](https://github.com/FlagOpen/FlagEmbedding)