dicta-il
/

mt5-xl-heq

Text2Text Generation

Model card Files Files and versions Community

mt5-xl-heq / README.md

Shaltiel's picture

Update README.md

6e1b410 about 1 year ago

|

history blame contribute delete

3.47 kB

	---
	license: cc-by-4.0
	language:
	- he
	inference: false
	---
	# Google's mT5-XL - Finetuned for Hebrew Question-Answering

	[Google's mT5](https://github.com/google-research/multilingual-t5) multilingual Seq2Seq model, finetuned on [HeQ](https://u.cs.biu.ac.il/~yogo/heq.pdf) for the Hebrew Question-Answering task.

	This is the model that was reported in the `DictaBERT` release [here](https://arxiv.org/abs/2308.16687).

	Sample usage:

	```python
	import torch
	from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

	tokenizer = AutoTokenizer.from_pretrained('dicta-il/mt5-xl-heq')
	model = AutoModelForSeq2SeqLM.from_pretrained('dicta-il/mt5-xl-heq')

	model.eval()

	question='כיצד הוגבל המידע שניתן להשיג באמצעות העוגיות?'
	context='בניית פרופילים של משתמשים נחשבת על ידי רבים כאיום פוטנציאלי על הפרטיות. מסיבה זו הגבילו חלק מהמדינות באמצעות חקיקה את המידע שניתן להשיג באמצעות עוגיות ואת אופן השימוש בעוגיות. ארצות הברית, למשל, קבעה חוקים נוקשים בכל הנוגע ליצירת עוגיות חדשות. חוקים אלו, אשר נקבעו בשנת 2000, נקבעו לאחר שנחשף כי המשרד ליישום המדיניות של הממשל האמריקאי נגד השימוש בסמים (ONDCP) בבית הלבן השתמש בעוגיות כדי לעקוב אחרי משתמשים שצפו בפרסומות נגד השימוש בסמים במטרה לבדוק האם משתמשים אלו נכנסו לאתרים התומכים בשימוש בסמים. דניאל בראנט, פעיל הדוגל בפרטיות המשתמשים באינטרנט, חשף כי ה-CIA שלח עוגיות קבועות למחשבי אזרחים במשך עשר שנים. ב-25 בדצמבר 2005 גילה בראנט כי הסוכנות לביטחון לאומי (ה-NSA) השאירה שתי עוגיות קבועות במחשבי מבקרים בגלל שדרוג תוכנה. לאחר שהנושא פורסם, הם ביטלו מיד את השימוש בהן.'

	with torch.inference_mode():
	prompt = 'question: %s context: %s ' % (question, context)
	kwargs = dict(
	inputs=tokenizer(prompt, return_tensors='pt').input_ids.to(model.device),
	do_sample=True,
	top_k=50,
	top_p=0.95,
	temperature=0.75,
	max_length=100,
	min_new_tokens=2
	)

	print(tokenizer.batch_decode(model.generate(**kwargs), skip_special_tokens=True))
	```

	Output:
	```json
	["באמצעות חקיקה"]
	```


	## Citation

	If you use `mt5-xl-heq` in your research, please cite ```DictaBERT: A State-of-the-Art BERT Suite for Modern Hebrew```

	BibTeX:

	```bibtex
	@misc{shmidman2023dictabert,
	title={DictaBERT: A State-of-the-Art BERT Suite for Modern Hebrew},
	author={Shaltiel Shmidman and Avi Shmidman and Moshe Koppel},
	year={2023},
	eprint={2308.16687},
	archivePrefix={arXiv},
	primaryClass={cs.CL}
	}
	```

	## License

	Shield: [![CC BY 4.0][cc-by-shield]][cc-by]

	This work is licensed under a
	[Creative Commons Attribution 4.0 International License][cc-by].

	[![CC BY 4.0][cc-by-image]][cc-by]

	[cc-by]: http://creativecommons.org/licenses/by/4.0/
	[cc-by-image]: https://i.creativecommons.org/l/by/4.0/88x31.png
	[cc-by-shield]: https://img.shields.io/badge/License-CC%20BY%204.0-lightgrey.svg