File size: 3,989 Bytes
6d064e2 61a3f5f 6d064e2 afd8bec 6d064e2 a4efa0d 6d064e2 6cee99c 6d064e2 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 |
---
library_name: transformers
license: apache-2.0
language:
- en
- he
widget:
- text: <|endoftext|>\%Hugging face
- text: <|endoftext|>\%Machine learning
- text: <|endoftext|>\%Wikipedia
- text: <|endoftext|>\%דורון אדלר
- text: <|endoftext|>\%
datasets:
- wikimedia/wikipedia
---
# SmolLM-135M-FakyPedia-EngHeb
## Table of Contents
- [Model Details](#model-details)
- [Uses](#uses)
- [Risks, Limitations and Biases](#risks-limitations-and-biases)
- [Training](#training)
## Model Details
**Base Model**
This model extended the tokenizer of and is a fine-tuned of [SmolLM-135M-Instruct](https://huggingface.co/HuggingFaceTB/SmolLM-135M-Instruct)
**Model Description:**
A bilingual (English and Hebrew) nonsense generation model which produces silly Wikipedia-like abstract text.
- **Fine tuned by:** [Doron Adler](https://linktr.ee/Norod78)
- **Model Type:** Text Generation
- **Language(s):** English, Hebrew
- **License:** apache-2.0 (as a derived work of SmolLM)
## Uses
### Input format
BOS-TOKEN followed by '\\%' followed by the optional title for the fake "Wikipedia" article
### Generation
```bash
pip install transformers
```
```python
# pip install transformers
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model_id = "Norod78/SmolLM-135M-FakyPedia-EngHeb"
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token_id = tokenizer.eos_token_id
bos_token = tokenizer.bos_token
model = AutoModelForCausalLM.from_pretrained(model_id).to(device)
model.generation_config.pad_token_id = tokenizer.pad_token_id
torch.manual_seed(1234)
def generate_fakypedia(article_title: str):
with torch.no_grad():
string_to_tokenize= f"{bos_token}\\%{article_title}"
input_ids = tokenizer( string_to_tokenize, return_tensors="pt").input_ids.to(device)
sample_outputs = model.generate(input_ids, do_sample=True,repetition_penalty=1.2, temperature=0.5, max_length=96, num_return_sequences=3)
print(f"# Fakypedia results for \"{article_title}\" \n")
for i, sample_output in enumerate(sample_outputs):
decoded_output = tokenizer.decode(sample_output, skip_special_tokens=True).replace(f"\%{article_title}", f"## {article_title}").replace("\%", " ").replace("\\n", " \n")
print("{}\n".format(decoded_output))
generate_fakypedia("Hugging Face")
```
### Generate with llama.cpp
Download [SmolLM-135M-FakyPedia-EngHeb-BF16.gguf](https://huggingface.co/Norod78/SmolLM-135M-FakyPedia-EngHeb/resolve/main/SmolLM-135M-FakyPedia-EngHeb-BF16.gguf)
Run:
```bash
llama-cli -m SmolLM-135M-FakyPedia-EngHeb-BF16.gguf -p "<|endoftext|>\\%Hugging Face"
```
#### Misuse and Out-of-scope Use
## Risks, Limitations and Biases
**CONTENT WARNING: Readers should be aware this section contains content that is disturbing, offensive, and can propagate historical and current stereotypes.**
This model is basically a joke and intended to generate silly and fake results.
## Training
#### Training Data
[English and Hebrew Wikipedia](https://huggingface.co/datasets/wikimedia/wikipedia)
#### Training Procedure
* A tokenizer with vocab size of 14,000 was trained
* The trained tokenizer was then [merged](https://huggingface.co/Norod78/gpt2-tokenizer-with-added-hebrew-14k) at the end of the base model's tokenizer using [this script](https://github.com/huggingface/tokenizers/issues/690#issuecomment-830665989) so the original base model knowledge was retained as well as make it better fine-tunable upon Hebrew text
* Hebrew and English datasets were [interleaved](https://huggingface.co/docs/datasets/en/process#interleave) so each language had an identical amount of samples.
* Each example was processed in the following manner:
```python
def add_prefix(example):
example["text"] = ("\%" + example["title"] + "\%\n" + example["text"]).replace("\n", "\\n")
return example
```
|