File size: 3,989 Bytes
6d064e2
 
 
 
 
 
 
61a3f5f
 
 
 
 
 
 
6d064e2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
afd8bec
6d064e2
 
 
 
 
 
 
 
 
 
a4efa0d
6d064e2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6cee99c
 
 
 
 
 
 
 
6d064e2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
---
library_name: transformers
license: apache-2.0
language:
- en
- he
widget:
- text: <|endoftext|>\%Hugging face
- text: <|endoftext|>\%Machine learning
- text: <|endoftext|>\%Wikipedia
- text: <|endoftext|>\%דורון אדלר
- text: <|endoftext|>\%
datasets:
- wikimedia/wikipedia
---

# SmolLM-135M-FakyPedia-EngHeb

## Table of Contents
- [Model Details](#model-details)
- [Uses](#uses)
- [Risks, Limitations and Biases](#risks-limitations-and-biases)
- [Training](#training)

## Model Details

**Base Model**

This model extended the tokenizer of and is a fine-tuned of [SmolLM-135M-Instruct](https://huggingface.co/HuggingFaceTB/SmolLM-135M-Instruct)  

**Model Description:**

A bilingual (English and Hebrew) nonsense generation model which produces silly Wikipedia-like abstract text. 

- **Fine tuned by:** [Doron Adler](https://linktr.ee/Norod78)
- **Model Type:** Text Generation
- **Language(s):** English, Hebrew
- **License:** apache-2.0 (as a derived work of SmolLM)

## Uses

### Input format

BOS-TOKEN followed by '\\%' followed by the optional title for the fake "Wikipedia" article

### Generation
```bash
pip install transformers
```

```python
# pip install transformers
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

model_id = "Norod78/SmolLM-135M-FakyPedia-EngHeb"
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token_id = tokenizer.eos_token_id
bos_token = tokenizer.bos_token
model = AutoModelForCausalLM.from_pretrained(model_id).to(device)
model.generation_config.pad_token_id = tokenizer.pad_token_id

torch.manual_seed(1234)

def generate_fakypedia(article_title: str):
    with torch.no_grad():  
        string_to_tokenize= f"{bos_token}\\%{article_title}"
        input_ids = tokenizer( string_to_tokenize, return_tensors="pt").input_ids.to(device)        
        sample_outputs = model.generate(input_ids, do_sample=True,repetition_penalty=1.2, temperature=0.5, max_length=96, num_return_sequences=3)        
        print(f"# Fakypedia results for \"{article_title}\"  \n")
        for i, sample_output in enumerate(sample_outputs):
            decoded_output = tokenizer.decode(sample_output, skip_special_tokens=True).replace(f"\%{article_title}", f"## {article_title}").replace("\%", " ").replace("\\n", "  \n")
            print("{}\n".format(decoded_output))

generate_fakypedia("Hugging Face")
```

### Generate with llama.cpp

Download [SmolLM-135M-FakyPedia-EngHeb-BF16.gguf](https://huggingface.co/Norod78/SmolLM-135M-FakyPedia-EngHeb/resolve/main/SmolLM-135M-FakyPedia-EngHeb-BF16.gguf)  
Run:
```bash
llama-cli -m SmolLM-135M-FakyPedia-EngHeb-BF16.gguf -p "<|endoftext|>\\%Hugging Face"
```

#### Misuse and Out-of-scope Use

## Risks, Limitations and Biases
**CONTENT WARNING: Readers should be aware this section contains content that is disturbing, offensive, and can propagate historical and current stereotypes.**

This model is basically a joke and intended to generate silly and fake results. 

## Training

#### Training Data
 [English and Hebrew Wikipedia](https://huggingface.co/datasets/wikimedia/wikipedia)

#### Training Procedure

* A tokenizer with vocab size of 14,000 was trained
* The trained tokenizer was then [merged](https://huggingface.co/Norod78/gpt2-tokenizer-with-added-hebrew-14k) at the end of the base model's tokenizer using [this script](https://github.com/huggingface/tokenizers/issues/690#issuecomment-830665989) so the original base model knowledge was retained as well as make it better fine-tunable upon Hebrew text
* Hebrew and English datasets were [interleaved](https://huggingface.co/docs/datasets/en/process#interleave) so each language had an identical amount of samples.
* Each example was processed in the following manner:
```python
def add_prefix(example):
  example["text"] = ("\%" + example["title"] + "\%\n" + example["text"]).replace("\n", "\\n")
  return example
```