--- library_name: transformers license: apache-2.0 language: - en - he widget: - text: <|endoftext|>\%Hugging face - text: <|endoftext|>\%Machine learning - text: <|endoftext|>\%Wikipedia - text: <|endoftext|>\%דורון אדלר - text: <|endoftext|>\% datasets: - wikimedia/wikipedia --- # SmolLM-135M-FakyPedia-EngHeb ## Table of Contents - [Model Details](#model-details) - [Uses](#uses) - [Risks, Limitations and Biases](#risks-limitations-and-biases) - [Training](#training) ## Model Details **Base Model** This model extended the tokenizer of and is a fine-tuned of [SmolLM-135M-Instruct](https://huggingface.co/HuggingFaceTB/SmolLM-135M-Instruct) **Model Description:** A bilingual (English and Hebrew) nonsense generation model which produces silly Wikipedia-like abstract text. - **Fine tuned by:** [Doron Adler](https://linktr.ee/Norod78) - **Model Type:** Text Generation - **Language(s):** English, Hebrew - **License:** apache-2.0 (as a derived work of SmolLM) ## Uses ### Input format BOS-TOKEN followed by '\\%' followed by the optional title for the fake "Wikipedia" article ### Generation ```bash pip install transformers ``` ```python # pip install transformers import torch from transformers import AutoTokenizer, AutoModelForCausalLM device = torch.device("cuda" if torch.cuda.is_available() else "cpu") model_id = "Norod78/SmolLM-135M-FakyPedia-EngHeb" tokenizer = AutoTokenizer.from_pretrained(model_id) tokenizer.pad_token_id = tokenizer.eos_token_id bos_token = tokenizer.bos_token model = AutoModelForCausalLM.from_pretrained(model_id).to(device) model.generation_config.pad_token_id = tokenizer.pad_token_id torch.manual_seed(1234) def generate_fakypedia(article_title: str): with torch.no_grad(): string_to_tokenize= f"{bos_token}\\%{article_title}" input_ids = tokenizer( string_to_tokenize, return_tensors="pt").input_ids.to(device) sample_outputs = model.generate(input_ids, do_sample=True,repetition_penalty=1.2, temperature=0.5, max_length=96, num_return_sequences=3) print(f"# Fakypedia results for \"{article_title}\" \n") for i, sample_output in enumerate(sample_outputs): decoded_output = tokenizer.decode(sample_output, skip_special_tokens=True).replace(f"\%{article_title}", f"## {article_title}").replace("\%", " ").replace("\\n", " \n") print("{}\n".format(decoded_output)) generate_fakypedia("Hugging Face") ``` ### Generate with llama.cpp Download [SmolLM-135M-FakyPedia-EngHeb-BF16.gguf](https://huggingface.co/Norod78/SmolLM-135M-FakyPedia-EngHeb/resolve/main/SmolLM-135M-FakyPedia-EngHeb-BF16.gguf) Run: ```bash llama-cli -m SmolLM-135M-FakyPedia-EngHeb-BF16.gguf -p "<|endoftext|>\\%Hugging Face" ``` #### Misuse and Out-of-scope Use ## Risks, Limitations and Biases **CONTENT WARNING: Readers should be aware this section contains content that is disturbing, offensive, and can propagate historical and current stereotypes.** This model is basically a joke and intended to generate silly and fake results. ## Training #### Training Data [English and Hebrew Wikipedia](https://huggingface.co/datasets/wikimedia/wikipedia) #### Training Procedure * A tokenizer with vocab size of 14,000 was trained * The trained tokenizer was then [merged](https://huggingface.co/Norod78/gpt2-tokenizer-with-added-hebrew-14k) at the end of the base model's tokenizer using [this script](https://github.com/huggingface/tokenizers/issues/690#issuecomment-830665989) so the original base model knowledge was retained as well as make it better fine-tunable upon Hebrew text * Hebrew and English datasets were [interleaved](https://huggingface.co/docs/datasets/en/process#interleave) so each language had an identical amount of samples. * Each example was processed in the following manner: ```python def add_prefix(example): example["text"] = ("\%" + example["title"] + "\%\n" + example["text"]).replace("\n", "\\n") return example ```