File size: 2,843 Bytes
8a22474 99fb990 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 |
---
license: gpl-3.0
language:
- en
datasets:
- HuggingFaceTB/cosmopedia-100k
- pleisto/wikipedia-cn-20230720-filtered
pipeline_tag: text-generation
tags:
- text-generation-inference
---
# NanoLM-365M-base
English | [简体中文](README_zh-CN.md)
## Introduction
Based on [Qwen2-0.5B](https://huggingface.co/Qwen/Qwen2-0.5B), the tokenizer has been replaced with [BilingualTokenizer-8K](https://huggingface.co/Mxode/Bilingual-Tokenizer) to reduce the number of parameters. The total parameters have been reduced from 0.5B to 365M.
## Details
To recover some performance and facilitate fine-tuning for downstream tasks, I chose to freeze the backbone parameters and only train the embedding part after replacing the tokenizer. Training was conducted for 40,000 steps on [wikipedia-zh](https://huggingface.co/datasets/pleisto/wikipedia-cn-20230720-filtered) and [cosmopedia-100k](https://huggingface.co/datasets/HuggingFaceTB/cosmopedia-100k).
| | Value |
| :-------------------------: | :----------------------------------------------------------: |
| Total Params | 365 M |
| Trainable Params | < 10 M |
| Trainable Parts | `model.embed_tokens` |
| Training Steps | 40,000 |
| Training Dataset | [wikipedia-zh](https://huggingface.co/datasets/pleisto/wikipedia-cn-20230720-filtered), [cosmopedia-100k](https://huggingface.co/datasets/HuggingFaceTB/cosmopedia-100k) |
| Optimizer | adamw_torch |
| Learning Rate | 2e-4 |
| LR Scheduler | cosine |
| Weight Decay | 0.1 |
| Warm-up Ratio | 0.03 |
| Batch Size | 16 |
| Gradient Accumulation Steps | 1 |
| Seq Len | 4096 |
| Dtype | bf16 |
| Peak GPU Memory | < 48 GB |
| Device | NVIDIA A100-SXM4-80GB |
The specific training records are as follows:
![result](static/result.png) |