shortyforce
commited on
Commit
•
4c6095e
1
Parent(s):
a6e5daf
Update README.md
Browse files
README.md
CHANGED
@@ -14,7 +14,7 @@ inference: false
|
|
14 |
- **模型结构**:XVERSE-13B 使用主流 Decoder-only 的标准 Transformer 网络结构,支持 8K 的上下文长度(Context Length),为同尺寸模型中最长,能满足更长的多轮对话、知识问答与摘要等需求,模型应用场景更广泛。
|
15 |
- **训练数据**:构建了 1.4 万亿 token 的高质量、多样化的数据对模型进行充分训练,包含中、英、俄、西等 40 多种语言,通过精细化设置不同类型数据的采样比例,使得中英两种语言表现优异,也能兼顾其他语言效果。
|
16 |
- **分词**:基于 BPE(Byte-Pair Encoding)算法,使用上百 GB 语料训练了一个词表大小为 100,278 的分词器,能够同时支持多语言,而无需额外扩展词表。
|
17 |
-
-
|
18 |
|
19 |
## Model Introduction
|
20 |
|
@@ -23,7 +23,7 @@ inference: false
|
|
23 |
- **Model Structure**: XVERSE-13B uses the mainstream Decoder-only Transformer network structure, supports 8k context length, the longest one among models of the same size, which can meet the need of longer multi-round dialogues, knowledge question-answering, and summarization. This makes the model more versatile in application scenarios.
|
24 |
- **Training Data**: The model has been thoroughly trained on a diversified and high-quality dataset consisting of 1.4 trillion of tokens, including more than 40 languages such as Chinese, English, Russian, and Spanish. The sampling ratio of different types of data is finely set, which makes the performance of Chinese and English excellent, and also takes into account the effect of other languages.
|
25 |
- **Tokenization**: Based on the BPE (Byte-Pair Encoding) algorithm, a tokenizer with a vocabulary size of 100,278 has been trained using hundreds of gigabytes of language data. This tokenizer is capable of supporting multilingual without the need for additional vocabulary expansion.
|
26 |
-
- **Training Framework**:
|
27 |
|
28 |
## 评测结果
|
29 |
|
|
|
14 |
- **模型结构**:XVERSE-13B 使用主流 Decoder-only 的标准 Transformer 网络结构,支持 8K 的上下文长度(Context Length),为同尺寸模型中最长,能满足更长的多轮对话、知识问答与摘要等需求,模型应用场景更广泛。
|
15 |
- **训练数据**:构建了 1.4 万亿 token 的高质量、多样化的数据对模型进行充分训练,包含中、英、俄、西等 40 多种语言,通过精细化设置不同类型数据的采样比例,使得中英两种语言表现优异,也能兼顾其他语言效果。
|
16 |
- **分词**:基于 BPE(Byte-Pair Encoding)算法,使用上百 GB 语料训练了一个词表大小为 100,278 的分词器,能够同时支持多语言,而无需额外扩展词表。
|
17 |
+
- **训练框架**:自主研发多项关键技术,包括高效算子、显存优化、并行调度策略、数据-计算-通信重叠、平台和框架协同等,让训练效率更高,模型稳定性强,在千卡集群上的峰值算力利用率可达到 58.5%,位居业界前列。
|
18 |
|
19 |
## Model Introduction
|
20 |
|
|
|
23 |
- **Model Structure**: XVERSE-13B uses the mainstream Decoder-only Transformer network structure, supports 8k context length, the longest one among models of the same size, which can meet the need of longer multi-round dialogues, knowledge question-answering, and summarization. This makes the model more versatile in application scenarios.
|
24 |
- **Training Data**: The model has been thoroughly trained on a diversified and high-quality dataset consisting of 1.4 trillion of tokens, including more than 40 languages such as Chinese, English, Russian, and Spanish. The sampling ratio of different types of data is finely set, which makes the performance of Chinese and English excellent, and also takes into account the effect of other languages.
|
25 |
- **Tokenization**: Based on the BPE (Byte-Pair Encoding) algorithm, a tokenizer with a vocabulary size of 100,278 has been trained using hundreds of gigabytes of language data. This tokenizer is capable of supporting multilingual without the need for additional vocabulary expansion.
|
26 |
+
- **Training Framework**: Several key technologies have also been independently developed, including efficient operators, memory optimization, parallel scheduling strategies, overlap of data-computation-communication, and synergy between platforms and frameworks. These advancements enhance training efficiency and model stability. With these technologies, the peak computational power utilization rate on a thousand-card cluster can reach 58.5%, ranking at the forefront of the industry.
|
27 |
|
28 |
## 评测结果
|
29 |
|