jasonfang3900
commited on
Commit
•
80dbf03
1
Parent(s):
bb5e7ca
Update README.md
Browse files
README.md
CHANGED
@@ -2,10 +2,11 @@
|
|
2 |
license: apache-2.0
|
3 |
---
|
4 |
|
5 |
-
# Tele-FLM
|
6 |
Tele-FLM-1T (aka FLM-2-1T) is a 1T open-sourced multilingual large language model that features a stable, efficient pre-training paradigm and enhanced factual judgement capabilities.
|
7 |
-
Built upon the decoder-only transformer architecture, it has been trained on approximately
|
8 |
-
Tele-FLM series
|
|
|
9 |
In addition to sharing the model weights, we provide the core designs, engineering practices, and training details, anticipating their benefits for both academic and industrial communities.
|
10 |
|
11 |
## Model Details
|
@@ -38,7 +39,7 @@ Based on growth technology, the Tele-FLM-1T model training is divided into three
|
|
38 |
- Input and output multiplier
|
39 |
|
40 |
Consequently, Tele-FLM-1T is largely compatible with Llama architecturally.
|
41 |
-
To maximize convenience for the community, we made minimal adjustments to Llama's code to adapt it to Tele-FLM and released it as open source.
|
42 |
|
43 |
|
44 |
| Models | layer<br>number | attention<br>heads | hidden<br>size | ffn hidden<br>size | vocab<br>size | context<br>length | params<br>count |
|
@@ -56,8 +57,8 @@ All nodes are interconnected via InfiniBand (IB). The training process lasted ar
|
|
56 |
|
57 |
### Software
|
58 |
|
59 |
-
Tele-FLM utilizes 3D parallel training, combining the prevailing methodologies: data parallelism, tensor parallelism, and pipeline parallelism.
|
60 |
-
The parallel training setup for Tele-FLM is configured as follows: tensor parallel=32, pipeline parallel=28, and data parallel=1.
|
61 |
|
62 |
### Relate Work
|
63 |
[Tele-FLM (52B)](https://huggingface.co/CofeAI/Tele-FLM)
|
|
|
2 |
license: apache-2.0
|
3 |
---
|
4 |
|
5 |
+
# Tele-FLM-1T
|
6 |
Tele-FLM-1T (aka FLM-2-1T) is a 1T open-sourced multilingual large language model that features a stable, efficient pre-training paradigm and enhanced factual judgement capabilities.
|
7 |
+
Built upon the decoder-only transformer architecture, it has been trained on approximately 2.3T tokens.
|
8 |
+
Tele-FLM-1T, currently the largest size among Tele-FLM series, is build upon Tele-FLM (52B) with superior performances at its scale, is capable of dealing with even harder tasks with better performances in all likelihood.
|
9 |
+
For now, it's still under evaluation due to limited computing resouces.
|
10 |
In addition to sharing the model weights, we provide the core designs, engineering practices, and training details, anticipating their benefits for both academic and industrial communities.
|
11 |
|
12 |
## Model Details
|
|
|
39 |
- Input and output multiplier
|
40 |
|
41 |
Consequently, Tele-FLM-1T is largely compatible with Llama architecturally.
|
42 |
+
To maximize convenience for the community, we made minimal adjustments to Llama's code to adapt it to Tele-FLM-1T and released it as open source.
|
43 |
|
44 |
|
45 |
| Models | layer<br>number | attention<br>heads | hidden<br>size | ffn hidden<br>size | vocab<br>size | context<br>length | params<br>count |
|
|
|
57 |
|
58 |
### Software
|
59 |
|
60 |
+
Tele-FLM-1T utilizes 3D parallel training, combining the prevailing methodologies: data parallelism, tensor parallelism, and pipeline parallelism.
|
61 |
+
The parallel training setup for Tele-FLM-1T is configured as follows: tensor parallel=32, pipeline parallel=28, and data parallel=1.
|
62 |
|
63 |
### Relate Work
|
64 |
[Tele-FLM (52B)](https://huggingface.co/CofeAI/Tele-FLM)
|