Post
1814
Why nobdoy is talking about the new training corpus released by MBZUAI today.
TxT360 is +15 Trillion tokens corpus outperforming FineWeb on several metrics. Ablation studies were done up to 1T tokens.
Read blog here : LLM360/TxT360
Dataset : LLM360/TxT360
TxT360 is +15 Trillion tokens corpus outperforming FineWeb on several metrics. Ablation studies were done up to 1T tokens.
Read blog here : LLM360/TxT360
Dataset : LLM360/TxT360