metadata
license: gpl-3.0
language:
- en
- zh
- ja
- de
datasets:
- JosephusCheung/GuanacoDataset
- meta-math/MetaMathQA
- jondurbin/airoboros-3.1
- WizardLM/WizardLM_evol_instruct_V2_196k
- RyokoAI/ShareGPT52K
- RyokoAI/Fandom23K
- milashkaarshif/MoeGirlPedia_wikitext_raw_archive
- wikipedia
- wiki_lingua
- garage-bAInd/Open-Platypus
- LDJnr/Puffin
- BAAI/COIG
- TigerResearch/tigerbot-zhihu-zh-10k
- liwu/MNBVC
- teknium/openhermes
- CausalLM/Refined-Anime-Text
- microsoft/orca-math-word-problems-200k
- m-a-p/CodeFeedback-Filtered-Instruction
Tokenizer is different from cohere - and chat template is ChatML - fully fine-tuned at 128K+ ~ 30M entries long, web crawl input, GPT-4-32k/3.5-16k output, synthetic dataset - 1 epoch
For another candidate version of 1 epoch - https://huggingface.co/CausalLM/35b-beta - somehow less overfitting?
No loras, no quants, no tricks.
This one is not "very 128k", use https://huggingface.co/CausalLM/35b-beta-long for long context. But better in general tasks, knowledge, coding and so on.
And, merge them if you want!