metadata

license: apache-2.0

Base checkpoint

augmxnt/shisa-7b-v1

Training datasets (total ~7B)

Aozora Bunko
Japanese Law Precedent Dataset
Japanese Wikipedia
.lg.jp, .go.jp, .ac.jp domain webscrapes from CulturaX (Any documents with same first 25 characters were de-duplicated)
English Ultrachat200K-gen (So that it doesn't forget English and chatting ability learned in the base checkpoint)