Tucano
Tucano is a series of decoder-transformers based on the Llama 2 architecture, natively pre-trained in Portuguese.
- Paper • 2411.07854 • Published
TucanoBR/Tucano-2b4
Text Generation • Updated • 72Note 2.4 billion-parameter version of the Tucano series.
TucanoBR/Tucano-2b4-Instruct
Text Generation • Updated • 281 • 1Note 2.4 billion-parameter version of the Tucano fine-tuned on the TucanoBR/Tucano-SFT dataset.
TucanoBR/Tucano-1b1
Text Generation • Updated • 402Note 1.1 billion-parameter version of the Tucano series.
TucanoBR/Tucano-1b1-Instruct
Text Generation • Updated • 398 • 1Note 1.1 billion-parameter version of the Tucano fine-tuned on the TucanoBR/Tucano-SFT dataset.
TucanoBR/Tucano-630m
Text Generation • Updated • 23Note 630 million-parameter version of the Tucano series.
TucanoBR/Tucano-160m
Text Generation • Updated • 37Note 160 million-parameter version of the Tucan series.
TucanoBR/BERTimbau-large-text-filter
Text Classification • Updated • 4Note BERTimbau-large fine-tuned on the TucanoBR/GigaVerbo-Text-Filter dataset.
TucanoBR/BERTimbau-base-text-filter
Text Classification • Updated • 9Note BERTimbau-base fine-tuned on the TucanoBR/GigaVerbo-Text-Filter dataset.
TucanoBR/XGBClassifier-text-filter
UpdatedNote XGBClassifier trained on the TucanoBR/GigaVerbo-Text-Filter dataset (requires the embeddings generated by sentence-transformers/LaBSE).
TucanoBR/XGBRegressor-text-filter
UpdatedNote XGBRegressor trained on the TucanoBR/GigaVerbo-Text-Filter dataset (requires the embeddings generated by sentence-transformers/LaBSE).
TucanoBR/GigaVerbo
Viewer • Updated • 145M • 551 • 2Note GigaVerbo is an extensive dataset comprising 780 GB of Portuguese text, being a concatenated version of several datasets available in Hugging Face, containing over 200 billion tokens.
TucanoBR/GigaVerbo-Text-Filter
Viewer • Updated • 110k • 54Note GigaVerbo Text-Filter is a dataset with 110,000 randomly selected samples from 9 subsets of GigaVerbo, all scored by GPT-4o.
TucanoBR/Tucano-SFT
Viewer • Updated • 680k • 75Note This is the dataset used to train the "Instruct" versions of the Tucano series.
TucanoBR/lambada-pt
Viewer • Updated • 5.15k • 23 • 2Note This dataset is a translated version (Portuguese) of the LAMBADA test split as pre-processed by OpenAI.
TucanoBR/alpaca-eval-pt
Viewer • Updated • 805 • 48Note This dataset contains 805 translated samples (Portuguese) from the Alpaca dataset.
nicholasKluge/reward-aira-dataset
Viewer • Updated • 70k • 107 • 3Note This dataset contains pairs of completions to prompts. Used for DPO fine-tuning.