Data for Continued Pre-Training

by pszemraj - opened Dec 22, 2023

Dec 22, 2023

Hi, firstly awesome work! I just wanted to check in/ask about the data use for continued pre-training:

finally, continued pre-training for the entire model.

I understand that direct sharing may not be possible, but I wanted to ask if any of the continued pre-training data was synthetically generated via OpenAI models (or any other source with similarly restrictive terms of use)?

bongchoi

Dec 27, 2023

I'm also curious, how many tokens were used for continual pretrain?

Leon-Leee

Dec 29, 2023

curious too. I read their paper and didn't find the details. https://browse.arxiv.org/html/2312.15166v1

hunkim

upstage org Dec 30, 2023

Details of Data Sets and Training Techniques: Thank you for your interest! Unfortunately, due to the high level of competition in this field, we are unable to share detailed information about the training techniques and datasets used. We appreciate your understanding. However, we have released a list of fine-tuning datasets, https://huggingface.co/upstage/SOLAR-10.7B-Instruct-v1.0.

hunkim changed discussion status to closed Dec 30, 2023

YeungNLP

Jan 3

how much data is used in continue pretraining?

pszemraj

Jan 3

@hunkim thanks! understood. I'm primarily interested in this checkpoint upstage/SOLAR-10.7B-v1.0 as it is apache-2.0 - based on your response it seems like you all have done your homework. I assume there is no issue using upstage/SOLAR-10.7B-v1.0 to the fullest extent of it's apache-2.0 license, including synthetic data generation, commercial use, etc. Please advise if my interpretation is incorrect & thanks again.

hunkim

upstage org Jan 3

@pszemraj You are right! Enjoy!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment