Data for Continued Pre-Training
Hi, firstly awesome work! I just wanted to check in/ask about the data use for continued pre-training:
finally, continued pre-training for the entire model.
I understand that direct sharing may not be possible, but I wanted to ask if any of the continued pre-training data was synthetically generated via OpenAI models (or any other source with similarly restrictive terms of use)?
I'm also curious, how many tokens were used for continual pretrain?
curious too. I read their paper and didn't find the details. https://browse.arxiv.org/html/2312.15166v1
Details of Data Sets and Training Techniques: Thank you for your interest! Unfortunately, due to the high level of competition in this field, we are unable to share detailed information about the training techniques and datasets used. We appreciate your understanding. However, we have released a list of fine-tuning datasets, https://huggingface.co/upstage/SOLAR-10.7B-Instruct-v1.0.
how much data is used in continue pretraining?
@hunkim
thanks! understood. I'm primarily interested in this checkpoint upstage/SOLAR-10.7B-v1.0
as it is apache-2.0 - based on your response it seems like you all have done your homework. I assume there is no issue using upstage/SOLAR-10.7B-v1.0
to the fullest extent of it's apache-2.0 license, including synthetic data generation, commercial use, etc. Please advise if my interpretation is incorrect & thanks again.