Detailed Gemini Summary
π Gemini highlights:
Dataset:
- Multimodal and multilingual
- Web documents, books, and code
Dataset size
- #Tokens to train Pro+Ultra chosen using Chinchilla scaling laws.
- Nano models trained on more tokens than predicted by Chinchilla scaling laws, following LLaMa
Filtering
- Quality filtering: Heuristics + model classifiers
- Safety filtering done to remove harmful content
Dataset composition
- Which datasets to include + weighting determined by ablations on smaller models
- Training done in multiple stages: weight of domain-relevant data increased towards the end of training
- Data quality is critical
Model architecture:
- Transformer, decoder-only
- 32,768 context length
- Uses multi-query attention (and other ways to make transformers efficient, which are not mentioned)
Model sizes
- 4 model sizes
- Ultra <-> GPT-4V (performance-wise on benchmarks)
- Pro <-> GPT3.5/4-Turbo (empirically, as reported by people using Bard today)
- Nano: Nano-1: 1.8B, Nano-2: 3.25B. 4-bit quantized. Distilled from larger models. Nano-1 designed for low memory devices, Nano-2 for high memory devices. ({Gemini Nano is on Pixel 8 phones as of today })
Input:
- Text, interleaved with images, audio, video
- Multi-modal starting from pre-training, as opposed to adding other modalities to text later
- Text input: Sentencepiece tokenizer (unk. if BPE, Wordpiece or unigram)
- Visual input: "Inspired by previous work: Flamingo, CoCa, PaLI" - i.e. ViT (probably PaLI-style because simplest + most recent)
- Video: Frames encoded as image inputs (evaluation done on 16 frames, equally spaced apart)
- Audio: Universal Speech Model (USM) features @ 16kHz
Output:
- Text + image
- Image: inspired by previous approaches (cited DALL-E and Parti): probably Parti because auto-regressive
Implementation Details
- Programmed in Jax (unsurprisingly)
- Trained using Pathways on TPUv5e and TPUv4 (also unsurprisingly)
- Saves in-memory copy of model in case of hardware failure, instead of checkpointing and saving to disk. Saves recovery time and model trains for 97% of the time (up from 85%). Takes more training resources than checkpointing. ({Jeff Dean on X: Only matters for larger models, shouldn't matter for smaller ones})
Alignment:
- Quality > Quantity, esp. for larger models when instruction tuning (SFT, reward model training, RLHF) (avoiding dataset leakage)
- Quotes LLaMa 2 for quality: LLaMa 2 uses fewer but high-quality/diversity self-collected SFT data (esp. for chat instructions) instead of millions of low-quality/diversity, third-party SFT data (from various sources), which improves results (LLaMa 2 uses 27,540 annotations)
- Must balance reward model with examples of refusals and helpful responses
- Multi-objective optimization with weighted sum of reward scores from helpfulness, factuality, safety used to train multi-headed reward model (i.e. three outputs for helpfulness, factuality, and safety, RM loss is a weighted sum, so is the reward)
- To generate harmful response dataset: For each of 20 types of identified harm types, pass several variants of Google's content policy language as "constitutions" to pre-aligned model and use 0-shot CoT to revise responses and choose between multiple response candidates
Factuality-focused adaptation (part of instruction tuning):
- Attribution: If asked to generate a response that is attributed to the prompt context, Gemini should be faithful to the context (incl. summarization, citation, QA given long prompt (e.g. book), prompted output format adherence)
- Closed book response generation: Don't answer fact-seeking prompt without sources, whether directly or if semi-creative prompt indirectly requires facts to give answer
- Hallucination: Should hedge instead of trying to answer "unanswerable" questions
Novel MMLU decoding scheme: uncertainty-routed chain-of-thought
- Produce k chain-of-thought samples, select majority vote IF model is confident beyond a threshold
- Otherwise, return greedy sample choice
- Improves Gemini Ultra on MMLU by 6% (84.0->90.0), vs. 3.1% (84.2->87.3) on GPT-4V
- CoT only improves Gemini Ultra's perf. on MMLU by 1%
(Everything else are just examples of use case/benchmark results)
Anyone got a guess or a leak on model's parameters numbers!?
Guess: Pro=β20B, Ultra=β200B.Pro=β70B, Ultra=β200B
Pro=β30-70B, Ultra=β150B-1.5T
For a better analysis, plot MMLU (or other metric) vs. log(#parameters) for a bunch of the newer LLMs and LMMs and extrapolate until you find a suitable number for Pro and Ultra, LOL
Edit: Or just find the max. number of tokens you can train on from the Internet, then use Chinchilla scaling laws to find the corresponding model size.
@shermansiu thanks man!
Automatic Speech Recognition:
FLEURS is evaluated on 62 languages, even though the full dataset has 102 languages, following USM. As for other metrics... I couldn't find them on the Whisper paper/model page to get the full v2/v3 (large-1.5B) metrics for better comparison.
Gemini Nano-1 and Nano-2 beat Whisper v2 and v3 though, from the reported results.
All Gemini models (even 1.8B) have beaten the previous SOTA Whisper-v3-large (1.55B)
Machine Translation:
WMT23: Presented at EMNLP 2023 and metrics for other models aren't available yet
Pro and Ultra models
Text
Pro and Ultra are generally better than OSS models at text, esp. for math. OSS is not far behind though.
I removed DROP for the same reason it was removed from HF Open LLM Leaderboard.
Not enough LLMs use Natural2Code.
Image
For images, 7-13B OSS models have the same performance as Pro. Performance improvements could come by scaling up the number of parameters.
Video
Gemini Pro and Ultra don't do nearly as well as I thought it would on video, esp. given the parameter sizes of the OSS models they're competing against.
Multilingual
Too few new LLM papers report benchmarks on MGSM (math), XLSum (summarization), Wikilingua (summarization), opting instead to benchmark on Chinese benchmarks to test for multilinguality (e.g. CMMLU (language understanding), GAOKAO (university admissions test), C-Eval (multiple choice questions for 52 disciplines (from humanities to STEM) and 4 difficulties (middle school, high school, college-level, professional))
Automatic Speech Recognition
Once again, Nano models beat the OSS SOTA (Whisper-v3-large-1.55B). Comparing it to the Pro and Ultra models is unnecessary.
And yes, I'm comparing generalist models to a bunch of specialized models. Β―\_(γ)_/Β―