Gumini outperforms Qwen-2.5-1.5B with 5,732× less data and surpasses the 2x larger Llama-3.2-3B with
2,866× less data.
The new standard for data-efficient Korean LLMs.
I evaluated model performance to ensure robustness in Korean contexts.
Standard Korean Boolean QA benchmark test split.
Latest Korean Wikipedia dump for language modeling.
Standard LLMs waste compute on redundant "lazy layers" in deeper networks. Inheritune solves this by inheriting potent early layers and progressively expanding achieving comparable performance with far fewer parameters and tokens.
More Data Efficient than Qwen-2.5-1.5B
Gumini-1.5B (PPL 8.49) surpasses Qwen-2.5-1.5B (PPL 8.84) at the same scale with 5,732× less data.
Gumini demonstrates that smart architectural choices and curriculum learning can dramatically reduce data requirements.
Evaluated on Korean benchmarks.
Gumini outperforms larger models trained on significantly more
data.
| RANK | MODEL | PARAMS | OVERALL PPL | OVERALL TOP-1 ACC | TOP-5 ACC | OVERALL ↑ |
|---|---|---|---|---|---|---|
| #1 | Qwen-2.5-7B | 7.62B | 6.39 | 58.8% | 79.7% | 0.8003 |
| #2 | Gemma-2B | 2B | 8.15 | 54.9% | 76.5% | 0.7759 |
| #3 | Gumini-1.5B | 1.54B | 8.49 | 53.6% | 74.8% | 0.7662 |
| #4 | Qwen-2.5-1.5B | 1.5B | 8.84 | 53.3% | 74.6% | 0.7639 |
| #5 | Llama-3.2-3B | 3.21B | 9.47 | 53.0% | 74.6% | 0.7671 |
| #6 | EXAONE-3.5-2.4B | 2.4B | 9.80 | 54.0% | 76.1% | 0.7766 |
| #7 | Gumini-1B | 1.08B | 11.19 | 49.5% | 70.7% | 0.6971 |
| #8 | Llama-3.2-1B | 1.24B | 12.14 | 49.4% | 70.8% | 0.6720 |
| #9 | Qwen-2.5-0.5B | 0.5B | 13.37 | 47.2% | 68.5% | 0.6240 |
| #10 | BLOOM-1.1B | 1.1B | 16.03 | 41.9% | 64.6% | 0.5365 |
| #11 | Polyglot-Ko-1.3B | 1.3B | 25.05 | 48.6% | 69.1% | 0.4889 |
| Model | Training Tokens | Efficiency Multiplier (vs Gumini) | Calculation |
|---|---|---|---|
| Qwen-2.5-7B | 18T | 5,732× | 18,000B ÷ 3.14B |
| Qwen-2.5-1.5B | 18T | 5,732× | 18,000B ÷ 3.14B |
| Qwen-2.5-0.5B | 18T | 5,732× | 18,000B ÷ 3.14B |
| Llama-3.2-3B | 9T | 2,866× | 9,000B ÷ 3.14B |
| Llama-3.2-1B | 9T | 2,866× | 9,000B ÷ 3.14B |
| EXAONE-3.5-2.4B | ~6.5T | ~2,070× | 6,500B ÷ 3.14B |
| Gemma-2B | 2T | 637× | 2,000B ÷ 3.14B |
| BLOOM-1.1B | 350B | 111× | 350B ÷ 3.14B |
| Polyglot-Ko-1.3B | 213B | 68× | 213B ÷ 3.14B |
"Less is More." Gumini uses a progressive training strategy where layers are added incrementally, ensuring maximum efficiency.
Standard LLMs have inefficient "lazy layers" in deeper networks. Inheritune initializes a compact model by inheriting potent early layers from a larger pre-trained model, then progressively retrains and expands it, achieving comparable or better performance with significantly fewer layers.
Perplexity Comparison
Efficiency Curve (PPL vs Params)
Performance Ranking
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model = AutoModelForCausalLM.from_pretrained(
"GuminiResearch/Gumini-1.5B-Base",
torch_dtype=torch.bfloat16,
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("GuminiResearch/Gumini-1.5B-Base")
prompt = "저는 구미니입니다."
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(
**inputs,
max_new_tokens=200,
repetition_penalty=1.2,
do_sample=True,
temperature=0.7,
top_p=0.9,
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Training token counts and source references for benchmark models.
| Model | Tokens | Source |
|---|---|---|
| Qwen-2.5 (7B / 1.5B / 0.5B) | 18T | arXiv |
| Llama-3.2 (3B / 1B) | 9T | HuggingFace |
| Gemma-2B | 2T | arXiv |
| EXAONE-3.5-2.4B | ~6.5T | arXiv |
| BLOOM-1.1B | 350B | HuggingFace |
| Polyglot-Ko-1.3B | 213B | HuggingFace |
| Symbol | Description |
|---|---|
| $$ D $$ | Set of evaluation datasets |
| $$ L_d $$ | Average cross-entropy loss on dataset \( d \) |
| $$ T_d $$ | Total token count in dataset \( d \) |
| $$ C_d $$ | Correctly predicted tokens in dataset \( d \) |