Research Preview

Gumini(구미니)

Gumini outperforms Qwen-2.5-1.5B with 5,732× less data and surpasses the 2x larger Llama-3.2-3B with 2,866× less data.
The new standard for data-efficient Korean LLMs.

3.14B
Training Tokens
8.49
Perplexity
#3
Overall Rank

Evaluation Methodology

I evaluated model performance to ensure robustness in Korean contexts.

KoBEST BoolQ

Korean Standard

Standard Korean Boolean QA benchmark test split.

Wikipedia KO

Recent Ko-wiki

Latest Korean Wikipedia dump for language modeling.

Data Efficiency Revolution

Standard LLMs waste compute on redundant "lazy layers" in deeper networks. Inheritune solves this by inheriting potent early layers and progressively expanding achieving comparable performance with far fewer parameters and tokens.

Key Achievement

5,732×

More Data Efficient than Qwen-2.5-1.5B

Gumini-1.5B (PPL 8.49) surpasses Qwen-2.5-1.5B (PPL 8.84) at the same scale with 5,732× less data.

Doing More With Less

Gumini demonstrates that smart architectural choices and curriculum learning can dramatically reduce data requirements.

Qwen-2.5-7B Alibaba

18T

Data Usage 5,732×

Qwen-2.5-1.5B Alibaba

18T

Data Usage 5,732×

Qwen-2.5-0.5B Alibaba

18T

Data Usage 5,732×

EXAONE-3.5-2.4B LG AI

~6.5T

Data Usage ~2,070×

Llama-3.2-3B Meta

9T

Data Usage 2,866×

Llama-3.2-1B Meta

9T

Data Usage 2,866×

Gemma-2B Google

2T

Data Usage 637×

BLOOM-1.1B BigScience

350B

Data Usage 111×

Polyglot-Ko-1.3B EleutherAI

213B

Data Usage 68×

Gumini-1.5B Ours

3.14B

Data Usage Baseline (1×)

Performance Comparison

Evaluated on Korean benchmarks.
Gumini outperforms larger models trained on significantly more data.

RANK MODEL PARAMS OVERALL PPL OVERALL TOP-1 ACC TOP-5 ACC OVERALL ↑
#1 Qwen-2.5-7B 7.62B 6.39 58.8% 79.7% 0.8003
#2 Gemma-2B 2B 8.15 54.9% 76.5% 0.7759
#3 Gumini-1.5B 1.54B 8.49 53.6% 74.8% 0.7662
#4 Qwen-2.5-1.5B 1.5B 8.84 53.3% 74.6% 0.7639
#5 Llama-3.2-3B 3.21B 9.47 53.0% 74.6% 0.7671
#6 EXAONE-3.5-2.4B 2.4B 9.80 54.0% 76.1% 0.7766
#7 Gumini-1B 1.08B 11.19 49.5% 70.7% 0.6971
#8 Llama-3.2-1B 1.24B 12.14 49.4% 70.8% 0.6720
#9 Qwen-2.5-0.5B 0.5B 13.37 47.2% 68.5% 0.6240
#10 BLOOM-1.1B 1.1B 16.03 41.9% 64.6% 0.5365
#11 Polyglot-Ko-1.3B 1.3B 25.05 48.6% 69.1% 0.4889

Data Efficiency Comparison

Model Training Tokens Efficiency Multiplier (vs Gumini) Calculation
Qwen-2.5-7B 18T 5,732× 18,000B ÷ 3.14B
Qwen-2.5-1.5B 18T 5,732× 18,000B ÷ 3.14B
Qwen-2.5-0.5B 18T 5,732× 18,000B ÷ 3.14B
Llama-3.2-3B 9T 2,866× 9,000B ÷ 3.14B
Llama-3.2-1B 9T 2,866× 9,000B ÷ 3.14B
EXAONE-3.5-2.4B ~6.5T ~2,070× 6,500B ÷ 3.14B
Gemma-2B 2T 637× 2,000B ÷ 3.14B
BLOOM-1.1B 350B 111× 350B ÷ 3.14B
Polyglot-Ko-1.3B 213B 68× 213B ÷ 3.14B

Training Method: Inheritune

"Less is More." Gumini uses a progressive training strategy where layers are added incrementally, ensuring maximum efficiency.

Core Philosophy

Standard LLMs have inefficient "lazy layers" in deeper networks. Inheritune initializes a compact model by inheriting potent early layers from a larger pre-trained model, then progressively retrains and expands it, achieving comparable or better performance with significantly fewer layers.

Gumini-1.5B Growth Schedule

  • Stage 0 (Start) 10 Layers
  • Stage 1-5 +1 Layer per stage
  • Stage 6 (Final) 16 Layers (3.14B Tokens)

Benchmark Figures

PPL Comparison

Perplexity Comparison

PPL vs Params

Efficiency Curve (PPL vs Params)

Ranking Table

Performance Ranking

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained(
    "GuminiResearch/Gumini-1.5B-Base",
    torch_dtype=torch.bfloat16,
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("GuminiResearch/Gumini-1.5B-Base")

prompt = "저는 구미니입니다."
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

outputs = model.generate(
    **inputs,
    max_new_tokens=200,
    repetition_penalty=1.2,
    do_sample=True,
    temperature=0.7,
    top_p=0.9,
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Appendix: Model Sources

Training token counts and source references for benchmark models.

Model Tokens Source
Qwen-2.5 (7B / 1.5B / 0.5B) 18T arXiv
Llama-3.2 (3B / 1B) 9T HuggingFace
Gemma-2B 2T arXiv
EXAONE-3.5-2.4B ~6.5T arXiv
BLOOM-1.1B 350B HuggingFace
Polyglot-Ko-1.3B 213B HuggingFace

Evaluation Metrics

Perplexity (PPL)

Per-dataset:
$$ PPL_d = \exp(L_d) $$
Overall:
$$ PPL_{overall} = \exp\left(\frac{\sum_{d \in D} L_d \cdot T_d}{\sum_{d \in D} T_d}\right) $$

Top-k Accuracy

Per-dataset:
$$ Acc_d = \frac{C_d}{T_d} $$
Overall:
$$ Acc_{overall} = \frac{\sum_{d \in D} C_d}{\sum_{d \in D} T_d} $$

Notation

Symbol Description
$$ D $$ Set of evaluation datasets
$$ L_d $$ Average cross-entropy loss on dataset \( d \)
$$ T_d $$ Total token count in dataset \( d \)
$$ C_d $$ Correctly predicted tokens in dataset \( d \)