Gumini (구미니) - Data-Efficient Korean-English Pretraining LLM

Evaluation Methodology

I evaluated model performance to ensure robustness in Korean contexts.

KoBEST BoolQ

Korean Standard

Standard Korean Boolean QA benchmark test split.

Wikipedia KO

Recent Ko-wiki

Latest Korean Wikipedia dump for language modeling.

Data Efficiency Revolution

Standard LLMs waste compute on redundant "lazy layers" in deeper networks. Inheritune solves this by inheriting potent early layers and progressively expanding achieving comparable performance with far fewer parameters and tokens.

DATA EFFICIENCY

Doing More With Less

Gumini demonstrates that smart architectural choices and curriculum learning can dramatically reduce data requirements.

Qwen-2.5-7B Alibaba

18T

Data Usage 5,732×

Qwen-2.5-1.5B Alibaba

18T

Data Usage 5,732×

Qwen-2.5-0.5B Alibaba

18T

Data Usage 5,732×

EXAONE-3.5-2.4B LG AI

~6.5T

Data Usage ~2,070×

Llama-3.2-3B Meta

Data Usage 2,866×

Llama-3.2-1B Meta

Data Usage 2,866×

Gemma-2B Google

Data Usage 637×

BLOOM-1.1B BigScience

350B

Data Usage 111×

Polyglot-Ko-1.3B EleutherAI

213B

Data Usage 68×

Gumini-1.5B Ours

3.14B

Data Usage Baseline (1×)

BENCHMARKS

Performance Comparison

Evaluated on Korean benchmarks.
Gumini outperforms larger models trained on significantly more data.

RANK	MODEL	PARAMS	OVERALL PPL	OVERALL TOP-1 ACC	TOP-5 ACC	OVERALL ↑
#1	Qwen-2.5-7B	7.62B	6.39	58.8%	79.7%	0.8003
#2	Gemma-2B	2B	8.15	54.9%	76.5%	0.7759
#3	Gumini-1.5B	1.54B	8.49	53.6%	74.8%	0.7662
#4	Qwen-2.5-1.5B	1.5B	8.84	53.3%	74.6%	0.7639
#5	Llama-3.2-3B	3.21B	9.47	53.0%	74.6%	0.7671
#6	EXAONE-3.5-2.4B	2.4B	9.80	54.0%	76.1%	0.7766
#7	Gumini-1B	1.08B	11.19	49.5%	70.7%	0.6971
#8	Llama-3.2-1B	1.24B	12.14	49.4%	70.8%	0.6720
#9	Qwen-2.5-0.5B	0.5B	13.37	47.2%	68.5%	0.6240
#10	BLOOM-1.1B	1.1B	16.03	41.9%	64.6%	0.5365
#11	Polyglot-Ko-1.3B	1.3B	25.05	48.6%	69.1%	0.4889

Data Efficiency Comparison

Model	Training Tokens	Efficiency Multiplier (vs Gumini)	Calculation
Qwen-2.5-7B	18T	5,732×	18,000B ÷ 3.14B
Qwen-2.5-1.5B	18T	5,732×	18,000B ÷ 3.14B
Qwen-2.5-0.5B	18T	5,732×	18,000B ÷ 3.14B
Llama-3.2-3B	9T	2,866×	9,000B ÷ 3.14B
Llama-3.2-1B	9T	2,866×	9,000B ÷ 3.14B
EXAONE-3.5-2.4B	~6.5T	~2,070×	6,500B ÷ 3.14B
Gemma-2B	2T	637×	2,000B ÷ 3.14B
BLOOM-1.1B	350B	111×	350B ÷ 3.14B
Polyglot-Ko-1.3B	213B	68×	213B ÷ 3.14B

Training Method: Inheritune

"Less is More." Gumini uses a progressive training strategy where layers are added incrementally, ensuring maximum efficiency.

Core Philosophy

Standard LLMs have inefficient "lazy layers" in deeper networks. Inheritune initializes a compact model by inheriting potent early layers from a larger pre-trained model, then progressively retrains and expands it, achieving comparable or better performance with significantly fewer layers.

Read the Paper

                                @inproceedings{Sanyal2024inheritune,
  title={Inheritune: Training Smaller Yet More Attentive Language Models},
  author={Sunny Sanyal and Ravid Shwartz-Ziv and Alexandros G. Dimakis and Sujay Sanghavi},
  year={2024},
  url={https://arxiv.org/abs/2404.08634}
}

Gumini-1.5B Growth Schedule

Stage 0 (Start) 10 Layers
Stage 1-5 +1 Layer per stage
Stage 6 (Final) 16 Layers (3.14B Tokens)

QUICK START

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained(
    "GuminiResearch/Gumini-1.5B-Base",
    torch_dtype=torch.bfloat16,
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("GuminiResearch/Gumini-1.5B-Base")

prompt = "저는 구미니입니다."
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

outputs = model.generate(
    **inputs,
    max_new_tokens=200,
    repetition_penalty=1.2,
    do_sample=True,
    temperature=0.7,
    top_p=0.9,
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Model	Tokens	Source
Qwen-2.5 (7B / 1.5B / 0.5B)	18T	arXiv
Llama-3.2 (3B / 1B)	9T	HuggingFace
Gemma-2B	2T	arXiv
EXAONE-3.5-2.4B	~6.5T	arXiv
BLOOM-1.1B	350B	HuggingFace
Polyglot-Ko-1.3B	213B	HuggingFace

Evaluation Metrics

Perplexity (PPL)

Per-dataset:

$$ PPL_d = \exp(L_d) $$

Overall:

$$ PPL_{overall} = \exp\left(\frac{\sum_{d \in D} L_d \cdot T_d}{\sum_{d \in D} T_d}\right) $$

Top-k Accuracy

Per-dataset:

$$ Acc_d = \frac{C_d}{T_d} $$

Overall:

$$ Acc_{overall} = \frac{\sum_{d \in D} C_d}{\sum_{d \in D} T_d} $$

Notation

Symbol	Description
$$ D $$	Set of evaluation datasets
$$ L_d $$	Average cross-entropy loss on dataset $ d $
$$ T_d $$	Total token count in dataset $ d $
$$ C_d $$	Correctly predicted tokens in dataset $ d $

Gumini(구미니)

Evaluation Methodology

KoBEST BoolQ

Wikipedia KO

Data Efficiency Revolution

Key Achievement

Doing More With Less

Qwen-2.5-7B Alibaba

Qwen-2.5-1.5B Alibaba

Qwen-2.5-0.5B Alibaba

EXAONE-3.5-2.4B LG AI

Llama-3.2-3B Meta

Llama-3.2-1B Meta

Gemma-2B Google

BLOOM-1.1B BigScience

Polyglot-Ko-1.3B EleutherAI

Gumini-1.5B Ours

Performance Comparison

Data Efficiency Comparison

Training Method: Inheritune

Core Philosophy

Gumini-1.5B Growth Schedule

Benchmark Figures

Usage

Appendix: Model Sources

Evaluation Metrics

Perplexity (PPL)

Top-k Accuracy

Notation