SpecViT: A Systematic Comparison of Deep Learning Architectures for Stellar Surface Gravity Estimation from Medium-Resolution Spectra

Viska Wei; Xiaosheng Zhao; Rosemary F.G. Wyse; Alexander S. Szalay; Laszlo Dobos; Tamas Budavari

SpecViT

A systematic comparison of deep learning architectures for stellar surface gravity estimation

What attention learns from stellar spectra, where transformers help, and where hybrids still win

0.711

R² on DESI (50k)

0.196

MAE (dex)

0.098

σ on APOGEE

Key Findings

🎯

Near-Optimal at Low SNR

SpecViT comes within 0.02 R² of the Fisher Information / CRLB theoretical ceiling at SNR ≈ 4.6. The model extracts nearly all the information that is physically present in the data.

✨

Clean Priors > Big Data

BOSZ-only training achieves σ = 0.098 on APOGEE, beating DESI-360k training (σ = 0.175) by 78%. Clean synthetic spectra provide more useful physical priors than massive noisy catalogs.

🔄

Cross-Survey Transfer

A single SpecViT model works across optical (DESI, 710–885 nm) and near-infrared (APOGEE, 1.5–1.7 μm) surveys. One architecture, multiple instruments.

The Challenge

Every star's light encodes its physical properties — gravity, temperature, composition. But extracting a single parameter (like surface gravity log g) from a noisy 4096-pixel spectrum is non-trivial, especially for faint targets at magnitude >20.

Key Insight: Not all wavelengths matter equally. Attention mechanisms can learn which spectral features are physically informative — the calcium triplet, magnesium lines, and iron features that carry the strongest gravity signal.

Interactive: SNR Degradation Explorer

Drag the slider to see how the same stellar spectrum degrades as the star gets fainter.

Method: Architecture Walkthrough

SpecViT applies Vision Transformer architecture to 1D stellar spectra:

Spectrum (4096 px) → 256 Patches → Token Embed → 6-Layer ViT → log g

SpecViT Architecture: Spectrum → Conv1D Patches → Transformer Encoder → log g prediction

Interactive: Tokenization Visualizer

Hover over patches to see how the spectrum is divided. Patch colors show attention weight — brighter means higher attention.

Key Equations

Patch embedding:

Self-attention:

Training loss (Huber):

Attention Map Explorer

Explore what the model learns to attend to. The deepest layers concentrate attention on the Ca II infrared triplet (λ8498, 8542, 8662 Å) — the strongest surface gravity indicator in this wavelength range.

Static attention map (fallback)

Averaged attention weights across 6 transformer layers, showing peaks at Ca II triplet wavelengths

Per-Head Attention Heatmap

Each attention head learns to focus on different spectral features. Head 1 specializes in the Ca II infrared triplet, while Head 5 captures positional patterns across the full wavelength range. Hover for details.

Transfer Learning Story

How do synthetic spectra inform real-world prediction? We compare three strategies:

BOSZ Synthetic → DESI Real → APOGEE Test

Strategy	Training Data	σ (dex)	R²	MAE
BOSZ-only (Clean Prior)	50k BOSZ	0.098	0.956	0.071
DESI Direct	360k DESI	0.175	0.889	0.128
Two-Stage	BOSZ + DESI	0.112	0.942	0.082

Key Insight: More data ≠ better. Clean synthetic priors (50k BOSZ) outperform massive noisy catalogs (360k DESI) by 78% in prediction scatter. Quality of physics modeling trumps quantity of observations.

Performance by Stellar Type

SpecViT maintains consistent low MAE across stellar evolutionary stages — giants, subgiants, and dwarfs — while LightGBM struggles most with evolved stars where spectral features are subtler.

Deep Dive

Fisher Information & CRLB

The Cramér-Rao Lower Bound (CRLB) defines the minimum achievable variance for any unbiased estimator. For a parameter θ estimated from data x with likelihood p(x|θ):

We compute Fisher information from BOSZ synthetic spectra, where the noise model is known exactly. This gives us a theoretical performance ceiling against which we benchmark SpecViT.

Training Procedure

SpecViT-S (Small) configuration:

6 Transformer layers, 6 attention heads, embed_dim = 384
Patch size = 16, num_patches = 256
AdamW optimizer, lr = 3e-4, cosine annealing
Huber loss (δ = 0.5), batch size = 256
Training: ~50 epochs on 1× A100 GPU (~30 min)

Computational Cost

Model	Params	Train Time	Inference	R²
Ridge Regression	4K	<1 min	<1 ms	0.507
LightGBM	~100K	~2 min	<1 ms	0.647
SpecViT-S	7.4M	~30 min	~2 ms	0.731

SpecViT

Key Findings

Near-Optimal at Low SNR

Clean Priors > Big Data

Cross-Survey Transfer

The Challenge

Interactive: SNR Degradation Explorer

Method: Architecture Walkthrough

Interactive: Tokenization Visualizer

Key Equations

Attention Map Explorer

Per-Head Attention Heatmap

Results Dashboard

Transfer Learning Story

Spectrum Explorer

Performance by Stellar Type

Deep Dive