NoNans · Benchmark — Reproducible numerical continuity validation

Training results

Where standard PyTorch dies, and where NoNans holds.

Each row is a workload chosen because it fails reliably on standard PyTorch. The point of the benchmark is not speed. It is whether the run completes at all.

FP8 training, 7B transformer

8× H100, batch 64, lr 3e-4

● Diverges step ~840

● Completes 50K steps

Impossible → routine

Long-context attention

Sequence 256K, vanilla softmax

● NaN at step 1

● Stable through 10K steps

Crash → continuation

Aggressive learning rate

lr 5e-3 on 13B, no warmup

● Gradient explosion epoch 2

● Converges, 32% faster

Failed → 1.32× speedup

RLHF / GRPO post-training

7B base, KL penalty 0.05

● 3 of 5 runs collapse

● 5 of 5 complete

60% → 100% completion

Mixed precision pretraining

70B params, BF16, FSDP

● 1.4 rollbacks/week avg

● 0 rollbacks, 4-week obs.

~$31K/wk recovered

Inference results

Same layer. Different surface.

The continuity layer applies equally to inference workloads where numerical singularities cause production crashes. Long context and aggressive batching are the two regimes most commonly affected.

1M-token inference

Llama 3.1 70B, vanilla attention

● Softmax overflow at 256K

● Stable through 1.0M tokens

4× context unlock

Large batch inference

Batch 128, FP8 weights

● OOM / NaN at batch 64

● Batch 128 stable

2× throughput

Custom kernel ops

User-written CUDA extension

● CUDA error 700

● Resolved in-kernel

Zero downtime

How to reproduce

Run the benchmark on your own hardware.

The public benchmark Docker image bundles every workload above with fixed seeds and reference outputs. Pull it, run it, compare. The detection layer is open-source. The resolution core runs under a 30-day trial token issued at first run.

# 1. Pull the public benchmark image

$ docker pull ghcr.io/nonans/bench:v1.0.4

# 2. Run the standard PyTorch baseline (no resolution layer)

$ docker run --gpus all ghcr.io/nonans/bench:v1.0.4 baseline

# 3. Run the same workload with NoNans active (trial token auto-issued)

$ docker run --gpus all ghcr.io/nonans/bench:v1.0.4 nonans

# 4. Compare

$ docker run ghcr.io/nonans/bench:v1.0.4 compare

# Reports written to ./out/ as JSON + Markdown.

# Trial tokens last 30 days, are GPU-UUID bound, and are read-only beyond that window.

Methodology

Every assumption, on the table.

No hidden tricks. Identical seeds, identical hardware, identical model code. The only variable is whether the resolution layer is active.

Hardware

H100 SXM5, 80GB

8-GPU node for <13B workloads, 16-GPU node for 70B+. CUDA 12.4, driver 550.x. Reproducible on A100 with seeds noted in the bench output.

Software

PyTorch 2.5, FlashAttention 2.7

Stock PyTorch with the standard optimizer family (AdamW, Lion). FSDP and DeepSpeed configurations published in the repo.

Seeds

Fixed across runs

Every workload uses a fixed seed for the data pipeline, model init, and dropout. Random-seed runs ship a separate set of results in the appendix.

Comparisons

Identical model code

The model class is the same Python file in both runs. The only difference is the single line that wraps the model. Diffs available in the repo.

Want the full replication kit?

The public Docker image runs the headline scenarios above. Larger workloads (70B+, frontier-scale RLHF, multi-node FSDP) require an MNDA replication kit because they include configuration material we don't publish openly. We send the kit within 24 hours of MNDA execution.

Request MNDA kit →

FAQ

What technical reviewers ask first.

If your question isn't here, reach engineering directly. We answer technical due-diligence questions in writing within 48 hours.

What is the per-step overhead when no singularity occurs?

Less than 0.3% on average across the workloads above, measured with NVIDIA Nsight on the hot path. The detection branch is in-kernel; the resolution path only fires when an actual singularity is detected, so the amortized cost across non-failing steps is dominated by detection alone.

Does the resolved tensor preserve optimizer state coherence?

Yes. Momentum and second-moment buffers (Adam variants, Lion, etc.) remain numerically valid through the resolution. The benchmark output reports buffer-norm continuity across the resolved step. This is the property that makes the resolution a continuation rather than a soft restart.

How does this differ from gradient clipping or loss scaling?

Gradient clipping bounds the norm post-hoc and loss scaling is a multiplicative trick to keep BF16/FP8 in range. Both are in-tree PyTorch features and both fail on the workloads above; the benchmark is run with them enabled. NoNans operates on a different layer of the problem: it acts at the kernel boundary at the moment a singularity arises, regardless of whether scaling or clipping was already applied upstream.

Can you describe the resolution mechanism?

The mechanism is patent-pending and ships only inside the compiled binary. It is not described in public materials or even in the MNDA replication kit. Technical due diligence is conducted by reproducing behavior on real workloads, not by reading source. This is a deliberate posture and one we treat as non-negotiable.

What happens if the layer encounters a workload it cannot resolve?

The layer fails open: it surfaces a clear out-of-scope event in the dashboard and lets the standard PyTorch failure path proceed. We do not silently corrupt training. Out-of-scope events are reported transparently in the customer telemetry and are excluded from the rollback statistics on the landing page.

Why publish this benchmark openly?

Because the benchmark is the pitch. We don't have a sales motion that depends on slides; we have a sales motion that depends on infrastructure engineers running the Docker image and seeing the result on their own hardware. The numbers above are reproducible by anyone with H100 access. That's the entire point.

Standard PyTorch fails on these workloads.NoNans completes them.

Where standard PyTorch dies, and where NoNans holds.

Same layer. Different surface.

Run the benchmark on your own hardware.

Every assumption, on the table.

What technical reviewers ask first.

Standard PyTorch fails on these workloads.
NoNans completes them.