Where standard PyTorch dies, and where NoNans holds.

Each row is a workload chosen because it fails reliably on standard PyTorch. The point of the benchmark is not speed. It is whether the run completes at all.

Workload
Standard PyTorch
PyTorch + NoNans
Outcome
FP8 training, 7B transformer
8× H100, batch 64, lr 3e-4
● Diverges step ~840
● Completes 50K steps
Impossible → routine
Long-context attention
Sequence 256K, vanilla softmax
● NaN at step 1
● Stable through 10K steps
Crash → continuation
Aggressive learning rate
lr 5e-3 on 13B, no warmup
● Gradient explosion epoch 2
● Converges, 32% faster
Failed → 1.32× speedup
RLHF / GRPO post-training
7B base, KL penalty 0.05
● 3 of 5 runs collapse
● 5 of 5 complete
60% → 100% completion
Mixed precision pretraining
70B params, BF16, FSDP
● 1.4 rollbacks/week avg
● 0 rollbacks, 4-week obs.
~$31K/wk recovered

Same layer. Different surface.

The continuity layer applies equally to inference workloads where numerical singularities cause production crashes. Long context and aggressive batching are the two regimes most commonly affected.

Workload
Standard runtime
Runtime + NoNans
Outcome
1M-token inference
Llama 3.1 70B, vanilla attention
● Softmax overflow at 256K
● Stable through 1.0M tokens
4× context unlock
Large batch inference
Batch 128, FP8 weights
● OOM / NaN at batch 64
● Batch 128 stable
2× throughput
Custom kernel ops
User-written CUDA extension
● CUDA error 700
● Resolved in-kernel
Zero downtime

Run the benchmark on your own hardware.

The public benchmark Docker image bundles every workload above with fixed seeds and reference outputs. Pull it, run it, compare. The detection layer is open-source. The resolution core runs under a 30-day trial token issued at first run.

# 1. Pull the public benchmark image
$ docker pull ghcr.io/nonans/bench:v1.0.4

# 2. Run the standard PyTorch baseline (no resolution layer)
$ docker run --gpus all ghcr.io/nonans/bench:v1.0.4 baseline

# 3. Run the same workload with NoNans active (trial token auto-issued)
$ docker run --gpus all ghcr.io/nonans/bench:v1.0.4 nonans

# 4. Compare
$ docker run ghcr.io/nonans/bench:v1.0.4 compare

# Reports written to ./out/ as JSON + Markdown.
# Trial tokens last 30 days, are GPU-UUID bound, and are read-only beyond that window.

Every assumption, on the table.

No hidden tricks. Identical seeds, identical hardware, identical model code. The only variable is whether the resolution layer is active.

Hardware
H100 SXM5, 80GB
8-GPU node for <13B workloads, 16-GPU node for 70B+. CUDA 12.4, driver 550.x. Reproducible on A100 with seeds noted in the bench output.
Software
PyTorch 2.5, FlashAttention 2.7
Stock PyTorch with the standard optimizer family (AdamW, Lion). FSDP and DeepSpeed configurations published in the repo.
Seeds
Fixed across runs
Every workload uses a fixed seed for the data pipeline, model init, and dropout. Random-seed runs ship a separate set of results in the appendix.
Comparisons
Identical model code
The model class is the same Python file in both runs. The only difference is the single line that wraps the model. Diffs available in the repo.
Want the full replication kit?
The public Docker image runs the headline scenarios above. Larger workloads (70B+, frontier-scale RLHF, multi-node FSDP) require an MNDA replication kit because they include configuration material we don't publish openly. We send the kit within 24 hours of MNDA execution.
Request MNDA kit →

What technical reviewers ask first.

If your question isn't here, reach engineering directly. We answer technical due-diligence questions in writing within 48 hours.

What is the per-step overhead when no singularity occurs?
Less than 0.3% on average across the workloads above, measured with NVIDIA Nsight on the hot path. The detection branch is in-kernel; the resolution path only fires when an actual singularity is detected, so the amortized cost across non-failing steps is dominated by detection alone.
Does the resolved tensor preserve optimizer state coherence?
Yes. Momentum and second-moment buffers (Adam variants, Lion, etc.) remain numerically valid through the resolution. The benchmark output reports buffer-norm continuity across the resolved step. This is the property that makes the resolution a continuation rather than a soft restart.
How does this differ from gradient clipping or loss scaling?
Gradient clipping bounds the norm post-hoc and loss scaling is a multiplicative trick to keep BF16/FP8 in range. Both are in-tree PyTorch features and both fail on the workloads above; the benchmark is run with them enabled. NoNans operates on a different layer of the problem: it acts at the kernel boundary at the moment a singularity arises, regardless of whether scaling or clipping was already applied upstream.
Can you describe the resolution mechanism?
The mechanism is patent-pending and ships only inside the compiled binary. It is not described in public materials or even in the MNDA replication kit. Technical due diligence is conducted by reproducing behavior on real workloads, not by reading source. This is a deliberate posture and one we treat as non-negotiable.
What happens if the layer encounters a workload it cannot resolve?
The layer fails open: it surfaces a clear out-of-scope event in the dashboard and lets the standard PyTorch failure path proceed. We do not silently corrupt training. Out-of-scope events are reported transparently in the customer telemetry and are excluded from the rollback statistics on the landing page.
Why publish this benchmark openly?
Because the benchmark is the pitch. We don't have a sales motion that depends on slides; we have a sales motion that depends on infrastructure engineers running the Docker image and seeing the result on their own hardware. The numbers above are reproducible by anyone with H100 access. That's the entire point.