How to reproduce
Run the benchmark on your own hardware.
The public benchmark Docker image bundles every workload above with fixed seeds and reference outputs. Pull it, run it, compare. The detection layer is open-source. The resolution core runs under a 30-day trial token issued at first run.
$ docker pull ghcr.io/nonans/bench:v1.0.4
$ docker run --gpus all ghcr.io/nonans/bench:v1.0.4 baseline
$ docker run --gpus all ghcr.io/nonans/bench:v1.0.4 nonans
$ docker run ghcr.io/nonans/bench:v1.0.4 compare
Methodology
Every assumption, on the table.
No hidden tricks. Identical seeds, identical hardware, identical model code. The only variable is whether the resolution layer is active.
Hardware
H100 SXM5, 80GB
8-GPU node for <13B workloads, 16-GPU node for 70B+. CUDA 12.4, driver 550.x. Reproducible on A100 with seeds noted in the bench output.
Software
PyTorch 2.5, FlashAttention 2.7
Stock PyTorch with the standard optimizer family (AdamW, Lion). FSDP and DeepSpeed configurations published in the repo.
Seeds
Fixed across runs
Every workload uses a fixed seed for the data pipeline, model init, and dropout. Random-seed runs ship a separate set of results in the appendix.
Comparisons
Identical model code
The model class is the same Python file in both runs. The only difference is the single line that wraps the model. Diffs available in the repo.
FAQ
What technical reviewers ask first.
If your question isn't here, reach engineering directly. We answer technical due-diligence questions in writing within 48 hours.
What is the per-step overhead when no singularity occurs?
Less than 0.3% on average across the workloads above, measured with NVIDIA Nsight on the hot path. The detection branch is in-kernel; the resolution path only fires when an actual singularity is detected, so the amortized cost across non-failing steps is dominated by detection alone.
Does the resolved tensor preserve optimizer state coherence?
Yes. Momentum and second-moment buffers (Adam variants, Lion, etc.) remain numerically valid through the resolution. The benchmark output reports buffer-norm continuity across the resolved step. This is the property that makes the resolution a continuation rather than a soft restart.
How does this differ from gradient clipping or loss scaling?
Gradient clipping bounds the norm post-hoc and loss scaling is a multiplicative trick to keep BF16/FP8 in range. Both are in-tree PyTorch features and both fail on the workloads above; the benchmark is run with them enabled. NoNans operates on a different layer of the problem: it acts at the kernel boundary at the moment a singularity arises, regardless of whether scaling or clipping was already applied upstream.
Can you describe the resolution mechanism?
The mechanism is patent-pending and ships only inside the compiled binary. It is not described in public materials or even in the MNDA replication kit. Technical due diligence is conducted by reproducing behavior on real workloads, not by reading source. This is a deliberate posture and one we treat as non-negotiable.
What happens if the layer encounters a workload it cannot resolve?
The layer fails open: it surfaces a clear out-of-scope event in the dashboard and lets the standard PyTorch failure path proceed. We do not silently corrupt training. Out-of-scope events are reported transparently in the customer telemetry and are excluded from the rollback statistics on the landing page.
Why publish this benchmark openly?
Because the benchmark is the pitch. We don't have a sales motion that depends on slides; we have a sales motion that depends on infrastructure engineers running the Docker image and seeing the result on their own hardware. The numbers above are reproducible by anyone with H100 access. That's the entire point.