v1.0.4 · Patent pending · 64-test public suite

The numerical continuity layer for GPU computing.

When a CUDA kernel produces a numerical singularity, NoNans intercepts the event at the kernel boundary, resolves it inside our framework, and returns a finite, optimizer-coherent tensor to the GPU. Training continues at the next step. No rollback. No checkpoint reload. No code change.

8–18%
Compute recovered
<0.3%
Step overhead
0
Rollbacks since v1.0
training_loop.py
# 1. Install
$ pip install nonans

# 2. Wrap your model. That’s the integration.
import nonans
import torch

model = MyModel().cuda()
model = nonans.wrap(model, mode='auto')

# 3. Train as usual.
# Detection runs every step.
# Resolution fires only on singularity events.
for batch in dataloader:
  loss = model(batch)
  loss.backward()
  optimizer.step()
The problem

Frontier training's largest controllable cost is rework.

IEEE 754 has no defined value for a class of operations that occur regularly in modern training: division by zero in attention, softmax overflow at long context, gradient accumulation past representable range. When these occur, the standard practice is to roll back to the last checkpoint, discarding hours of compute. NoNans operates one layer below: at the kernel boundary, where the singularity originates, before propagation begins.

01 — Detection too late
Gradient clipping reacts after the cascade
By the time loss-scaling and clipping see a NaN, it has already propagated through optimizer state, momentum buffers, and weight tensors. The run is structurally compromised; clipping just delays the recognition.
Lost: 4–14 epochs per event
02 — Rollback is the default
Checkpoint recovery wastes compute by design
Watchdog tools like Mosaic AutoResume and DeepSpeed's checkpoint cycler restore the last valid state. The compute between checkpoint and failure is gone. For 70B+ runs, a single rollback typically destroys $5–40K of GPU time.
Lost: $5K–$40K per rollback
03 — Frequency scales nonlinearly
Singularity rate grows with model size
As parameter counts move from 70B to 700B, gradient singularity events increase nonlinearly. The market for stability infrastructure is growing faster than the GPU market itself, and existing tools at the framework level cannot intercept fast enough.
Curve: superlinear past 100B
Architecture

A transparent intercept at the kernel boundary.

The layer sits between PyTorch (or any framework that ultimately calls CUDA) and your GPU. On the normal path, it does nothing measurable. When a kernel produces a numerical singularity, NoNans intercepts the event, hands it to the resolution runtime, receives a finite tensor back, and returns it to the GPU. Three components, two boundaries, one new behavior at the moment of failure.

Layer 1
Training framework — PyTorch · JAX · vLLM · Megatron · DeepSpeed · FSDP
Your existing code. No modifications. The wrap call is the entire integration surface.
unchanged
single import · nonans.wrap(model)
Layer 2
Detection layer — open source, MIT, in-process
Lightweight kernel hooks observe every tensor operation. When a non-finite value would emerge, a structured SingularityEvent is constructed with the tensor fingerprint, layer name, operator, and step. Detection cost is the only overhead on the normal path.
<0.3% per step
if singularity detected · binary IPC
Layer 3
Resolution runtime — proprietary, hosted at runtime.nonans.com
Receives the SingularityEvent over TLS, resolves it inside our framework, returns a finite, optimizer-coherent tensor reference. Three deployment modes: hosted (default), local sidecar (Pro), or in-process binary (Enterprise on-prem). The mechanism is patent pending.
5–15ms / event
resolved tensor returned
Layer 4
CUDA / GPU hardware
Receives the resolved tensor. The kernel that produced the singularity now has a finite value to continue with. The next step proceeds. Optimizer state norm is preserved through the resolution.
normal path
Detection is open and inspectable. The full source of Layer 2 is on GitHub under MIT. ML engineers can read every hook, every classifier, every event field. Trust comes from inspection, not assertion.
Resolution is a service, not a binary on your machine. The mechanism never ships to customer infrastructure by default. Source-code escrow available for Enterprise contracts that require it.
Failure is explicit, not silent. If the runtime is unreachable, NoNans falls back to PyTorch's native error path. We never corrupt training to mask an outage.
Composes with everything. FSDP, DeepSpeed, Megatron-LM, torch.compile, AMP, gradient clipping. All coexist. NoNans operates at a layer below the framework's view of execution.
Integration depths

Three depths. Same core. Different reach.

The detection layer is identical across tiers. The resolution mode determines deployment topology, latency profile, and pricing. Choose the depth that matches your trust model, network constraints, and contract scale.

L1
Hosted runtime — TLS to runtime.nonans.com
Default. The detection layer ships in-process; the resolution runtime is hosted on our infrastructure. Trial token issued automatically on first use, valid 30 days. Tensor data never leaves your VPC unless you explicitly opt in to remote tensor mirroring.
Free → $0.50/GPU-hr
Distribution funnel
View on GitHub
L3
In-process — Enterprise on-premise
Compiled binary linked into the training process directly. Universal coverage across CUDA calls. SLA-backed uptime, custom kernel extensions, source-code escrow under defined trigger events, MNDA + IP indemnification.
$250K–$2M ARR
Strategic / Enterprise
Talk to engineering
What it unlocks

Not a rescue layer. A capability layer.

Removing numerical singularities as a constraint changes which training regimes are viable in production. Each of the four below is a regime that frontier teams currently avoid or carefully babysit. With the layer underneath, they become routine.

Lower precision

FP8 training, stable in production

FP8 is roughly twice as fast as BF16 on H100 and Blackwell. Almost nobody runs it for full training because the loss-scaling dance is fragile and one bad batch ends the run. With NoNans in place, FP8 stops being a research demo and becomes the default precision.

≈ 1.8× throughput per GPU vs BF16
Aggressive learning rates

2–3× higher LR without explosion

High learning rates produce gradient explosions that birth NaNs. Frontier teams babysit warmup schedules to avoid them. With a stable layer below, the warmup loosens, the schedule simplifies, and convergence accelerates measurably.

Time to convergence cut 30–40%
Long context

Million-token attention without collapse

Standard softmax attention denominators collapse at sequence lengths the industry races toward. NoNans holds at million-token context where attention scores would otherwise produce singularities, both in training and inference.

Stable through 1M+ tokens
RL post-training

RLHF / GRPO / DPO without the chaos

Reinforcement learning is the most numerically unstable training regime in modern ML. KL explosions, reward variance, gradient norms swinging across orders of magnitude in a single step. Our layer holds where stock PyTorch loses 40% of runs.

100% completion vs 60% baseline
Use cases

One layer. Six revenue surfaces.

The same primitive deployed across distinct buyer pains. Each surface has its own customer profile, pricing logic, and sales cycle. The training and inference surfaces are the wedge; the others compound as the deployment base grows.

#
Surface
What it sells
Pricing model
01
LLM pre-training stability
For: ML infra leads · AI labs · Frontier teams
Frontier-scale training without checkpoint rollbacks. The flagship surface. Per-event dollar magnitude scales with model size, from $5K on 7B to $500K on 700B+.
$0.50/GPU-hr
or 8% of spend
02
Long-context inference
For: Inference platforms · Enterprise serving
Softmax overflow, attention denominator collapse, NaN at sequence > 128K. The pain that vLLM and SGLang teams know intimately. Faster sales cycle than training.
Per inference-hr
$0.25–$0.40
03
RL post-training stability
For: Post-training teams · Alignment teams
The most unstable regime in modern ML. Weekly pain, smaller experiments, ML engineers with discretionary budget. The fastest wedge into frontier labs.
Project-based
$5K–$25K/run
04
Singularity telemetry corpus
For: Hardware vendors · Framework cores
Anonymized event corpus across deployments: classifiers, layer positions, gradient distributions. NVIDIA, PyTorch core, framework teams pay for the feed; nobody else can replicate it.
Advisory feed
$50K–$200K/yr
05
Underwritten uptime SLA
For: Enterprise procurement · CFO-led buys
Contractual no-rollback guarantee on production training. Pricing as a percentage of compute. Reinsurance-backed for tail risk on frontier-scale runs.
3–5% of compute
$150K–$500K/run
06
Cloud / framework partnerships
For: Together AI · Lambda · CoreWeave · Modular
License the kernel-boundary layer to GPU clouds and compiler vendors as a stability differentiator they sell to their customers. Per-GPU-hour passthrough revenue, no direct sales effort.
30–50%
passthrough share
Measured behavior

Numbers from production deployments.

Measured on H100 SXM5 clusters running 70B+ parameter training runs. Methodology and replication kit available under MNDA. Public benchmark for the headline scenarios is reproducible by anyone with H100 access.

Compute recovered
8–18%
Of compute spend recovered per month, depending on training regime and failure baseline. Measured across early deployments.
Median 12% · MNDA validates
Per-step overhead
<0.3%
In-kernel intercept on the hot path. Detection runs every step. Resolution fires only on actual singularity events.
Sub-microsecond detection
Optimizer coherence
99.9%+
Post-resolution optimizer-state integrity. Momentum and second-moment buffers preserved through resolution.
Across 100B+ param runs
Rollbacks required
0
Across all in-scope production deployments since v1.0. Out-of-scope events surfaced transparently in dashboards.
Reproducible per benchmark
Defensibility

What makes this hard to replicate.

The mechanism is patent pending and lives only on our servers (or, for Enterprise on-prem, inside compiled binaries under contract). The system around it compounds with every deployment.

01 — IP
Patent-pending system claim
In-kernel singularity resolution with optimizer-state coherence. Provisional filed 2026. Claim shape covers the system, not the underlying mathematics. Source code never ships to customer infrastructure by default.
02 — Data
Production telemetry corpus
Every deployment generates labeled singularity events: classifier, layer, operator, gradient state, optimizer config. The corpus compounds with each customer and is replicable only by reaching production scale we already have.
03 — Timing
Singularity rate scales nonlinearly
As parameter counts grow from 70B to 700B, gradient singularity events compound nonlinearly. The market for a numerical-continuity layer is growing faster than the GPU market itself. The category will exist; the question is who owns it.
Pricing

Aligned with the value. You pay when we save the run.

Free tier covers experimentation up to 8 GPUs. Pro is usage-based with no seat fees, scaling with your cluster. Enterprise is contracted, SLA-backed, and supports on-premise and source-escrow terms.

Detect
Free / always
Open-source detection layer plus 30-day resolution trial. MIT license on the detector.
  • Open-source detection layer (MIT)
  • 30-day resolution trial
  • Three-line PyTorch / JAX integration
  • Up to 8 GPUs, single host
  • Public benchmark replication kit
  • Community support
View on GitHub
Enterprise
Custom / annual
From $250K ARR. On-prem, SLA-backed, dedicated infra engineer. Source-code escrow available.
  • Everything in Pro
  • In-process integration (Enterprise on-prem)
  • SLA-backed uptime guarantees
  • On-premise deployment
  • Custom kernel extensions
  • MNDA + IP indemnification
  • Source-code escrow (defined triggers)
Talk to engineering
Common questions

What ML infra leads ask first.

Does the resolution path send my tensor data over the public internet?
No, not by default. The hosted runtime receives a SingularityEvent record (event metadata plus a tensor fingerprint) over TLS. The actual tensor data stays on your hardware. For deployments that need lower latency or full data isolation, the sidecar runtime runs locally inside your cluster and communicates over a Unix socket; nothing leaves your VPC. Enterprise on-prem ships an in-process binary linked into the training job, so the resolution path is purely intra-process.
How does this differ from gradient clipping or loss scaling?
Both are framework-level techniques applied after a tensor is computed. Clipping bounds the gradient norm post-hoc; loss scaling rescales the loss to keep BF16/FP8 in range. Both fail on the workloads in our public benchmark, which we run with them enabled. NoNans operates one layer below: at the kernel boundary, at the moment a singularity arises, regardless of whether scaling or clipping was already applied upstream.
Can you describe the resolution mechanism in detail?
No. The mechanism is patent pending and lives only on our infrastructure. Technical due diligence is conducted by reproducing behavior on real workloads, not by reading source. The public benchmark Docker image runs the same code that produces the numbers on this site; the MNDA replication kit covers larger workloads. We treat this posture as non-negotiable.
What happens if the runtime is unreachable mid-training?
NoNans fails open. Detection continues to run in-process and emit telemetry; the wrapped model executes exactly as it would without NoNans. We never silently corrupt training to mask an outage. Customer dashboards surface every out-of-scope event explicitly.
How do I integrate with FSDP, DeepSpeed, or Megatron-LM?
Wrap before or after the distributed wrapper. Both work. The detection layer is implemented as a transparent passthrough that does not interfere with parameter sharding, gradient accumulation, or distributed collectives. See the FSDP and DeepSpeed examples in the public repository.
What's included in the public GitHub repository?
The full detection layer (MIT license), the public client (TLS to runtime.nonans.com plus sidecar and in-process discovery), the benchmark harness with eight reproducible workloads, integration examples for PyTorch / FSDP / vLLM, and a 64-test suite that runs on every PR. The resolution runtime is not in the repository. It lives on our infrastructure.
Built by

The founder.

Founder photo
ahlem makhebi
Founder · Doctoral researcher · Computer scientist

Doctoral researcher and computer scientist operating at the confluence of advanced mathematics, artificial intelligence, and systems design. Author of the resolution framework that powers NoNans, validated on production neural networks and large language models.

Pre-seed open · talking to design partners

If you run training at scale, we should talk.

Open to ML infrastructure leads, technical founders, and pre-seed investors who understand that compute waste is the largest controllable cost in frontier AI.

Contact engineering Reproduce the benchmark
Patent pending · v1.0.4 · MNDA replication kit available · infra@nonans.com