Paper validation and evidence¶
Lab-side validation for Position Encoding in Detection-Based LiDAR–Camera Matching: A Diagnostic Study at Infrastructure Sites (accepted at IEEE Sensors Letters): how to reproduce the public A9 Table I column, verify UTC reference numbers when institutional data are available, and capture the Thor latency numbers. Frozen evaluation summaries and how they map to supplemental tables live in the evidence index.
The pairwise baseline calibrefine is ported from
CalibRefine (Cheng et al., IEEE
TIM 2026, arXiv:2502.17648); citation entries for both papers
are in the repo README.
1. Paper validation¶
pixi run validate-a9 # public A9 r02_s01 release
# UTC reference verification; requires local institutional HDF5 caches.
pixi run validate-utc4
pixi run validate-utc3
The A9 command is the public reproducibility path for the release. UTC4/UTC3 commands verify the
same Table I reference numbers only for partners who already hold compatible institutional HDF5
caches and weights; those raw UTC artefacts are not distributed in this repository. All commands
evaluate the held-out test split used by the paper experiments. The A9 r02_s01 caches ship inside
this repo at datasets/a9_dataset_r02_s01/hdf5_cache/a9_r02_s01_{train,val,test}.h5 (and on the public HF
dataset repo). The training pipeline reserves the _val.h5 split for early-stopping /
best-checkpoint selection, so _test.h5 is the only file in each cache that is never seen during
training.
A9 note (multi-camera frames). A9 r02_s01 frames carry detections from two cameras
(s110_camera_basler_south1_8mm, s110_camera_basler_south2_8mm); the loader returns every
referenced camera image plus a per-detection source-camera array, and the cropping path picks each
2-D box's own camera (prepare_frame(..., images=, camera_per_det=)) exactly like the paper-side
consistent_loader. The A9 configs also use bbox_expansion: 1.25 (UTC uses 1.05), mirroring the
per-site lab dataset configs.
By default the script evaluates the 5 paper-claimed methods only (calibrefine, crlite,
crlite_2dpe, crlite_vit_exp1, crlite_vit_exp3). To also include the delivery-only
crlite_vit_exp4 or to customize the H5 path / write a JSON report:
pixi run python scripts/paper/validate_paper.py \
--site utc4 \
--hdf5 datasets/utc_dataset4/hdf5_cache/utc_test.h5 \
--models crlite crlite_2dpe crlite_vit_exp1 crlite_vit_exp3 crlite_vit_exp4 calibrefine \
--output docs/evidence/standalone_paper_validation_utc4.json
--output writes a JSON file containing site, hdf5 (absolute path of the H5 actually used),
device, and a results dict keyed by model name — each value has top1, top3, mrr,
latency_ms_mean, latency_ms_p50, latency_ms_p95, latency_ms_p99, latency_ms_std,
latency_ms_min, latency_ms_max, throughput_fps_mean, n_iters, n_frames, and
wall_seconds. Parent directories are created automatically.
1.1 UTC reference numbers (RTX 5090, utc_test.h5, 50 frames per site)¶
Canonical reference numbers captured in docs/evidence/standalone_paper_validation_utc{3,4}.json on
2026-05-16 with the retrained calibrefine UTC4 weights. They are frozen for manuscript
reconciliation; only partners with local UTC data should expect to rerun the same commands. Latency
is end-to-end model forward (no I/O); fps is
throughput_fps_mean from the JSON. Top-1 / Top-3 / MRR are stable within ±0.3 pp across runs (CUDA
non-determinism); latency drifts ±20 % depending on GPU thermal state, which is why we also report
p50 / p95 / p99 in the JSON.
UTC4
| Model | Top-1 | Top-3 | MRR | mean lat | p95 | fps | Paper Table I |
|---|---|---|---|---|---|---|---|
crlite_2dpe |
99.75 | 100.00 | 0.999 | 7.6 ms | 10.3 | 132.1 | ≈97.9 (PE abl.) |
crlite |
99.02 | 99.75 | 0.994 | 10.9 ms | 17.2 | 91.7 | 99.0 |
crlite_vit_exp3 |
97.54 | 99.26 | 0.984 | 8.3 ms | 13.0 | 119.9 | 98.0 |
crlite_vit_exp1 |
96.81 | 99.26 | 0.980 | 7.8 ms | 13.6 | 127.7 | 96.8 |
calibrefine |
97.54 | 100.00 | 0.988 | 1093.8 ms | 1689.4 | 0.9 | 97.5 (note 1) |
UTC3
| Model | Top-1 | Top-3 | MRR | mean lat | p95 | fps | Paper Table I |
|---|---|---|---|---|---|---|---|
crlite_2dpe |
97.84 | 100.00 | 0.989 | 9.6 ms | 14.3 | 103.8 | n/a |
crlite |
96.46 | 99.41 | 0.979 | 11.5 ms | 22.6 | 86.7 | 94.6 |
crlite_vit_exp1 |
92.53 | 98.43 | 0.956 | 5.6 ms | 7.6 | 178.5 | 90.5 |
calibrefine |
92.14 | 99.02 | 0.956 | 1196.2 ms | 1892.0 | 0.8 | 91.9 |
crlite_vit_exp3 |
91.75 | 99.61 | 0.956 | 5.2 ms | 7.3 | 191.6 | 89.9 |
1.1a Public A9 numbers (RTX 5090, a9_r02_s01_test.h5, 29 frames, captured 2026-06-11)¶
Canonical public A9 numbers in docs/evidence/standalone_paper_validation_a9_dataset_r02_s01.json from
pixi run validate-a9. "Paper Table I" is the manuscript's per-frame Top-1, while the standalone
accumulator micro-averages over detections (same definition gap that explains the UTC3 offsets
above), so agreement within ~2 pp is expected and observed.
| Model | Top-1 | Top-3 | MRR | mean lat | fps | Paper Table I |
|---|---|---|---|---|---|---|
calibrefine |
98.07 | 99.68 | 0.989 | 1956.8 ms | 0.5 | 98.4 |
crlite_2dpe |
97.43 | 100.00 | 0.987 | 25.4 ms | 39.3 | 97.5 (PE abl.) |
crlite |
91.96 | 100.00 | 0.959 | 32.6 ms | 30.6 | 92.7 |
crlite_vit_exp1 |
90.35 | 97.43 | 0.942 | 13.3 ms | 75.1 | 92.4 |
crlite_vit_exp3 |
90.03 | 99.04 | 0.945 | 11.6 ms | 86.2 | 89.8 |
The paper's A9 ordering is preserved: pairwise > 2D-only > ResNet + 3D PE, and ViT (no PE) ≥ ViT + 3D PE. A9 latencies are higher than UTC because the A9 gantry frames contain more detections per frame (latency is per frame, not per pair).
1.2 ONNX accuracy validation (lossless-export check)¶
Section 1.1 reproduces the paper numbers with the PyTorch model. For the partner deployment we
need to confirm that the ONNX artifact they consume on Thor still computes the same matching
quality on the same utc_test.h5 split — not just per-tensor parity on a synthetic batch (which the
export-time check at max|torch − onnx| ≈ 1e-6 already covers).
pixi run validate-onnx-utc4
pixi run validate-onnx-utc3
Each task loads the graphs from onnx/<model>_<site>/ (per-site dir introduced 2026-05-17 — see
§2.1 for the silent-mismatch bug it fixes), drives them with onnxruntime (GPU on dev, CPU on CI /
Thor), and feeds them the same utc_test.h5 frames validate-utc{3,4} use. Output:
docs/evidence/standalone_onnx_validation_utc{3,4}.json— same schema as the PyTorch validation JSON (Top-1 / Top-3 / MRR / latency stats per model, plusproviders,hdf5, andonnx_dirsmetadata; the last field records which on-disk ONNX directory each row was loaded from, so a_utc3validator that silently picked up_utc4weights is immediately visible in the JSON).- A pretty-printed per-model row table on stdout.
Direct comparison against docs/evidence/standalone_paper_validation_utc{3,4}.json should show
agreement to within ±0.3 pp on Top-1 / Top-3 (FP32 vs FP32, sole sources of drift are ORT graph
fusions and CUDA non- determinism). Anything larger means the ONNX export silently changed behaviour
— the script is the regression-guard for that.
The script knows about all four control flows we ship — two-stage hybrid (crlite, crlite_2dpe),
single-graph cosine (crlite_vit_exp1, crlite_vit_exp3), GNN cosine (crlite_vit_exp4), and
pairwise (calibrefine) — and re-implements each in numpy + ORT calls. No PyTorch op runs on the
eval path. To override the provider, model set, or H5:
pixi run python scripts/paper/validate_onnx.py --site utc4 \
--provider cpu --models crlite calibrefine \
--output reports/onnx_smoke.json
CPU is fine for crlite / crlite_2dpe / ViT variants (~0.2 s/frame). calibrefine is pairwise so
it has to issue N×M ORT calls (up to 1024 per frame); CPU is OK for a smoke run but use
--provider cuda or auto for the full 50-frame split unless you don't mind a ~10 minute wall.
Reference numbers (ORT-CPU, FP32 graph, captured 2026-05-17 in
docs/evidence/standalone_onnx_validation_utc{3,4}.json):
| Model | UTC3 ONNX | UTC3 PyTorch | UTC4 ONNX | UTC4 PyTorch |
|---|---|---|---|---|
crlite_2dpe |
97.84 % | 97.84 % | 99.75 % | 99.75 % |
crlite |
96.46 % | 96.46 % | 99.02 % | 99.02 % |
crlite_vit_exp1 |
92.53 % | 92.53 % | 96.81 % | 96.81 % |
crlite_vit_exp3 |
91.75 % | 91.75 % | 97.54 % | 97.54 % |
calibrefine |
91.94 % | 92.14 % | 97.54 % | 97.54 % |
Every row matches PyTorch FP32 to the rounding floor on both sites (the single 0.20 pp calibrefine
UTC3 row is within CUDA non-determinism). The ONNX export is matching-quality-preserving on both the
deployed UTC3 and UTC4 weights — this is the headline claim cited in supplemental §S-7.
1.3 What this does not check¶
- TRT FP16 quantisation effect. Both §1.1 and §1.2 are FP32 evaluations (PyTorch FP32 and
ONNX-Runtime FP32 respectively). The shipped engine on Thor is FP16. We have not yet wired a Top-1
/ Top-3 / MRR check on top of the TRT FP16 engine itself, in part because doing it cleanly on Thor
needs
utc_test.h5to live there too. If the partner ever sees Top-1 drop more than ~1 pp on real data we should add that script (sketch:scripts/validate_trt.py, mirrorvalidate_onnx.pybut usetensorrt.Runtimeinstead ofonnxruntime.InferenceSession). - Throughput numbers in §1.1 / §1.2 are wall-time PyTorch / ORT numbers and should not be
cited as edge latency. The Thor numbers for the paper / supplemental come from §3
(
benchmark_thor.py, PyTorch eager) and §2.3(a) (bench-trt, TRT FP16) only.
2. ONNX export and TensorRT engines for Thor¶
2.1 Export ONNX¶
All ONNX / engine / report artifacts are site-tagged so the UTC3 and UTC4 weight sets cannot silently overwrite each other (an earlier issue: a UTC4-only export was unintentionally evaluated on UTC3 frames and produced 13–28 % Top-1 noise; per-site subdirs are the structural fix):
onnx/<model>_<site>/{stage1,stage2,model}.onnx
engines/<model>_<site>/{stage1,stage2,model}.<precision>.engine
docs/evidence/standalone_onnx_validation_<site>.json
docs/evidence/thor_trt_latency_<site>.json
# Single model, default site (utc4):
pixi run export-onnx # crlite -> onnx/crlite_utc4/
# Both sites, all 6 models in one go (cleanest entry point):
pixi run export-onnx-all # = utc4-all + utc3-all
# Per-site batch:
pixi run export-onnx-utc4-all # writes onnx/<model>_utc4/
pixi run export-onnx-utc3-all # writes onnx/<model>_utc3/
# Explicit override:
pixi run python scripts/paper/export_onnx.py \
--model crlite --site utc3 \
--weights checkpoints/crlite_utc3_best.pth \
--output onnx/crlite_utc3
The same invocation works on both the lab workstation and Thor itself — the pixi install step
picks the right torch wheel for the host platform, so the command line stays identical.
Back-compat: the resolver in validate_onnx.py / build_trt.py / bench_trt.py still falls
back to the legacy site-less layout (onnx/<model>/, engines/<model>/) only when
--site=utc4 is requested and the per-site dir doesn't exist. Any UTC3 lookup requires the per-site
dir, so the silent-mismatch bug cannot recur.
For crlite / crlite_2dpe this writes two graphs:
stage1.onnx— backbones + position encoding + cosine retrieval matrix.stage2.onnx— embed-fusion + similarity-head MLP for Top-K refinement.
For the ViT variants (crlite_vit_exp1, crlite_vit_exp3, crlite_vit_exp4) a single model.onnx
is written that directly returns the N×M cosine similarity matrix. calibrefine exports as one
pairwise graph indexed by batch B (one logit per (i, j) pair).
2.2 Build TensorRT engines¶
All graphs have dynamic axes on detection counts (N, M, K, B). The helper task auto-locates
trtexec (it ships at /usr/src/tensorrt/bin/trtexec on JetPack 7 / Thor and is not on $PATH
by default) and passes the right min/opt/max shape profile for each model:
# Both sites, paper-validated models only, FP16 (default precision):
pixi run build-trt-all # = utc4-all + utc3-all
# Per-site batch (writes engines/<model>_<site>/):
pixi run build-trt-utc4-all
pixi run build-trt-utc3-all
# Single model + site:
pixi run build-trt --model crlite --site utc4 --precision fp16
# Partner deliverable that also wants the ViT+GNN engine:
pixi run python scripts/paper/build_trt.py --all-incl-delivery --site utc4 \
--log-dir reports/trt_logs
# Override trtexec location explicitly if it lives somewhere else:
pixi run python scripts/paper/build_trt.py --model crlite --site utc4 \
--trtexec /opt/nvidia/tensorrt/bin/trtexec
Scope note (--all semantics): --all builds only the 5 paper-validated models (crlite,
crlite_2dpe, crlite_vit_exp1, crlite_vit_exp3, calibrefine). The 6th shipped model
crlite_vit_exp4 (ViT + Edge-Aware GAT, delivery-only) is not included because its
TorchScript-traced ONNX uses ScatterND with dynamic axes which TensorRT 10.x on Blackwell does not
yet support; empirically the build fails on Thor under JetPack 7 for both UTC3 and UTC4 weights,
with no impact on the paper. If you need the vit_exp4 engine for partner delivery, request it
explicitly with --all-incl-delivery (or --model crlite_vit_exp4); the failure mode is the same
regardless of site.
Engines land at engines/<model>_<site>/<stage>.<precision>.engine. trtexec's own latency summary
is printed at the end of each build; it's the canonical TensorRT-FP16 number we cite in the
supplemental.
INT8 calibration is left to the partner — pass extra trtexec args through after --, e.g.
pixi run build-trt --model crlite --precision fp16 -- \
--int8 --calib=/path/to/crops.cache --workspace=8192
If trtexec cannot be located the script prints an actionable error listing the directories it
searched.
2.3 Use the engine¶
pixi run build-trt-all writes engines/<model>_<site>/{stage1,stage2,model}.fp16.engine. There
are three ways to put those engines to work, depending on what you need.
(a) Canonical Thor-TRT-FP16 latency for the paper. Re-runs trtexec on every built engine with
the opt-profile shapes (N=8, M=12, B=80 for crlite stage2, B=96 for calibrefine pairs) and parses
the GPU Compute Time block out of stdout into a single JSON. This is what we cite in the
supplemental table.
pixi run bench-trt # = bench-trt-utc4 (default site)
pixi run bench-trt-utc4 # explicit; engines/<model>_utc4/
pixi run bench-trt-utc3 # explicit; engines/<model>_utc3/
pixi run bench-trt --model crlite # one model on utc4
pixi run bench-trt --iterations 1000 # tighter percentiles
pixi run bench-trt -- --useCudaGraph # forward extra flags to trtexec
Outputs:
docs/evidence/thor_trt_latency_<site>.json—{model: [{stage, precision, latency_ms_{mean,p50,p95,p99,min,max}, throughput_fps_mean}, ...]}. For--site utc4we also keep the legacy site-less aliasdocs/evidence/thor_trt_latency.jsonso the existing supplemental table doesn't have to be re-pathed.reports/trt_logs/<model>_<site>_<stage>.fp16.bench.log— full trtexec stdout per engine.
Reference Thor numbers (JetPack 7, FP16, N=8 / M=12 dets, warmup=50, iters=200; captured
2026-05-16/17 in docs/evidence/thor_trt_latency.json (UTC4) and docs/evidence/thor_trt_latency_utc3.json
(UTC3)):
| Model | Stage(s) | UTC3 mean | UTC3 p99 | UTC3 thr. | UTC4 mean | UTC4 p99 | UTC4 thr. |
|---|---|---|---|---|---|---|---|
crlite_2dpe |
stage1+stage2 | 0.645 | 0.72 | 1550 fps | 0.640 | 0.74 | 1565 fps |
crlite |
stage1+stage2 | 0.685 | 0.79 | 1459 fps | 0.700 | 0.81 | 1430 fps |
crlite_vit_exp1 |
model | 0.924 | 1.03 | 1082 fps | 0.925 | 1.03 | 1081 fps |
crlite_vit_exp3 |
model | 0.934 | 1.04 | 1071 fps | 0.926 | 1.02 | 1080 fps |
calibrefine |
model (B=96) | 334.7 | 345.0 | 2.99 fps | 342.6 | 370.9 | 2.92 fps |
The crlite two-stage column is stage1 + stage2; on either site stage 1 alone is ~0.64 ms
(~1530–1566 fps) and stage 2 alone is ~0.046 ms (~21 700–22 300 fps), so end-to-end is dominated by
stage 1. Both sites agree to within 2.3 % — engines share topology so latency is essentially
weight-independent. These are the canonical FP16 Thor numbers cited in supplemental §S-7 (Table
S-VII).
If you also want a sanity-check without building the engine first, trtexec can do it all in one shot:
trtexec --loadEngine=engines/crlite_utc4/stage1.fp16.engine \
--shapes=image_crops:8x3x32x32,lidar_crops:12x1024x3,img_centers:8x2,lid_centers:12x3 \
--iterations=200 --warmUp=50 --useSpinWait --noDataTransfers
(b) ONNX-Runtime smoke test (any host, no TRT needed). The exported ONNX graphs are validated
against PyTorch at export time (max|torch-onnx| ≈ 1e-6) and again at the matching-quality level by
§1.2 (pixi run validate-onnx-utc{3,4}, full Top-1 / Top-3 / MRR on utc_test.h5). If you just
want a programmatic smoke — for example to sanity check a checkpoint update on a partner box without
rebuilding the engine — call onnxruntime directly. It's already in the pixi environment.
import numpy as np, onnxruntime as ort
sess = ort.InferenceSession(
"onnx/crlite_utc4/stage1.onnx",
providers=["CUDAExecutionProvider", "CPUExecutionProvider"],
)
out = sess.run(
None,
{
"image_crops": np.random.randn(8, 3, 32, 32).astype(np.float32),
"lidar_crops": np.random.randn(12, 1024, 3).astype(np.float32),
"img_centers": np.random.randn(8, 2).astype(np.float32),
"lid_centers": np.random.randn(12, 3).astype(np.float32),
},
)
img_embed, lid_embed = out # [8, 256], [12, 256]
similarity = (img_embed / np.linalg.norm(img_embed, axis=1, keepdims=True)) @ \
(lid_embed / np.linalg.norm(lid_embed, axis=1, keepdims=True)).T
For the cosine-only and pairwise models (crlite_vit_exp1, crlite_vit_exp3, crlite_vit_exp4,
calibrefine) there is just one graph (onnx/<model>_<site>/model.onnx) and the output is the
similarity matrix or per-pair logit directly.
(c) Production deployment via Python TensorRT. This is what the deployment pipeline actually consumes. We don't ship a TRT runtime (it is partner-side glue), but the integration is small — load the engine, allocate buffers, set dynamic shapes, run, copy back:
import numpy as np, tensorrt as trt
import pycuda.driver as cuda, pycuda.autoinit # or cudart-py
logger = trt.Logger(trt.Logger.WARNING)
runtime = trt.Runtime(logger)
with open("engines/crlite_utc4/stage1.fp16.engine", "rb") as f:
engine = runtime.deserialize_cuda_engine(f.read())
ctx = engine.create_execution_context()
# Set the actual N / M for this frame (must lie in min..max of the build profile)
ctx.set_input_shape("image_crops", (8, 3, 32, 32))
ctx.set_input_shape("lidar_crops", (12, 1024, 3))
ctx.set_input_shape("img_centers", (8, 2))
ctx.set_input_shape("lid_centers", (12, 3))
# ... allocate device buffers, copy inputs in, run ctx.execute_v3(stream),
# copy outputs out. See NVIDIA's `samples/python/onnx_resnet50` for the
# full buffer-management boilerplate; the contract here is identical, only
# the input names differ.
For crlite / crlite_2dpe you run stage1 to get the (N, D) / (M, D) embeddings, do an N×M
cosine + top-K on the host (milliseconds, not worth a kernel), and feed the K selected pairs into
stage2's (B, D) × (B, D) → (B, 1) engine. The pairwise wrapper in
xcalib.engine.wrappers.HybridTwoStageWrapper is the torch-side reference for that orchestration;
mirror its control flow 1-to-1 in C++/Python around the two engines.
3. Thor latency benchmark¶
Matching-quality validation (Top-1 / Top-3 / MRR) is done on the lab workstation, where we keep the full UTC HDF5 caches. The Jetson AGX Thor's role in the deliverable is exclusively to back up the real-time-feasibility claim — i.e. that these lightweight models can keep up with a 10 Hz fused-perception pipeline on the partner's iGPU.
pixi run benchmark # all 6 shipped models, warmup=20, iters=200
# writes reports/thor_latency.json + .png
pixi run benchmark-quick # warmup=5, iters=30 -- fast sanity run
# Single model with explicit weights:
pixi run python scripts/paper/benchmark_thor.py --model crlite_vit_exp4 \
--weights checkpoints/crlite_vit_exp4_utc4_best.pth
The script uses synthetic frames (random crops + point clouds at the right shapes), so no HDF5 cache
is required on the Thor. It reports per-model FP32 + FP16 mean / p50 / p95 / p99 / std / min / max /
fps across the timed iterations, and the matplotlib bar chart annotates each model with its FPS and
overlays the 100 ms (10 Hz) real-time budget line. The pairwise baseline calibrefine is shown
alongside the lightweight models so the figure tells the speed story end-to-end.
3.1 Reference numbers (RTX 5090, N=8 / M=12 dets, warmup=20, iters=200)¶
Captured 2026-05-16 in reports/thor_latency.json with the matching reports/thor_latency.png.
These are dev-box reference numbers — the partner reports the analogous Thor numbers from running
the same pixi run benchmark on the iGPU.
FP32 (all 6 shipped models)
| Model | mean | p50 | p95 | p99 | std | fps |
|---|---|---|---|---|---|---|
crlite_vit_exp3 |
3.0 ms | 2.9 | 3.8 | 4.5 | 0.5 | 331.4 |
crlite_vit_exp1 |
4.4 ms | 4.1 | 5.7 | 6.7 | 3.9 | 227.5 |
crlite_2dpe |
5.4 ms | 5.2 | 6.7 | 7.5 | 0.7 | 184.2 |
crlite |
5.5 ms | 5.4 | 6.5 | 7.5 | 0.6 | 183.1 |
crlite_vit_exp4 |
7.6 ms | 7.3 | 9.1 | 10.2 | 0.8 | 132.3 |
calibrefine |
547.5 ms | 534.9 | 601.0 | 718.5 | 44.7 | 1.8 |
FP16 (PyTorch model.half(), all 6 shipped models)
| Model | mean | p50 | p95 | p99 | std | fps |
|---|---|---|---|---|---|---|
crlite_vit_exp3 |
2.8 ms | 2.7 | 3.7 | 4.0 | 0.4 | 351.8 |
crlite_vit_exp1 |
3.2 ms | 3.0 | 4.3 | 4.8 | 0.7 | 310.4 |
crlite |
5.5 ms | 5.5 | 6.6 | 7.5 | 0.6 | 180.5 |
crlite_2dpe |
5.5 ms | 5.3 | 6.8 | 7.8 | 0.6 | 182.1 |
crlite_vit_exp4 |
7.5 ms | 7.3 | 9.0 | 10.0 | 0.8 | 133.5 |
calibrefine |
532.3 ms | 528.6 | 575.5 | 595.9 | 24.8 | 1.9 |
FP16 note. All six models run in PyTorch eager-mode FP16 after model.half(). An earlier
revision tripped Index put requires the source and destination dtypes match for crlite,
crlite_2dpe, and calibrefine because three pre-allocated buffers (final_sim in
CRLiteModel.forward_inference, and scores
img/lid_centerszero-pads inPairwiseWrapper, and the FPS distance buffer in_pointnet2_ops.farthest_point_sample) defaulted tofloat32while the model output wasfloat16. Those allocations now honour the model's parameter dtype, so the FP16 column is fully populated. TensorRT FP16 engines on Thor are still the recommended deployment path — they do better mixed-precision scheduling than eager-modemodel.half()— but the eager-mode FP16 numbers above are a useful sanity check on the iGPU beforetrtexecruns.
3.2 What to expect on Thor¶
Thor's Blackwell iGPU is roughly 0.4–0.6× the throughput of an RTX 5090 on these workloads, so
absolute FP32 latencies will be ~2× larger but the ~250× spread between the lightweight models
and the pairwise baseline carries over. TensorRT FP16 engines on Thor are expected to give another
2–3× over PyTorch FP32 inside this directory, putting every lightweight model comfortably below the
100 ms (10 Hz) real-time budget. The pairwise baseline (calibrefine) is reported for reference
only — it is the slow ceiling that motivates the lightweight design and is not the recommended
deployment model.