xcalib input protocol — v1.0¶
This document is the data contract between the partner's perception stack and the xcalib package.
Its executable counterpart is xcalib.protocol.validate_frame_inputs, which every Matcher.match()
call runs by default (validate="warn").
Versioning: the protocol version (xcalib.protocol.PROTOCOL_VERSION) is bumped on any breaking
change to this contract and noted in the release notes. Additive relaxations do not bump the
version.
1. Per-frame inputs¶
matcher.match(image, point_cloud, bboxes_2d, bboxes_3d) consumes one time-synchronized frame:
| Input | Shape / dtype | Contract |
|---|---|---|
image |
[H, W, 3] uint8 |
RGB channel order (convert with cv2.cvtColor(img, cv2.COLOR_BGR2RGB) — OpenCV decodes BGR). Values 0–255. |
point_cloud |
[P, >=3] float32 |
Columns 0–2 are X, Y, Z in meters; extra columns (intensity, ring, ...) are ignored. No NaN/Inf. Must be in the same coordinate frame as bboxes_3d (sensor or global — consistency is what matters). |
bboxes_2d |
[K, 4] float32 |
(x1, y1, x2, y2) in pixel coordinates of image, with x1 < x2, y1 < y2. One row per camera detection. |
bboxes_3d |
[M, 6] float32 |
Either axis-aligned extents (xmin, ymin, zmin, xmax, ymax, zmax) or center+dimensions (cx, cy, cz, dx, dy, dz). One row per LiDAR detection. |
Concrete file example¶
The public demo includes a tiny manifest-backed sample under
demo/frames/a9_sample/. Each frame directory uses ordinary files:
frame_0000/
image.png
point_cloud.pcd
detections.json
image.png can be any PNG/JPEG decoded to RGB uint8. OpenCV reads BGR, so
convert explicitly:
import cv2
image_bgr = cv2.imread("image.png", cv2.IMREAD_COLOR)
if image_bgr is None:
raise FileNotFoundError("image.png")
image = cv2.cvtColor(image_bgr, cv2.COLOR_BGR2RGB)
Pillow also works and already returns RGB when requested:
from PIL import Image
import numpy as np
image = np.asarray(Image.open("image.png").convert("RGB"), dtype=np.uint8)
point_cloud.pcd should contain XYZ points in meters. ASCII PCD is easy to
inspect:
# .PCD v0.7 - Point Cloud Data file format
VERSION 0.7
FIELDS x y z
SIZE 4 4 4
TYPE F F F
COUNT 1 1 1
WIDTH 3
HEIGHT 1
VIEWPOINT 0 0 0 1 0 0 0
POINTS 3
DATA ascii
7.100 -3.250 0.200
7.450 -3.250 0.450
7.800 -3.000 0.700
Use Open3D if it is already in your perception stack:
import numpy as np
import open3d as o3d
pcd = o3d.io.read_point_cloud("point_cloud.pcd")
point_cloud = np.asarray(pcd.points, dtype=np.float32)
For minimal ASCII PCD files, the demo uses a small fallback loader:
import numpy as np
def load_pcd_ascii(path: str) -> np.ndarray:
with open(path, "r", encoding="utf-8") as f:
lines = f.readlines()
for idx, line in enumerate(lines):
if line.strip().lower() == "data ascii":
data_start = idx + 1
break
else:
raise ValueError("PCD file does not contain DATA ascii")
return np.loadtxt(lines[data_start:], dtype=np.float32).reshape(-1, 3)
detections.json stores the camera detections and LiDAR detections for the
same frame:
{
"frame_key": "a9_sample_0000",
"camera_name": "s110_camera_basler_south2_8mm",
"image": "image.png",
"point_cloud": "point_cloud.pcd",
"bboxes_2d": [
[42.0, 62.0, 96.0, 128.0],
[138.0, 54.0, 198.0, 124.0]
],
"bboxes_3d": [
[6.7, -3.6, -0.2, 9.3, -2.4, 1.6],
[13.7, -1.1, -0.2, 16.3, 0.1, 1.6]
]
}
The JSON-to-API conversion is:
import json
from pathlib import Path
import numpy as np
from xcalib import Matcher
frame_dir = Path("demo/frames/a9_sample/frame_0000")
det = json.loads((frame_dir / "detections.json").read_text())
image = load_image_rgb(frame_dir / det["image"])
point_cloud = load_pcd_ascii(frame_dir / det["point_cloud"])
bboxes_2d = np.asarray(det["bboxes_2d"], dtype=np.float32).reshape(-1, 4)
bboxes_3d = np.asarray(det["bboxes_3d"], dtype=np.float32).reshape(-1, 6)
matcher = Matcher.from_pretrained("crlite", site="a9_dataset_r02_s01")
result = matcher.match(image, point_cloud, bboxes_2d, bboxes_3d)
print(result.matches)
3D bbox format auto-detection¶
The package disambiguates the two [M, 6] conventions per row: when bbox[3:6] >= bbox[0:3]
element-wise, the row is treated as min/max extents, otherwise as center+dimensions. Mixed
conventions inside a single call are technically handled but strongly discouraged — pick one and
keep it. Note a center-style box at e.g. (30, 5, 1) with dims (4, 2, 2) looks like
second < first so it parses correctly, but a center-style box whose dims all exceed its center
coordinates would be misread as extents; with metric, road-scene boxes this does not occur in
practice.
Synchronization & framing requirements¶
- Image and point cloud must come from the same trigger window; the matchers tolerate normal ITS jitter (≤ ~50 ms at 10 Hz) but are not built to match across frames.
- The LiDAR points inside each 3D box are what the model sees — boxes are expanded by
bbox_expansion(default 1.25×, per-model YAML) and points inside are resampled topoint_cloud_size(default 1024). - Detections are produced upstream (the package does no detection).
2. Quality floors (soft — warnings)¶
| Check | Floor | Why |
|---|---|---|
| Image resolution | ≥ 640 × 480 | Training data was full-HD ITS footage; small frames upsample crops. |
| 2D bbox size | ≥ 8 px per side | Crops are resized to crop_size (32 px default); tiny boxes carry no texture. |
| Points per 3D box | ≥ ~50 inside the (expanded) box | PointNet embeddings degrade on near-empty crops; empty crops are dropped. |
| Detections per frame | ≤ 32 image, ≤ 32 LiDAR | Matches the TensorRT dynamic-shape maxShapes the Thor engines are built with (xcalib/engine/trt.py). PyTorch/ONNX-RT inference has no hard cap. |
Violating a floor logs a warning (once per kind); inference still runs. Hard violations — wrong
rank/dtype, NaN/Inf coordinates — raise xcalib.ProtocolError instead.
result = matcher.match(image, pc, b2, b3) # validate="warn" (default)
result = matcher.match(image, pc, b2, b3, validate="strict") # any violation raises
result = matcher.match(image, pc, b2, b3, validate="off") # trusted hot path
To pre-flight a recording without running inference:
from xcalib import validate_frame_inputs
for v in validate_frame_inputs(image, pc, b2, b3):
print(v) # [warning] bboxes_2d.small: 2 bbox(es) are smaller than 8px ...
3. Camera intrinsics (calibrate / one-shot only)¶
matcher.match() never needs intrinsics. matcher.calibrate() and matcher.oneshot() do — they
solve / use the camera-LiDAR projection:
from xcalib import CameraIntrinsics
K = CameraIntrinsics(fx=2666.7, fy=2666.7, cx=960.0, cy=540.0) # pixels
K = CameraIntrinsics.from_matrix(K_3x3, distortion=np.array([k1, k2, p1, p2, k3]))
- Intrinsics are fixed and known (factory calibration); this package estimates only the
camera↔LiDAR extrinsics
[R|t]and reportsP = K [R|t]. - Distortion coefficients follow OpenCV ordering and are optional; if the image stream is already rectified, omit them.
- The solved extrinsics map
bboxes_3d-frame coordinates to the camera frame — i.e. whatever frame the 3D boxes/point cloud were given in.
4. Outputs¶
MatchResult.similarity is [K', M'] where K' <= K, M' <= M after degenerate/empty detections
are dropped; kept_2d_indices / kept_3d_indices map the surviving rows/columns back to the
caller's original indices, and entries of matches are already expressed in the caller's numbering.
Score semantics per model: cosine in [-1, 1] for the ViT models, normalized Stage-2 scores in
[0, 1] (non-top-K entries are -1) for crlite/crlite_2dpe, sigmoid pair probability in
[0, 1] for calibrefine. Ranking, not absolute magnitude, is the supported signal; threshold per
model after a short on-site calibration run.
5. TensorRT engine envelope (Thor deployment)¶
The shipped engines are built with these dynamic-shape profiles (min / opt / max):
| Tensor | min | opt | max |
|---|---|---|---|
image_crops (N) |
1 | 8 | 32 |
lidar_crops (M) |
1 | 12 | 32 |
| stage-2 pairs (B = N·K) | 1 | 80 | 320 |
calibrefine pairs (B = N·M) |
1 | 96 | 1024 |
Frames outside the envelope must be split or truncated before the engine call; the PyTorch and ONNX-Runtime paths accept any size.