What accuracy means for probabilistic predictions
DimensionsPot returns a statistical estimate for each dimension, not a measurement. Accuracy has two components:
- Point accuracy — how close the predicted
valueis to the true measurement, on average. Expressed as Mean Absolute Error (MAE). - Interval calibration — whether the 95% prediction interval (
range_95) actually contains the true measurement 95% of the time. A well-calibrated API is one where a stated 95% PI contains the true value in at least 95% of cases.
Both are validated separately. Point accuracy tells you how far off a typical prediction is. Interval calibration tells you whether the API is honest about its uncertainty.
Validation framework
API v1.4.0 (model adult_ridge_v4.0) was validated before deployment across 7 test tracks totalling 5,200+ API calls against ground-truth measurements from three independent public datasets. No cherry-picking: precision thresholds were set before the runs, and every track is included in the report.
| Track | Population | Input | N | Reference |
|---|---|---|---|---|
| T1 | ANSUR II athletic adults | Height + mass | 300 | ANSUR II 2012 |
| T2 | ANSUR II athletic adults | 1–3 random anchors | 200 | ANSUR II 2012 |
| T3 | NHANES civilian adults | Height + mass | 500 | NHANES 2001–2018 |
| T4 | NHANES civilian adults | 2–3 random anchors | 300 | NHANES 2001–2018 |
| T5 | Pediatric 0–18 y | Age-stratified | 400 | NHANES pediatric |
| T6 | Multi body-build | Height + mass + circumferences | 2,100 | ANSUR II 2012 |
| T7 | Weak-anchor audit | Single anchor only | 1,400 | ANSUR II 2012 |
Server errors across all 5,200+ calls: 0.
Selected dimension performance — height + mass input
Under the most common production condition (height and mass only, no additional anchors), MAE and 95% PI coverage for selected BONE dimensions (Track 1, n=300):
| Dimension | MAE | 95% PI coverage |
|---|---|---|
bimalleolar_breadth | 2.8 mm | 84.0% |
hand_breadth | 2.9 mm | 88.3% |
lateral_malleolus_height | 3.9 mm | 92.3% |
head_breadth | 4.3 mm | 91.3% |
menton_sellion_length | 4.8 mm | 94.3% |
wrist_circumference | 5.4 mm | 86.7% |
head_length | 5.7 mm | 89.0% |
hand_length | 6.6 mm | 89.0% |
forearm_length | 10.0 mm | 90.7% |
ankle_circumference | 10.8 mm | 78.7% |
FLESH dimensions (circumferences) carry higher MAE by design: soft-tissue volume is not fully determined by height and mass alone. Supplying circumference anchors directly reduces FLESH MAE — see Anchor Strategy for the precision impact of additional inputs.
NHANES civilian holdout (Tracks 3–4)
ANSUR II is a military dataset of athletic adults. To assess generalization to a civilian population, a separate holdout of 500 NHANES subjects (Track 3) and 300 with random anchors (Track 4) was run.
| Metric | Result | Target |
|---|---|---|
| Average MAE across dimensions | 14.1 mm | ≤ 16 mm ✓ |
| 95% PI coverage | 77–79% | > 75% ✓ |
Note on NHANES landmark differences: NHANES measures upper_arm_length from a different anatomical landmark than ANSUR II, producing a systematic ~50 mm offset. The engine produces values consistent with ANSUR II / ISO 7250-1 landmark definitions. If your application requires NHANES-convention measurements, apply a fixed 50 mm correction to upper_arm_length outputs.
Effect of circumference anchors — Track 6
When clients supply circumference measurements alongside height and mass (PRIMARY_RICH tier), the engine constrains soft-tissue predictions directly rather than estimating them from body mass.
| Configuration | Dimensions improved | Neutral | Degraded |
|---|---|---|---|
| Height + mass + hip | 15 | 35 | 5 |
| Height + mass + waist | 11 | 40 | 4 |
| Height + mass + chest | 15 | 37 | 3 |
| Height + mass + neck + wrist | 7 | 45 | 2 |
| Height + mass + all 5 circumferences | 34 | 17 | 0 |
| Height + mass + all 5 (civilian build) | 34 | 17 | 0 |
| Height + mass + all 5 (overweight) | 29 | 19 | 3 |
With 5 circumference anchors, 34 of 56 tested dimensions improve and zero degrade for athletic and civilian body types. Additional input signal propagates correctly through the prediction pipeline — it does not introduce regression artifacts.
Reproducibility
All results above were generated by tests/precision_validation_v4.py (RANDOM_SEED=42). Three independent runs — local development, local post-cleanup, and production Google Cloud Run EU — produced numerically identical output across all MAE, bias, and coverage metrics (0.0 mm delta). The validation is not environment-specific; it is a property of the API itself.
Limitations
| Area | Detail |
|---|---|
| Training population | ANSUR II is US military: athletic adults, narrow BMI distribution. NHANES civilian holdout (n=800) confirms generalization, but predictions for extreme BMI subjects carry higher FLESH error. |
| MIDDLE_EAST female | Coefficients derived from global baseline. Dedicated regional survey not available. |
| AFRICA | Standard deviations proxied from ANSUR II with a conservative confidence penalty applied. |
| INDIA female | ASIA_PACIFIC fallback coefficients. |
| Torso circumferences without anchors | waist_circumference, hip_circumference, calf_circumference have higher MAE at PRIMARY_BOTH tier — body composition at a given BMI varies substantially across individuals. Supplying one circumference anchor brings these dimensions to PRIMARY_RICH accuracy. |