Bias Auditing in Body Measurement APIs: What to Test

A body measurement prediction model trained predominantly on European and North American populations will be more accurate for European and North American users than for East Asian, South Asian, or African users. This isn’t a hypothesis — it’s a structural consequence of the underlying training data, which is itself a consequence of which populations have been surveyed and which haven’t.

For applications serving global user bases, demographic bias in body measurement predictions produces systematically worse size recommendations for some groups of users. This is worth testing for and, where possible, correcting.

What bias looks like in anthropometric prediction

Anthropometric bias manifests in two specific ways:

Accuracy disparity: The model’s prediction error is higher for some demographic groups than others. If the Root Mean Square Error for a hip circumference prediction is 45mm for European women and 70mm for East Asian women, the model is less useful for East Asian users.

Systematic offset: The model consistently predicts too high or too low for a specific demographic group. A model trained on US military (ANSUR II) data and applied to East Asian civilians will predict larger waist-to-height ratios than are typical for that population, because East Asian populations have different body proportions.

Both are problems, but systematic offsets are often more damaging because they produce consistently wrong recommendations in the same direction — always too large, or always too small — for entire population segments.

Which groups are most at risk

The public anthropometric datasets that most models draw on have well-documented coverage gaps:

East Asian populations: SIZE KOREA provides coverage, but many models rely primarily on ANSUR II and NHANES, which are US-centric. East Asian populations have systematically different body proportions — trunk-to-leg ratios, waist-to-hip ratios, and proportional differences in limb measurements — that are not captured by US or European training data.

South Asian populations: Limited public anthropometric data for Indian subcontinent populations. Models tend to generalize from adjacent populations (East Asian or European) with unclear accuracy.

Sub-Saharan African populations: Very limited full-dimension anthropometric survey data. Models applied to African populations are typically extrapolating beyond their training distribution.

Higher BMI ranges: Military datasets (ANSUR II) fitness-screen participants, creating a gap in training data representation for individuals with BMI above ~30–32. Models may extrapolate less accurately for this population.

Older adults (65+): Most major datasets focus on working-age adults. Body composition and proportions shift with age (sarcopenic changes, height reduction, redistributed fat patterns) in ways that aren’t captured by models trained on younger populations.

How to test for bias

If you have access to a ground-truth measurement dataset (even a small one — 200–500 subjects is sufficient for basic bias testing), you can measure prediction accuracy separately by demographic group.

import numpy as np
from scipy import stats

def bias_audit(
    predictions: list[dict],
    ground_truth: list[dict],
    demographic_key: str,
    dimension_key: str
) -> dict:
    """
    Compute bias metrics for a specific dimension, stratified by a demographic group.
    
    predictions: list of API prediction results
    ground_truth: list of actual measurements for the same subjects
    demographic_key: e.g. "region", "gender", "age_category"
    dimension_key: e.g. "waist_circumference_natural"
    """
    groups = {}
    
    for pred, truth in zip(predictions, ground_truth):
        group = truth.get(demographic_key, "unknown")
        predicted_val = pred.get(dimension_key)
        true_val = truth.get(dimension_key)
        
        if predicted_val is None or true_val is None:
            continue
        
        if group not in groups:
            groups[group] = []
        groups[group].append((predicted_val, true_val))
    
    results = {}
    for group, pairs in groups.items():
        if len(pairs) < 10:  # Too few samples for reliable stats
            continue
        
        predicted = np.array([p for p, _ in pairs])
        true_vals = np.array([t for _, t in pairs])
        errors = predicted - true_vals
        
        results[group] = {
            "n": len(pairs),
            "mean_error_mm": round(float(np.mean(errors)), 1),          # Systematic offset
            "mae_mm": round(float(np.mean(np.abs(errors))), 1),         # Mean absolute error
            "rmse_mm": round(float(np.sqrt(np.mean(errors**2))), 1),   # RMSE
            "error_p5_mm": round(float(np.percentile(errors, 5)), 1),   # 5th percentile error
            "error_p95_mm": round(float(np.percentile(errors, 95)), 1)  # 95th percentile error
        }
    
    # Flag groups with significantly worse accuracy
    if results:
        overall_rmse = np.mean([v["rmse_mm"] for v in results.values()])
        for group, metrics in results.items():
            metrics["vs_overall"] = round(metrics["rmse_mm"] - overall_rmse, 1)
            metrics["flag"] = "HIGH_BIAS" if metrics["vs_overall"] > 10 else "OK"
    
    return results

# Example usage
# results = bias_audit(predictions, ground_truth, "region", "waist_circumference_natural")
# Output:
# {
#   "EUROPE": {"n": 124, "mean_error_mm": 1.2, "rmse_mm": 42.1, "flag": "OK"},
#   "ASIA_PACIFIC": {"n": 87, "mean_error_mm": 8.4, "rmse_mm": 61.3, "flag": "HIGH_BIAS"},
#   ...
# }

Testing without ground-truth data

If you don’t have a validation dataset, you can still perform a basic structural audit by testing whether the API’s predictions vary appropriately across regions when you change only the input_origin_region parameter.

import requests

def cross_region_consistency_test(
    gender: str,
    height_mm: int,
    weight_kg: float,
    dimension: str = "waist_circumference_natural"
) -> dict:
    """
    Test whether regional calibration produces meaningfully different predictions
    for the same height/weight combination across regions.
    
    If predictions are identical across all regions, the model likely has no
    regional calibration — a signal of potential population bias.
    """
    regions = ["GLOBAL", "EUROPE", "ASIA_PACIFIC", "AFRICA", "LATAM", "INDIA", "MIDDLE_EAST"]
    results = {}
    
    for region in regions:
        response = requests.post(
            "https://dimensionspot-bodysize-engine.p.rapidapi.com/v1/predict",
            json={
                "input_data": {
                    "input_unit_system": "metric",
                    "subject": {"gender": gender, "input_origin_region": region},
                    "anchors": {"body_height": height_mm, "body_mass": weight_kg}
                },
                "output_settings": {
                    "calculation": {"target_region": region},
                    "requested_dimensions": {"specific_dimensions": [dimension]},
                    "output_format": {"include_range_95": False}
                }
            },
            headers={
                "X-RapidAPI-Key": "YOUR_API_KEY",
                "X-RapidAPI-Host": "dimensionspot-bodysize-engine.p.rapidapi.com"
            }
        )
        data = response.json()
        dim = data.get("body_dimensions", {}).get(dimension)
        if dim:
            results[region] = dim.get("value")
    
    # Compute variance across regions
    values = [v for v in results.values() if v is not None]
    spread_mm = max(values) - min(values) if len(values) > 1 else 0
    
    return {
        "dimension": dimension,
        "predictions_by_region": results,
        "spread_mm": spread_mm,
        "note": "Higher spread indicates more regional calibration. Zero spread suggests no regional variation."
    }

What to do when you find bias

Use regional calibration parameters. If the API supports it, always pass the correct input_origin_region for your user’s population. Defaulting to GLOBAL when you know your users are primarily East Asian produces systematically biased results.

Display wider uncertainty intervals for underrepresented populations. The 95% prediction interval should be wider for populations where training data is sparse. If you can identify that your user is in an underrepresented group, surface this in the UI: “Our estimates are less precise for [population]. Providing your waist measurement will improve accuracy.”

Collect feedback and use it. If you have any way to collect post-purchase fit feedback (returns, ratings), segment it by user demographics. This gives you ground-truth bias data from your own user base.

Ask the API provider. A responsible provider should be able to tell you which datasets were used for regional calibration, the validation accuracy by population group, and known limitations. If this information isn’t available, factor that uncertainty into how you present predictions.

Fall back gracefully. For users in populations you know are poorly covered, consider making size recommendations more conservative: suggest two adjacent sizes rather than a single recommendation, or prompt the user to provide a self-measured circumference.

Framing bias to stakeholders

For CTOs and product leads: demographic bias in a sizing model is not a distant ethics problem — it’s a user experience problem with direct business consequences. Users who consistently get wrong size recommendations become users who return products, or stop trusting your recommendations, or leave for a competitor.

Framing the bias audit as “accuracy testing for non-European users” rather than “ethics review” often lands more concretely in product planning conversations. The goal is accurate predictions for all users, not just the majority.

Bias testing is not a one-time exercise. If you update your model, add new training data, or expand to new markets, repeat the audit. Population bias in anthropometric models is a structural issue rooted in data collection history — it doesn’t disappear unless you explicitly measure and address it.

Bias Auditing in Body Measurement APIs: What to Test and Why It Matters

What bias looks like in anthropometric prediction

Which groups are most at risk

How to test for bias

Testing without ground-truth data

What to do when you find bias

Framing bias to stakeholders

Understanding Confidence Scores in Anthropometric APIs: A Developer's Guide

Body Measurement Bundles Explained: TORSO, HAND_ARM, LEGS_FEET, HEAD_FACE

Single Anchor vs. Multi-Anchor: When One Body Measurement Is Enough