Life Expectancy Predictor

A Gradient Boosting Regressor trained on 14 health and lifestyle features to predict a person's life expectancy in years. Built with scikit-learn and wrapped in a production-ready FastAPI service.

Model Description

This model takes a snapshot of an individual's health profile — including physical attributes, lifestyle habits, and medical history — and returns a predicted life expectancy in years. The primary model is a GradientBoostingRegressor achieving an R² of ~0.87 on the held-out test set. A baseline LinearRegression model is also included for comparison.

Artifact File Description
Primary model gradient_boosting_model.pkl GradientBoostingRegressor (472 KB)
Baseline model linear_model.pkl LinearRegression (4 KB)
Feature scaler scaler.pkl StandardScaler for all features
Categorical encoder preprocessor.pkl LabelEncoder mapping for categorical inputs

Intended Use

  • Research & education: understanding which health factors most affect life expectancy.
  • Health-tech prototypes: powering wellness apps or patient-facing dashboards.
  • Academic exploration: studying gradient boosting on tabular health data.

Not intended for: clinical diagnosis, medical decision-making, or any high-stakes healthcare decisions. Predictions are statistical estimates, not medical advice.

How to Use

Install dependencies

pip install scikit-learn>=1.5.0 joblib numpy

Load and run inference

import joblib
import numpy as np

# Load artifacts
model = joblib.load("gradient_boosting_model.pkl")
scaler = joblib.load("scaler.pkl")
preprocessor = joblib.load("preprocessor.pkl")  # dict of LabelEncoders

# --- Prepare a sample input ---
# Categorical columns and their LabelEncoders are stored in preprocessor.pkl
# Categorical features: Gender, Physical_Activity, Smoking_Status,
#                       Alcohol_Consumption, Diet, Blood_Pressure

def encode_and_predict(sample: dict) -> float:
    """
    sample keys (all required):
        Gender, Height, Weight, BMI, Physical_Activity, Smoking_Status,
        Alcohol_Consumption, Diet, Blood_Pressure, Cholesterol,
        Diabetes, Hypertension, Heart_Disease, Asthma
    """
    categorical_cols = [
        "Gender", "Physical_Activity", "Smoking_Status",
        "Alcohol_Consumption", "Diet", "Blood_Pressure",
    ]
    for col in categorical_cols:
        le = preprocessor[col]          # LabelEncoder for this column
        sample[col] = le.transform([sample[col]])[0]

    feature_order = [
        "Gender", "Height", "Weight", "BMI",
        "Physical_Activity", "Smoking_Status", "Alcohol_Consumption",
        "Diet", "Blood_Pressure", "Cholesterol",
        "Diabetes", "Hypertension", "Heart_Disease", "Asthma",
    ]
    X = np.array([[sample[f] for f in feature_order]])
    X_scaled = scaler.transform(X)
    return float(model.predict(X_scaled)[0])


sample = {
    "Gender": "Male",
    "Height": 175,
    "Weight": 75,
    "BMI": 24.5,
    "Physical_Activity": "Medium",
    "Smoking_Status": "Never",
    "Alcohol_Consumption": "Moderate",
    "Diet": "Good",
    "Blood_Pressure": "Normal",
    "Cholesterol": 190,
    "Diabetes": 0,
    "Hypertension": 0,
    "Heart_Disease": 0,
    "Asthma": 0,
}

prediction = encode_and_predict(sample)
print(f"Predicted life expectancy: {prediction:.1f} years")

Download from the Hub

from huggingface_hub import hf_hub_download
import joblib

model = joblib.load(
    hf_hub_download(repo_id="lebiraja/life-expectancy-predictor",
                    filename="gradient_boosting_model.pkl")
)
scaler = joblib.load(
    hf_hub_download(repo_id="lebiraja/life-expectancy-predictor",
                    filename="scaler.pkl")
)
preprocessor = joblib.load(
    hf_hub_download(repo_id="lebiraja/life-expectancy-predictor",
                    filename="preprocessor.pkl")
)

Input Features

Feature Type Values / Range Description
Gender categorical Male / Female Biological sex
Height numerical cm Body height
Weight numerical kg Body weight
BMI numerical continuous Body Mass Index
Physical_Activity categorical Low / Medium / High Exercise level
Smoking_Status categorical Never / Former / Current Smoking history
Alcohol_Consumption categorical None / Moderate / Heavy Alcohol intake
Diet categorical Poor / Average / Good Overall diet quality
Blood_Pressure categorical Low / Normal / High Blood pressure category
Cholesterol numerical mg/dL Total cholesterol level
Diabetes binary 0 / 1 Diabetes diagnosis flag
Hypertension binary 0 / 1 Hypertension diagnosis flag
Heart_Disease binary 0 / 1 Heart disease diagnosis flag
Asthma binary 0 / 1 Asthma diagnosis flag

Output

A single continuous float representing predicted life expectancy in years.

Training Details

Dataset

  • Size: ~10,002 records
  • Split: 68 % train / 10 % validation / 22 % test
  • Target variable: Age (life expectancy in years)

Preprocessing

  1. Fill missing categorical values with "None".
  2. LabelEncoder applied per categorical column (encoders saved in preprocessor.pkl).
  3. StandardScaler applied to all 14 features after encoding (saved in scaler.pkl).

Primary Model — GradientBoostingRegressor

from sklearn.ensemble import GradientBoostingRegressor

model = GradientBoostingRegressor(
    n_estimators=100,
    learning_rate=0.1,
    max_depth=5,
    min_samples_split=5,
    min_samples_leaf=2,
    random_state=42,
)

Baseline Model — LinearRegression

A standard LinearRegression is also provided (linear_model.pkl) for interpretability and benchmarking.

Performance

Metric Value
R² (test set) 0.85 – 0.92
RMSE 3 – 5 years
Confidence score 0.87

Metrics are on the held-out test split (~22 % of 10 k records).

Limitations

  • The model is trained on a synthetic / illustrative dataset; real-world generalization is not guaranteed.
  • It does not account for socioeconomic factors, genetics, geography, or environmental exposures.
  • Categorical label encodings are order-sensitive — always use the supplied preprocessor.pkl rather than re-encoding independently.
  • Predictions for feature combinations far outside the training distribution may be unreliable.

Ethical Considerations

  • Not a medical device. Do not use predictions to make clinical, insurance, or policy decisions.
  • Fairness: The model may reflect biases present in the training data. Subgroup performance (by gender, age bracket, etc.) has not been audited.
  • Privacy: No personal data is stored or transmitted by this model artifact; ensure your application handles user health data in compliance with applicable regulations (HIPAA, GDPR, etc.).

Citation

If you use this model in your research or application, please cite:

@misc{lebiraja2024lifeexpectancy,
  author       = {lebiraja},
  title        = {Life Expectancy Predictor},
  year         = {2024},
  publisher    = {Hugging Face},
  howpublished = {\url{https://huggingface.co/lebiraja/life-expectancy-predictor}},
}

License

MIT

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Evaluation results