Life Expectancy Predictor

A Gradient Boosting Regressor trained on 14 health and lifestyle features to predict a person's life expectancy in years. Built with scikit-learn and wrapped in a production-ready FastAPI service.

Model Description

This model takes a snapshot of an individual's health profile — including physical attributes, lifestyle habits, and medical history — and returns a predicted life expectancy in years. The primary model is a GradientBoostingRegressor achieving an R² of ~0.87 on the held-out test set. A baseline LinearRegression model is also included for comparison.

Artifact	File	Description
Primary model	`gradient_boosting_model.pkl`	GradientBoostingRegressor (472 KB)
Baseline model	`linear_model.pkl`	LinearRegression (4 KB)
Feature scaler	`scaler.pkl`	StandardScaler for all features
Categorical encoder	`preprocessor.pkl`	LabelEncoder mapping for categorical inputs

Intended Use

Research & education: understanding which health factors most affect life expectancy.
Health-tech prototypes: powering wellness apps or patient-facing dashboards.
Academic exploration: studying gradient boosting on tabular health data.

Not intended for: clinical diagnosis, medical decision-making, or any high-stakes healthcare decisions. Predictions are statistical estimates, not medical advice.

How to Use

Install dependencies

pip install scikit-learn>=1.5.0 joblib numpy

Load and run inference

import joblib
import numpy as np

# Load artifacts
model = joblib.load("gradient_boosting_model.pkl")
scaler = joblib.load("scaler.pkl")
preprocessor = joblib.load("preprocessor.pkl")  # dict of LabelEncoders

# --- Prepare a sample input ---
# Categorical columns and their LabelEncoders are stored in preprocessor.pkl
# Categorical features: Gender, Physical_Activity, Smoking_Status,
#                       Alcohol_Consumption, Diet, Blood_Pressure

def encode_and_predict(sample: dict) -> float:
    """
    sample keys (all required):
        Gender, Height, Weight, BMI, Physical_Activity, Smoking_Status,
        Alcohol_Consumption, Diet, Blood_Pressure, Cholesterol,
        Diabetes, Hypertension, Heart_Disease, Asthma
    """
    categorical_cols = [
        "Gender", "Physical_Activity", "Smoking_Status",
        "Alcohol_Consumption", "Diet", "Blood_Pressure",
    ]
    for col in categorical_cols:
        le = preprocessor[col]          # LabelEncoder for this column
        sample[col] = le.transform([sample[col]])[0]

    feature_order = [
        "Gender", "Height", "Weight", "BMI",
        "Physical_Activity", "Smoking_Status", "Alcohol_Consumption",
        "Diet", "Blood_Pressure", "Cholesterol",
        "Diabetes", "Hypertension", "Heart_Disease", "Asthma",
    ]
    X = np.array([[sample[f] for f in feature_order]])
    X_scaled = scaler.transform(X)
    return float(model.predict(X_scaled)[0])


sample = {
    "Gender": "Male",
    "Height": 175,
    "Weight": 75,
    "BMI": 24.5,
    "Physical_Activity": "Medium",
    "Smoking_Status": "Never",
    "Alcohol_Consumption": "Moderate",
    "Diet": "Good",
    "Blood_Pressure": "Normal",
    "Cholesterol": 190,
    "Diabetes": 0,
    "Hypertension": 0,
    "Heart_Disease": 0,
    "Asthma": 0,
}

prediction = encode_and_predict(sample)
print(f"Predicted life expectancy: {prediction:.1f} years")

Download from the Hub

from huggingface_hub import hf_hub_download
import joblib

model = joblib.load(
    hf_hub_download(repo_id="lebiraja/life-expectancy-predictor",
                    filename="gradient_boosting_model.pkl")
)
scaler = joblib.load(
    hf_hub_download(repo_id="lebiraja/life-expectancy-predictor",
                    filename="scaler.pkl")
)
preprocessor = joblib.load(
    hf_hub_download(repo_id="lebiraja/life-expectancy-predictor",
                    filename="preprocessor.pkl")
)

Input Features

Feature	Type	Values / Range	Description
`Gender`	categorical	Male / Female	Biological sex
`Height`	numerical	cm	Body height
`Weight`	numerical	kg	Body weight
`BMI`	numerical	continuous	Body Mass Index
`Physical_Activity`	categorical	Low / Medium / High	Exercise level
`Smoking_Status`	categorical	Never / Former / Current	Smoking history
`Alcohol_Consumption`	categorical	None / Moderate / Heavy	Alcohol intake
`Diet`	categorical	Poor / Average / Good	Overall diet quality
`Blood_Pressure`	categorical	Low / Normal / High	Blood pressure category
`Cholesterol`	numerical	mg/dL	Total cholesterol level
`Diabetes`	binary	0 / 1	Diabetes diagnosis flag
`Hypertension`	binary	0 / 1	Hypertension diagnosis flag
`Heart_Disease`	binary	0 / 1	Heart disease diagnosis flag
`Asthma`	binary	0 / 1	Asthma diagnosis flag

Output

A single continuous float representing predicted life expectancy in years.

Training Details

Dataset

Size: ~10,002 records
Split: 68 % train / 10 % validation / 22 % test
Target variable: Age (life expectancy in years)

Preprocessing

Fill missing categorical values with "None".
LabelEncoder applied per categorical column (encoders saved in preprocessor.pkl).
StandardScaler applied to all 14 features after encoding (saved in scaler.pkl).

Primary Model — GradientBoostingRegressor

from sklearn.ensemble import GradientBoostingRegressor

model = GradientBoostingRegressor(
    n_estimators=100,
    learning_rate=0.1,
    max_depth=5,
    min_samples_split=5,
    min_samples_leaf=2,
    random_state=42,
)

Baseline Model — LinearRegression

A standard LinearRegression is also provided (linear_model.pkl) for interpretability and benchmarking.

Performance

Metric	Value
R² (test set)	0.85 – 0.92
RMSE	3 – 5 years
Confidence score	0.87

Metrics are on the held-out test split (~22 % of 10 k records).

Limitations

The model is trained on a synthetic / illustrative dataset; real-world generalization is not guaranteed.
It does not account for socioeconomic factors, genetics, geography, or environmental exposures.
Categorical label encodings are order-sensitive — always use the supplied preprocessor.pkl rather than re-encoding independently.
Predictions for feature combinations far outside the training distribution may be unreliable.

Ethical Considerations

Not a medical device. Do not use predictions to make clinical, insurance, or policy decisions.
Fairness: The model may reflect biases present in the training data. Subgroup performance (by gender, age bracket, etc.) has not been audited.
Privacy: No personal data is stored or transmitted by this model artifact; ensure your application handles user health data in compliance with applicable regulations (HIPAA, GDPR, etc.).

Citation

If you use this model in your research or application, please cite:

@misc{lebiraja2024lifeexpectancy,
  author       = {lebiraja},
  title        = {Life Expectancy Predictor},
  year         = {2024},
  publisher    = {Hugging Face},
  howpublished = {\url{https://huggingface.co/lebiraja/life-expectancy-predictor}},
}

License

MIT

Downloads last month: -

Evaluation results

R² Score
self-reported

0.870
RMSE (years)
self-reported

4.000