Life Expectancy Predictor
A Gradient Boosting Regressor trained on 14 health and lifestyle features to predict a person's life expectancy in years. Built with scikit-learn and wrapped in a production-ready FastAPI service.
Model Description
This model takes a snapshot of an individual's health profile — including physical attributes, lifestyle habits, and medical history — and returns a predicted life expectancy in years. The primary model is a GradientBoostingRegressor achieving an R² of ~0.87 on the held-out test set. A baseline LinearRegression model is also included for comparison.
| Artifact | File | Description |
|---|---|---|
| Primary model | gradient_boosting_model.pkl |
GradientBoostingRegressor (472 KB) |
| Baseline model | linear_model.pkl |
LinearRegression (4 KB) |
| Feature scaler | scaler.pkl |
StandardScaler for all features |
| Categorical encoder | preprocessor.pkl |
LabelEncoder mapping for categorical inputs |
Intended Use
- Research & education: understanding which health factors most affect life expectancy.
- Health-tech prototypes: powering wellness apps or patient-facing dashboards.
- Academic exploration: studying gradient boosting on tabular health data.
Not intended for: clinical diagnosis, medical decision-making, or any high-stakes healthcare decisions. Predictions are statistical estimates, not medical advice.
How to Use
Install dependencies
pip install scikit-learn>=1.5.0 joblib numpy
Load and run inference
import joblib
import numpy as np
# Load artifacts
model = joblib.load("gradient_boosting_model.pkl")
scaler = joblib.load("scaler.pkl")
preprocessor = joblib.load("preprocessor.pkl") # dict of LabelEncoders
# --- Prepare a sample input ---
# Categorical columns and their LabelEncoders are stored in preprocessor.pkl
# Categorical features: Gender, Physical_Activity, Smoking_Status,
# Alcohol_Consumption, Diet, Blood_Pressure
def encode_and_predict(sample: dict) -> float:
"""
sample keys (all required):
Gender, Height, Weight, BMI, Physical_Activity, Smoking_Status,
Alcohol_Consumption, Diet, Blood_Pressure, Cholesterol,
Diabetes, Hypertension, Heart_Disease, Asthma
"""
categorical_cols = [
"Gender", "Physical_Activity", "Smoking_Status",
"Alcohol_Consumption", "Diet", "Blood_Pressure",
]
for col in categorical_cols:
le = preprocessor[col] # LabelEncoder for this column
sample[col] = le.transform([sample[col]])[0]
feature_order = [
"Gender", "Height", "Weight", "BMI",
"Physical_Activity", "Smoking_Status", "Alcohol_Consumption",
"Diet", "Blood_Pressure", "Cholesterol",
"Diabetes", "Hypertension", "Heart_Disease", "Asthma",
]
X = np.array([[sample[f] for f in feature_order]])
X_scaled = scaler.transform(X)
return float(model.predict(X_scaled)[0])
sample = {
"Gender": "Male",
"Height": 175,
"Weight": 75,
"BMI": 24.5,
"Physical_Activity": "Medium",
"Smoking_Status": "Never",
"Alcohol_Consumption": "Moderate",
"Diet": "Good",
"Blood_Pressure": "Normal",
"Cholesterol": 190,
"Diabetes": 0,
"Hypertension": 0,
"Heart_Disease": 0,
"Asthma": 0,
}
prediction = encode_and_predict(sample)
print(f"Predicted life expectancy: {prediction:.1f} years")
Download from the Hub
from huggingface_hub import hf_hub_download
import joblib
model = joblib.load(
hf_hub_download(repo_id="lebiraja/life-expectancy-predictor",
filename="gradient_boosting_model.pkl")
)
scaler = joblib.load(
hf_hub_download(repo_id="lebiraja/life-expectancy-predictor",
filename="scaler.pkl")
)
preprocessor = joblib.load(
hf_hub_download(repo_id="lebiraja/life-expectancy-predictor",
filename="preprocessor.pkl")
)
Input Features
| Feature | Type | Values / Range | Description |
|---|---|---|---|
Gender |
categorical | Male / Female | Biological sex |
Height |
numerical | cm | Body height |
Weight |
numerical | kg | Body weight |
BMI |
numerical | continuous | Body Mass Index |
Physical_Activity |
categorical | Low / Medium / High | Exercise level |
Smoking_Status |
categorical | Never / Former / Current | Smoking history |
Alcohol_Consumption |
categorical | None / Moderate / Heavy | Alcohol intake |
Diet |
categorical | Poor / Average / Good | Overall diet quality |
Blood_Pressure |
categorical | Low / Normal / High | Blood pressure category |
Cholesterol |
numerical | mg/dL | Total cholesterol level |
Diabetes |
binary | 0 / 1 | Diabetes diagnosis flag |
Hypertension |
binary | 0 / 1 | Hypertension diagnosis flag |
Heart_Disease |
binary | 0 / 1 | Heart disease diagnosis flag |
Asthma |
binary | 0 / 1 | Asthma diagnosis flag |
Output
A single continuous float representing predicted life expectancy in years.
Training Details
Dataset
- Size: ~10,002 records
- Split: 68 % train / 10 % validation / 22 % test
- Target variable:
Age(life expectancy in years)
Preprocessing
- Fill missing categorical values with
"None". LabelEncoderapplied per categorical column (encoders saved inpreprocessor.pkl).StandardScalerapplied to all 14 features after encoding (saved inscaler.pkl).
Primary Model — GradientBoostingRegressor
from sklearn.ensemble import GradientBoostingRegressor
model = GradientBoostingRegressor(
n_estimators=100,
learning_rate=0.1,
max_depth=5,
min_samples_split=5,
min_samples_leaf=2,
random_state=42,
)
Baseline Model — LinearRegression
A standard LinearRegression is also provided (linear_model.pkl) for interpretability and benchmarking.
Performance
| Metric | Value |
|---|---|
| R² (test set) | 0.85 – 0.92 |
| RMSE | 3 – 5 years |
| Confidence score | 0.87 |
Metrics are on the held-out test split (~22 % of 10 k records).
Limitations
- The model is trained on a synthetic / illustrative dataset; real-world generalization is not guaranteed.
- It does not account for socioeconomic factors, genetics, geography, or environmental exposures.
- Categorical label encodings are order-sensitive — always use the supplied
preprocessor.pklrather than re-encoding independently. - Predictions for feature combinations far outside the training distribution may be unreliable.
Ethical Considerations
- Not a medical device. Do not use predictions to make clinical, insurance, or policy decisions.
- Fairness: The model may reflect biases present in the training data. Subgroup performance (by gender, age bracket, etc.) has not been audited.
- Privacy: No personal data is stored or transmitted by this model artifact; ensure your application handles user health data in compliance with applicable regulations (HIPAA, GDPR, etc.).
Citation
If you use this model in your research or application, please cite:
@misc{lebiraja2024lifeexpectancy,
author = {lebiraja},
title = {Life Expectancy Predictor},
year = {2024},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/lebiraja/life-expectancy-predictor}},
}
License
MIT
- Downloads last month
- -
Evaluation results
- R² Scoreself-reported0.870
- RMSE (years)self-reported4.000