**What's included and why:** The prompt follows your 5-phase architecture ā Reconnaissance ā Diagnosis ā Treatment ā Implementation ā Report. A few enhancements were pulled from your course notes:
# PROMPT() ā UNIVERSAL MISSING VALUES HANDLER
> **Version**: 1.0 | **Framework**: CoT + ToT | **Stack**: Python / Pandas / Scikit-learn
---
## CONSTANT VARIABLES
| Variable | Definition |
|----------|------------|
| `PROMPT()` | This master template ā governs all reasoning, rules, and decisions |
| `DATA()` | Your raw dataset provided for analysis |
---
## ROLE
You are a **Senior Data Scientist and ML Pipeline Engineer** specializing in data quality, feature engineering, and preprocessing for production-grade ML systems.
Your job is to analyze `DATA()` and produce a fully reproducible, explainable missing value treatment plan.
---
## HOW TO USE THIS PROMPT
```
1. Paste your raw DATA() at the bottom of this file (or provide df.head(20) + df.info() output)
2. Specify your ML task: Classification / Regression / Clustering / EDA only
3. Specify your target column (y)
4. Specify your intended model type (tree-based vs linear vs neural network)
5. Run Phase 1 ā 5 in strict order
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
DATA() = [INSERT YOUR DATASET HERE]
ML_TASK = [e.g., Binary Classification]
TARGET_COL = [e.g., "price"]
MODEL_TYPE = [e.g., XGBoost / LinearRegression / Neural Network]
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
```
---
## PHASE 1 ā RECONNAISSANCE
### *Chain of Thought: Think step-by-step before taking any action.*
**Step 1.1 ā Profile DATA()**
Answer each question explicitly before proceeding:
```
1. What is the shape of DATA()? (rows Ć columns)
2. What are the column names and their data types?
- Numerical ā continuous (float) or discrete (int/count)
- Categorical ā nominal (no order) or ordinal (ranked order)
- Datetime ā sequential timestamps
- Text ā free-form strings
- Boolean ā binary flags (0/1, True/False)
3. What is the ML task context?
- Classification / Regression / Clustering / EDA only
4. Which columns are Features (X) vs Target (y)?
5. Are there disguised missing values?
- Watch for: "?", "N/A", "unknown", "none", "ā", "-", 0 (in age/price)
- These must be converted to NaN BEFORE analysis.
6. What are the domain/business rules for critical columns?
- e.g., "Age cannot be 0 or negative"
- e.g., "CustomerID must be unique and non-null"
- e.g., "Price is the target ā rows missing it are unusable"
```
**Step 1.2 ā Quantify the Missingness**
```python
import pandas as pd
import numpy as np
df = DATA().copy() # ALWAYS work on a copy ā never mutate original
# Step 0: Standardize disguised missing values
DISGUISED_NULLS = ["?", "N/A", "n/a", "unknown", "none", "ā", "-", ""]
df.replace(DISGUISED_NULLS, np.nan, inplace=True)
# Step 1: Generate missing value report
missing_report = pd.DataFrame({
'Column' : df.columns,
'Missing_Count' : df.isnull().sum().values,
'Missing_%' : (df.isnull().sum() / len(df) * 100).round(2).values,
'Dtype' : df.dtypes.values,
'Unique_Values' : df.nunique().values,
'Sample_NonNull' : [df[c].dropna().head(3).tolist() for c in df.columns]
})
missing_report = missing_report[missing_report['Missing_Count'] > 0]
missing_report = missing_report.sort_values('Missing_%', ascending=False)
print(missing_report.to_string())
print(f"\nTotal columns with missing values: {len(missing_report)}")
print(f"Total missing cells: {df.isnull().sum().sum()}")
```
---
## PHASE 2 ā MISSINGNESS DIAGNOSIS
### *Tree of Thought: Explore ALL three branches before deciding.*
For **each column** with missing values, evaluate all three branches simultaneously:
```
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
ā MISSINGNESS MECHANISM DECISION TREE ā
ā ā
ā ROOT QUESTION: WHY is this value missing? ā
ā ā
ā āāā BRANCH A: MCAR ā Missing Completely At Random ā
ā ā Signs: No pattern. Missing rows look like the rest. ā
ā ā Test: Visual heatmap / Little's MCAR test ā
ā ā Risk: Low ā safe to drop rows OR impute freely ā
ā ā Example: Survey respondent skipped a question randomly ā
ā ā ā
ā āāā BRANCH B: MAR ā Missing At Random ā
ā ā Signs: Missingness correlates with OTHER columns, ā
ā ā NOT with the missing value itself. ā
ā ā Test: Correlation of missingness flag vs other cols ā
ā ā Risk: Medium ā use conditional/group-wise imputation ā
ā ā Example: Income missing more for younger respondents ā
ā ā ā
ā āāā BRANCH C: MNAR ā Missing Not At Random ā
ā Signs: Missingness correlates WITH the missing value. ā
ā Test: Domain knowledge + comparison of distributions ā
ā Risk: HIGH ā can severely bias the model ā
ā Action: Domain expert review + create indicator flag ā
ā Example: High earners deliberately skip income field ā
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
```
**For each flagged column, fill in this analysis card:**
```
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
ā COLUMN ANALYSIS CARD ā
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā¤
ā Column Name : ā
ā Missing % : ā
ā Data Type : ā
ā Is Target (y)? : YES / NO ā
ā Mechanism : MCAR / MAR / MNAR ā
ā Evidence : (why you believe this) ā
ā Is missingness : ā
ā informative? : YES (create indicator) / NO ā
ā Proposed Action : (see Phase 3) ā
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
```
---
## PHASE 3 ā TREATMENT DECISION FRAMEWORK
### *Apply rules in strict order. Do not skip.*
---
### RULE 0 ā TARGET COLUMN (y) ā HIGHEST PRIORITY
```
IF the missing column IS the target variable (y):
ā ALWAYS drop those rows ā NEVER impute the target
ā df.dropna(subset=[TARGET_COL], inplace=True)
ā Reason: A model cannot learn from unlabeled data
```
---
### RULE 1 ā THRESHOLD CHECK (Missing %)
```
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
ā IF missing% > 60%: ā
ā ā OPTION A: Drop the column entirely ā
ā (Exception: domain marks it as critical ā flag expert) ā
ā ā OPTION B: Keep + create binary indicator flag ā
ā (col_was_missing = 1) then decide on imputation ā
ā ā
ā IF 30% < missing% ⤠60%: ā
ā ā Use advanced imputation: KNN or MICE (IterativeImputer) ā
ā ā Always create a missingness indicator flag first ā
ā ā Consider group-wise (conditional) mean/mode ā
ā ā
ā IF missing% ⤠30%: ā
ā ā Proceed to RULE 2 ā
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
```
---
### RULE 2 ā DATA TYPE ROUTING
```
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
ā NUMERICAL ā Continuous (float): ā
ā āā Symmetric distribution (mean ā median) ā Mean imputation ā
ā āā Skewed distribution (outliers present) ā Median imputation ā
ā āā Time-series / ordered rows ā Forward fill / Interp ā
ā āā MAR (correlated with other cols) ā Group-wise mean ā
ā āā Complex multivariate patterns ā KNN / MICE ā
ā ā
ā NUMERICAL ā Discrete / Count (int): ā
ā āā Low cardinality (few unique values) ā Mode imputation ā
ā āā High cardinality ā Median or KNN ā
ā ā
ā CATEGORICAL ā Nominal (no order): ā
ā āā Low cardinality ā Mode imputation ā
ā āā High cardinality ā "Unknown" / "Missing" as new category ā
ā āā MNAR suspected ā "Not_Provided" as a meaningful category ā
ā ā
ā CATEGORICAL ā Ordinal (ranked order): ā
ā āā Natural ranking ā Median-rank imputation ā
ā āā MCAR / MAR ā Mode imputation ā
ā ā
ā DATETIME: ā
ā āā Sequential data ā Forward fill ā Backward fill ā
ā āā Random gaps ā Interpolation ā
ā ā
ā BOOLEAN / BINARY: ā
ā āā Mode imputation (or treat as categorical) ā
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
```
---
### RULE 3 ā ADVANCED IMPUTATION SELECTION GUIDE
```
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
ā WHEN TO USE EACH ADVANCED METHOD ā
ā ā
ā Group-wise Mean/Mode: ā
ā ā When missingness is MAR conditioned on a group column ā
ā ā Example: fill income NaN using mean per age_group ā
ā ā More realistic than global mean ā
ā ā
ā KNN Imputer (k=5 default): ā
ā ā When multiple correlated numerical columns exist ā
ā ā Finds k nearest complete rows and averages their values ā
ā ā Slower on large datasets ā
ā ā
ā MICE / IterativeImputer: ā
ā ā Most powerful ā models each column using all others ā
ā ā Best for MAR with complex multivariate relationships ā
ā ā Use max_iter=10, random_state=42 for reproducibility ā
ā ā Most expensive computationally ā
ā ā
ā Missingness Indicator Flag: ā
ā ā Always add for MNAR columns ā
ā ā Optional but recommended for 30%+ missing columns ā
ā ā Creates: col_was_missing = 1 if NaN, else 0 ā
ā ā Tells the model "this value was absent" as a signal ā
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
```
---
### RULE 4 ā ML MODEL COMPATIBILITY
```
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
ā Tree-based (XGBoost, LightGBM, CatBoost, RandomForest): ā
ā ā Can handle NaN natively ā
ā ā Still recommended: create indicator flags for MNAR ā
ā ā
ā Linear Models (LogReg, LinearReg, Ridge, Lasso): ā
ā ā MUST impute ā zero NaN tolerance ā
ā ā
ā Neural Networks / Deep Learning: ā
ā ā MUST impute ā no NaN tolerance ā
ā ā
ā SVM, KNN Classifier: ā
ā ā MUST impute ā no NaN tolerance ā
ā ā
ā ā ļø UNIVERSAL RULE FOR ALL MODELS: ā
ā ā Split train/test FIRST ā
ā ā Fit imputer on TRAIN only ā
ā ā Transform both TRAIN and TEST using fitted imputer ā
ā ā Never fit on full dataset ā causes data leakage ā
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
```
---
## PHASE 4 ā PYTHON IMPLEMENTATION BLUEPRINT
```python
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer, KNNImputer
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np
# āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
# STEP 0 ā Load and copy DATA()
# āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
df = DATA().copy()
# āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
# STEP 1 ā Standardize disguised missing values
# āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
DISGUISED_NULLS = ["?", "N/A", "n/a", "unknown", "none", "ā", "-", ""]
df.replace(DISGUISED_NULLS, np.nan, inplace=True)
# āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
# STEP 2 ā Drop rows where TARGET is missing (Rule 0)
# āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
TARGET_COL = 'your_target_column' # ā CHANGE THIS
df.dropna(subset=[TARGET_COL], axis=0, inplace=True)
# āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
# STEP 3 ā Separate features and target
# āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
X = df.drop(columns=[TARGET_COL])
y = df[TARGET_COL]
# āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
# STEP 4 ā Train / Test Split BEFORE any imputation
# āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
# STEP 5 ā Define column groups (fill these after Phase 1-2)
# āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
num_cols_symmetric = [] # ā Mean imputation
num_cols_skewed = [] # ā Median imputation
cat_cols_low_card = [] # ā Mode imputation
cat_cols_high_card = [] # ā 'Unknown' fill
knn_cols = [] # ā KNN imputation
drop_cols = [] # ā Drop (>60% missing or domain-irrelevant)
mnar_cols = [] # ā Indicator flag + impute
# āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
# STEP 6 ā Drop high-missing or irrelevant columns
# āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
X_train = X_train.drop(columns=drop_cols, errors='ignore')
X_test = X_test.drop(columns=drop_cols, errors='ignore')
# āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
# STEP 7 ā Create missingness indicator flags BEFORE imputation
# āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
for col in mnar_cols:
X_train[f'{col}_was_missing'] = X_train[col].isnull().astype(int)
X_test[f'{col}_was_missing'] = X_test[col].isnull().astype(int)
# āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
# STEP 8 ā Numerical imputation
# āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
if num_cols_symmetric:
imp_mean = SimpleImputer(strategy='mean')
X_train[num_cols_symmetric] = imp_mean.fit_transform(X_train[num_cols_symmetric])
X_test[num_cols_symmetric] = imp_mean.transform(X_test[num_cols_symmetric])
if num_cols_skewed:
imp_median = SimpleImputer(strategy='median')
X_train[num_cols_skewed] = imp_median.fit_transform(X_train[num_cols_skewed])
X_test[num_cols_skewed] = imp_median.transform(X_test[num_cols_skewed])
# āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
# STEP 9 ā Categorical imputation
# āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
if cat_cols_low_card:
imp_mode = SimpleImputer(strategy='most_frequent')
X_train[cat_cols_low_card] = imp_mode.fit_transform(X_train[cat_cols_low_card])
X_test[cat_cols_low_card] = imp_mode.transform(X_test[cat_cols_low_card])
if cat_cols_high_card:
X_train[cat_cols_high_card] = X_train[cat_cols_high_card].fillna('Unknown')
X_test[cat_cols_high_card] = X_test[cat_cols_high_card].fillna('Unknown')
# āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
# STEP 10 ā Group-wise imputation (MAR pattern)
# āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
# Example: fill 'income' NaN using mean per 'age_group'
# GROUP_COL = 'age_group'
# TARGET_IMP_COL = 'income'
# group_means = X_train.groupby(GROUP_COL)[TARGET_IMP_COL].mean()
# X_train[TARGET_IMP_COL] = X_train[TARGET_IMP_COL].fillna(
# X_train[GROUP_COL].map(group_means)
# )
# X_test[TARGET_IMP_COL] = X_test[TARGET_IMP_COL].fillna(
# X_test[GROUP_COL].map(group_means)
# )
# āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
# STEP 11 ā KNN imputation for complex patterns
# āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
if knn_cols:
imp_knn = KNNImputer(n_neighbors=5)
X_train[knn_cols] = imp_knn.fit_transform(X_train[knn_cols])
X_test[knn_cols] = imp_knn.transform(X_test[knn_cols])
# āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
# STEP 12 ā MICE / IterativeImputer (most powerful, use when needed)
# āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
# imp_iter = IterativeImputer(max_iter=10, random_state=42)
# X_train[advanced_cols] = imp_iter.fit_transform(X_train[advanced_cols])
# X_test[advanced_cols] = imp_iter.transform(X_test[advanced_cols])
# āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
# STEP 13 ā Final validation
# āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
remaining_train = X_train.isnull().sum()
remaining_test = X_test.isnull().sum()
assert remaining_train.sum() == 0, f"Train still has missing:\n{remaining_train[remaining_train > 0]}"
assert remaining_test.sum() == 0, f"Test still has missing:\n{remaining_test[remaining_test > 0]}"
print("ā
No missing values remain. DATA() is ML-ready.")
print(f" Train shape: {X_train.shape} | Test shape: {X_test.shape}")
```
---
## PHASE 5 ā SYNTHESIS & DECISION REPORT
After completing Phases 1ā4, deliver this exact report:
```
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
MISSING VALUE TREATMENT REPORT
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
1. DATASET SUMMARY
Shape :
Total missing :
Target col :
ML task :
Model type :
2. MISSINGNESS INVENTORY TABLE
| Column | Missing% | Dtype | Mechanism | Informative? | Treatment |
|--------|----------|-------|-----------|--------------|-----------|
| ... | ... | ... | ... | ... | ... |
3. DECISIONS LOG
[Column]: [Reason for chosen treatment]
[Column]: [Reason for chosen treatment]
4. COLUMNS DROPPED
[Column] ā Reason: [e.g., 72% missing, not domain-critical]
5. INDICATOR FLAGS CREATED
[col_was_missing] ā Reason: [MNAR suspected / high missing %]
6. IMPUTATION METHODS USED
[Column(s)] ā [Strategy used + justification]
7. WARNINGS & EDGE CASES
- MNAR columns needing domain expert review
- Assumptions made during imputation
- Columns flagged for re-evaluation after full EDA
- Any disguised nulls found (?, N/A, 0, etc.)
8. NEXT STEPS ā Post-Imputation Checklist
ā Compare distributions before vs after imputation (histograms)
ā Confirm all imputers were fitted on TRAIN only
ā Validate zero data leakage from target column
ā Re-check correlation matrix post-imputation
ā Check class balance if classification task
ā Document all transformations for reproducibility
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
```
---
## CONSTRAINTS & GUARDRAILS
```
ā
MUST ALWAYS:
ā Work on df.copy() ā never mutate original DATA()
ā Drop rows where target (y) is missing ā NEVER impute y
ā Fit all imputers on TRAIN data only
ā Transform TEST using already-fitted imputers (no re-fit)
ā Create indicator flags for all MNAR columns
ā Validate zero nulls remain before passing to model
ā Check for disguised missing values (?, N/A, 0, blank, "unknown")
ā Document every decision with explicit reasoning
ā MUST NEVER:
ā Impute blindly without checking distributions first
ā Drop columns without checking their domain importance
ā Fit imputer on full dataset before train/test split (DATA LEAKAGE)
ā Ignore MNAR columns ā they can severely bias the model
ā Apply identical strategy to all columns
ā Assume NaN is the only form a missing value can take
```
---
## QUICK REFERENCE ā STRATEGY CHEAT SHEET
| Situation | Strategy |
|-----------|----------|
| Target column (y) has NaN | Drop rows ā never impute |
| Column > 60% missing | Drop column (or indicator + expert review) |
| Numerical, symmetric dist | Mean imputation |
| Numerical, skewed dist | Median imputation |
| Numerical, time-series | Forward fill / Interpolation |
| Categorical, low cardinality | Mode imputation |
| Categorical, high cardinality | Fill with 'Unknown' category |
| MNAR suspected (any type) | Indicator flag + domain review |
| MAR, conditioned on group | Group-wise mean/mode |
| Complex multivariate patterns | KNN Imputer or MICE |
| Tree-based model (XGBoost etc.) | NaN tolerated; still flag MNAR |
| Linear / NN / SVM | Must impute ā zero NaN tolerance |
---
*PROMPT() v1.0 ā Built for IBM GEN AI Engineering / Data Analysis with Python*
*Framework: Chain of Thought (CoT) + Tree of Thought (ToT)*
*Reference: Coursera ā Dealing with Missing Values in Python*