This report builds, tunes and compares regression models to predict Melbourne property prices, then uses the best model to price the current For Sale supply.
Data snapshot: this report uses data updated on 2 May 2026.
The report consumes the cleaned dataset produced by eda_report.ipynb. All cleaning rules, outlier policies and encoding decisions were finalized in EDA and are loaded here from eda_decisions.json to keep the two reports in sync.
Three models are compared: Linear Regression with Ridge regularization (baseline), Random Forest, and XGBoost. The best model is then used to predict prices for all For Sale listings, with the current Year and Month injected at inference to obtain today-priced estimates. Two additional XGBoost quantile models (q=0.1, q=0.9) produce an 80% prediction interval for each listing.
Workflow: setup, problem framing, feature engineering, time-based 70/15/15 split, preprocessing pipeline, three model families, model comparison, SHAP interpretability, final retraining and For Sale inference, save artifacts.
1. Load cleaned data and EDA decisions
I load three things from the EDA output: the cleaned dataset (Sold + For Sale combined into one parquet), the JSON of decisions taken during EDA (split ratios, encoding rules, rare-type list, new-build categories), and a quick sanity check on group sizes.
Code
import warningswarnings.filterwarnings("ignore")import jsonimport numpy as npimport pandas as pdimport matplotlib.pyplot as pltimport seaborn as snsfrom pathlib import Pathimport joblibfrom sklearn.preprocessing import StandardScalerfrom sklearn.impute import SimpleImputerfrom sklearn.compose import ColumnTransformerfrom sklearn.pipeline import Pipelinefrom sklearn.linear_model import LinearRegression, Ridgefrom sklearn.ensemble import RandomForestRegressorfrom sklearn.metrics import mean_squared_error, mean_absolute_error, r2_scoreimport xgboost as xgbimport shappd.set_option("display.max_columns", 50)pd.set_option("display.width", 200)sns.set_style("whitegrid")plt.rcParams["figure.dpi"] =100plt.rcParams["figure.figsize"] = (10, 5)DATA_DIR = Path("report_data")MODEL_DIR = Path("models")MODEL_DIR.mkdir(exist_ok=True)# To convert to html, quarto render report/ml_report.ipynbprint("Setup complete.")
Setup complete.
Code
# Load cleaned dataset and EDA decisions.df = pd.read_parquet(DATA_DIR /"cleaned_data.parquet")withopen(DATA_DIR /"eda_decisions.json", "r") as f: decisions = json.load(f)print(f"Loaded data: {df.shape}")print(f" Sold: {(df['Status'] =='Sold').sum():,}")print(f" For Sale: {(df['Status'] =='For Sale').sum():,}")print("\nDecision keys loaded:")for k in decisions:print(f" - {k}")print(f"\nSnapshot date: {decisions['data_snapshot_date']}")print(f"Split ratios: {decisions['split_ratios']}")print(f"Target transform: {decisions['target_transform']}")print(f"Inference Year/Month: {decisions['inference_year']}/{decisions['inference_month']}")
Goal: predict Numeric_Price for For Sale listings.
Training set: Sold rows that have a valid (non-NaN) Numeric_Price. Rows with “Price Withheld” or with any other missing target are excluded from training.
Inference set: all For Sale rows.
Target:log1p(Numeric_Price). The log transform was justified in EDA Section 4.1 by the heavy right-skew of raw price.
Time-aware features: each Sold row uses its actual Year and Month at training time, letting the model absorb the 4-12% YoY inflation observed in EDA Section 6. For Sale rows have no transaction date by structural design, so at inference I inject Year = 2026, Month = 5 (the snapshot date) so predictions reflect today’s market level rather than the historical average.
Evaluation: I report RMSE, MAE, MAPE, and R² on both log scale and original AUD scale. RMSE on AUD is the primary metric because absolute pricing error is what matters for downstream decisions.
Code
# Split the combined frame into training-eligible Sold and inference-target For Sale.df_sold = df[df["Status"] =="Sold"].copy()df_forsale = df[df["Status"] =="For Sale"].copy()# Training set requires a non-NaN Numeric_Price.df_train_pool = df_sold.dropna(subset=["Numeric_Price"]).copy()print(f"Sold total: {len(df_sold):,}")print(f"Sold with valid price: {len(df_train_pool):,} (training pool)")print(f"For Sale (inference): {len(df_forsale):,}")print(f"\nTraining pool date range: "f"{df_train_pool['Date_parsed'].min().date()} to "f"{df_train_pool['Date_parsed'].max().date()}")print(f"\nTarget summary (raw AUD):")print(df_train_pool["Numeric_Price"].describe(percentiles=[.25, .5, .75, .95]).round(0))
Sold total: 124,820
Sold with valid price: 114,594 (training pool)
For Sale (inference): 38,419
Training pool date range: 2005-12-01 to 2026-05-02
Target summary (raw AUD):
count 114594.0
mean 958510.0
std 701764.0
min 50000.0
25% 610000.0
50% 772295.0
75% 1081545.0
95% 2100000.0
max 19500000.0
Name: Numeric_Price, dtype: float64
3. Feature Engineering
I build the feature matrix in this order:
Apply the rare-type grouping (“Other”) and create the is_new_build flag identified in EDA.
Define the four feature buckets: numeric, time, binary flags, categorical.
Keep the engineering deterministic and reusable - the same function will run in production.
The frequency encoding for Suburb is fitted later, on the training fold only (Section 5), to avoid leakage.
About is_new_build: this flag marks properties belonging to four new-construction categories - “New House & Land”, “New Apartments / Off the Plan”, “New Home Designs”, and “New land”. EDA Section 4.3 found that these categories make up 27% of For Sale supply but only 0.3% of Sold history. Without the flag, the model would have very few training examples in these categories and would generalize poorly. The flag acts as a single binary signal the model can latch onto, regardless of how rare each specific new-build category is in training. The original Property_Type column still goes into the model as a one-hot categorical, so the flag adds information rather than replacing it.
Code
def add_engineered_features(d):""" Add features derived from EDA decisions: - Group rare Property_Type categories into "Other". - Add is_new_build flag for the 4 new-build categories. is_land and out_of_metro are already present from EDA cleaning. """ d = d.copy()# Group rare types into "Other". rare =set(decisions["rare_property_types"]) d["Property_Type"] = d["Property_Type"].where(~d["Property_Type"].isin(rare), "Other")# New-build flag. new_build_types =set(decisions["new_build_types"]) d["is_new_build"] = d["Property_Type"].isin(new_build_types).astype(int)return ddf_train_pool = add_engineered_features(df_train_pool)df_forsale = add_engineered_features(df_forsale)print("After feature engineering:")print(f" Training pool Property_Type distinct values: {df_train_pool['Property_Type'].nunique()}")print(f" For Sale Property_Type distinct values: {df_forsale['Property_Type'].nunique()}")print(f"\nis_new_build share:")print(f" Training pool: {df_train_pool['is_new_build'].mean()*100:.1f}%")print(f" For Sale: {df_forsale['is_new_build'].mean()*100:.1f}%")print(f"\nProperty_Type composition (training pool):")print(df_train_pool["Property_Type"].value_counts().head(15))
After feature engineering:
Training pool Property_Type distinct values: 15
For Sale Property_Type distinct values: 15
is_new_build share:
Training pool: 0.3%
For Sale: 30.0%
Property_Type composition (training pool):
Property_Type
House 83147
Apartment / Unit / Flat 14233
Townhouse 8243
Vacant land 6701
Acreage / Semi-Rural 1138
Rural 378
Villa 260
New House & Land 235
Block of Units 72
Other 69
New Apartments / Off the Plan 48
Studio 37
New land 21
Unknown 8
New Home Designs 4
Name: count, dtype: int64
3.1 Define feature buckets
I separate features into four buckets so the preprocessing pipeline in Section 5 can apply the right transformation to each:
Numeric (continuous): need imputation and (for Linear/Ridge) scaling.
Time: Year and Month as integers; no scaling needed for trees, light scaling for Linear.
Binary flags: already 0/1; pass through.
Categorical: Property_Type → one-hot; Suburb → frequency encoded.
I do not include Latitude/Longitude as Property_Type-like features - they are continuous geographic coordinates and go into the numeric bucket so tree models can split on them.
Code
NUMERIC_FEATURES = ["Beds", "Baths", "Car_Spaces", "LandSize_sqm","Distance_to_CBD_km", "dist_nearest_train_km","Latitude", "Longitude","Propertycount","abs_median_income_weekly", "abs_median_age","abs_population", "crime_rate_per_100k",]TIME_FEATURES = ["Year", "Month"]FLAG_FEATURES = ["is_land", "out_of_metro", "is_new_build"]CATEGORICAL_FEATURES = ["Property_Type"] # one-hotFREQ_ENCODED_FEATURES = ["Suburb"] # frequency encoding, fitted on train foldALL_FEATURES = (NUMERIC_FEATURES + TIME_FEATURES + FLAG_FEATURES+ CATEGORICAL_FEATURES + FREQ_ENCODED_FEATURES)print(f"Total raw features before encoding: {len(ALL_FEATURES)}")print(f" Numeric: {len(NUMERIC_FEATURES)}")print(f" Time: {len(TIME_FEATURES)}")print(f" Flags: {len(FLAG_FEATURES)}")print(f" Categorical: {len(CATEGORICAL_FEATURES)}")print(f" Freq-encoded:{len(FREQ_ENCODED_FEATURES)}")# Sanity: verify all listed features exist in the data.missing = [c for c in ALL_FEATURES if c notin df_train_pool.columns]print(f"\nMissing from training pool: {missing if missing else'none'}")
Total raw features before encoding: 20
Numeric: 13
Time: 2
Flags: 3
Categorical: 1
Freq-encoded:1
Missing from training pool: none
3.2 Inject inference Year/Month for For Sale
For Sale has no transaction date. To produce today-priced predictions I inject the snapshot’s Year and Month from the EDA decisions. I do this now (before the split) so the For Sale frame is ready for inference later without further mutation.
Code
df_forsale["Year"] = decisions["inference_year"]df_forsale["Month"] = decisions["inference_month"]print(f"Injected For Sale Year/Month: "f"{decisions['inference_year']}/{decisions['inference_month']}")print(f"For Sale Year unique values: {df_forsale['Year'].unique()}")print(f"For Sale Month unique values: {df_forsale['Month'].unique()}")
Injected For Sale Year/Month: 2026/5
For Sale Year unique values: [2026]
For Sale Month unique values: [5]
4. Time-based 70/15/15 Split
I sort the training pool by Date_parsed and cut it into three folds in chronological order:
Train (70%): oldest transactions. Used to fit the model.
Validation (15%): middle period. Used to tune hyperparameters.
Test (15%): newest transactions. Used once at the end for unbiased evaluation; never touched during tuning.
A random split would leak future information into training and inflate metrics. Time-based split mimics the production scenario where the model is asked to price recent listings using only older data.
Because EDA Section 4.5 showed 40% of transactions occurred in 2025 alone, even 15% folds will cover several months of recent data - more than enough for stable hyperparameter selection and a meaningful test evaluation.
Code
# Sort by Date_parsed ascending and cut into three contiguous folds.df_sorted = df_train_pool.sort_values("Date_parsed").reset_index(drop=True)n =len(df_sorted)train_end =int(n *0.70)val_end =int(n *0.85)train_df = df_sorted.iloc[:train_end].copy()val_df = df_sorted.iloc[train_end:val_end].copy()test_df = df_sorted.iloc[val_end:].copy()def fold_info(name, d):return {"fold": name,"n": len(d),"date_min": d["Date_parsed"].min().date(),"date_max": d["Date_parsed"].max().date(),"price_median": int(d["Numeric_Price"].median()), }info = pd.DataFrame([ fold_info("train", train_df), fold_info("val", val_df), fold_info("test", test_df),])print(info.to_string(index=False))# Save cutoff dates for production scripts.split_meta = {"train_end_date": str(train_df["Date_parsed"].max().date()),"val_end_date": str(val_df["Date_parsed"].max().date()),"n_train": len(train_df),"n_val": len(val_df),"n_test": len(test_df),}print("\nSplit metadata:")print(json.dumps(split_meta, indent=2))
Median imputation for numeric features with NaN (LandSize on rows that originally had 0, Beds/Baths on rooms-less rows, crime_rate for missing-suburb rows).
One-hot encoding for Property_Type.
Frequency encoding for Suburb (cardinality ~540).
All three are fitted on the train fold only. The fitted transformers are then applied to val, test, and For Sale to avoid leakage. Standard scaling is applied only inside the Linear/Ridge pipeline (Section 6); tree models do not need it.
Frequency encoding maps each suburb to its count in the train fold. Unseen suburbs in val/test/For Sale get encoded as 0. The cold-start check in EDA Section 8.2 confirmed this affects only 12 For Sale rows (0.03%), so a simple zero fallback is fine.
Code
class FrequencyEncoder:"""Maps a categorical column to its training-fold frequency. Unseen values map to 0."""def fit(self, series):self.freq_ = series.value_counts().to_dict()returnselfdef transform(self, series):return series.map(self.freq_).fillna(0).astype(int)def fit_transform(self, series):returnself.fit(series).transform(series)def build_preprocessor():"""ColumnTransformer for numeric + flags + one-hot, returns dense array. Suburb is handled separately via FrequencyEncoder for cardinality control.""" numeric_pipe = Pipeline([ ("imputer", SimpleImputer(strategy="median")), ])return ColumnTransformer( transformers=[ ("num", numeric_pipe, NUMERIC_FEATURES + TIME_FEATURES), ("flag", "passthrough", FLAG_FEATURES), ("cat", pd.get_dummies, CATEGORICAL_FEATURES), ], remainder="drop", verbose_feature_names_out=False, )def prepare_features(d, preproc, suburb_encoder, fit=False):"""Apply preprocessing to a dataframe. If fit=True, fit transformers first.""" d = d.copy()# Frequency-encode Suburb.if fit: d["Suburb_freq"] = suburb_encoder.fit_transform(d["Suburb"])else: d["Suburb_freq"] = suburb_encoder.transform(d["Suburb"])# One-hot Property_Type manually (sklearn-compatible, leakage-safe).if fit:global PROPERTY_TYPE_COLUMNS PROPERTY_TYPE_COLUMNS =sorted(d["Property_Type"].unique().tolist()) ohe = pd.get_dummies(d["Property_Type"], prefix="ptype")# Ensure all train columns are present in val/test/forsale.for col in [f"ptype_{c}"for c in PROPERTY_TYPE_COLUMNS]:if col notin ohe.columns: ohe[col] =0 ohe = ohe[[f"ptype_{c}"for c in PROPERTY_TYPE_COLUMNS]]# Numeric + time + flags. num_block = d[NUMERIC_FEATURES + TIME_FEATURES + FLAG_FEATURES + ["Suburb_freq"]].copy()# Impute numeric/time medians (flags are 0/1, no NaN by construction).if fit:global NUMERIC_MEDIANS NUMERIC_MEDIANS = num_block[NUMERIC_FEATURES + TIME_FEATURES].median() num_block[NUMERIC_FEATURES + TIME_FEATURES] = ( num_block[NUMERIC_FEATURES + TIME_FEATURES].fillna(NUMERIC_MEDIANS) )# Combine. X = pd.concat([num_block.reset_index(drop=True), ohe.reset_index(drop=True)], axis=1) X = X.astype(float)return X# Fit on train, transform all four sets.suburb_encoder = FrequencyEncoder()PROPERTY_TYPE_COLUMNS =NoneNUMERIC_MEDIANS =NoneX_train = prepare_features(train_df, None, suburb_encoder, fit=True)X_val = prepare_features(val_df, None, suburb_encoder, fit=False)X_test = prepare_features(test_df, None, suburb_encoder, fit=False)X_fs = prepare_features(df_forsale, None, suburb_encoder, fit=False)# Targets in log scale.y_train = np.log1p(train_df["Numeric_Price"].values)y_val = np.log1p(val_df["Numeric_Price"].values)y_test = np.log1p(test_df["Numeric_Price"].values)print(f"X_train: {X_train.shape}")print(f"X_val: {X_val.shape}")print(f"X_test: {X_test.shape}")print(f"X_forsale: {X_fs.shape}")print(f"\nFeature columns ({X_train.shape[1]}):")print(list(X_train.columns))
A quick check that the imputer caught everything. If any NaN slips through, tree models will tolerate it but Linear/Ridge will crash.
Code
for name, X in [("X_train", X_train), ("X_val", X_val), ("X_test", X_test), ("X_forsale", X_fs)]: n_nan = X.isna().sum().sum()print(f"{name}: NaN count = {n_nan}")
X_train: NaN count = 0
X_val: NaN count = 0
X_test: NaN count = 0
X_forsale: NaN count = 0
6. Model 1: Linear Regression + Ridge
The Linear baseline establishes a floor for the more complex models to beat. I run two variants:
Vanilla LinearRegression - no regularization.
Ridge with alpha tuned on the validation fold across {0.1, 1, 10, 100, 1000}.
Both use StandardScaler since linear coefficients are scale-sensitive. EDA Section 5.4 found Beds-Baths correlation at 0.76, so Ridge is expected to outperform vanilla Linear by stabilizing the multicollinear cluster.
Metrics are computed on both log scale (the model’s actual prediction target) and AUD scale (via np.expm1 back-transform). AUD-scale RMSE is the headline metric.
Code
def metrics(y_true_log, y_pred_log):"""Return RMSE/MAE/R2 on log scale and RMSE/MAE/MAPE on AUD scale.""" y_true_aud = np.expm1(y_true_log) y_pred_aud = np.expm1(y_pred_log)return {"rmse_log": float(np.sqrt(mean_squared_error(y_true_log, y_pred_log))),"mae_log": float(mean_absolute_error(y_true_log, y_pred_log)),"r2": float(r2_score(y_true_log, y_pred_log)),"rmse_aud": float(np.sqrt(mean_squared_error(y_true_aud, y_pred_aud))),"mae_aud": float(mean_absolute_error(y_true_aud, y_pred_aud)),"mape": float(np.mean(np.abs((y_true_aud - y_pred_aud) / y_true_aud)) *100), }# 1. Vanilla LinearRegression with scaling.scaler = StandardScaler().fit(X_train)X_train_s = scaler.transform(X_train)X_val_s = scaler.transform(X_val)lin = LinearRegression().fit(X_train_s, y_train)m_lin_val = metrics(y_val, lin.predict(X_val_s))print("Vanilla LinearRegression (val):")for k, v in m_lin_val.items():print(f" {k:10s}: {v:,.4f}"if"log"in k or"r2"in k or"mape"in kelsef" {k:10s}: {v:,.0f}")# 2. Ridge - tune alpha on validation.print("\nRidge alpha tuning:")ridge_results = []for alpha in [0.1, 1, 10, 100, 1000]: r = Ridge(alpha=alpha, random_state=0).fit(X_train_s, y_train) m = metrics(y_val, r.predict(X_val_s)) ridge_results.append({"alpha": alpha, **m})print(f" alpha={alpha:>6}: rmse_aud={m['rmse_aud']:,.0f} "f"mape={m['mape']:.2f}% r2={m['r2']:.4f}")ridge_df = pd.DataFrame(ridge_results)best_alpha = ridge_df.loc[ridge_df["rmse_aud"].idxmin(), "alpha"]print(f"\nBest Ridge alpha: {best_alpha}")# Final Linear baseline metrics for comparison.best_ridge = Ridge(alpha=best_alpha, random_state=0).fit(X_train_s, y_train)m_ridge_val = metrics(y_val, best_ridge.predict(X_val_s))linear_results = {"vanilla_linear_val": m_lin_val,"best_ridge_alpha": float(best_alpha),"best_ridge_val": m_ridge_val,}print(f"\nBest Ridge (val): rmse_aud={m_ridge_val['rmse_aud']:,.0f} "f"mape={m_ridge_val['mape']:.2f}% r2={m_ridge_val['r2']:.4f}")
Ridge tuning: alpha has near-zero effect. Best alpha=1000 only improves RMSE by ~$330 (0.07%). The Beds-Baths multicollinearity hurts less than expected, likely because the 34-feature space dilutes its impact.
Linear baseline explains 63% of log-price variance. The non-linearities found in EDA (Year x Property_Type interaction, log-log LandSize-price, suburb-level price hierarchy) cannot be captured by a linear model. Tree models should comfortably beat this.
Floor to beat: RMSE $448k, MAPE 23%, R² 0.635.
7. Model 2: Random Forest
Grid search across 36 hyperparameter combinations, evaluated on validation. RF does not need scaling, so I use the raw X_train instead of X_train_s.
Best config: n_estimators=500, max_depth=None (unlimited), min_samples_leaf=1, max_features=0.5. Deep unconstrained trees with moderate feature sampling.
Best RF (val): RMSE $324,712, MAPE 13.07%, R² 0.849. Major improvement over Linear: RMSE drops 27% ($448k → $325k), MAPE nearly halves (23% → 13%), R² jumps from 0.635 to 0.849.
Top 5 are tightly clustered (RMSE $325k-$335k) - the model is robust to hyperparameter choice as long as trees are allowed to grow deep.
max_features=0.5 slightly outperforms “sqrt” - tree diversity matters but moderate feature sampling beats heavy subsampling here.
Floor to beat for XGBoost: RMSE $325k, MAPE 13%, R² 0.85.
8. Model 3: XGBoost
Grid search with early stopping on validation. Early stopping halts training when val RMSE stops improving for 50 rounds, which both speeds up the search and avoids overfitting deep trees.
Search grid:
max_depth: [4, 6, 8]
learning_rate: [0.05, 0.1]
subsample: [0.8, 1.0]
colsample_bytree: [0.8, 1.0]
n_estimators is set high (2000) with early stopping doing the actual selection. tree_method=“hist” is the fast histogram-based algorithm.
Best config: max_depth=6, learning_rate=0.05, subsample=0.8, colsample_bytree=1.0, best_iter=1532. Moderate depth with slow learning rate and many trees.
Best XGBoost (val): RMSE $295,156, MAPE 12.79%, R² 0.864. Beats RF by ~$30k RMSE (9% lower) and improves R² from 0.849 to 0.864.
Top 5 are clustered in two depth tiers (4 and 6) - both work well. Higher depth=8 underperforms slightly, suggesting trees don’t need to go deeper than 6 for this dataset.
Slow learning rate (0.05) with 1500+ trees beats faster learning rate (0.1) with fewer trees - classic gradient boosting pattern.
New floor: XGBoost wins the validation comparison and becomes the candidate for test evaluation in Section 9.
9. Model Comparison and Test Evaluation
Side-by-side comparison of the three families on validation, then a single, final evaluation of the winner on the test fold. The test fold is touched only once - no further hyperparameter changes are allowed after seeing test metrics.
After test evaluation I break the winner’s errors down by Property_Type, price bucket, and Year to identify where the model is weakest.
=== Validation comparison ===
model rmse_aud mape r2
Linear (Ridge alpha=1000) 447644.5022 23.1277 0.6346
Random Forest 324711.7352 13.0724 0.8485
XGBoost 295156.2157 12.7929 0.8639
Winner: XGBoost
9.1 Final test evaluation
I retrain the winning XGBoost configuration on train+val combined (using the same n_estimators that early stopping picked) and evaluate on the untouched test fold. This is the unbiased estimate of production performance.
Code
# Retrain XGBoost on train + val with best hyperparameters and the n_estimators found.X_tv = pd.concat([X_train, X_val], axis=0, ignore_index=True)y_tv = np.concatenate([y_train, y_val])final_xgb_val_stage = xgb.XGBRegressor( n_estimators=int(best_xgb_params["best_iter"]), max_depth=int(best_xgb_params["max_depth"]), learning_rate=best_xgb_params["learning_rate"], subsample=best_xgb_params["subsample"], colsample_bytree=best_xgb_params["colsample_bytree"], tree_method="hist", random_state=0, n_jobs=-1, verbosity=0,).fit(X_tv, y_tv)y_pred_test = final_xgb_val_stage.predict(X_test)m_test = metrics(y_test, y_pred_test)print("=== Test fold evaluation (XGBoost, retrained on train+val) ===")print(f" RMSE (AUD): {m_test['rmse_aud']:,.0f}")print(f" MAE (AUD): {m_test['mae_aud']:,.0f}")print(f" MAPE: {m_test['mape']:.2f}%")print(f" R²: {m_test['r2']:.4f}")print(f" RMSE (log): {m_test['rmse_log']:.4f}")
=== Test fold evaluation (XGBoost, retrained on train+val) ===
RMSE (AUD): 203,832
MAE (AUD): 114,892
MAPE: 11.83%
R²: 0.8688
RMSE (log): 0.1652
Test results (XGBoost retrained on train+val):
RMSE $203,832, MAE $114,892, MAPE 11.83%, R² 0.869.
Test metrics are better than validation (val RMSE was $295k, test is $204k). This is unusual but explainable: the val fold (Nov 2025 - Feb 2026) overlaps with the post-COVID pullback recovery period, while the test fold (Feb-May 2026) sits in a more stable regime that the train+val combined set learned well. No overfitting signal - the model generalizes properly.
9.2 Error breakdown for the winning model
Where does the model struggle? I break test-set errors by Property_Type, price bucket, and Year. Large differences signal sub-populations the model underperforms on - useful for FastAPI dashboard warnings and for future improvements.
Code
# Build a test-side error dataframe.err = test_df[["Property_Type", "Year", "Numeric_Price"]].copy().reset_index(drop=True)err["pred_aud"] = np.expm1(y_pred_test)err["abs_err"] = (err["pred_aud"] - err["Numeric_Price"]).abs()err["abs_pct_err"] = err["abs_err"] / err["Numeric_Price"] *100# 1. By Property_Type (top 8 by volume).top_types = err["Property_Type"].value_counts().head(8).indexprint("=== Error by Property_Type (top 8) ===")print(err[err["Property_Type"].isin(top_types)] .groupby("Property_Type") .agg(n=("abs_err", "size"), rmse=("abs_err", lambda s: float(np.sqrt((s**2).mean()))), mape=("abs_pct_err", "mean")) .round(2) .sort_values("rmse", ascending=False) .to_string())# 2. By price bucket.err["price_bucket"] = pd.cut(err["Numeric_Price"], bins=[0, 500_000, 750_000, 1_000_000,1_500_000, 2_500_000, 20_000_000], labels=["<500k", "500-750k", "750k-1M","1-1.5M", "1.5-2.5M", ">2.5M"])print("\n=== Error by price bucket ===")print(err.groupby("price_bucket") .agg(n=("abs_err", "size"), rmse=("abs_err", lambda s: float(np.sqrt((s**2).mean()))), mape=("abs_pct_err", "mean")) .round(2) .to_string())# 3. By Year (test fold spans late 2025 to May 2026).print("\n=== Error by Year ===")print(err.groupby("Year") .agg(n=("abs_err", "size"), rmse=("abs_err", lambda s: float(np.sqrt((s**2).mean()))), mape=("abs_pct_err", "mean")) .round(2) .to_string())
=== Error by Property_Type (top 8) ===
n rmse mape
Property_Type
Acreage / Semi-Rural 25 399751.94 14.60
New House & Land 15 292046.29 11.18
House 10569 223083.99 10.76
Vacant land 523 202393.44 12.78
Townhouse 1840 173816.98 10.51
Villa 76 163011.54 12.45
Apartment / Unit / Flat 4104 149592.43 14.86
Studio 14 74666.44 24.57
=== Error by price bucket ===
n rmse mape
price_bucket
<500k 2045 94875.67 17.06
500-750k 5804 79558.76 8.91
750k-1M 4639 122442.31 10.10
1-1.5M 3134 213196.85 13.15
1.5-2.5M 1339 405681.67 17.43
>2.5M 229 994937.68 23.30
=== Error by Year ===
n rmse mape
Year
2026.0 17190 203831.74 11.83
Error by Property_Type:
Apartment has high MAPE (14.86%) despite low RMSE - smaller absolute prices amplify percent error.
Studio MAPE 24.57% but only 14 rows - noisy estimate, ignore.
Acreage/Semi-Rural has highest RMSE ($400k) due to high price level and small sample (25 rows).
House and Townhouse - the bulk of volume - sit at 10-11% MAPE, the best segments.
Error by price bucket - critical insight:
Sweet spot: $500k-$1M (8.9-10.1% MAPE, where most data lives). Model is reliable here.
Degradation at extremes: <$500k (17%), $1.5-2.5M (17.4%), >$2.5M (23.3%).
Luxury (>$2.5M): 994k RMSE on 229 rows - model struggles with high-end properties.
The dashboard will need to flag predictions in luxury and budget brackets with wider uncertainty.
Error by Year:
Test fold is entirely 2026 - single-year evaluation. Cannot break down trend by year here, but the strong overall metrics suggest the Year feature is working correctly.
10. Feature Importance and SHAP
I use two views:
Built-in XGBoost feature importance (gain-based) - which features the model splits on most.
SHAP summary plot - direction and magnitude of each feature’s impact on individual predictions.
I compute SHAP on a 2,000-row sample of the test fold to keep runtime reasonable. SHAP for tree models is exact (TreeExplainer), so sampling only affects the visualization smoothness, not correctness.
Beds (11.5%) and Baths (11.4%) tie for second - core property attributes.
abs_median_income_weekly (10.7%) - suburb wealth as a price driver.
Year (4.8%) ranks 7th - the model uses it but it’s not dominant, because the time dimension is concentrated and explains average shift more than individual variation.
ptype_Apartment, ptype_Studio, ptype_House feature in top 9 - property type one-hots are pulling their weight.
Suburb_freq ranks 15th (1.7%) - lower than expected. Geographic signal is being absorbed by Latitude/Longitude and Distance_to_CBD instead.
Code
# SHAP analysis on a sample of the test fold.sample_idx = np.random.RandomState(0).choice(len(X_test), size=min(2000, len(X_test)), replace=False)X_shap = X_test.iloc[sample_idx]explainer = shap.TreeExplainer(final_xgb_val_stage)shap_values = explainer.shap_values(X_shap)print(f"SHAP computed on {len(X_shap):,} test rows.")# Summary plot.shap.summary_plot(shap_values, X_shap, max_display=15, show=False)plt.title("SHAP summary - XGBoost on test sample", pad=20)plt.tight_layout(); plt.show()
SHAP computed on 2,000 test rows.
Observations on SHAP summary:
Distance_to_CBD_km: high values (red, far from CBD) push prediction up, low values push down. Interesting reversal - far-from-CBD includes peninsula and acreage premiums.
abs_median_age: high values push prediction up. Older suburbs are more expensive.
LandSize_sqm: high values push up significantly with long right tail - big lots add a lot.
abs_median_income_weekly: high income pushes up - same direction as expected.
Latitude (negative numbers in Melbourne): high values (less negative = further north) tend to push prediction down. This captures the south-Melbourne premium.
Suburb_freq: high frequency (densely-traded suburbs) pulls prediction slightly down - consistent with growth-corridor outer suburbs being cheap and high-volume.
11. Final Model and For Sale Inference
I retrain the winning XGBoost on the entire Sold dataset (train + val + test combined) using the hyperparameters chosen on validation. This gives the production model the maximum possible training signal. The test metrics in Section 9 remain valid as the unbiased estimate of production performance.
Two additional quantile XGBoost models (q=0.1 and q=0.9) are trained with the same configuration to produce an 80% prediction interval per listing.
For Sale inference uses Year=2026, Month=5 (already injected in Section 3.2).
Deal-signal logic for listings with an asking price:
Predicted < 0.9 × Asking → Overpriced
Predicted > 1.1 × Asking → Good Deal
Otherwise → Fair
Code
# Retrain final point-estimate model on full Sold (train + val + test).X_full = pd.concat([X_train, X_val, X_test], axis=0, ignore_index=True)y_full = np.concatenate([y_train, y_val, y_test])best_params = {"n_estimators": int(best_xgb_params["best_iter"]),"max_depth": int(best_xgb_params["max_depth"]),"learning_rate": best_xgb_params["learning_rate"],"subsample": best_xgb_params["subsample"],"colsample_bytree": best_xgb_params["colsample_bytree"],"tree_method": "hist","random_state": 0,"n_jobs": -1,"verbosity": 0,}final_model = xgb.XGBRegressor(**best_params).fit(X_full, y_full)print(f"Final point model trained on {len(X_full):,} rows.")# Train quantile models for 80% prediction interval (q=0.1, q=0.9).quantile_params = {**best_params, "objective": "reg:quantileerror"}print("\nTraining quantile model q=0.1 (lower bound)...")model_q10 = xgb.XGBRegressor(**quantile_params, quantile_alpha=0.1).fit(X_full, y_full)print("Training quantile model q=0.9 (upper bound)...")model_q90 = xgb.XGBRegressor(**quantile_params, quantile_alpha=0.9).fit(X_full, y_full)print("All three models trained.")
Final point model trained on 114,594 rows.
Training quantile model q=0.1 (lower bound)...
Training quantile model q=0.9 (upper bound)...
All three models trained.
Code
# Predict on For Sale.y_pred_fs = final_model.predict(X_fs)y_pred_fs_q10 = model_q10.predict(X_fs)y_pred_fs_q90 = model_q90.predict(X_fs)# Inverse-transform from log scale to AUD.df_forsale["Predicted_Price"] = np.expm1(y_pred_fs).round(-3) # round to nearest $1kdf_forsale["Predicted_Price_Lower"] = np.expm1(y_pred_fs_q10).round(-3)df_forsale["Predicted_Price_Upper"] = np.expm1(y_pred_fs_q90).round(-3)# Sanity: enforce ordering lower <= point <= upper (rare floating-point inversions).df_forsale["Predicted_Price_Lower"] = df_forsale[["Predicted_Price_Lower", "Predicted_Price"]].min(axis=1)df_forsale["Predicted_Price_Upper"] = df_forsale[["Predicted_Price_Upper", "Predicted_Price"]].max(axis=1)print("For Sale predicted price summary:")print(df_forsale[["Predicted_Price", "Predicted_Price_Lower", "Predicted_Price_Upper"]] .describe(percentiles=[.05, .25, .5, .75, .95]).round(0))# Interval width as a quality check.df_forsale["Interval_Width_Pct"] = ( (df_forsale["Predicted_Price_Upper"] - df_forsale["Predicted_Price_Lower"])/ df_forsale["Predicted_Price"] *100)print(f"\nMedian 80% interval width: {df_forsale['Interval_Width_Pct'].median():.1f}% of predicted price")
For Sale predicted price summary:
Predicted_Price Predicted_Price_Lower Predicted_Price_Upper
count 38419.0 38419.0 38419.0
mean 1021647.0 820242.0 1253588.0
std 740877.0 456156.0 1026249.0
min 72000.0 70000.0 208000.0
5% 424000.0 354000.0 519000.0
25% 665000.0 601000.0 726000.0
50% 776000.0 699000.0 867000.0
75% 1126000.0 897000.0 1407000.0
95% 2367000.0 1653000.0 3187100.0
max 13469000.0 11012000.0 15548000.0
Median 80% interval width: 30.3% of predicted price
Code
# Deal signal for listings that have an asking price.def classify_deal(row): asking = row["Numeric_Price"]if pd.isna(asking):return"No Asking Price" pred = row["Predicted_Price"]if pred <0.9* asking:return"Overpriced"if pred >1.1* asking:return"Good Deal"return"Fair"df_forsale["Deal_Signal"] = df_forsale.apply(classify_deal, axis=1)print("Deal signal distribution:")print(df_forsale["Deal_Signal"].value_counts())print("\n% of For Sale listings:")print((df_forsale["Deal_Signal"].value_counts(normalize=True) *100).round(1))# Sample of each category.print("\n=== Sample listings per category ===")for cat in ["Good Deal", "Fair", "Overpriced", "No Asking Price"]: sub = df_forsale[df_forsale["Deal_Signal"] == cat]iflen(sub) ==0:continueprint(f"\n--- {cat} ({len(sub):,} listings) ---") cols = ["Suburb", "Property_Type", "Beds", "Numeric_Price","Predicted_Price", "Predicted_Price_Lower", "Predicted_Price_Upper"]print(sub[cols].head(3).to_string(index=False))
Deal signal distribution:
Deal_Signal
Fair 18299
Overpriced 8497
Good Deal 6289
No Asking Price 5334
Name: count, dtype: int64
% of For Sale listings:
Deal_Signal
Fair 47.6
Overpriced 22.1
Good Deal 16.4
No Asking Price 13.9
Name: proportion, dtype: float64
=== Sample listings per category ===
--- Good Deal (6,289 listings) ---
Suburb Property_Type Beds Numeric_Price Predicted_Price Predicted_Price_Lower Predicted_Price_Upper
BRUNSWICK House 4.0 1000000.0 2041000.0 1450000.0 2601000.0
BRUNSWICK House 3.0 1190000.0 1310000.0 1009000.0 1538000.0
BULLEEN House 4.0 1195000.0 1461000.0 1245000.0 1830000.0
--- Fair (18,299 listings) ---
Suburb Property_Type Beds Numeric_Price Predicted_Price Predicted_Price_Lower Predicted_Price_Upper
BRUNSWICK EAST House 3.0 1270000.0 1283000.0 1021000.0 1835000.0
ROSANNA Apartment / Unit / Flat 2.0 520000.0 570000.0 471000.0 723000.0
ROSANNA House 3.0 1360000.0 1288000.0 1037000.0 1445000.0
--- Overpriced (8,497 listings) ---
Suburb Property_Type Beds Numeric_Price Predicted_Price Predicted_Price_Lower Predicted_Price_Upper
RHYLL Vacant land 0.0 600000.0 491000.0 441000.0 664000.0
WEST MELBOURNE House 2.0 1350000.0 1161000.0 898000.0 1369000.0
CARLTON Apartment / Unit / Flat 2.0 377500.0 318000.0 257000.0 591000.0
--- No Asking Price (5,334 listings) ---
Suburb Property_Type Beds Numeric_Price Predicted_Price Predicted_Price_Lower Predicted_Price_Upper
HAWTHORN House 3.0 NaN 1983000.0 1186000.0 2323000.0
IVANHOE House 4.0 NaN 1902000.0 1493000.0 2609000.0
MANOR LAKES Vacant land 0.0 NaN 388000.0 361000.0 413000.0
Observations on inference results:
Predicted price: median $776k (matches 2026 market level confirmed in EDA Section 6), mean $1.02M (typical right-skew from luxury tail). Range $72k to $13.5M.
80% interval width: median 30.3% of predicted price. Tight enough to be informative, wide enough to capture true uncertainty.
Deal signal distribution:
Fair: 47.6% (18,299 listings) - asking and predicted align within ±10%
Good Deal: 16.4% (6,289) - predicted > 10% above asking
No Asking Price: 13.9% (5,334) - Contact Agent / EOI / Private Sale
The 22% Overpriced share is higher than the 16% Good Deal share - consistent with the well-known seller’s tendency to anchor high.
Sample inspection: predictions look reasonable. BRUNSWICK 4-bed House predicted $2M vs asking $1M is a plausible “Good Deal” flag (or a stale listing). RHYLL Vacant Land asking $600k vs predicted $491k flagged as Overpriced is consistent with the known $600k → $630k price-reduction history of that exact listing (noted in EDA Section 2).
12. Save Artifacts and Next Steps
I persist everything needed for the FastAPI dashboard and the weekly production pipeline:
3 model files (point, lower, upper quantile)
Preprocessing parameters (frequency map, numeric medians, property type column list)
Feature list (column order for inference)
Metrics summary (validation and test)
For Sale predictions with intervals and deal signals
Code
# 1. Models.joblib.dump(final_model, MODEL_DIR /"model.pkl")joblib.dump(model_q10, MODEL_DIR /"model_q10.pkl")joblib.dump(model_q90, MODEL_DIR /"model_q90.pkl")print(f"Saved 3 models to {MODEL_DIR}/")# 2. Preprocessing parameters.preproc_params = {"suburb_frequency_map": suburb_encoder.freq_,"property_type_columns": PROPERTY_TYPE_COLUMNS,"numeric_medians": NUMERIC_MEDIANS.to_dict(),"numeric_features": NUMERIC_FEATURES,"time_features": TIME_FEATURES,"flag_features": FLAG_FEATURES,"categorical_features": CATEGORICAL_FEATURES,"all_feature_columns": X_train.columns.tolist(),}joblib.dump(preproc_params, MODEL_DIR /"preprocessor.pkl")print(f"Saved preprocessor.pkl ({len(preproc_params)} keys).")# 3. Metrics summary.metrics_summary = {"data_snapshot_date": decisions["data_snapshot_date"],"n_train": len(X_train),"n_val": len(X_val),"n_test": len(X_test),"validation_metrics": {"linear_ridge": m_ridge_val,"random_forest": {k: best_rf_params[k] for k in ["rmse_aud", "mae_aud", "mape", "r2"]},"xgboost": {k: best_xgb_params[k] for k in ["rmse_aud", "mae_aud", "mape", "r2"]}, },"test_metrics_xgboost": m_test,"best_hyperparameters": {"n_estimators": int(best_xgb_params["best_iter"]),"max_depth": int(best_xgb_params["max_depth"]),"learning_rate": best_xgb_params["learning_rate"],"subsample": best_xgb_params["subsample"],"colsample_bytree": best_xgb_params["colsample_bytree"], },"winner": "XGBoost",}withopen(MODEL_DIR /"metrics.json", "w") as f: json.dump(metrics_summary, f, indent=2, default=str)print(f"Saved metrics.json.")# 4. For Sale predictions parquet for the FastAPI dashboard.out_cols = ["Property_ID", "Status", "Suburb", "Postcode", "Property_Type","Beds", "Baths", "Car_Spaces", "LandSize_sqm","Latitude", "Longitude", "Distance_to_CBD_km","Raw_Price", "Numeric_Price","Predicted_Price", "Predicted_Price_Lower", "Predicted_Price_Upper","Interval_Width_Pct", "Deal_Signal","is_land", "out_of_metro", "is_new_build","Last_Updated",]predictions_path = DATA_DIR /"predictions_for_sale.parquet"df_forsale[out_cols].to_parquet(predictions_path, index=False)print(f"Saved predictions to {predictions_path}")print(f" Rows: {len(df_forsale):,}")print(f" Columns: {len(out_cols)}")
Saved 3 models to models/
Saved preprocessor.pkl (8 keys).
Saved metrics.json.
Saved predictions to report_data\predictions_for_sale.parquet
Rows: 38,419
Columns: 23
Next steps - production pipeline
This report is the development artifact. The same logic must run weekly on updated data (../data/melbourne_price_data_enriched.csv). To enable that, I will refactor Sections 2-3 (EDA cleaning) + Sections 3-11 (ML training and inference) into three standalone Python scripts:
predict.py - reproduces ML Section 11 (For Sale inference). Output: predictions_for_sale.parquet.
weekly_update.py - orchestrates the three scripts in sequence.
The FastAPI dashboard reads predictions_for_sale.parquet directly, so updating the dashboard is just a matter of re-running the weekly pipeline.
Known limitations carried forward:
Model underperforms in two segments: luxury (>$2.5M, 23% MAPE) and budget (<$500k, 17% MAPE). The dashboard should show wider uncertainty for listings in these ranges.
Cold-start suburbs (12 rows currently) get frequency-encoded as 0; Postcode and Lat/Lng carry the geographic signal as fallback.
New-build supply (30% of For Sale) is captured by is_new_build but training data on these types remains thin. Predictions on New House & Land may be less reliable than on resale categories.