Melbourne Property Price - Machine Learning

Author

Hoang Son Lai

Published

May 12, 2026

Introduction

This report builds, tunes and compares regression models to predict Melbourne property prices, then uses the best model to price the current For Sale supply.

Data snapshot: this report uses data updated on 2 May 2026.

The report consumes the cleaned dataset produced by eda_report.ipynb. All cleaning rules, outlier policies and encoding decisions were finalized in EDA and are loaded here from eda_decisions.json to keep the two reports in sync.

Three models are compared: Linear Regression with Ridge regularization (baseline), Random Forest, and XGBoost. The best model is then used to predict prices for all For Sale listings, with the current Year and Month injected at inference to obtain today-priced estimates. Two additional XGBoost quantile models (q=0.1, q=0.9) produce an 80% prediction interval for each listing.

Workflow: setup, problem framing, feature engineering, time-based 70/15/15 split, preprocessing pipeline, three model families, model comparison, SHAP interpretability, final retraining and For Sale inference, save artifacts.

1. Load cleaned data and EDA decisions

I load three things from the EDA output: the cleaned dataset (Sold + For Sale combined into one parquet), the JSON of decisions taken during EDA (split ratios, encoding rules, rare-type list, new-build categories), and a quick sanity check on group sizes.

Code
import warnings
warnings.filterwarnings("ignore")

import json
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path

import joblib

from sklearn.preprocessing  import StandardScaler
from sklearn.impute         import SimpleImputer
from sklearn.compose        import ColumnTransformer
from sklearn.pipeline       import Pipeline
from sklearn.linear_model   import LinearRegression, Ridge
from sklearn.ensemble       import RandomForestRegressor
from sklearn.metrics        import mean_squared_error, mean_absolute_error, r2_score

import xgboost as xgb
import shap

pd.set_option("display.max_columns", 50)
pd.set_option("display.width", 200)
sns.set_style("whitegrid")
plt.rcParams["figure.dpi"] = 100
plt.rcParams["figure.figsize"] = (10, 5)

DATA_DIR  = Path("report_data")
MODEL_DIR = Path("models")
MODEL_DIR.mkdir(exist_ok=True)

# To convert to html, quarto render report/ml_report.ipynb

print("Setup complete.")
Setup complete.
Code
# Load cleaned dataset and EDA decisions.
df = pd.read_parquet(DATA_DIR / "cleaned_data.parquet")

with open(DATA_DIR / "eda_decisions.json", "r") as f:
    decisions = json.load(f)

print(f"Loaded data: {df.shape}")
print(f"  Sold:     {(df['Status'] == 'Sold').sum():,}")
print(f"  For Sale: {(df['Status'] == 'For Sale').sum():,}")

print("\nDecision keys loaded:")
for k in decisions:
    print(f"  - {k}")

print(f"\nSnapshot date: {decisions['data_snapshot_date']}")
print(f"Split ratios:  {decisions['split_ratios']}")
print(f"Target transform: {decisions['target_transform']}")
print(f"Inference Year/Month: {decisions['inference_year']}/{decisions['inference_month']}")
Loaded data: (163239, 32)
  Sold:     124,820
  For Sale: 38,419

Decision keys loaded:
  - data_snapshot_date
  - land_property_types
  - apt_like_types
  - rural_large_types
  - residential_dense_types
  - price_floor
  - price_ceiling
  - landsize_caps_by_type
  - metro_envelope
  - target_transform
  - split_strategy
  - split_ratios
  - inference_year
  - inference_month
  - rare_property_types
  - new_build_types
  - rationale_year_month

Snapshot date: 2026-05-02
Split ratios:  {'train': 0.7, 'val': 0.15, 'test': 0.15}
Target transform: log1p
Inference Year/Month: 2026/5

2. Problem framing

Goal: predict Numeric_Price for For Sale listings.

Training set: Sold rows that have a valid (non-NaN) Numeric_Price. Rows with “Price Withheld” or with any other missing target are excluded from training.

Inference set: all For Sale rows.

Target: log1p(Numeric_Price). The log transform was justified in EDA Section 4.1 by the heavy right-skew of raw price.

Time-aware features: each Sold row uses its actual Year and Month at training time, letting the model absorb the 4-12% YoY inflation observed in EDA Section 6. For Sale rows have no transaction date by structural design, so at inference I inject Year = 2026, Month = 5 (the snapshot date) so predictions reflect today’s market level rather than the historical average.

Evaluation: I report RMSE, MAE, MAPE, and R² on both log scale and original AUD scale. RMSE on AUD is the primary metric because absolute pricing error is what matters for downstream decisions.

Code
# Split the combined frame into training-eligible Sold and inference-target For Sale.
df_sold    = df[df["Status"] == "Sold"].copy()
df_forsale = df[df["Status"] == "For Sale"].copy()

# Training set requires a non-NaN Numeric_Price.
df_train_pool = df_sold.dropna(subset=["Numeric_Price"]).copy()

print(f"Sold total:             {len(df_sold):,}")
print(f"Sold with valid price:  {len(df_train_pool):,}  (training pool)")
print(f"For Sale (inference):   {len(df_forsale):,}")

print(f"\nTraining pool date range: "
      f"{df_train_pool['Date_parsed'].min().date()} to "
      f"{df_train_pool['Date_parsed'].max().date()}")
print(f"\nTarget summary (raw AUD):")
print(df_train_pool["Numeric_Price"].describe(percentiles=[.25, .5, .75, .95]).round(0))
Sold total:             124,820
Sold with valid price:  114,594  (training pool)
For Sale (inference):   38,419

Training pool date range: 2005-12-01 to 2026-05-02

Target summary (raw AUD):
count      114594.0
mean       958510.0
std        701764.0
min         50000.0
25%        610000.0
50%        772295.0
75%       1081545.0
95%       2100000.0
max      19500000.0
Name: Numeric_Price, dtype: float64

3. Feature Engineering

I build the feature matrix in this order:

  1. Apply the rare-type grouping (“Other”) and create the is_new_build flag identified in EDA.

  2. Define the four feature buckets: numeric, time, binary flags, categorical.

  3. Keep the engineering deterministic and reusable - the same function will run in production.

The frequency encoding for Suburb is fitted later, on the training fold only (Section 5), to avoid leakage.

About is_new_build: this flag marks properties belonging to four new-construction categories - “New House & Land”, “New Apartments / Off the Plan”, “New Home Designs”, and “New land”. EDA Section 4.3 found that these categories make up 27% of For Sale supply but only 0.3% of Sold history. Without the flag, the model would have very few training examples in these categories and would generalize poorly. The flag acts as a single binary signal the model can latch onto, regardless of how rare each specific new-build category is in training. The original Property_Type column still goes into the model as a one-hot categorical, so the flag adds information rather than replacing it.

Code
def add_engineered_features(d):
    """
    Add features derived from EDA decisions:
    - Group rare Property_Type categories into "Other".
    - Add is_new_build flag for the 4 new-build categories.
    is_land and out_of_metro are already present from EDA cleaning.
    """
    d = d.copy()

    # Group rare types into "Other".
    rare = set(decisions["rare_property_types"])
    d["Property_Type"] = d["Property_Type"].where(~d["Property_Type"].isin(rare), "Other")

    # New-build flag.
    new_build_types = set(decisions["new_build_types"])
    d["is_new_build"] = d["Property_Type"].isin(new_build_types).astype(int)

    return d

df_train_pool = add_engineered_features(df_train_pool)
df_forsale    = add_engineered_features(df_forsale)

print("After feature engineering:")
print(f"  Training pool Property_Type distinct values: {df_train_pool['Property_Type'].nunique()}")
print(f"  For Sale     Property_Type distinct values:  {df_forsale['Property_Type'].nunique()}")
print(f"\nis_new_build share:")
print(f"  Training pool: {df_train_pool['is_new_build'].mean()*100:.1f}%")
print(f"  For Sale:      {df_forsale['is_new_build'].mean()*100:.1f}%")

print(f"\nProperty_Type composition (training pool):")
print(df_train_pool["Property_Type"].value_counts().head(15))
After feature engineering:
  Training pool Property_Type distinct values: 15
  For Sale     Property_Type distinct values:  15

is_new_build share:
  Training pool: 0.3%
  For Sale:      30.0%

Property_Type composition (training pool):
Property_Type
House                            83147
Apartment / Unit / Flat          14233
Townhouse                         8243
Vacant land                       6701
Acreage / Semi-Rural              1138
Rural                              378
Villa                              260
New House & Land                   235
Block of Units                      72
Other                               69
New Apartments / Off the Plan       48
Studio                              37
New land                            21
Unknown                              8
New Home Designs                     4
Name: count, dtype: int64

3.1 Define feature buckets

I separate features into four buckets so the preprocessing pipeline in Section 5 can apply the right transformation to each:

  • Numeric (continuous): need imputation and (for Linear/Ridge) scaling.
  • Time: Year and Month as integers; no scaling needed for trees, light scaling for Linear.
  • Binary flags: already 0/1; pass through.
  • Categorical: Property_Type → one-hot; Suburb → frequency encoded.

I do not include Latitude/Longitude as Property_Type-like features - they are continuous geographic coordinates and go into the numeric bucket so tree models can split on them.

Code
NUMERIC_FEATURES = [
    "Beds", "Baths", "Car_Spaces", "LandSize_sqm",
    "Distance_to_CBD_km", "dist_nearest_train_km",
    "Latitude", "Longitude",
    "Propertycount",
    "abs_median_income_weekly", "abs_median_age",
    "abs_population", "crime_rate_per_100k",
]

TIME_FEATURES = ["Year", "Month"]

FLAG_FEATURES = ["is_land", "out_of_metro", "is_new_build"]

CATEGORICAL_FEATURES = ["Property_Type"]   # one-hot

FREQ_ENCODED_FEATURES = ["Suburb"]         # frequency encoding, fitted on train fold

ALL_FEATURES = (NUMERIC_FEATURES + TIME_FEATURES + FLAG_FEATURES
                + CATEGORICAL_FEATURES + FREQ_ENCODED_FEATURES)

print(f"Total raw features before encoding: {len(ALL_FEATURES)}")
print(f"  Numeric:     {len(NUMERIC_FEATURES)}")
print(f"  Time:        {len(TIME_FEATURES)}")
print(f"  Flags:       {len(FLAG_FEATURES)}")
print(f"  Categorical: {len(CATEGORICAL_FEATURES)}")
print(f"  Freq-encoded:{len(FREQ_ENCODED_FEATURES)}")

# Sanity: verify all listed features exist in the data.
missing = [c for c in ALL_FEATURES if c not in df_train_pool.columns]
print(f"\nMissing from training pool: {missing if missing else 'none'}")
Total raw features before encoding: 20
  Numeric:     13
  Time:        2
  Flags:       3
  Categorical: 1
  Freq-encoded:1

Missing from training pool: none

3.2 Inject inference Year/Month for For Sale

For Sale has no transaction date. To produce today-priced predictions I inject the snapshot’s Year and Month from the EDA decisions. I do this now (before the split) so the For Sale frame is ready for inference later without further mutation.

Code
df_forsale["Year"]  = decisions["inference_year"]
df_forsale["Month"] = decisions["inference_month"]

print(f"Injected For Sale Year/Month: "
      f"{decisions['inference_year']}/{decisions['inference_month']}")
print(f"For Sale Year unique values:  {df_forsale['Year'].unique()}")
print(f"For Sale Month unique values: {df_forsale['Month'].unique()}")
Injected For Sale Year/Month: 2026/5
For Sale Year unique values:  [2026]
For Sale Month unique values: [5]

4. Time-based 70/15/15 Split

I sort the training pool by Date_parsed and cut it into three folds in chronological order:

  • Train (70%): oldest transactions. Used to fit the model.
  • Validation (15%): middle period. Used to tune hyperparameters.
  • Test (15%): newest transactions. Used once at the end for unbiased evaluation; never touched during tuning.

A random split would leak future information into training and inflate metrics. Time-based split mimics the production scenario where the model is asked to price recent listings using only older data.

Because EDA Section 4.5 showed 40% of transactions occurred in 2025 alone, even 15% folds will cover several months of recent data - more than enough for stable hyperparameter selection and a meaningful test evaluation.

Code
# Sort by Date_parsed ascending and cut into three contiguous folds.
df_sorted = df_train_pool.sort_values("Date_parsed").reset_index(drop=True)

n = len(df_sorted)
train_end = int(n * 0.70)
val_end   = int(n * 0.85)

train_df = df_sorted.iloc[:train_end].copy()
val_df   = df_sorted.iloc[train_end:val_end].copy()
test_df  = df_sorted.iloc[val_end:].copy()

def fold_info(name, d):
    return {
        "fold":         name,
        "n":            len(d),
        "date_min":     d["Date_parsed"].min().date(),
        "date_max":     d["Date_parsed"].max().date(),
        "price_median": int(d["Numeric_Price"].median()),
    }

info = pd.DataFrame([
    fold_info("train", train_df),
    fold_info("val",   val_df),
    fold_info("test",  test_df),
])
print(info.to_string(index=False))

# Save cutoff dates for production scripts.
split_meta = {
    "train_end_date": str(train_df["Date_parsed"].max().date()),
    "val_end_date":   str(val_df["Date_parsed"].max().date()),
    "n_train":        len(train_df),
    "n_val":          len(val_df),
    "n_test":         len(test_df),
}
print("\nSplit metadata:")
print(json.dumps(split_meta, indent=2))
 fold     n   date_min   date_max  price_median
train 80215 2005-12-01 2025-11-05        763000
  val 17189 2025-11-05 2026-02-02        807000
 test 17190 2026-02-02 2026-05-02        782000

Split metadata:
{
  "train_end_date": "2025-11-05",
  "val_end_date": "2026-02-02",
  "n_train": 80215,
  "n_val": 17189,
  "n_test": 17190
}

5. Preprocessing Pipeline

The pipeline performs three operations:

  1. Median imputation for numeric features with NaN (LandSize on rows that originally had 0, Beds/Baths on rooms-less rows, crime_rate for missing-suburb rows).
  2. One-hot encoding for Property_Type.
  3. Frequency encoding for Suburb (cardinality ~540).

All three are fitted on the train fold only. The fitted transformers are then applied to val, test, and For Sale to avoid leakage. Standard scaling is applied only inside the Linear/Ridge pipeline (Section 6); tree models do not need it.

Frequency encoding maps each suburb to its count in the train fold. Unseen suburbs in val/test/For Sale get encoded as 0. The cold-start check in EDA Section 8.2 confirmed this affects only 12 For Sale rows (0.03%), so a simple zero fallback is fine.

Code
class FrequencyEncoder:
    """Maps a categorical column to its training-fold frequency. Unseen values map to 0."""

    def fit(self, series):
        self.freq_ = series.value_counts().to_dict()
        return self

    def transform(self, series):
        return series.map(self.freq_).fillna(0).astype(int)

    def fit_transform(self, series):
        return self.fit(series).transform(series)


def build_preprocessor():
    """ColumnTransformer for numeric + flags + one-hot, returns dense array.
    Suburb is handled separately via FrequencyEncoder for cardinality control."""
    numeric_pipe = Pipeline([
        ("imputer", SimpleImputer(strategy="median")),
    ])

    return ColumnTransformer(
        transformers=[
            ("num",  numeric_pipe, NUMERIC_FEATURES + TIME_FEATURES),
            ("flag", "passthrough",                  FLAG_FEATURES),
            ("cat",  pd.get_dummies,                 CATEGORICAL_FEATURES),
        ],
        remainder="drop",
        verbose_feature_names_out=False,
    )


def prepare_features(d, preproc, suburb_encoder, fit=False):
    """Apply preprocessing to a dataframe. If fit=True, fit transformers first."""
    d = d.copy()

    # Frequency-encode Suburb.
    if fit:
        d["Suburb_freq"] = suburb_encoder.fit_transform(d["Suburb"])
    else:
        d["Suburb_freq"] = suburb_encoder.transform(d["Suburb"])

    # One-hot Property_Type manually (sklearn-compatible, leakage-safe).
    if fit:
        global PROPERTY_TYPE_COLUMNS
        PROPERTY_TYPE_COLUMNS = sorted(d["Property_Type"].unique().tolist())

    ohe = pd.get_dummies(d["Property_Type"], prefix="ptype")
    # Ensure all train columns are present in val/test/forsale.
    for col in [f"ptype_{c}" for c in PROPERTY_TYPE_COLUMNS]:
        if col not in ohe.columns:
            ohe[col] = 0
    ohe = ohe[[f"ptype_{c}" for c in PROPERTY_TYPE_COLUMNS]]

    # Numeric + time + flags.
    num_block = d[NUMERIC_FEATURES + TIME_FEATURES + FLAG_FEATURES + ["Suburb_freq"]].copy()

    # Impute numeric/time medians (flags are 0/1, no NaN by construction).
    if fit:
        global NUMERIC_MEDIANS
        NUMERIC_MEDIANS = num_block[NUMERIC_FEATURES + TIME_FEATURES].median()
    num_block[NUMERIC_FEATURES + TIME_FEATURES] = (
        num_block[NUMERIC_FEATURES + TIME_FEATURES].fillna(NUMERIC_MEDIANS)
    )

    # Combine.
    X = pd.concat([num_block.reset_index(drop=True),
                   ohe.reset_index(drop=True)], axis=1)
    X = X.astype(float)
    return X


# Fit on train, transform all four sets.
suburb_encoder = FrequencyEncoder()
PROPERTY_TYPE_COLUMNS = None
NUMERIC_MEDIANS = None

X_train = prepare_features(train_df,   None, suburb_encoder, fit=True)
X_val   = prepare_features(val_df,     None, suburb_encoder, fit=False)
X_test  = prepare_features(test_df,    None, suburb_encoder, fit=False)
X_fs    = prepare_features(df_forsale, None, suburb_encoder, fit=False)

# Targets in log scale.
y_train = np.log1p(train_df["Numeric_Price"].values)
y_val   = np.log1p(val_df["Numeric_Price"].values)
y_test  = np.log1p(test_df["Numeric_Price"].values)

print(f"X_train: {X_train.shape}")
print(f"X_val:   {X_val.shape}")
print(f"X_test:  {X_test.shape}")
print(f"X_forsale:    {X_fs.shape}")
print(f"\nFeature columns ({X_train.shape[1]}):")
print(list(X_train.columns))
X_train: (80215, 34)
X_val:   (17189, 34)
X_test:  (17190, 34)
X_forsale:    (38419, 34)

Feature columns (34):
['Beds', 'Baths', 'Car_Spaces', 'LandSize_sqm', 'Distance_to_CBD_km', 'dist_nearest_train_km', 'Latitude', 'Longitude', 'Propertycount', 'abs_median_income_weekly', 'abs_median_age', 'abs_population', 'crime_rate_per_100k', 'Year', 'Month', 'is_land', 'out_of_metro', 'is_new_build', 'Suburb_freq', 'ptype_Acreage / Semi-Rural', 'ptype_Apartment / Unit / Flat', 'ptype_Block of Units', 'ptype_House', 'ptype_New Apartments / Off the Plan', 'ptype_New Home Designs', 'ptype_New House & Land', 'ptype_New land', 'ptype_Other', 'ptype_Rural', 'ptype_Studio', 'ptype_Townhouse', 'ptype_Unknown', 'ptype_Vacant land', 'ptype_Villa']

Verify no NaN in feature matrix

A quick check that the imputer caught everything. If any NaN slips through, tree models will tolerate it but Linear/Ridge will crash.

Code
for name, X in [("X_train", X_train), ("X_val", X_val),
                ("X_test", X_test), ("X_forsale", X_fs)]:
    n_nan = X.isna().sum().sum()
    print(f"{name}: NaN count = {n_nan}")
X_train: NaN count = 0
X_val: NaN count = 0
X_test: NaN count = 0
X_forsale: NaN count = 0

6. Model 1: Linear Regression + Ridge

The Linear baseline establishes a floor for the more complex models to beat. I run two variants:

  1. Vanilla LinearRegression - no regularization.

  2. Ridge with alpha tuned on the validation fold across {0.1, 1, 10, 100, 1000}.

Both use StandardScaler since linear coefficients are scale-sensitive. EDA Section 5.4 found Beds-Baths correlation at 0.76, so Ridge is expected to outperform vanilla Linear by stabilizing the multicollinear cluster.

Metrics are computed on both log scale (the model’s actual prediction target) and AUD scale (via np.expm1 back-transform). AUD-scale RMSE is the headline metric.

Code
def metrics(y_true_log, y_pred_log):
    """Return RMSE/MAE/R2 on log scale and RMSE/MAE/MAPE on AUD scale."""
    y_true_aud = np.expm1(y_true_log)
    y_pred_aud = np.expm1(y_pred_log)
    return {
        "rmse_log":  float(np.sqrt(mean_squared_error(y_true_log, y_pred_log))),
        "mae_log":   float(mean_absolute_error(y_true_log, y_pred_log)),
        "r2":        float(r2_score(y_true_log, y_pred_log)),
        "rmse_aud":  float(np.sqrt(mean_squared_error(y_true_aud, y_pred_aud))),
        "mae_aud":   float(mean_absolute_error(y_true_aud, y_pred_aud)),
        "mape":      float(np.mean(np.abs((y_true_aud - y_pred_aud) / y_true_aud)) * 100),
    }


# 1. Vanilla LinearRegression with scaling.
scaler = StandardScaler().fit(X_train)
X_train_s = scaler.transform(X_train)
X_val_s   = scaler.transform(X_val)

lin = LinearRegression().fit(X_train_s, y_train)
m_lin_val = metrics(y_val, lin.predict(X_val_s))
print("Vanilla LinearRegression (val):")
for k, v in m_lin_val.items():
    print(f"  {k:10s}: {v:,.4f}" if "log" in k or "r2" in k or "mape" in k
          else f"  {k:10s}: {v:,.0f}")

# 2. Ridge - tune alpha on validation.
print("\nRidge alpha tuning:")
ridge_results = []
for alpha in [0.1, 1, 10, 100, 1000]:
    r = Ridge(alpha=alpha, random_state=0).fit(X_train_s, y_train)
    m = metrics(y_val, r.predict(X_val_s))
    ridge_results.append({"alpha": alpha, **m})
    print(f"  alpha={alpha:>6}: rmse_aud={m['rmse_aud']:,.0f}  "
          f"mape={m['mape']:.2f}%  r2={m['r2']:.4f}")

ridge_df = pd.DataFrame(ridge_results)
best_alpha = ridge_df.loc[ridge_df["rmse_aud"].idxmin(), "alpha"]
print(f"\nBest Ridge alpha: {best_alpha}")

# Final Linear baseline metrics for comparison.
best_ridge = Ridge(alpha=best_alpha, random_state=0).fit(X_train_s, y_train)
m_ridge_val = metrics(y_val, best_ridge.predict(X_val_s))

linear_results = {
    "vanilla_linear_val":  m_lin_val,
    "best_ridge_alpha":    float(best_alpha),
    "best_ridge_val":      m_ridge_val,
}
print(f"\nBest Ridge (val): rmse_aud={m_ridge_val['rmse_aud']:,.0f}  "
      f"mape={m_ridge_val['mape']:.2f}%  r2={m_ridge_val['r2']:.4f}")
Vanilla LinearRegression (val):
  rmse_log  : 0.2948
  mae_log   : 0.2169
  r2        : 0.6349
  rmse_aud  : 447,975
  mae_aud   : 224,496
  mape      : 23.2112

Ridge alpha tuning:
  alpha=   0.1: rmse_aud=447,975  mape=23.21%  r2=0.6349
  alpha=     1: rmse_aud=447,975  mape=23.21%  r2=0.6349
  alpha=    10: rmse_aud=447,979  mape=23.21%  r2=0.6349
  alpha=   100: rmse_aud=448,005  mape=23.20%  r2=0.6349
  alpha=  1000: rmse_aud=447,645  mape=23.13%  r2=0.6346

Best Ridge alpha: 1000.0

Best Ridge (val): rmse_aud=447,645  mape=23.13%  r2=0.6346

Observations on the linear baseline:

  • Vanilla Linear: RMSE $448k, MAPE 23.2%, R² 0.635 on validation. Reasonable floor.

  • Ridge tuning: alpha has near-zero effect. Best alpha=1000 only improves RMSE by ~$330 (0.07%). The Beds-Baths multicollinearity hurts less than expected, likely because the 34-feature space dilutes its impact.

  • Linear baseline explains 63% of log-price variance. The non-linearities found in EDA (Year x Property_Type interaction, log-log LandSize-price, suburb-level price hierarchy) cannot be captured by a linear model. Tree models should comfortably beat this.

  • Floor to beat: RMSE $448k, MAPE 23%, R² 0.635.

7. Model 2: Random Forest

Grid search across 36 hyperparameter combinations, evaluated on validation. RF does not need scaling, so I use the raw X_train instead of X_train_s.

Search grid: - n_estimators: [200, 500] - max_depth: [10, 20, None] - min_samples_leaf: [1, 5, 10] - max_features: [“sqrt”, 0.5]

random_state=0 for reproducibility. n_jobs=-1 to use all cores.

Code
from itertools import product

rf_grid = list(product(
    [200, 500],         # n_estimators
    [10, 20, None],     # max_depth
    [1, 5, 10],         # min_samples_leaf
    ["sqrt", 0.5],      # max_features
))
print(f"Searching {len(rf_grid)} RF configurations...")

rf_results = []
for i, (n_est, depth, leaf, mf) in enumerate(rf_grid, 1):
    rf = RandomForestRegressor(
        n_estimators=n_est, max_depth=depth, min_samples_leaf=leaf,
        max_features=mf, random_state=0, n_jobs=-1,
    ).fit(X_train, y_train)
    m = metrics(y_val, rf.predict(X_val))
    rf_results.append({
        "n_estimators": n_est, "max_depth": depth,
        "min_samples_leaf": leaf, "max_features": mf, **m,
    })
    if i % 6 == 0 or i == len(rf_grid):
        print(f"  [{i:2d}/{len(rf_grid)}] last: n={n_est}, d={depth}, "
              f"leaf={leaf}, mf={mf} -> rmse_aud={m['rmse_aud']:,.0f}")

rf_df = pd.DataFrame(rf_results).sort_values("rmse_aud").reset_index(drop=True)
print("\nTop 5 RF configurations by val RMSE (AUD):")
print(rf_df.head(5)[["n_estimators", "max_depth", "min_samples_leaf",
                     "max_features", "rmse_aud", "mape", "r2"]].round(4).to_string(index=False))

best_rf_params = rf_df.iloc[0].to_dict()
print(f"\nBest RF config:")
for k in ["n_estimators", "max_depth", "min_samples_leaf", "max_features"]:
    print(f"  {k}: {best_rf_params[k]}")
print(f"\nBest RF (val): rmse_aud={best_rf_params['rmse_aud']:,.0f}  "
      f"mape={best_rf_params['mape']:.2f}%  r2={best_rf_params['r2']:.4f}")
Searching 36 RF configurations...
  [ 6/36] last: n=200, d=10, leaf=10, mf=0.5 -> rmse_aud=381,689
  [12/36] last: n=200, d=20, leaf=10, mf=0.5 -> rmse_aud=348,101
  [18/36] last: n=200, d=None, leaf=10, mf=0.5 -> rmse_aud=349,125
  [24/36] last: n=500, d=10, leaf=10, mf=0.5 -> rmse_aud=381,306
  [30/36] last: n=500, d=20, leaf=10, mf=0.5 -> rmse_aud=348,869
  [36/36] last: n=500, d=None, leaf=10, mf=0.5 -> rmse_aud=349,146

Top 5 RF configurations by val RMSE (AUD):
 n_estimators  max_depth  min_samples_leaf max_features    rmse_aud    mape     r2
          500        NaN                 1          0.5 324711.7352 13.0724 0.8485
          500       20.0                 1          0.5 325006.8499 13.1278 0.8474
          200        NaN                 1          0.5 325236.1795 13.1037 0.8472
          200       20.0                 1          0.5 325311.2890 13.1891 0.8461
          200        NaN                 1         sqrt 334637.4211 12.8346 0.8502

Best RF config:
  n_estimators: 500
  max_depth: nan
  min_samples_leaf: 1
  max_features: 0.5

Best RF (val): rmse_aud=324,712  mape=13.07%  r2=0.8485

Observations on Random Forest:

  • Best config: n_estimators=500, max_depth=None (unlimited), min_samples_leaf=1, max_features=0.5. Deep unconstrained trees with moderate feature sampling.

  • Best RF (val): RMSE $324,712, MAPE 13.07%, R² 0.849. Major improvement over Linear: RMSE drops 27% ($448k → $325k), MAPE nearly halves (23% → 13%), R² jumps from 0.635 to 0.849.

  • Top 5 are tightly clustered (RMSE $325k-$335k) - the model is robust to hyperparameter choice as long as trees are allowed to grow deep.

  • max_features=0.5 slightly outperforms “sqrt” - tree diversity matters but moderate feature sampling beats heavy subsampling here.

  • Floor to beat for XGBoost: RMSE $325k, MAPE 13%, R² 0.85.

8. Model 3: XGBoost

Grid search with early stopping on validation. Early stopping halts training when val RMSE stops improving for 50 rounds, which both speeds up the search and avoids overfitting deep trees.

Search grid:

  • max_depth: [4, 6, 8]

  • learning_rate: [0.05, 0.1]

  • subsample: [0.8, 1.0]

  • colsample_bytree: [0.8, 1.0]

n_estimators is set high (2000) with early stopping doing the actual selection. tree_method=“hist” is the fast histogram-based algorithm.

Code
xgb_grid = list(product(
    [4, 6, 8],          # max_depth
    [0.05, 0.1],        # learning_rate
    [0.8, 1.0],         # subsample
    [0.8, 1.0],         # colsample_bytree
))
print(f"Searching {len(xgb_grid)} XGBoost configurations...")

xgb_results = []
for i, (depth, lr, sub, cs) in enumerate(xgb_grid, 1):
    model = xgb.XGBRegressor(
        n_estimators=2000,
        max_depth=depth,
        learning_rate=lr,
        subsample=sub,
        colsample_bytree=cs,
        tree_method="hist",
        random_state=0,
        n_jobs=-1,
        early_stopping_rounds=50,
        verbosity=0,
    )
    model.fit(X_train, y_train, eval_set=[(X_val, y_val)], verbose=False)
    m = metrics(y_val, model.predict(X_val))
    xgb_results.append({
        "max_depth": depth, "learning_rate": lr,
        "subsample": sub, "colsample_bytree": cs,
        "best_iter": model.best_iteration, **m,
    })
    if i % 6 == 0 or i == len(xgb_grid):
        print(f"  [{i:2d}/{len(xgb_grid)}] last: d={depth}, lr={lr}, "
              f"sub={sub}, cs={cs}, n_iter={model.best_iteration} -> "
              f"rmse_aud={m['rmse_aud']:,.0f}")

xgb_df = pd.DataFrame(xgb_results).sort_values("rmse_aud").reset_index(drop=True)
print("\nTop 5 XGBoost configurations by val RMSE (AUD):")
print(xgb_df.head(5)[["max_depth", "learning_rate", "subsample",
                       "colsample_bytree", "best_iter",
                       "rmse_aud", "mape", "r2"]].round(4).to_string(index=False))

best_xgb_params = xgb_df.iloc[0].to_dict()
print(f"\nBest XGBoost config:")
for k in ["max_depth", "learning_rate", "subsample", "colsample_bytree", "best_iter"]:
    print(f"  {k}: {best_xgb_params[k]}")
print(f"\nBest XGBoost (val): rmse_aud={best_xgb_params['rmse_aud']:,.0f}  "
      f"mape={best_xgb_params['mape']:.2f}%  r2={best_xgb_params['r2']:.4f}")
Searching 24 XGBoost configurations...
  [ 6/24] last: d=4, lr=0.1, sub=0.8, cs=1.0, n_iter=1474 -> rmse_aud=298,745
  [12/24] last: d=6, lr=0.05, sub=1.0, cs=1.0, n_iter=749 -> rmse_aud=302,968
  [18/24] last: d=8, lr=0.05, sub=0.8, cs=1.0, n_iter=798 -> rmse_aud=302,316
  [24/24] last: d=8, lr=0.1, sub=1.0, cs=1.0, n_iter=309 -> rmse_aud=303,038

Top 5 XGBoost configurations by val RMSE (AUD):
 max_depth  learning_rate  subsample  colsample_bytree  best_iter    rmse_aud    mape     r2
         6           0.05        0.8               1.0       1532 295156.2157 12.7929 0.8639
         4           0.10        1.0               0.8       1266 295512.5956 13.5079 0.8524
         4           0.10        1.0               1.0       1349 295830.6999 13.3702 0.8535
         6           0.05        0.8               0.8       1560 296091.7410 12.8526 0.8630
         4           0.05        0.8               0.8       1985 297467.6290 13.3502 0.8544

Best XGBoost config:
  max_depth: 6.0
  learning_rate: 0.05
  subsample: 0.8
  colsample_bytree: 1.0
  best_iter: 1532.0

Best XGBoost (val): rmse_aud=295,156  mape=12.79%  r2=0.8639

Observations on XGBoost:

  • Best config: max_depth=6, learning_rate=0.05, subsample=0.8, colsample_bytree=1.0, best_iter=1532. Moderate depth with slow learning rate and many trees.

  • Best XGBoost (val): RMSE $295,156, MAPE 12.79%, R² 0.864. Beats RF by ~$30k RMSE (9% lower) and improves R² from 0.849 to 0.864.

  • Top 5 are clustered in two depth tiers (4 and 6) - both work well. Higher depth=8 underperforms slightly, suggesting trees don’t need to go deeper than 6 for this dataset.

  • Slow learning rate (0.05) with 1500+ trees beats faster learning rate (0.1) with fewer trees - classic gradient boosting pattern.

  • New floor: XGBoost wins the validation comparison and becomes the candidate for test evaluation in Section 9.

9. Model Comparison and Test Evaluation

Side-by-side comparison of the three families on validation, then a single, final evaluation of the winner on the test fold. The test fold is touched only once - no further hyperparameter changes are allowed after seeing test metrics.

After test evaluation I break the winner’s errors down by Property_Type, price bucket, and Year to identify where the model is weakest.

Code
# Comparison table on validation.
comparison = pd.DataFrame([
    {"model": "Linear (Ridge alpha=1000)",
     "rmse_aud": m_ridge_val["rmse_aud"],
     "mape":     m_ridge_val["mape"],
     "r2":       m_ridge_val["r2"]},
    {"model": "Random Forest",
     "rmse_aud": best_rf_params["rmse_aud"],
     "mape":     best_rf_params["mape"],
     "r2":       best_rf_params["r2"]},
    {"model": "XGBoost",
     "rmse_aud": best_xgb_params["rmse_aud"],
     "mape":     best_xgb_params["mape"],
     "r2":       best_xgb_params["r2"]},
])
print("=== Validation comparison ===")
print(comparison.round(4).to_string(index=False))

winner = "XGBoost"
print(f"\nWinner: {winner}")
=== Validation comparison ===
                    model    rmse_aud    mape     r2
Linear (Ridge alpha=1000) 447644.5022 23.1277 0.6346
            Random Forest 324711.7352 13.0724 0.8485
                  XGBoost 295156.2157 12.7929 0.8639

Winner: XGBoost

9.1 Final test evaluation

I retrain the winning XGBoost configuration on train+val combined (using the same n_estimators that early stopping picked) and evaluate on the untouched test fold. This is the unbiased estimate of production performance.

Code
# Retrain XGBoost on train + val with best hyperparameters and the n_estimators found.
X_tv = pd.concat([X_train, X_val], axis=0, ignore_index=True)
y_tv = np.concatenate([y_train, y_val])

final_xgb_val_stage = xgb.XGBRegressor(
    n_estimators=int(best_xgb_params["best_iter"]),
    max_depth=int(best_xgb_params["max_depth"]),
    learning_rate=best_xgb_params["learning_rate"],
    subsample=best_xgb_params["subsample"],
    colsample_bytree=best_xgb_params["colsample_bytree"],
    tree_method="hist",
    random_state=0,
    n_jobs=-1,
    verbosity=0,
).fit(X_tv, y_tv)

y_pred_test = final_xgb_val_stage.predict(X_test)
m_test = metrics(y_test, y_pred_test)

print("=== Test fold evaluation (XGBoost, retrained on train+val) ===")
print(f"  RMSE (AUD): {m_test['rmse_aud']:,.0f}")
print(f"  MAE  (AUD): {m_test['mae_aud']:,.0f}")
print(f"  MAPE:       {m_test['mape']:.2f}%")
print(f"  R²:         {m_test['r2']:.4f}")
print(f"  RMSE (log): {m_test['rmse_log']:.4f}")
=== Test fold evaluation (XGBoost, retrained on train+val) ===
  RMSE (AUD): 203,832
  MAE  (AUD): 114,892
  MAPE:       11.83%
  R²:         0.8688
  RMSE (log): 0.1652

Test results (XGBoost retrained on train+val):

  • RMSE $203,832, MAE $114,892, MAPE 11.83%, R² 0.869.

  • Test metrics are better than validation (val RMSE was $295k, test is $204k). This is unusual but explainable: the val fold (Nov 2025 - Feb 2026) overlaps with the post-COVID pullback recovery period, while the test fold (Feb-May 2026) sits in a more stable regime that the train+val combined set learned well. No overfitting signal - the model generalizes properly.

9.2 Error breakdown for the winning model

Where does the model struggle? I break test-set errors by Property_Type, price bucket, and Year. Large differences signal sub-populations the model underperforms on - useful for FastAPI dashboard warnings and for future improvements.

Code
# Build a test-side error dataframe.
err = test_df[["Property_Type", "Year", "Numeric_Price"]].copy().reset_index(drop=True)
err["pred_aud"] = np.expm1(y_pred_test)
err["abs_err"] = (err["pred_aud"] - err["Numeric_Price"]).abs()
err["abs_pct_err"] = err["abs_err"] / err["Numeric_Price"] * 100

# 1. By Property_Type (top 8 by volume).
top_types = err["Property_Type"].value_counts().head(8).index
print("=== Error by Property_Type (top 8) ===")
print(err[err["Property_Type"].isin(top_types)]
      .groupby("Property_Type")
      .agg(n=("abs_err", "size"),
           rmse=("abs_err", lambda s: float(np.sqrt((s**2).mean()))),
           mape=("abs_pct_err", "mean"))
      .round(2)
      .sort_values("rmse", ascending=False)
      .to_string())

# 2. By price bucket.
err["price_bucket"] = pd.cut(err["Numeric_Price"],
                              bins=[0, 500_000, 750_000, 1_000_000,
                                    1_500_000, 2_500_000, 20_000_000],
                              labels=["<500k", "500-750k", "750k-1M",
                                      "1-1.5M", "1.5-2.5M", ">2.5M"])
print("\n=== Error by price bucket ===")
print(err.groupby("price_bucket")
      .agg(n=("abs_err", "size"),
           rmse=("abs_err", lambda s: float(np.sqrt((s**2).mean()))),
           mape=("abs_pct_err", "mean"))
      .round(2)
      .to_string())

# 3. By Year (test fold spans late 2025 to May 2026).
print("\n=== Error by Year ===")
print(err.groupby("Year")
      .agg(n=("abs_err", "size"),
           rmse=("abs_err", lambda s: float(np.sqrt((s**2).mean()))),
           mape=("abs_pct_err", "mean"))
      .round(2)
      .to_string())
=== Error by Property_Type (top 8) ===
                             n       rmse   mape
Property_Type                                   
Acreage / Semi-Rural        25  399751.94  14.60
New House & Land            15  292046.29  11.18
House                    10569  223083.99  10.76
Vacant land                523  202393.44  12.78
Townhouse                 1840  173816.98  10.51
Villa                       76  163011.54  12.45
Apartment / Unit / Flat   4104  149592.43  14.86
Studio                      14   74666.44  24.57

=== Error by price bucket ===
                 n       rmse   mape
price_bucket                        
<500k         2045   94875.67  17.06
500-750k      5804   79558.76   8.91
750k-1M       4639  122442.31  10.10
1-1.5M        3134  213196.85  13.15
1.5-2.5M      1339  405681.67  17.43
>2.5M          229  994937.68  23.30

=== Error by Year ===
            n       rmse   mape
Year                           
2026.0  17190  203831.74  11.83

Error by Property_Type:

  • Apartment has high MAPE (14.86%) despite low RMSE - smaller absolute prices amplify percent error.

  • Studio MAPE 24.57% but only 14 rows - noisy estimate, ignore.

  • Acreage/Semi-Rural has highest RMSE ($400k) due to high price level and small sample (25 rows).

  • House and Townhouse - the bulk of volume - sit at 10-11% MAPE, the best segments.

Error by price bucket - critical insight:

  • Sweet spot: $500k-$1M (8.9-10.1% MAPE, where most data lives). Model is reliable here.

  • Degradation at extremes: <$500k (17%), $1.5-2.5M (17.4%), >$2.5M (23.3%).

  • Luxury (>$2.5M): 994k RMSE on 229 rows - model struggles with high-end properties.

  • The dashboard will need to flag predictions in luxury and budget brackets with wider uncertainty.

Error by Year:

  • Test fold is entirely 2026 - single-year evaluation. Cannot break down trend by year here, but the strong overall metrics suggest the Year feature is working correctly.

10. Feature Importance and SHAP

I use two views:

  1. Built-in XGBoost feature importance (gain-based) - which features the model splits on most.

  2. SHAP summary plot - direction and magnitude of each feature’s impact on individual predictions.

I compute SHAP on a 2,000-row sample of the test fold to keep runtime reasonable. SHAP for tree models is exact (TreeExplainer), so sampling only affects the visualization smoothness, not correctness.

Code
# Built-in feature importance.
imp = pd.DataFrame({
    "feature":    X_train.columns,
    "importance": final_xgb_val_stage.feature_importances_,
}).sort_values("importance", ascending=False)

plt.figure(figsize=(9, 7))
plt.barh(imp["feature"].head(20)[::-1], imp["importance"].head(20)[::-1], color="steelblue")
plt.title("XGBoost feature importance (top 20)")
plt.xlabel("Gain importance")
plt.tight_layout(); plt.show()

print("Top 15 features by importance:")
print(imp.head(15).to_string(index=False))

Top 15 features by importance:
                      feature  importance
               abs_median_age    0.172552
                         Beds    0.115216
                        Baths    0.114072
     abs_median_income_weekly    0.107300
ptype_Apartment / Unit / Flat    0.050002
                 ptype_Studio    0.049840
                         Year    0.048256
                 LandSize_sqm    0.043058
                  ptype_House    0.038090
                    Longitude    0.023144
                   Car_Spaces    0.021492
                     Latitude    0.020225
                  ptype_Villa    0.017725
            ptype_Vacant land    0.017600
                  Suburb_freq    0.016892

Observations on feature importance (gain-based):

  • abs_median_age dominates (17.3%) - older established suburbs systematically command higher prices.

  • Beds (11.5%) and Baths (11.4%) tie for second - core property attributes.

  • abs_median_income_weekly (10.7%) - suburb wealth as a price driver.

  • Year (4.8%) ranks 7th - the model uses it but it’s not dominant, because the time dimension is concentrated and explains average shift more than individual variation.

  • ptype_Apartment, ptype_Studio, ptype_House feature in top 9 - property type one-hots are pulling their weight.

  • Suburb_freq ranks 15th (1.7%) - lower than expected. Geographic signal is being absorbed by Latitude/Longitude and Distance_to_CBD instead.

Code
# SHAP analysis on a sample of the test fold.
sample_idx = np.random.RandomState(0).choice(len(X_test), size=min(2000, len(X_test)), replace=False)
X_shap = X_test.iloc[sample_idx]

explainer = shap.TreeExplainer(final_xgb_val_stage)
shap_values = explainer.shap_values(X_shap)

print(f"SHAP computed on {len(X_shap):,} test rows.")

# Summary plot.
shap.summary_plot(shap_values, X_shap, max_display=15, show=False)
plt.title("SHAP summary - XGBoost on test sample", pad=20)
plt.tight_layout(); plt.show()
SHAP computed on 2,000 test rows.

Observations on SHAP summary:

  • Distance_to_CBD_km: high values (red, far from CBD) push prediction up, low values push down. Interesting reversal - far-from-CBD includes peninsula and acreage premiums.

  • Beds: high values push up, low push down. Clean monotonic effect.

  • abs_median_age: high values push prediction up. Older suburbs are more expensive.

  • LandSize_sqm: high values push up significantly with long right tail - big lots add a lot.

  • abs_median_income_weekly: high income pushes up - same direction as expected.

  • Latitude (negative numbers in Melbourne): high values (less negative = further north) tend to push prediction down. This captures the south-Melbourne premium.

  • ptype_Apartment dummy = 1 strongly pulls prediction down.

  • Suburb_freq: high frequency (densely-traded suburbs) pulls prediction slightly down - consistent with growth-corridor outer suburbs being cheap and high-volume.

11. Final Model and For Sale Inference

I retrain the winning XGBoost on the entire Sold dataset (train + val + test combined) using the hyperparameters chosen on validation. This gives the production model the maximum possible training signal. The test metrics in Section 9 remain valid as the unbiased estimate of production performance.

Two additional quantile XGBoost models (q=0.1 and q=0.9) are trained with the same configuration to produce an 80% prediction interval per listing.

For Sale inference uses Year=2026, Month=5 (already injected in Section 3.2).

Deal-signal logic for listings with an asking price:

  • Predicted < 0.9 × Asking → Overpriced

  • Predicted > 1.1 × Asking → Good Deal

  • Otherwise → Fair

Code
# Retrain final point-estimate model on full Sold (train + val + test).
X_full = pd.concat([X_train, X_val, X_test], axis=0, ignore_index=True)
y_full = np.concatenate([y_train, y_val, y_test])

best_params = {
    "n_estimators":     int(best_xgb_params["best_iter"]),
    "max_depth":        int(best_xgb_params["max_depth"]),
    "learning_rate":    best_xgb_params["learning_rate"],
    "subsample":        best_xgb_params["subsample"],
    "colsample_bytree": best_xgb_params["colsample_bytree"],
    "tree_method":      "hist",
    "random_state":     0,
    "n_jobs":           -1,
    "verbosity":        0,
}

final_model = xgb.XGBRegressor(**best_params).fit(X_full, y_full)
print(f"Final point model trained on {len(X_full):,} rows.")

# Train quantile models for 80% prediction interval (q=0.1, q=0.9).
quantile_params = {**best_params, "objective": "reg:quantileerror"}

print("\nTraining quantile model q=0.1 (lower bound)...")
model_q10 = xgb.XGBRegressor(**quantile_params, quantile_alpha=0.1).fit(X_full, y_full)

print("Training quantile model q=0.9 (upper bound)...")
model_q90 = xgb.XGBRegressor(**quantile_params, quantile_alpha=0.9).fit(X_full, y_full)

print("All three models trained.")
Final point model trained on 114,594 rows.

Training quantile model q=0.1 (lower bound)...
Training quantile model q=0.9 (upper bound)...
All three models trained.
Code
# Predict on For Sale.
y_pred_fs     = final_model.predict(X_fs)
y_pred_fs_q10 = model_q10.predict(X_fs)
y_pred_fs_q90 = model_q90.predict(X_fs)

# Inverse-transform from log scale to AUD.
df_forsale["Predicted_Price"]       = np.expm1(y_pred_fs).round(-3)        # round to nearest $1k
df_forsale["Predicted_Price_Lower"] = np.expm1(y_pred_fs_q10).round(-3)
df_forsale["Predicted_Price_Upper"] = np.expm1(y_pred_fs_q90).round(-3)

# Sanity: enforce ordering lower <= point <= upper (rare floating-point inversions).
df_forsale["Predicted_Price_Lower"] = df_forsale[["Predicted_Price_Lower", "Predicted_Price"]].min(axis=1)
df_forsale["Predicted_Price_Upper"] = df_forsale[["Predicted_Price_Upper", "Predicted_Price"]].max(axis=1)

print("For Sale predicted price summary:")
print(df_forsale[["Predicted_Price", "Predicted_Price_Lower", "Predicted_Price_Upper"]]
      .describe(percentiles=[.05, .25, .5, .75, .95]).round(0))

# Interval width as a quality check.
df_forsale["Interval_Width_Pct"] = (
    (df_forsale["Predicted_Price_Upper"] - df_forsale["Predicted_Price_Lower"])
    / df_forsale["Predicted_Price"] * 100
)
print(f"\nMedian 80% interval width: {df_forsale['Interval_Width_Pct'].median():.1f}% of predicted price")
For Sale predicted price summary:
       Predicted_Price  Predicted_Price_Lower  Predicted_Price_Upper
count          38419.0                38419.0                38419.0
mean         1021647.0               820242.0              1253588.0
std           740877.0               456156.0              1026249.0
min            72000.0                70000.0               208000.0
5%            424000.0               354000.0               519000.0
25%           665000.0               601000.0               726000.0
50%           776000.0               699000.0               867000.0
75%          1126000.0               897000.0              1407000.0
95%          2367000.0              1653000.0              3187100.0
max         13469000.0             11012000.0             15548000.0

Median 80% interval width: 30.3% of predicted price
Code
# Deal signal for listings that have an asking price.
def classify_deal(row):
    asking = row["Numeric_Price"]
    if pd.isna(asking):
        return "No Asking Price"
    pred = row["Predicted_Price"]
    if pred < 0.9 * asking:
        return "Overpriced"
    if pred > 1.1 * asking:
        return "Good Deal"
    return "Fair"

df_forsale["Deal_Signal"] = df_forsale.apply(classify_deal, axis=1)

print("Deal signal distribution:")
print(df_forsale["Deal_Signal"].value_counts())
print("\n% of For Sale listings:")
print((df_forsale["Deal_Signal"].value_counts(normalize=True) * 100).round(1))

# Sample of each category.
print("\n=== Sample listings per category ===")
for cat in ["Good Deal", "Fair", "Overpriced", "No Asking Price"]:
    sub = df_forsale[df_forsale["Deal_Signal"] == cat]
    if len(sub) == 0:
        continue
    print(f"\n--- {cat} ({len(sub):,} listings) ---")
    cols = ["Suburb", "Property_Type", "Beds", "Numeric_Price",
            "Predicted_Price", "Predicted_Price_Lower", "Predicted_Price_Upper"]
    print(sub[cols].head(3).to_string(index=False))
Deal signal distribution:
Deal_Signal
Fair               18299
Overpriced          8497
Good Deal           6289
No Asking Price     5334
Name: count, dtype: int64

% of For Sale listings:
Deal_Signal
Fair               47.6
Overpriced         22.1
Good Deal          16.4
No Asking Price    13.9
Name: proportion, dtype: float64

=== Sample listings per category ===

--- Good Deal (6,289 listings) ---
   Suburb Property_Type  Beds  Numeric_Price  Predicted_Price  Predicted_Price_Lower  Predicted_Price_Upper
BRUNSWICK         House   4.0      1000000.0        2041000.0              1450000.0              2601000.0
BRUNSWICK         House   3.0      1190000.0        1310000.0              1009000.0              1538000.0
  BULLEEN         House   4.0      1195000.0        1461000.0              1245000.0              1830000.0

--- Fair (18,299 listings) ---
        Suburb           Property_Type  Beds  Numeric_Price  Predicted_Price  Predicted_Price_Lower  Predicted_Price_Upper
BRUNSWICK EAST                   House   3.0      1270000.0        1283000.0              1021000.0              1835000.0
       ROSANNA Apartment / Unit / Flat   2.0       520000.0         570000.0               471000.0               723000.0
       ROSANNA                   House   3.0      1360000.0        1288000.0              1037000.0              1445000.0

--- Overpriced (8,497 listings) ---
        Suburb           Property_Type  Beds  Numeric_Price  Predicted_Price  Predicted_Price_Lower  Predicted_Price_Upper
         RHYLL             Vacant land   0.0       600000.0         491000.0               441000.0               664000.0
WEST MELBOURNE                   House   2.0      1350000.0        1161000.0               898000.0              1369000.0
       CARLTON Apartment / Unit / Flat   2.0       377500.0         318000.0               257000.0               591000.0

--- No Asking Price (5,334 listings) ---
     Suburb Property_Type  Beds  Numeric_Price  Predicted_Price  Predicted_Price_Lower  Predicted_Price_Upper
   HAWTHORN         House   3.0            NaN        1983000.0              1186000.0              2323000.0
    IVANHOE         House   4.0            NaN        1902000.0              1493000.0              2609000.0
MANOR LAKES   Vacant land   0.0            NaN         388000.0               361000.0               413000.0

Observations on inference results:

  • Predicted price: median $776k (matches 2026 market level confirmed in EDA Section 6), mean $1.02M (typical right-skew from luxury tail). Range $72k to $13.5M.

  • 80% interval width: median 30.3% of predicted price. Tight enough to be informative, wide enough to capture true uncertainty.

  • Deal signal distribution:

    • Fair: 47.6% (18,299 listings) - asking and predicted align within ±10%
    • Overpriced: 22.1% (8,497) - predicted > 10% below asking
    • Good Deal: 16.4% (6,289) - predicted > 10% above asking
    • No Asking Price: 13.9% (5,334) - Contact Agent / EOI / Private Sale
  • The 22% Overpriced share is higher than the 16% Good Deal share - consistent with the well-known seller’s tendency to anchor high.

  • Sample inspection: predictions look reasonable. BRUNSWICK 4-bed House predicted $2M vs asking $1M is a plausible “Good Deal” flag (or a stale listing). RHYLL Vacant Land asking $600k vs predicted $491k flagged as Overpriced is consistent with the known $600k → $630k price-reduction history of that exact listing (noted in EDA Section 2).

12. Save Artifacts and Next Steps

I persist everything needed for the FastAPI dashboard and the weekly production pipeline:

  • 3 model files (point, lower, upper quantile)
  • Preprocessing parameters (frequency map, numeric medians, property type column list)
  • Feature list (column order for inference)
  • Metrics summary (validation and test)
  • For Sale predictions with intervals and deal signals
Code
# 1. Models.
joblib.dump(final_model, MODEL_DIR / "model.pkl")
joblib.dump(model_q10,   MODEL_DIR / "model_q10.pkl")
joblib.dump(model_q90,   MODEL_DIR / "model_q90.pkl")
print(f"Saved 3 models to {MODEL_DIR}/")

# 2. Preprocessing parameters.
preproc_params = {
    "suburb_frequency_map":  suburb_encoder.freq_,
    "property_type_columns": PROPERTY_TYPE_COLUMNS,
    "numeric_medians":       NUMERIC_MEDIANS.to_dict(),
    "numeric_features":      NUMERIC_FEATURES,
    "time_features":         TIME_FEATURES,
    "flag_features":         FLAG_FEATURES,
    "categorical_features":  CATEGORICAL_FEATURES,
    "all_feature_columns":   X_train.columns.tolist(),
}
joblib.dump(preproc_params, MODEL_DIR / "preprocessor.pkl")
print(f"Saved preprocessor.pkl ({len(preproc_params)} keys).")

# 3. Metrics summary.
metrics_summary = {
    "data_snapshot_date":   decisions["data_snapshot_date"],
    "n_train":              len(X_train),
    "n_val":                len(X_val),
    "n_test":               len(X_test),
    "validation_metrics": {
        "linear_ridge":  m_ridge_val,
        "random_forest": {k: best_rf_params[k] for k in ["rmse_aud", "mae_aud", "mape", "r2"]},
        "xgboost":       {k: best_xgb_params[k] for k in ["rmse_aud", "mae_aud", "mape", "r2"]},
    },
    "test_metrics_xgboost": m_test,
    "best_hyperparameters": {
        "n_estimators":     int(best_xgb_params["best_iter"]),
        "max_depth":        int(best_xgb_params["max_depth"]),
        "learning_rate":    best_xgb_params["learning_rate"],
        "subsample":        best_xgb_params["subsample"],
        "colsample_bytree": best_xgb_params["colsample_bytree"],
    },
    "winner": "XGBoost",
}
with open(MODEL_DIR / "metrics.json", "w") as f:
    json.dump(metrics_summary, f, indent=2, default=str)
print(f"Saved metrics.json.")

# 4. For Sale predictions parquet for the FastAPI dashboard.
out_cols = [
    "Property_ID", "Status", "Suburb", "Postcode", "Property_Type",
    "Beds", "Baths", "Car_Spaces", "LandSize_sqm",
    "Latitude", "Longitude", "Distance_to_CBD_km",
    "Raw_Price", "Numeric_Price",
    "Predicted_Price", "Predicted_Price_Lower", "Predicted_Price_Upper",
    "Interval_Width_Pct", "Deal_Signal",
    "is_land", "out_of_metro", "is_new_build",
    "Last_Updated",
]
predictions_path = DATA_DIR / "predictions_for_sale.parquet"
df_forsale[out_cols].to_parquet(predictions_path, index=False)
print(f"Saved predictions to {predictions_path}")
print(f"  Rows: {len(df_forsale):,}")
print(f"  Columns: {len(out_cols)}")
Saved 3 models to models/
Saved preprocessor.pkl (8 keys).
Saved metrics.json.
Saved predictions to report_data\predictions_for_sale.parquet
  Rows: 38,419
  Columns: 23

Next steps - production pipeline

This report is the development artifact. The same logic must run weekly on updated data (../data/melbourne_price_data_enriched.csv). To enable that, I will refactor Sections 2-3 (EDA cleaning) + Sections 3-11 (ML training and inference) into three standalone Python scripts:

  • clean.py - reproduces EDA Sections 2-3 (data quality + outlier handling). Output: cleaned_data.parquet.
  • train.py - reproduces ML Sections 3-10 (feature engineering, split, fit, evaluate). Output: model.pkl, preprocessor.pkl, metrics.json.
  • predict.py - reproduces ML Section 11 (For Sale inference). Output: predictions_for_sale.parquet.
  • weekly_update.py - orchestrates the three scripts in sequence.

The FastAPI dashboard reads predictions_for_sale.parquet directly, so updating the dashboard is just a matter of re-running the weekly pipeline.

Known limitations carried forward:

  • Model underperforms in two segments: luxury (>$2.5M, 23% MAPE) and budget (<$500k, 17% MAPE). The dashboard should show wider uncertainty for listings in these ranges.
  • Cold-start suburbs (12 rows currently) get frequency-encoded as 0; Postcode and Lat/Lng carry the geographic signal as fallback.
  • New-build supply (30% of For Sale) is captured by is_new_build but training data on these types remains thin. Predictions on New House & Land may be less reliable than on resale categories.