Melbourne Property Market - Exploratory Data Analysis
Author
Hoang Son Lai
Published
May 12, 2026
Introduction
This report performs end-to-end EDA on Melbourne property listings to prepare a clean dataset for downstream price prediction.
Data snapshot: this report uses data updated on 2 May 2026.
The dataset combines two distinct groups of records:
Sold: completed transactions with a transaction Date and (usually) a Numeric_Price. These are the future training labels.
For Sale: active listings with no Date and no reliable price. These are the records I will price-predict in the ML notebook.
Because the two groups play fundamentally different roles in the pipeline, I split them early and analyze each on its own, and also side-by-side, throughout the report.
Workflow: setup, data quality, outlier handling, univariate, bivariate, time series, geospatial, Sold-vs-ForSale drift, summary, save cleaned parquet.
1. Data Loading & Initial Inspection
I read the CSV, normalize a possible BOM in the first column name, then inspect the schema. The very first thing I look at is the Status distribution - this confirms the two-group structure that motivates the entire analysis.
Code
import warningswarnings.filterwarnings("ignore")import jsonimport numpy as npimport pandas as pdimport matplotlib.pyplot as pltimport seaborn as snsfrom pathlib import Pathpd.set_option("display.max_columns", 50)pd.set_option("display.width", 200)sns.set_style("whitegrid")plt.rcParams["figure.dpi"] =100plt.rcParams["figure.figsize"] = (10, 5)# To convert to html, quarto render report/eda_report.ipynb
# Strip BOM if present (CSVs saved from Excel often carry one on the first column)df.columns = [c.lstrip("\ufeff") for c in df.columns]print(f"Shape: {df.shape}")print(f"Memory: {df.memory_usage(deep=True).sum()/1e6:.1f} MB\n")print("Status distribution:")print(df["Status"].value_counts())print(f"\nColumns ({len(df.columns)}):")print(list(df.columns))
Shape: (163466, 27)
Memory: 65.7 MB
Status distribution:
Status
Sold 124959
For Sale 38507
Name: count, dtype: int64
Columns (27):
['Property_ID', 'Status', 'Full_Address', 'Suburb', 'Postcode', 'Property_Type', 'Date', 'Beds', 'Baths', 'Car_Spaces', 'LandSize_sqm', 'Propertycount', 'Raw_Price', 'Numeric_Price', 'Latitude', 'Longitude', 'Distance_to_CBD_km', 'URL', 'Last_Updated', 'abs_median_income_weekly', 'abs_median_age', 'abs_population', 'crime_offence_count', 'crime_suburb_ref', 'crime_rate_per_100k', 'dist_nearest_train_km', 'Enriched_Date']
I parse Date into a real datetime column and extract Year and Month - these will be central features in the ML notebook because they let the model learn price inflation. I then split into df_sold and df_forsale. I also confirm my assumption that For Sale rows have no Date (it’s the defining structural difference between the two groups).
Code
# Format in the file is "10 Dec 2012"df["Date_parsed"] = pd.to_datetime(df["Date"], format="%d %b %Y", errors="coerce")df["Year"] = df["Date_parsed"].dt.yeardf["Month"] = df["Date_parsed"].dt.monthdf_sold = df[df["Status"] =="Sold"].copy()df_forsale = df[df["Status"] =="For Sale"].copy()print("Sold - Date coverage:")print(f" min : {df_sold['Date_parsed'].min()}")print(f" max : {df_sold['Date_parsed'].max()}")print(f" NaN : {df_sold['Date_parsed'].isna().sum():,} "f"({df_sold['Date_parsed'].isna().mean()*100:.1f}%)")print("\nFor Sale - Date coverage (expected: all NaN):")print(f" NaN : {df_forsale['Date_parsed'].isna().sum():,} "f"({df_forsale['Date_parsed'].isna().mean()*100:.1f}%)")print("\nYear range in Sold:")print(df_sold["Year"].describe().round(0))
Sold - Date coverage:
min : 2001-10-13 00:00:00
max : 2026-05-02 00:00:00
NaN : 109 (0.1%)
For Sale - Date coverage (expected: all NaN):
NaN : 38,507 (100.0%)
Year range in Sold:
count 124850.0
mean 2024.0
std 3.0
min 2001.0
25% 2023.0
50% 2025.0
75% 2025.0
max 2026.0
Name: Year, dtype: float64
Observations from the output:
For Sale Date is 100% NaN, as expected. This confirms the structural assumption that For Sale listings carry no transaction date. This is why the ML notebook will inject the current Year and Month at inference time to obtain “today-priced” predictions.
Sold Date spans 13 Oct 2001 to 2 May 2026, a window of roughly 24.5 years.
Only 109 Sold rows (0.1%) have a missing Date. These will be excluded from time-aware training because Year and Month features cannot be derived for them.
Year distribution is heavily concentrated in recent years. The 25th percentile is 2023, median is 2025, and 75th percentile is 2025. At least half of Sold transactions occurred in 2025 alone. The bulk of the data sits in 2023–2026 and the early-2000s tail is thin.
Implications for the ML pipeline:
Time-based 70/15/15 split will allocate roughly:
Train: everything up to a cutoff somewhere in early or mid 2025.
Validation: a few months in 2025.
Test: late 2025 to May 2026.
The exact cutoff dates will be computed in the ML notebook from sorted Date_parsed.
Recency bias is a feature, not a bug. The model will be used to price For Sale listings at the current market level (May 2026), so having most training data from 2023–2026 is advantageous. The model will be heavily informed by recent price dynamics.
The early-2000s tail (2001–2015) is sparse and will mostly land in the train fold. Median price by year (Section 6) will likely be noisy in those early years. I will filter low-volume years in the time-series visualization to avoid misleading trend lines.
No data leakage risk from the 2 May 2026 snapshot date. The latest Sold date equals the snapshot date, confirming the data was scraped fresh on that day. For Sale listings represent the supply standing on that exact date.
2. Data Quality Check
I check existence (missingness, duplicates) before validity (price availability, type sanity), because IQRs or histograms on a NaN-heavy column give misleading results.
The three checks are missingness per group, duplicate Property_IDs, and Property_Type sanity. I also inspect Raw_Price for NaN-price rows to confirm whether any can be recovered. At the end of the section I remove rows identified as unrecoverable.
Missingness report:
Code
# Build a per-column NaN report, sorted by missing count, with both absolute and percentage figures.def missing_report(d, name): miss = d.isnull().sum() miss = miss[miss >0].sort_values(ascending=False) # only show columns that have at least one NaN pct = (miss /len(d) *100).round(2) out = pd.DataFrame({"missing": miss, "pct": pct}) out.index.name =f"{name} (n={len(d):,})"return outprint("=== Missingness - Sold ===")print(missing_report(df_sold, "Sold"))print("\n=== Missingness - For Sale ===")print(missing_report(df_forsale, "For Sale"))
=== Missingness - Sold ===
missing pct
Sold (n=124,959)
Numeric_Price 10334 8.27
Date 109 0.09
Year 109 0.09
Date_parsed 109 0.09
Month 109 0.09
crime_offence_count 19 0.02
crime_suburb_ref 19 0.02
crime_rate_per_100k 19 0.02
=== Missingness - For Sale ===
missing pct
For Sale (n=38,507)
Date_parsed 38507 100.00
Month 38507 100.00
Year 38507 100.00
Date 38464 99.89
Numeric_Price 5421 14.08
LandSize_sqm 87 0.23
Property_Type 87 0.23
Raw_Price 87 0.23
crime_offence_count 7 0.02
crime_suburb_ref 7 0.02
crime_rate_per_100k 7 0.02
Observations from the missingness report:
Sold Numeric_Price is missing in 8.27% of rows. Cell 13 will confirm all are “Price Withheld”.
Sold Date / Year / Month missing in 109 rows (0.09%). These cannot contribute to time-aware ML training and will be removed.
19 Sold rows have missing crime_rate_per_100k (suburb not in crime reference). These will be median-imputed in ML preprocessing.
For Sale Date / Year / Month are 100% missing as expected.
For Sale Numeric_Price missing in 5,421 rows (14.08%). Cell 13 will inspect Raw_Price for these.
87 For Sale rows have simultaneous NaNs in Property_Type, LandSize_sqm and Raw_Price. These are broken scrape records and will be removed.
Duplicate check
Code
# Check uniqueness of Property_ID at three levels: full dataset, within Sold, within For Sale.dup_total = df["Property_ID"].duplicated().sum()dup_sold = df_sold["Property_ID"].duplicated().sum()dup_fs = df_forsale["Property_ID"].duplicated().sum()print(f"Duplicate Property_IDs in full dataset: {dup_total:,}")print(f"Duplicate Property_IDs within Sold: {dup_sold:,}")print(f"Duplicate Property_IDs within For Sale: {dup_fs:,}")# Cross-group: any property that was sold and is now relisted? Important for leakage analysis.sold_ids =set(df_sold["Property_ID"])fs_ids =set(df_forsale["Property_ID"])overlap = sold_ids & fs_idsprint(f"\nProperty_IDs appearing in BOTH Sold and For Sale: {len(overlap):,}")# Inspect every row of the duplicate so we can decide which to keep.dup_in_fs = df_forsale[df_forsale["Property_ID"].duplicated(keep=False)]iflen(dup_in_fs) >0:print(f"\nThe duplicate For Sale Property_ID:") cols = ["Property_ID", "Status", "Suburb", "Property_Type", "Raw_Price", "Last_Updated"]print(dup_in_fs[cols].to_string(index=False))
Duplicate Property_IDs in full dataset: 1
Duplicate Property_IDs within Sold: 0
Duplicate Property_IDs within For Sale: 1
Property_IDs appearing in BOTH Sold and For Sale: 0
The duplicate For Sale Property_ID:
Property_ID Status Suburb Property_Type Raw_Price Last_Updated
2020548366 For Sale RHYLL Vacant land $630,000 2026-03-17
2020548366 For Sale RHYLL Vacant land $600,000 2026-05-02
Zero duplicates within Sold and zero overlap between Sold and For Sale.
The single For Sale duplicate (Property_ID 2020548366, RHYLL vacant land) has two rows differing in price ($630,000 vs $600,000) and Last_Updated (2026-03-17 vs 2026-05-02). The newer row reflects a price reduction and is the listing’s current state. I keep the latest and drop the older.
Price Check
Code
# Confirm that 100% of Sold NaN-prices are "Price Withheld".print("=== Sold - Raw_Price values when Numeric_Price is NaN ===")sold_no_price = df_sold[df_sold["Numeric_Price"].isna()]print(f"Rows with NaN Numeric_Price: {len(sold_no_price):,}")print("\nMost common Raw_Price values:")print(sold_no_price["Raw_Price"].value_counts(dropna=False))# Inspect the 5,421 For Sale rows lacking a numeric price.# This tells us whether any of them carry recoverable information.print("\n=== For Sale - Raw_Price values when Numeric_Price is NaN ===")fs_no_price = df_forsale[df_forsale["Numeric_Price"].isna()]print(f"Rows with NaN Numeric_Price: {len(fs_no_price):,}")print("\nTop 20 most common Raw_Price values:")print(fs_no_price["Raw_Price"].value_counts(dropna=False).head(20))
=== Sold - Raw_Price values when Numeric_Price is NaN ===
Rows with NaN Numeric_Price: 10,334
Most common Raw_Price values:
Raw_Price
Price Withheld 10334
Name: count, dtype: int64
=== For Sale - Raw_Price values when Numeric_Price is NaN ===
Rows with NaN Numeric_Price: 5,421
Top 20 most common Raw_Price values:
Raw_Price
Contact Agent 2193
CONTACT AGENT 380
Private Sale 147
Price on Application 127
Expressions of Interest 108
NaN 87
Contact agent 63
Contact Agent! 56
Call before it's SOLD !! 47
Expressions of interest 45
Private sale 39
Auction 36
contact agent 29
Contact Agent !! 27
THE DEAL: Expressions of Interest 25
AUCTION 24
Expression of Interest 18
Price Upon Application 17
ON HOLD 17
contact Agent 17
Name: count, dtype: int64
Sold side: all 10,334 NaN-price rows are “Price Withheld”. These have no label so won’t contribute to ML training, but I keep them for descriptive analysis to avoid biasing the supply-side picture.
For Sale side: none of the 5,421 NaN-price rows are price ranges. Ranges like “$1,100,000 - $1,200,000” do exist in For Sale but the source parser already collapsed them into Numeric_Price as midpoints. The remaining NaN-price rows are non-numeric expressions (“Contact Agent”, “Private Sale”, “Expressions of Interest”, “Auction”). None can be turned into a number.
Implications:
No price-recovery step needed; Numeric_Price is as complete as the data allows.
The 5,421 rows still have full physical and locational features, so they remain valid ML inference inputs. They simply can’t participate in the predicted-vs-asking deal-signal comparison.
Casing inconsistency in Raw_Price (“Contact Agent” vs “CONTACT AGENT” etc.) affects only the text column, not Numeric_Price, so I don’t normalize it.
Property Type sanity
Code
# Property_Type frequencies per group, plus a long-tail report on rare types.print("=== Property_Type - Sold ===")print(df_sold["Property_Type"].value_counts(dropna=False).head(15))print(f"NaN: {df_sold['Property_Type'].isna().sum()}")print("\n=== Property_Type - For Sale ===")print(df_forsale["Property_Type"].value_counts(dropna=False).head(15))print(f"NaN: {df_forsale['Property_Type'].isna().sum()}")# Identify categories with very few records across the entire dataset.# These will be grouped into "Other" later in ML feature engineering.type_counts_all = df["Property_Type"].value_counts(dropna=False)rare_threshold =50rare_types = type_counts_all[type_counts_all < rare_threshold]print(f"\nProperty_Type categories with < {rare_threshold} records (long tail): {len(rare_types)}")print(rare_types)
=== Property_Type - Sold ===
Property_Type
House 89082
Apartment / Unit / Flat 15845
Townhouse 8996
Vacant land 8158
Acreage / Semi-Rural 1305
Rural 512
New House & Land 338
Villa 290
Block of Units 106
New land 102
New Apartments / Off the Plan 60
Studio 48
Farm 21
Development Site 19
Unknown 18
Name: count, dtype: int64
NaN: 0
=== Property_Type - For Sale ===
Property_Type
House 15229
New House & Land 10276
Apartment / Unit / Flat 4772
Townhouse 3357
Vacant land 2843
New Home Designs 637
New Apartments / Off the Plan 517
Acreage / Semi-Rural 280
New land 98
Unknown 90
NaN 87
Villa 86
Rural 76
Block of Units 41
Studio 38
Name: count, dtype: int64
NaN: 87
Property_Type categories with < 50 records (long tail): 13
Property_Type
Development Site 38
Farm 30
Rural Lifestyle 24
Specialist Farm 20
Semi-Detached 18
Terrace 12
Duplex 11
Car Space 8
Penthouse 7
Farmlet 2
Mixed Farming 2
Grazing 1
Livestock 1
Name: count, dtype: int64
Sold has zero NaN Property_Type; For Sale has 87, aligning with the broken-scrape records above.
The composition differs sharply: House is 71% of Sold but only 40% of For Sale, while “New House & Land” is 0.3% of Sold but 27% of For Sale. “New Home Designs” and “New Apartments / Off the Plan” are also disproportionately present in For Sale.
This is a real market phenomenon - new-build packages are heavily marketed during listing but typically resell later under plain “House” or “Apartment” labels. The ML model will see far more new-build inputs at inference than in training. Section 8 quantifies this drift.
The long tail of 13 categories with fewer than 50 records each (Development Site, Farm, Rural Lifestyle, Specialist Farm, Semi-Detached, Terrace, Duplex, Car Space, Penthouse, Farmlet, Mixed Farming, Grazing, Livestock; ~164 records total) will be grouped into “Other” in ML feature engineering. NaN Property_Type values will be replaced with “Unknown” as its own category.
Apply removals identified in this section
Three classes of rows have been identified as unrecoverable by the checks above:
The 109 Sold rows with a missing Date, which cannot contribute to time-aware ML training.
The 87 broken-scrape rows in For Sale with simultaneous NaNs in Property_Type, LandSize_sqm and Raw_Price, which carry no usable content.
The older copy of the duplicate For Sale Property_ID 2020548366. I keep only the row with the latest Last_Updated.
I apply these removals now so that every section that follows operates on the same cleaned df_sold and df_forsale frames. Outlier-driven removals are deliberately deferred to Section 3.
Code
# 1. Drop Sold rows with a missing parsed Date (cannot derive Year/Month for time-aware ML).n_before =len(df_sold)df_sold = df_sold.dropna(subset=["Date_parsed"]).copy()print(f"Sold: dropped {n_before -len(df_sold):,} rows with missing Date "f"({n_before:,} -> {len(df_sold):,}).")# 2. Drop For Sale rows that are entirely broken scrapes# (no Property_Type AND no LandSize AND no Raw_Price).broken_mask = (df_forsale["Property_Type"].isna() & df_forsale["LandSize_sqm"].isna() & df_forsale["Raw_Price"].isna())n_before =len(df_forsale)df_forsale = df_forsale[~broken_mask].copy()print(f"For Sale: dropped {n_before -len(df_forsale):,} broken-scrape rows "f"({n_before:,} -> {len(df_forsale):,}).")# 3. Resolve the For Sale duplicate by keeping the most recent Last_Updated.# Sort descending on Last_Updated, then drop_duplicates(keep='first') keeps the newest.df_forsale["Last_Updated_dt"] = pd.to_datetime(df_forsale["Last_Updated"], errors="coerce")n_before =len(df_forsale)df_forsale = (df_forsale .sort_values("Last_Updated_dt", ascending=False) .drop_duplicates(subset="Property_ID", keep="first") .drop(columns=["Last_Updated_dt"]) .reset_index(drop=True))print(f"For Sale: dropped {n_before -len(df_forsale):,} older duplicate row "f"({n_before:,} -> {len(df_forsale):,}).")print("\nFinal sizes after Section 2 cleaning:")print(f" Sold: {len(df_sold):,}")print(f" For Sale: {len(df_forsale):,}")
Sold: dropped 109 rows with missing Date (124,959 -> 124,850).
For Sale: dropped 87 broken-scrape rows (38,507 -> 38,420).
For Sale: dropped 1 older duplicate row (38,420 -> 38,419).
Final sizes after Section 2 cleaning:
Sold: 124,850
For Sale: 38,419
3. Outlier Analysis
I run outlier analysis before the descriptive EDA because outliers distort summary statistics and charts. The cleaning philosophy is cap > flag > drop: dropping is the most destructive and reserved for unrecoverable values like price = $1; capping preserves the row while limiting damage; flagging adds information without removing anything.
I examine four families of outliers: price (Sold only), physical attributes (Beds, Baths, Car_Spaces), LandSize, and geography. I also build an is_land flag because land properties have fundamentally different price dynamics and must be treated separately from residential properties.
Build the is_land flag
A property is land if its Property_Type is one of the land categories (Vacant land, New land, Development Site, Farm, Farmlet, Grazing, Livestock, Mixed Farming, Specialist Farm), OR if it has zero rooms and positive land area (which catches mislabeled rows). The flag is added to both df_sold and df_forsale because outlier rules differ for land vs residential.
Code
# Property types that are explicitly land-related.LAND_TYPES = {"Vacant land", "New land", "Development Site","Farm", "Farmlet", "Grazing", "Livestock","Mixed Farming", "Specialist Farm",}def flag_land(d):""" Add an `is_land` column. True if the row is land-related either by type or by physical pattern (no rooms but positive land area). """ d = d.copy() type_is_land = d["Property_Type"].isin(LAND_TYPES) no_rooms = (d["Beds"] ==0) & (d["Baths"] ==0) has_land = d["LandSize_sqm"].fillna(0) >0 d["is_land"] = (type_is_land | (no_rooms & has_land)).astype(int)return ddf_sold = flag_land(df_sold)df_forsale = flag_land(df_forsale)print(f"is_land in Sold: {df_sold['is_land'].sum():,} "f"({df_sold['is_land'].mean()*100:.1f}%)")print(f"is_land in For Sale: {df_forsale['is_land'].sum():,} "f"({df_forsale['is_land'].mean()*100:.1f}%)")# Sanity: cross-tab is_land vs Property_Type to confirm the flag picks up the right rows.print("\nProperty_Type breakdown of is_land=1 rows (Sold):")print(df_sold[df_sold["is_land"] ==1]["Property_Type"].value_counts().head(10))
is_land in Sold: 8,433 (6.8%)
is_land in For Sale: 3,059 (8.0%)
Property_Type breakdown of is_land=1 rows (Sold):
Property_Type
Vacant land 8141
New land 100
Rural 95
House 26
Farm 20
Development Site 19
Acreage / Semi-Rural 18
Specialist Farm 7
Rural Lifestyle 3
Unknown 1
Name: count, dtype: int64
Price outliers (Sold only)
Only Sold has reliable transaction prices. I plot raw and log10 scale to confirm the heavy right tail and motivate the log-transform of the ML target. I then inspect global percentiles to set hard floor/ceiling bounds.
Code
sold_priced = df_sold.dropna(subset=["Numeric_Price"]).copy()# Two-panel histogram: raw scale shows the right tail, log10 shows distribution shape.fig, axes = plt.subplots(1, 2, figsize=(14, 4))axes[0].hist(sold_priced["Numeric_Price"] /1e6, bins=80, color="steelblue", edgecolor="white")axes[0].set_xlabel("Price (AUD million)"); axes[0].set_title("Sold price - raw scale")axes[1].hist(np.log10(sold_priced["Numeric_Price"]), bins=80, color="seagreen", edgecolor="white")axes[1].set_xlabel("log10(Price)"); axes[1].set_title("Sold price - log10 scale")plt.tight_layout(); plt.show()print("Global Numeric_Price percentiles (Sold):")print(sold_priced["Numeric_Price"].describe(percentiles=[.001, .01, .05, .95, .99, .999]).round(0))
Global Numeric_Price percentiles (Sold):
count 114624.0
mean 965566.0
std 1067423.0
min 1.0
0.1% 110000.0
1% 255000.0
5% 378000.0
95% 2100000.0
99% 3650000.0
99.9% 8150000.0
max 140000000.0
Name: Numeric_Price, dtype: float64
The raw histogram is dominated by one extreme bar around zero because of the $140M maximum; the log10 view shows a clean roughly-normal distribution centered around log10 ≈ 5.8 (i.e. ~$630k), which confirms log1p(Price) is the right ML target.
The global percentiles show the tails are sharp: minimum is $1 (clearly not a real transaction), 99.9th percentile is $8.15M, and the maximum is $140M. Per-type IQR ceilings (e.g. House ≈ $2.7M, Apartment ≈ $1.47M) are tight and would discard many legitimate luxury sales, so I use them only as context, not as hard bounds.
I set hard bounds at $50,000 floor and $20,000,000 ceiling. This removes only ~100-150 rows in the deep tails, none of which are recoverable.
Physical attribute outliers
Beds, Baths, Car_Spaces should follow human-scale property logic. Values above 10 are almost always typos. I count suspicious patterns; cleaning is applied in one batch at the end of Section 3.
Code
# Distribution and tail counts on Beds, Baths, Car_Spaces in the Sold set.print("Beds distribution (Sold):")print(df_sold["Beds"].describe().round(2))print(f" Beds > 10: {(df_sold['Beds'] >10).sum()} rows")print(f" Beds = 0 AND not flagged as land: "f"{((df_sold['Beds'] ==0) & (df_sold['is_land'] ==0)).sum()} rows")print("\nBaths distribution (Sold):")print(df_sold["Baths"].describe().round(2))print(f" Baths > 10: {(df_sold['Baths'] >10).sum()} rows")print(f" Baths > Beds*2 (with Beds > 0): "f"{((df_sold['Baths'] > df_sold['Beds']*2) & (df_sold['Beds'] >0)).sum()} rows")print("\nCar_Spaces distribution (Sold):")print(df_sold["Car_Spaces"].describe().round(2))print(f" Car_Spaces > 10: {(df_sold['Car_Spaces'] >10).sum()} rows")# Same checks on For Sale, since the same caps will be applied to inference inputs.print("\n--- For Sale ---")print(f"Beds > 10: {(df_forsale['Beds'] >10).sum()}")print(f"Beds = 0 AND not flagged as land: "f"{((df_forsale['Beds'] ==0) & (df_forsale['is_land'] ==0)).sum()}")print(f"Baths > Beds*2 (with Beds > 0): "f"{((df_forsale['Baths'] > df_forsale['Beds']*2) & (df_forsale['Beds'] >0)).sum()}")print(f"Car_Spaces > 10: {(df_forsale['Car_Spaces'] >10).sum()}")
Beds distribution (Sold):
count 124850.00
mean 3.14
std 1.29
min 0.00
25% 3.00
50% 3.00
75% 4.00
max 30.00
Name: Beds, dtype: float64
Beds > 10: 56 rows
Beds = 0 AND not flagged as land: 227 rows
Baths distribution (Sold):
count 124850.00
mean 1.75
std 0.85
min 0.00
25% 1.00
50% 2.00
75% 2.00
max 20.00
Name: Baths, dtype: float64
Baths > 10: 18 rows
Baths > Beds*2 (with Beds > 0): 4 rows
Car_Spaces distribution (Sold):
count 124850.00
mean 2.07
std 33.85
min 0.00
25% 1.00
50% 2.00
75% 2.00
max 11951.00
Name: Car_Spaces, dtype: float64
Car_Spaces > 10: 277 rows
--- For Sale ---
Beds > 10: 23
Beds = 0 AND not flagged as land: 95
Baths > Beds*2 (with Beds > 0): 0
Car_Spaces > 10: 99
Beds and Car_Spaces both contain absurd values (Beds max=30, Car_Spaces max=11,951). I cap each at 10. The 227 Sold rows with Beds=0 on non-land are dominated by House (102) and Rural (39) - these will be set to NaN for ML imputation rather than kept as zero, because zero is not a plausible value for a house. The 4 Baths > Beds*2 cases are negligible and will be capped.
LandSize outliers
LandSize has two failure modes. Zero LandSize is normal for Apartment-like and Townhouse types (they typically share or have no individual land), but anomalous for House, Villa, Rural, Acreage, etc. Per-type analysis is needed because acreage can legitimately have hundreds of thousands of sqm while a House should not exceed roughly 50,000 sqm.
Code
# LandSize percentiles to inform the cap level.print("LandSize_sqm percentiles (Sold, all rows):")print(df_sold["LandSize_sqm"].describe(percentiles=[.5, .9, .99, .999]).round(0))# Zero LandSize on non-apartment, non-land property types.APT_LIKE = {"Apartment / Unit / Flat", "Studio", "Penthouse","Block of Units", "Car Space"}bad_zero_sold = (df_sold["LandSize_sqm"] ==0) &\ (~df_sold["Property_Type"].isin(APT_LIKE)) &\ (df_sold["is_land"] ==0)bad_zero_fs = (df_forsale["LandSize_sqm"] ==0) &\ (~df_forsale["Property_Type"].isin(APT_LIKE)) &\ (df_forsale["is_land"] ==0)print(f"\nLandSize=0 on non-apartment, non-land rows:")print(f" Sold: {bad_zero_sold.sum():,}")print(f" For Sale: {bad_zero_fs.sum():,}")# Top-tail look: largest LandSize values among non-land rows.print("\nTop 10 largest LandSize_sqm among non-land Sold rows:")print(df_sold[df_sold["is_land"] ==0] .nlargest(10, "LandSize_sqm")[["Property_Type", "Suburb", "LandSize_sqm", "Numeric_Price"]] .to_string(index=False))
LandSize_sqm percentiles (Sold, all rows):
count 124850.0
mean 5289.0
std 565297.0
min 0.0
50% 448.0
90% 1184.0
99% 68796.0
99.9% 449200.0
max 197433984.0
Name: LandSize_sqm, dtype: float64
LandSize=0 on non-apartment, non-land rows:
Sold: 25,102
For Sale: 7,776
Top 10 largest LandSize_sqm among non-land Sold rows:
Property_Type Suburb LandSize_sqm Numeric_Price
House GRUYERE 197433984.0 2420000.0
House MYRNIONG 17178100.0 1425000.0
House KURUNJANG 8093700.0 412500.0
House SORRENTO 5876000.0 1520000.0
House YERING 2060000.0 6500000.0
Rural YERING 2023400.0 8305000.0
Rural SUNBURY 1841300.0 NaN
House SUNBURY 1529700.0 NaN
Acreage / Semi-Rural CONNEWARRE 1397300.0 NaN
Acreage / Semi-Rural CURLEWIS 1351700.0 NaN
Two serious LandSize problems are visible.
First, 25,102 Sold rows (20% of the data) have LandSize=0 on property types that should have land - dominated by House (18,568) and Townhouse (5,883). This is a systemic scraping gap, not random noise. Setting all 25k to NaN would force ML to impute 20% of the column; setting them to per-type medians during cleaning is more honest because it acknowledges the gap explicitly. I will impute these now using per-Property_Type median LandSize computed only on rows with positive land area.
Second, the top of the LandSize distribution contains clear Property_Type misclassifications: a “House” with 197 million sqm in GRUYERE, another “House” with 17 million sqm. The 99% percentile for House is 51,800 sqm but 99.9% is 302,051 sqm - the gap is 6x. I apply per-Property_Type caps using the 99% percentile (computed on positive values) for House/Townhouse/Villa, and the 99.9% percentile for Acreage/Rural/Farm (where extreme values are legitimate). This is tighter than a global cap and respects the natural land-size scale of each type.
Geographic outliers
I define a Melbourne metro envelope of approximately:
Latitude: -38.6 to -37.4
Longitude: 144.4 to 145.7
Properties outside this envelope are not necessarily errors. Domain.com.au lists genuinely regional properties (Geelong, Ballarat, Mornington Peninsula extensions, etc.) that are legitimately outside the inner metro box. I therefore flag rather than drop. The flag (out_of_metro) lets downstream geospatial visualizations focus on the metro core while still keeping regional rows in the modeling set.
Code
# Count and inspect rows outside the Melbourne metro envelope.LAT_MIN, LAT_MAX =-38.6, -37.4LON_MIN, LON_MAX =144.4, 145.7geo_out_sold = df_sold[(~df_sold["Latitude"].between(LAT_MIN, LAT_MAX)) | (~df_sold["Longitude"].between(LON_MIN, LON_MAX))]geo_out_fs = df_forsale[(~df_forsale["Latitude"].between(LAT_MIN, LAT_MAX)) | (~df_forsale["Longitude"].between(LON_MIN, LON_MAX))]print(f"Rows outside Melbourne metro envelope:")print(f" Sold: {len(geo_out_sold):,} ({len(geo_out_sold)/len(df_sold)*100:.1f}%)")print(f" For Sale: {len(geo_out_fs):,} ({len(geo_out_fs)/len(df_forsale)*100:.1f}%)")# Show a few examples to confirm these look like real regional listings rather than coordinate errors.print("\nExamples of out-of-metro Sold rows:")print(geo_out_sold[["Suburb", "Postcode", "Latitude", "Longitude", "Distance_to_CBD_km"]] .head(8).to_string(index=False))
Rules applied to both Sold and For Sale (so inference data matches training shape). The price floor/ceiling applies only to Sold because For Sale has no transaction price.
Issue
Action
Price < $50k or > $20M (Sold)
Drop
Beds > 10
Cap at 10
Baths > Beds*2 when Beds > 0
Cap at Beds*2, then cap absolute at 10
Car_Spaces > 10
Cap at 10
Beds = 0 AND not land
Set Beds and Baths to NaN for ML imputation
LandSize = 0 on non-apartment, non-land
Set to NaN for ML imputation
LandSize > per-type cap (99% for House/Townhouse/Villa; 99.9% for Acreage/Rural/Farm)
Cap
Lat/Lng outside metro envelope
Flag out_of_metro
Property_Type = NaN
Set to “Unknown”
Per-type LandSize medians and caps are computed once on the Sold side (training-side parameters; For Sale must not influence them) and then applied to both groups.
Code
# Per-Property_Type LandSize CAPS only, computed from Sold positive-LandSize rows.# Imputation of NaN is left to the ML preprocessor (fit on train fold only).RESIDENTIAL_DENSE = {"House", "Townhouse", "Villa", "Semi-Detached","Terrace", "Duplex", "New House & Land","New Apartments / Off the Plan", "New Home Designs","Block of Units"}RURAL_LARGE = {"Acreage / Semi-Rural", "Rural", "Rural Lifestyle"}land_caps = {}for ptype in df_sold["Property_Type"].dropna().unique(): sub = df_sold[(df_sold["Property_Type"] == ptype) & (df_sold["LandSize_sqm"] >0)]["LandSize_sqm"]iflen(sub) <10:continue cap_q =0.999if ptype in RURAL_LARGE else0.99 land_caps[ptype] =float(sub.quantile(cap_q))print("Per-Property_Type LandSize caps (top 10 by cap value):")for k, v insorted(land_caps.items(), key=lambda x: -x[1])[:10]:print(f" {k:35s} cap={v:>12,.0f}")def clean_frame(d, is_sold):"""Apply outlier handling rules. is_sold=True additionally drops price extremes.""" d = d.copy()# Property_Type: NaN -> "Unknown". d["Property_Type"] = d["Property_Type"].fillna("Unknown")# Cap Beds and Car_Spaces at 10. d["Beds"] = d["Beds"].clip(upper=10) d["Car_Spaces"] = d["Car_Spaces"].clip(upper=10)# Baths: cap at Beds*2 when Beds > 0, then absolute cap at 10. mask_baths = (d["Beds"] >0) & (d["Baths"] > d["Beds"] *2) d.loc[mask_baths, "Baths"] = d.loc[mask_baths, "Beds"] *2 d["Baths"] = d["Baths"].clip(upper=10)# Beds=0 on non-land: set Beds and Baths to NaN for ML imputation. mask_no_rooms = (d["Beds"] ==0) & (d["is_land"] ==0) d.loc[mask_no_rooms, ["Beds", "Baths"]] = np.nan# LandSize=0 on non-apartment, non-land -> NaN for ML imputation. bad_zero = (d["LandSize_sqm"] ==0) &\ (~d["Property_Type"].isin(APT_LIKE)) &\ (d["is_land"] ==0) d.loc[bad_zero, "LandSize_sqm"] = np.nan# LandSize upper cap per Property_Type.for ptype, cap in land_caps.items(): mask_cap = (d["Property_Type"] == ptype) & (d["LandSize_sqm"] > cap) d.loc[mask_cap, "LandSize_sqm"] = cap# Flag (do not drop) out-of-metro rows. d["out_of_metro"] = ((~d["Latitude"].between(LAT_MIN, LAT_MAX)) | (~d["Longitude"].between(LON_MIN, LON_MAX))).astype(int)if is_sold:# Drop unrecoverable price extremes; keep Price Withheld (NaN) rows for descriptive use. priced = d["Numeric_Price"].notna() keep = (~priced) | (d["Numeric_Price"].between(50_000, 20_000_000)) before =len(d) d = d[keep].copy()print(f" Sold: dropped {before -len(d):,} rows with price < 50k or > 20M.")return dprint("\nCleaning Sold:")df_sold = clean_frame(df_sold, is_sold=True)print(f" Final size: {len(df_sold):,}")print("\nCleaning For Sale:")df_forsale = clean_frame(df_forsale, is_sold=False)print(f" Final size: {len(df_forsale):,}")# Post-clean sanity checks.print("\nPost-clean sanity:")print(f" Sold Beds max: {df_sold['Beds'].max()}")print(f" Sold Baths max: {df_sold['Baths'].max()}")print(f" Sold Car_Spaces max: {df_sold['Car_Spaces'].max()}")print(f" Sold LandSize max: {df_sold['LandSize_sqm'].max():,.0f}")print(f" Sold price range: {df_sold['Numeric_Price'].min():,.0f} - "f"{df_sold['Numeric_Price'].max():,.0f}")print(f" Sold Beds NaN (rooms-less): {df_sold['Beds'].isna().sum():,}")print(f" Sold LandSize NaN (was 0): {df_sold['LandSize_sqm'].isna().sum():,}")print(f" For Sale LandSize NaN (was 0): {df_forsale['LandSize_sqm'].isna().sum():,}")print(f" out_of_metro Sold: {df_sold['out_of_metro'].mean()*100:.1f}%")print(f" out_of_metro For Sale: {df_forsale['out_of_metro'].mean()*100:.1f}%")
Per-Property_Type LandSize caps (top 10 by cap value):
Rural cap= 1,956,387
Acreage / Semi-Rural cap= 1,330,778
Farm cap= 842,160
Rural Lifestyle cap= 632,469
Villa cap= 263,045
Vacant land cap= 213,926
House cap= 51,860
Development Site cap= 45,532
New land cap= 12,048
Apartment / Unit / Flat cap= 5,414
Cleaning Sold:
Sold: dropped 30 rows with price < 50k or > 20M.
Final size: 124,820
Cleaning For Sale:
Final size: 38,419
Post-clean sanity:
Sold Beds max: 10.0
Sold Baths max: 10.0
Sold Car_Spaces max: 10
Sold LandSize max: 1,956,387
Sold price range: 50,000 - 19,500,000
Sold Beds NaN (rooms-less): 227
Sold LandSize NaN (was 0): 25,095
For Sale LandSize NaN (was 0): 7,776
out_of_metro Sold: 3.5%
out_of_metro For Sale: 4.7%
4. Univariate Analysis
I examine each variable in isolation. For attributes shared by both groups (Beds, LandSize, Property_Type, Suburb, Distance_to_CBD), I plot Sold and For Sale side-by-side to enable a visual drift check that Section 8 will quantify formally with statistical tests.
Price is the only Sold-only analysis here, because For Sale has no transaction price. I plot it on both raw and log1p scales to confirm the right-skew that justifies log1p(Price) as the ML target.
4.1 Sold price distribution
I use Numeric_Price filtered to non-NaN rows only (i.e. excluding the ~10k Price Withheld rows). I show both raw and log1p views and report key percentiles.
Code
# Work on Sold rows that have a usable transaction price.sold_priced = df_sold.dropna(subset=["Numeric_Price"]).copy()fig, axes = plt.subplots(1, 2, figsize=(14, 4))# Raw scale in AUD million.axes[0].hist(sold_priced["Numeric_Price"] /1e6, bins=80, color="steelblue", edgecolor="white")axes[0].set_xlabel("Price (AUD million)")axes[0].set_ylabel("Count")axes[0].set_title("Sold price - raw scale")# log1p scale to reveal distribution shape.axes[1].hist(np.log1p(sold_priced["Numeric_Price"]), bins=80, color="darkorange", edgecolor="white")axes[1].set_xlabel("log1p(Price)")axes[1].set_title("Sold price - log1p scale")plt.tight_layout(); plt.show()print("Sold Numeric_Price summary (n={:,}):".format(len(sold_priced)))print(sold_priced["Numeric_Price"].describe(percentiles=[.05, .25, .5, .75, .95, .99]).round(0))
Sold Numeric_Price summary (n=114,594):
count 114594.0
mean 958510.0
std 701764.0
min 50000.0
5% 378500.0
25% 610000.0
50% 772295.0
75% 1081545.0
95% 2100000.0
99% 3630140.0
max 19500000.0
Name: Numeric_Price, dtype: float64
Observations on Sold price distribution:
Raw scale is sharply right-skewed; log1p scale is approximately bell-shaped and roughly symmetric. This confirms log1p(Numeric_Price) as the ML target.
Median = $772,295, 95th = $2.1M, 99th = $3.63M. The 99th percentile being only $3.63M confirms the $20M ceiling from Section 3 was conservative.
Mean ($959k) exceeds median ($772k), confirming right-skew even after capping. Median-based summaries are used throughout the rest of the notebook.
4.2 Physical attributes - side by side
Beds, Baths, Car_Spaces, and LandSize are compared between Sold and For Sale. I use the same x-axis bins on both panels so the eye can directly compare shapes. LandSize is plotted on log1p scale because it spans from ~100 sqm (apartments) to ~1.9M sqm (rural) - a linear scale would compress everything into one bar.
Code
def side_by_side(col, bins=30, log=False, drop_na=True):"""Plot Sold and For Sale distributions of `col` on a shared scale.""" fig, axes = plt.subplots(1, 2, figsize=(13, 3.5)) a = df_sold[col].dropna() if drop_na else df_sold[col] b = df_forsale[col].dropna() if drop_na else df_forsale[col]if log: a, b = np.log1p(a), np.log1p(b)# Shared bin edges for fair visual comparison. edges = np.histogram_bin_edges(np.concatenate([a, b]), bins=bins) axes[0].hist(a, bins=edges, color="steelblue", edgecolor="white") axes[0].set_title(f"Sold - {col}{' (log1p)'if log else''}") axes[1].hist(b, bins=edges, color="indianred", edgecolor="white") axes[1].set_title(f"For Sale - {col}{' (log1p)'if log else''}") plt.tight_layout(); plt.show()side_by_side("Beds", bins=12)side_by_side("Baths", bins=12)side_by_side("Car_Spaces", bins=12)side_by_side("LandSize_sqm", bins=40, log=True)side_by_side("Distance_to_CBD_km", bins=40)
Observations on physical attributes (Sold vs For Sale):
Beds, Baths, Car_Spaces: both groups peak at 3-4 / 2 / 2 respectively. For Sale leans more tightly to 4 beds, 2 baths, 2 cars - consistent with standardized new-build packages.
LandSize (log1p): bimodal in both groups - one bar near zero (NaN-rendered-as-0 plus apartment rows) and a main bell at log1p ≈ 6.3 (~550 sqm typical lot).
Distance_to_CBD: Sold spreads broadly across 0-80 km with multiple peaks. For Sale is concentrated at 25-35 km (growth corridors) with notably less supply in the 40-65 km range.
4.3 Property_Type composition
I compare the share (percentage) of each Property_Type across the two groups rather than raw counts, because Sold has ~3x as many rows as For Sale. I show the top 12 types side-by-side.
Code
def type_share(d, name):return (d["Property_Type"].value_counts(normalize=True) *100).rename(name)types = pd.concat([type_share(df_sold, "Sold %"), type_share(df_forsale, "For Sale %")], axis=1).fillna(0)types = types.sort_values("Sold %", ascending=False).head(12)types.plot(kind="barh", figsize=(9, 5), color=["steelblue", "indianred"])plt.gca().invert_yaxis()plt.title("Property type composition - Sold vs For Sale (top 12)")plt.xlabel("% of group")plt.tight_layout(); plt.show()print(types.round(2))
Sold % For Sale %
Property_Type
House 71.30 39.64
Apartment / Unit / Flat 12.69 12.42
Townhouse 7.20 8.74
Vacant land 6.52 7.40
Acreage / Semi-Rural 1.04 0.73
Rural 0.40 0.20
New House & Land 0.27 26.75
Villa 0.23 0.22
Block of Units 0.08 0.11
New land 0.08 0.26
New Apartments / Off the Plan 0.05 1.35
Studio 0.04 0.10
Observations on Property_Type composition:
House: 71.3% Sold vs 39.6% For Sale (32-point gap).
“New House & Land”: 0.27% Sold vs 26.75% For Sale (100x). This is the largest single drift in the dataset.
“New Apartments / Off the Plan” (0.05% vs 1.35%) and “New land” (0.08% vs 0.26%) show the same pattern.
All other types are roughly aligned between groups.
Cause: new-construction packages dominate listings but, once resold years later, are recorded under their physical type (“House”, “Apartment”). The For Sale snapshot overweights new-build categories the ML model trained on Sold will rarely have seen.
Mitigation for ML: I will add an is_new_build flag grouping the four new-build categories, so the model can learn a new-build premium without depending on rare training categories.
4.4 Top suburbs
The two top-15 suburb lists indicate where activity is concentrated in each group. Differences may reflect normal market dynamics (some suburbs turn over faster than they list) rather than data issues.
Code
top_sub_sold = df_sold["Suburb"].value_counts().head(15)top_sub_fs = df_forsale["Suburb"].value_counts().head(15)fig, axes = plt.subplots(1, 2, figsize=(14, 5))top_sub_sold.plot(kind="barh", ax=axes[0], color="steelblue")axes[0].invert_yaxis()axes[0].set_title("Top 15 suburbs - Sold")axes[0].set_xlabel("Number of records")top_sub_fs.plot(kind="barh", ax=axes[1], color="indianred")axes[1].invert_yaxis()axes[1].set_title("Top 15 suburbs - For Sale")axes[1].set_xlabel("Number of records")plt.tight_layout(); plt.show()# Total unique suburbs in each group for context.print(f"Unique suburbs in Sold: {df_sold['Suburb'].nunique():,}")print(f"Unique suburbs in For Sale: {df_forsale['Suburb'].nunique():,}")
Unique suburbs in Sold: 543
Unique suburbs in For Sale: 527
Observations on suburbs:
543 unique in Sold, 527 in For Sale. Section 8 will check overlap.
For Sale top suburbs are almost all outer-ring growth corridors (TARNEIT, CLYDE NORTH, DONNYBROOK, WOLLERT, MICKLEHAM, FRASER RISE, etc.), matching the new-build supply concentration.
Sold top suburbs are more diverse, including established coastal/peninsula suburbs (MOUNT MARTHA, COWES, PORTARLINGTON, ROSEBUD) absent from the For Sale top list.
The ML model will face systematic input bias toward growth-corridor suburbs at inference time.
4.5 Sold transaction time distribution
For Sold rows I plot the year-month distribution of transactions to confirm that the data is heavily concentrated in 2023-2026, as Section 1 suggested, and to identify low-volume years that I should filter out of the time-series chart in Section 6.
Code
# Yearly transaction count.year_counts = df_sold["Year"].value_counts().sort_index()fig, axes = plt.subplots(1, 2, figsize=(14, 4))axes[0].bar(year_counts.index, year_counts.values, color="steelblue", edgecolor="white")axes[0].set_xlabel("Year")axes[0].set_ylabel("Number of Sold transactions")axes[0].set_title("Sold transactions per year")axes[0].tick_params(axis="x", rotation=45)# Monthly transaction count (pooled across years) to check seasonality.month_counts = df_sold["Month"].value_counts().sort_index()axes[1].bar(month_counts.index, month_counts.values, color="seagreen", edgecolor="white")axes[1].set_xlabel("Month")axes[1].set_ylabel("Number of Sold transactions")axes[1].set_title("Sold transactions per month (all years pooled)")axes[1].set_xticks(range(1, 13))plt.tight_layout(); plt.show()print("Years with fewer than 200 Sold transactions (will be filtered in Section 6):")print(year_counts[year_counts <200])
Years with fewer than 200 Sold transactions (will be filtered in Section 6):
Year
2001.0 1
2004.0 1
2005.0 1
2006.0 8
2007.0 35
2008.0 47
Name: count, dtype: int64
Observations on transaction timing:
2025 alone holds ~51,000 transactions (40% of Sold). 2024 has ~14,000, 2023 has ~5,000. Years before 2020 are individually under 5,000.
Six early years (2001, 2004-2008) have fewer than 200 transactions each and will be filtered out of the time-series median plot in Section 6.
Monthly distribution shows clear seasonality: March peak (~15,000), May-June trough (~6,500). Australia’s spring/autumn auction seasons. Moderate amplitude (~2x peak-to-trough), so Month adds useful but secondary signal beyond Year.
5. Bivariate Analysis - Price Drivers
I examine how each feature relates to price. All charts in this section use Sold rows with a valid Numeric_Price only, because For Sale has no transaction price.
The goal is to confirm which features carry signal, identify non-linear relationships, and check for multicollinearity. Insights here directly shape the ML feature set.
5.1 Price vs physical features
I use boxplots for discrete features (Beds, Baths) and scatter plots for continuous ones (LandSize, Distance_to_CBD). For scatters I sample 5,000 points with low alpha to avoid over-plotting, and use log1p(Price) on the y-axis so the price scale is comparable across panels.
Code
fig, axes = plt.subplots(2, 2, figsize=(13, 8))# Price vs Beds (boxplot, discrete).sub_beds = sold_priced[sold_priced["Beds"].between(0, 7)]sns.boxplot(x="Beds", y=np.log1p(sub_beds["Numeric_Price"]), data=sub_beds, ax=axes[0,0], color="steelblue")axes[0,0].set_title("log1p(Price) vs Beds")axes[0,0].set_ylabel("log1p(Price)")# Price vs Baths.sub_baths = sold_priced[sold_priced["Baths"].between(0, 5)]sns.boxplot(x="Baths", y=np.log1p(sub_baths["Numeric_Price"]), data=sub_baths, ax=axes[0,1], color="steelblue")axes[0,1].set_title("log1p(Price) vs Baths")axes[0,1].set_ylabel("")# Price vs LandSize (scatter on log-log, sampled).sample = sold_priced.dropna(subset=["LandSize_sqm"]).sample(min(5000, len(sold_priced.dropna(subset=["LandSize_sqm"]))), random_state=0)axes[1,0].scatter(np.log1p(sample["LandSize_sqm"]), np.log1p(sample["Numeric_Price"]), alpha=0.2, s=8, color="steelblue")axes[1,0].set_xlabel("log1p(LandSize_sqm)")axes[1,0].set_ylabel("log1p(Price)")axes[1,0].set_title("log1p(Price) vs log1p(LandSize)")# Price vs Distance_to_CBD (scatter, sampled).sample2 = sold_priced.sample(min(5000, len(sold_priced)), random_state=0)axes[1,1].scatter(sample2["Distance_to_CBD_km"], np.log1p(sample2["Numeric_Price"]), alpha=0.2, s=8, color="darkorange")axes[1,1].set_xlabel("Distance to CBD (km)")axes[1,1].set_ylabel("")axes[1,1].set_title("log1p(Price) vs Distance to CBD")plt.tight_layout(); plt.show()
Observations on physical features:
Beds and Baths: clear monotonic increase in median price from 0 to 7 beds and 0 to 5 baths. Beds=0 and Beds=1 have very similar medians (~13 on log scale), but from Beds=2 onward the relationship is strongly linear in log-space. Baths shows an even cleaner monotonic pattern. Both will be strong features.
LandSize (log-log): two distinct populations visible. A vertical band at log1p=0 (apartments and rooms-less rows with LandSize=0) spans the full price range, while the main residential cloud at log1p(LandSize) 4-8 shows a clear positive trend. Price grows roughly linearly with log(LandSize), confirming log1p as the right transform for both axes.
Distance_to_CBD: surprisingly weak visual relationship - the cloud is nearly flat across 0-80 km. This is because the dataset is dominated by outer-ring suburbs (15-65 km) where price varies more by suburb-specific factors than by raw distance. Distance alone will be a weak feature; the model will rely more on Latitude/Longitude and Suburb-derived features.
5.2 Price by Property_Type
A boxplot of log1p(Price) across the top 8 Property_Types, to confirm the type-specific price levels assumed in Section 3’s outlier handling.
Code
top_types = sold_priced["Property_Type"].value_counts().head(8).indexsub = sold_priced[sold_priced["Property_Type"].isin(top_types)]plt.figure(figsize=(11, 4.5))sns.boxplot(x="Property_Type", y=np.log1p(sub["Numeric_Price"]), data=sub, order=top_types, color="steelblue")plt.xticks(rotation=30, ha="right")plt.ylabel("log1p(Price)")plt.title("log1p(Price) by Property Type (top 8 by volume)")plt.tight_layout(); plt.show()# Median price per type for reference.print("Median Numeric_Price by Property_Type (top 8):")print(sub.groupby("Property_Type")["Numeric_Price"].median().round(0).sort_values(ascending=False))
Median Numeric_Price by Property_Type (top 8):
Property_Type
Acreage / Semi-Rural 1720000.0
Rural 1350000.0
House 827500.0
Townhouse 780000.0
Villa 715000.0
New House & Land 685000.0
Apartment / Unit / Flat 570000.0
Vacant land 410000.0
Name: Numeric_Price, dtype: float64
Observations on price by Property_Type:
Acreage / Semi-Rural commands the highest median ($1.72M), followed by Rural ($1.35M). These types are land-heavy and concentrated in premium peri-urban locations.
House ($828k) and Townhouse ($780k) form the middle tier.
Villa ($715k), New House & Land ($685k), and Apartment ($570k) are the lower residential tier.
Vacant land has the lowest median ($410k) - makes sense as it carries no dwelling premium.
The Property_Type ordering is consistent and will be a strong categorical feature. The model will need this distinction because the same LandSize can imply very different prices depending on type (e.g. 1,000 sqm on Vacant land vs. House).
5.3 Price vs socio-economic features
Suburb-level demographics (income, crime, age) are enriched onto each row. I plot their relationship with log1p(Price) to see whether they carry independent signal beyond Distance_to_CBD.
Code
fig, axes = plt.subplots(1, 3, figsize=(15, 4))# Sample once for consistency across panels.samp = sold_priced.sample(min(5000, len(sold_priced)), random_state=0)axes[0].scatter(samp["abs_median_income_weekly"], np.log1p(samp["Numeric_Price"]), alpha=0.2, s=8, color="steelblue")axes[0].set_xlabel("Median weekly income (AUD)")axes[0].set_ylabel("log1p(Price)")axes[0].set_title("Price vs median income")axes[1].scatter(samp["crime_rate_per_100k"], np.log1p(samp["Numeric_Price"]), alpha=0.2, s=8, color="indianred")axes[1].set_xlabel("Crime rate per 100k")axes[1].set_title("Price vs crime rate")axes[2].scatter(samp["abs_median_age"], np.log1p(samp["Numeric_Price"]), alpha=0.2, s=8, color="seagreen")axes[2].set_xlabel("Median age (years)")axes[2].set_title("Price vs median age")plt.tight_layout(); plt.show()
Observations on socio-economic features:
Median income: weak positive relationship. Price drifts upward as income rises but with very high variance at every income level. Income alone is not a strong predictor.
Crime rate: weak negative trend. Most data sits at crime rates 2k-10k per 100k; the few extreme-crime suburbs (>20k) cluster at lower prices but the relationship is noisy.
Median age: moderate positive relationship - older-population suburbs (50-65 years) tend to be more expensive, consistent with established, settled coastal/peninsula areas commanding premiums.
All three socio-economic features carry some signal but each is weak individually. The ML model should benefit from their combined contribution rather than relying on any one.
5.4 Correlation heatmap
A Pearson correlation matrix on the numeric features in the Sold set. I use it to verify the bivariate findings above and to spot multicollinearity, which matters for linear models in the ML notebook.
Baths (0.43), Beds (0.37), abs_median_age (0.31), Car_Spaces (0.29), LandSize (0.24), abs_median_income_weekly (0.24) are the strongest linear correlates of price.
Distance_to_CBD has essentially zero linear correlation (0.03) - confirming the scatter observation. The relationship is non-linear and location-dependent.
crime_rate_per_100k (-0.15) and abs_population (-0.16) are mildly negative.
Multicollinearity to watch:
Beds ↔︎ Baths: 0.76 (strong). Linear regression will need regularization (Ridge) to handle this. Tree models are unaffected.
Beds ↔︎ Car_Spaces: 0.50 (moderate).
abs_median_age ↔︎ abs_population: -0.52 (older suburbs are smaller).
abs_median_age ↔︎ Distance_to_CBD: 0.48 (further-out suburbs are older).
abs_median_income ↔︎ Distance_to_CBD: -0.25 (closer-in suburbs are richer).
Implications for ML:
For tree-based models (RandomForest, XGBoost), all features can be included; multicollinearity does not hurt them.
For Linear Regression baseline, I will apply Ridge regularization to stabilize coefficients given the Beds-Baths-Car_Spaces collinearity cluster.
No feature is so dominantly correlated with price that I should treat it as an obvious primary driver - this is a multi-feature problem where the model must combine signals.
6. Time Series Analysis - Price Inflation
This section is the empirical justification for using Year and Month as ML features.
The ML notebook trains on Sold rows using each transaction’s actual Year and Month, then at inference time injects the current Year (2026) and Month (5) into For Sale rows so predictions reflect today’s market level. This design only makes sense if price actually changes materially over time. Section 6 verifies that.
I produce three views:
Median price per year across all property types, with transaction volume as context.
Median price per year broken down by Property_Type, to check whether inflation is uniform.
Median price per month pooled across years, to quantify seasonality.
Years with fewer than 200 transactions are excluded from the median plots, as identified in Section 4.5.
6.1 Median price per year (overall)
The line shows median Sold price per year. The bar overlay shows transaction volume to give the reader context on sample-size confidence: a year with thousands of transactions has a more reliable median than a year with a few hundred.
Code
# Group by year, keep only years with adequate sample size.ts = (sold_priced .groupby("Year")["Numeric_Price"] .agg(["median", "count"]) .reset_index())ts = ts[ts["count"] >=200].copy()ts["Year"] = ts["Year"].astype(int)fig, ax1 = plt.subplots(figsize=(11, 4))# Median line on left y-axis.ax1.plot( ts["Year"], ts["median"] /1e6, marker="o", color="steelblue", linewidth=2)ax1.set_xlabel("Year")ax1.set_ylabel("Median price (AUD million)", color="steelblue")ax1.tick_params(axis="y", labelcolor="steelblue")# Force integer year ticks (remove .0 / .5)ax1.set_xticks(ts["Year"])# Volume bars on right y-axis.ax2 = ax1.twinx()ax2.bar( ts["Year"], ts["count"], alpha=0.2, color="grey", width=0.7)ax2.set_ylabel("Transaction count", color="grey")plt.title("Median Sold price per year (line) and transaction volume (bars)")plt.tight_layout()plt.show()# Year-over-year growth table.ts["yoy_pct"] = (ts["median"].pct_change() *100).round(1)ts["median_k"] = (ts["median"] /1e3).round(0)print("Year-over-year median price growth (years with >=200 transactions):")print(ts[["Year", "median_k", "count", "yoy_pct"]].to_string(index=False))
Strong rise from $522k (2009) to peak $875k (2022), then pullback to $755k (2024), recovery to $785k (2025), slight softening to $775k (2026 YTD).
YoY swings range from -8% to +12%, easily large enough that a model without Year would systematically misprice older and newer transactions.
2026 figure covers only Jan-2 May (~21k transactions). Volume is dominated by 2024-2026; 2025 alone has more transactions than 2009-2020 combined.
Trajectory is non-monotonic. Tree models handle it naturally; the Linear baseline will likely underperform here.
At inference, injecting Year = 2026 places For Sale predictions on the current price level - the correct “today-priced” reference.
6.2 Median price per year by Property_Type
Different property types may inflate at different rates. I plot the four highest-volume types separately. If the lines diverge sharply, the ML model will need to capture an interaction between Year and Property_Type. Tree-based models do this automatically; linear models would need an explicit interaction term.
Code
big_types = ["House", "Apartment / Unit / Flat", "Townhouse", "Vacant land"]ts_t = (sold_priced[sold_priced["Property_Type"].isin(big_types)] .groupby(["Year", "Property_Type"])["Numeric_Price"] .agg(["median", "count"]) .reset_index())ts_t = ts_t[ts_t["count"] >=50] # min 50 sales per (year, type) cellplt.figure(figsize=(11, 4.5))colors = {"House": "steelblue", "Apartment / Unit / Flat": "indianred","Townhouse": "seagreen", "Vacant land": "darkorange"}for t in big_types: sub = ts_t[ts_t["Property_Type"] == t] plt.plot(sub["Year"], sub["median"] /1e6, marker="o", label=t, color=colors[t], linewidth=2)plt.legend()plt.ylabel("Median price (AUD million)")plt.xlabel("Year")plt.title("Median price by Property Type over time")plt.tight_layout(); plt.show()
Observations on median price per year by Property_Type:
All four types share the same broad trajectory (rise to 2022, dip 2023-2024, recovery 2025-2026) - driven by common macro factors.
House and Townhouse move closely. Apartment is much flatter (+58% over 10 years vs +57% for House but with lower amplitude). The House-Apartment gap widens from ~$200k (2016) to ~$300k (2026).
A simple Year coefficient would imply uniform inflation - wrong. The model needs a Year x Property_Type interaction, which tree models capture automatically.
6.3 Seasonality - median price by month (pooled)
I pool all years to isolate the month-of-year effect from the year-on-year trend. A flat bar chart would mean Month adds no signal; a clear pattern would justify including it as a feature alongside Year.
Code
month_med = (sold_priced .groupby("Month")["Numeric_Price"] .median() .reindex(range(1, 13)))month_med.plot(kind="bar", figsize=(9, 3.5), color="steelblue", edgecolor="white")plt.title("Median Sold price by month of sale (all years pooled)")plt.ylabel("Median price (AUD)")plt.xlabel("Month")plt.xticks(rotation=0)plt.tight_layout(); plt.show()# Compact summary: median per month and deviation from annual median.overall = sold_priced["Numeric_Price"].median()month_summary = pd.DataFrame({"median_price": month_med.round(0),"vs_overall_pct": ((month_med / overall -1) *100).round(1),})print(f"Overall median: {overall:,.0f}")print(month_summary.to_string())
Observations on median price per year by Property_Type:
All four types share the same broad trajectory (rise to 2022, dip 2023-2024, recovery 2025-2026) - driven by common macro factors.
House and Townhouse move closely. Apartment is much flatter (+58% over 10 years vs +57% for House but with lower amplitude). The House-Apartment gap widens from ~$200k (2016) to ~$300k (2026).
A simple Year coefficient would imply uniform inflation - wrong. The model needs a Year x Property_Type interaction, which tree models capture automatically.
7. Geospatial Analysis
Geographic patterns matter even after controlling for Distance_to_CBD, because two locations the same distance from the CBD can have very different prices (e.g. inner east vs inner west). This section produces three views:
A spatial scatter of Sold prices, to see how price varies across the metro area.
A spatial scatter of For Sale supply, to confirm the growth-corridor concentration observed in Section 4.
Top and bottom suburbs by median price, to identify the price hierarchy that suburb-level features will need to capture.
7.1 Price map (Sold) and supply map (For Sale)
I sample up to 15,000 points per map to avoid over-plotting. The Sold map is colored by log1p(Price); the For Sale map is uniform color because For Sale has no transaction price - its purpose is to show where current supply is concentrated. Both are clipped to the Melbourne metro envelope for clarity.
Code
fig, axes = plt.subplots(1, 2, figsize=(14, 6))# Sold map - colored by log1p(Price).s = sold_priced.sample(min(15000, len(sold_priced)), random_state=0)sc = axes[0].scatter(s["Longitude"], s["Latitude"], c=np.log1p(s["Numeric_Price"]), cmap="viridis", s=4, alpha=0.5)plt.colorbar(sc, ax=axes[0], label="log1p(Price)")axes[0].set_title(f"Sold (n={len(s):,}) - colored by price")axes[0].set_xlabel("Longitude")axes[0].set_ylabel("Latitude")# For Sale map - uniform color, shows current supply distribution.f = df_forsale.sample(min(15000, len(df_forsale)), random_state=0)axes[1].scatter(f["Longitude"], f["Latitude"], color="indianred", s=4, alpha=0.4)axes[1].set_title(f"For Sale (n={len(f):,}) - current supply")axes[1].set_xlabel("Longitude")# Clip both to metro envelope.for ax in axes: ax.set_xlim(144.4, 145.7) ax.set_ylim(-38.6, -37.4)plt.tight_layout(); plt.show()
Observations on the maps:
Sold map shows a clear price gradient: bright yellow-green (high log1p price) concentrated in inner-east and bayside Melbourne (around longitude 145.0-145.1, latitude -37.8 to -37.9), and along the Mornington Peninsula coast (the curved cluster at the bottom). Outer-ring growth corridors to the west and north (longitude 144.5-144.9) sit at lower price levels (darker green/teal).
For Sale supply concentrates in the western and northern growth corridors and middle ring. The peninsula and inner-east are sparser. This visually confirms the Section 4 finding: For Sale supply is biased toward outer-ring new-development zones.
The model will be asked to price For Sale listings concentrated in areas where Sold training data is moderately well-represented, but with different Property_Type mix (heavy New House & Land). This is the central drift challenge.
7.2 Top and bottom suburbs by median price
A ranking of the most and least expensive suburbs (with at least 30 Sold transactions each, to keep medians stable).
Code
sub_med = (sold_priced .groupby("Suburb")["Numeric_Price"] .agg(["median", "count"]) .query("count >= 30") .sort_values("median"))fig, axes = plt.subplots(1, 2, figsize=(13, 5))sub_med.tail(15)["median"].div(1e6).plot(kind="barh", ax=axes[0], color="seagreen")axes[0].set_title("Top 15 suburbs by median price")axes[0].set_xlabel("Median price (AUD M)")sub_med.head(15)["median"].div(1e6).plot(kind="barh", ax=axes[1], color="firebrick")axes[1].set_title("Bottom 15 suburbs by median price")axes[1].set_xlabel("Median price (AUD M)")plt.tight_layout(); plt.show()print(f"Suburb price range (n >=30): "f"{sub_med['median'].min():,.0f} to {sub_med['median'].max():,.0f}")print(f"Ratio top/bottom: {sub_med['median'].max() / sub_med['median'].min():.1f}x")
Suburb price range (n >=30): 380,000 to 3,062,500
Ratio top/bottom: 8.1x
Observations on suburb price ranking:
Range: $380k (TRAVANCORE) to $3.06M (MERRICKS). Top/bottom ratio = 8.1x. Suburb alone explains a huge share of price variation.
Top 15 dominated by Mornington Peninsula (MERRICKS, MERRICKS NORTH, MAIN RIDGE, PORTSEA, FLINDERS, TUERONG), inner-east (CANTERBURY, BALWYN NORTH, EAGLEMONT, PARK ORCHARDS), and bayside (BRIGHTON, BLACK ROCK, BRIGHTON EAST).
Bottom 15 dominated by far-west growth (MELTON, MELTON SOUTH, EYNESBURY), Geelong region (THOMSON, NORLANE, CORIO, WHITTINGTON), and a few high-density inner CBD-fringe pockets (SOUTHBANK, CARLTON, MELBOURNE, TRAVANCORE) where apartments dominate the mix.
The 8.1x ratio confirms Suburb will be a top-3 feature in the ML model. Frequency encoding (as planned) will preserve this signal without leakage.
8. Sold vs For Sale Distribution Comparison
The ML model trains on Sold but predicts on For Sale. If the two distributions differ materially, the test-set performance on Sold will overstate real-world performance on For Sale.
I quantify the drift with three checks:
Per-feature KS test - compares Sold and For Sale on each numeric feature. Large KS statistic or tiny p-value means the two groups have meaningfully different shapes.
Cold-start suburb check - suburbs that appear in For Sale but have no Sold history. The model has no signal for these.
Property_Type cold-start - same idea at the type level.
KS test is chosen over a simple median comparison because it captures whole-distribution differences (including tails), is non-parametric, and is robust to skew - which property data is full of.
8.1 KS test on numeric features
The Kolmogorov-Smirnov statistic measures the maximum vertical distance between two cumulative distribution functions. Values close to 0 mean similar distributions; values approaching 1 mean disjoint. For datasets this large, even small distributional differences will produce statistically significant p-values, so the absolute KS statistic matters more than significance.
Code
from scipy.stats import ks_2sampdrift_cols = ["Beds", "Baths", "Car_Spaces", "LandSize_sqm","Distance_to_CBD_km", "dist_nearest_train_km","abs_median_income_weekly", "abs_median_age","abs_population", "crime_rate_per_100k"]rows = []for c in drift_cols: a = df_sold[c].dropna() b = df_forsale[c].dropna() stat, p = ks_2samp(a, b) rows.append({"feature": c,"sold_median": round(a.median(), 2),"forsale_median": round(b.median(), 2),"ks_stat": round(stat, 3),"p_value": f"{p:.2e}", })drift_df = pd.DataFrame(rows).sort_values("ks_stat", ascending=False)print(drift_df.to_string(index=False))
Largest drifts (KS > 0.13): LandSize (For Sale lots are smaller - 400 vs 576 sqm), median age (younger suburbs - 35 vs 39), Distance_to_CBD (closer in - 26.8 vs 33 km), population, Beds, crime rate. All point to the same story: For Sale concentrates in growth-corridor new-build supply.
Smallest drifts (KS < 0.10): Baths, income, train distance, Car_Spaces. Stable across the metro.
All p-values are essentially zero, but with 100k+ rows that is expected. KS statistic itself is what matters.
Implications: the is_new_build flag planned in Section 4 explicitly handles the new-build sub-population. Tree models will handle the remaining drift through splits.
8.2 Cold-start suburbs
A suburb in For Sale but missing from Sold means the model has no historical signal for it. Frequency encoding will assign such suburbs a value of zero (or the overall mean, depending on the strategy), which is a weak fallback. The count of cold-start suburbs and the share of For Sale rows they cover determines whether this is a marginal concern or a real problem.
Code
sold_subs =set(df_sold["Suburb"].unique())forsale_subs =set(df_forsale["Suburb"].unique())new_in_fs = forsale_subs - sold_subsabsent_in_fs = sold_subs - forsale_subsshared = sold_subs & forsale_subsprint(f"Unique suburbs - Sold: {len(sold_subs)}")print(f"Unique suburbs - For Sale: {len(forsale_subs)}")print(f"Shared (both groups): {len(shared)}")print(f"Cold-start (in For Sale, never Sold): {len(new_in_fs)}")print(f"Absent (in Sold, none For Sale): {len(absent_in_fs)}")# How many For Sale rows fall in cold-start suburbs?n_coldstart_rows = df_forsale["Suburb"].isin(new_in_fs).sum()print(f"\nFor Sale rows in cold-start suburbs: {n_coldstart_rows:,} "f"({n_coldstart_rows /len(df_forsale) *100:.2f}%)")if new_in_fs:print(f"\nExamples of cold-start suburbs (up to 10): {sorted(list(new_in_fs))[:10]}")
Unique suburbs - Sold: 543
Unique suburbs - For Sale: 527
Shared (both groups): 519
Cold-start (in For Sale, never Sold): 8
Absent (in Sold, none For Sale): 24
For Sale rows in cold-start suburbs: 12 (0.03%)
Examples of cold-start suburbs (up to 10): ['APOLLO BAY', 'BURNLEY', 'CASTELLA', 'FIELDSTONE', 'GRANGEFIELDS', 'INGLISTON', 'LOVELY BANKS', 'MURGHEBOLUC']
Observations on cold-start suburbs:
519 of 543 Sold suburbs and 527 For Sale suburbs are shared (95.5% overlap).
8 cold-start suburbs in For Sale (APOLLO BAY, BURNLEY, CASTELLA, FIELDSTONE, GRANGEFIELDS, INGLISTON, LOVELY BANKS, MURGHEBOLUC) covering only 12 rows (0.03%). Marginal.
Frequency encoding will assign these zero; Postcode + Lat/Lng carry the geographic signal. No special fallback needed.
8.3 Property_Type cold-start
Code
sold_types =set(df_sold["Property_Type"].unique())fs_types =set(df_forsale["Property_Type"].unique())new_types = fs_types - sold_typesabsent_types = sold_types - fs_typesprint(f"Property_Types only in For Sale (cold-start): {new_types}")print(f"Property_Types only in Sold (no listings): {absent_types}")# Volume of For Sale rows in cold-start Property_Types.if new_types: n_coldstart_types = df_forsale["Property_Type"].isin(new_types).sum()print(f"\nFor Sale rows in cold-start Property_Types: {n_coldstart_types:,} "f"({n_coldstart_types /len(df_forsale) *100:.2f}%)")
Property_Types only in For Sale (cold-start): {'Grazing', 'Mixed Farming'}
Property_Types only in Sold (no listings): {'Livestock'}
For Sale rows in cold-start Property_Types: 3 (0.01%)
Observations on Property_Type cold-start:
Only Grazing and Mixed Farming exist in For Sale but not Sold (3 rows, 0.01%). All three are negligible agricultural categories that will be grouped into “Other” during ML feature engineering anyway.
No additional handling needed.
9. Summary and Save
I consolidate the findings from Sections 1-8 into a concise list of insights and decisions for the ML notebook, then persist the cleaned data and the decision parameters.
9.1 Key insights
The dataset has two structurally different groups: 124,820 Sold transactions (with Date and 91.8% with Numeric_Price) and 38,419 For Sale listings (no Date, 86% with asking price).
Sold data is heavily concentrated in recent years - 40% of transactions occurred in 2025 alone, and the bulk of data sits in 2023-2026. The early-2000s tail is sparse.
Price is heavily right-skewed; log1p(Numeric_Price) is the ML target.
Year-over-year median price moves 4-12% per year (non-monotonic, with pullbacks in 2012-2013, 2019, 2023-2024). Month adds 13% peak-to-trough seasonality. Year and Month must be features.
Different Property_Types inflate at different rates - the House-Apartment gap widens over time. Tree models will capture this Year x Property_Type interaction; the Linear baseline will not.
Strongest price drivers: Baths (corr 0.43), Beds (0.37), median age (0.31), Car_Spaces (0.29), LandSize (0.24), median income (0.24). Property_Type and Suburb are also strong (suburb price range is 8.1x top to bottom).
Beds-Baths correlation is 0.76 - Linear baseline needs Ridge regularization.
Largest Sold vs For Sale drift: For Sale concentrates in growth-corridor new-build supply (smaller lots, younger demographics, closer to CBD, heavy New House & Land share - 27% vs 0.3%). An is_new_build flag will be added in the ML notebook.
Cold-start risk is marginal: only 8 suburbs (12 rows, 0.03%) and 2 Property_Types (3 rows, 0.01%) appear in For Sale but not Sold.
Imputation: median per Property_Type for LandSize NaN; median for Beds/Baths NaN where the row is non-land; median for crime_rate_per_100k NaN. Fit on train fold only.
At For Sale inference: inject Year = 2026, Month = 5 to reflect the snapshot’s current market level.
Code
from pathlib import Path# Save cleaned dataout = pd.concat([df_sold, df_forsale], ignore_index=True)out_dir = Path("report_data")out_dir.mkdir(exist_ok=True)out_path = out_dir /"cleaned_data.parquet"out.to_parquet(out_path, index=False)print(f"Saved cleaned data -> {out_path}")print(f" Total rows: {len(out):,}")print(f" Sold: {(out['Status'] =='Sold').sum():,}")print(f" For Sale: {(out['Status'] =='For Sale').sum():,}")print(f" Columns: {len(out.columns)}")print(f" Size on disk: {out_path.stat().st_size /1e6:.1f} MB")
Saved cleaned data -> report_data\cleaned_data.parquet
Total rows: 163,239
Sold: 124,820
For Sale: 38,419
Columns: 32
Size on disk: 11.7 MB
Code
# Persist the EDA decisions as a JSON the ML notebook will load.# This is the single source of truth for cleaning/encoding parameters so the two notebooks stay in sync.decisions = {"data_snapshot_date": "2026-05-02",# Outlier handling"land_property_types": sorted(LAND_TYPES),"apt_like_types": sorted(APT_LIKE),"rural_large_types": sorted(RURAL_LARGE),"residential_dense_types": sorted(RESIDENTIAL_DENSE),"price_floor": 50_000,"price_ceiling": 20_000_000,"landsize_caps_by_type": {k: round(v, 0) for k, v in land_caps.items()},"metro_envelope": {"lat": [-38.6, -37.4], "lon": [144.4, 145.7]},# ML setup"target_transform": "log1p","split_strategy": "time_based","split_ratios": {"train": 0.70, "val": 0.15, "test": 0.15},"inference_year": 2026,"inference_month": 5,# Rare Property_Type categories to group as "Other" in ML feature engineering"rare_property_types": rare_types.index.tolist(),# New-build Property_Type categories to flag with is_new_build"new_build_types": ["New House & Land", "New Apartments / Off the Plan","New Home Designs", "New land"],# Rationale"rationale_year_month": ("Sold prices show 4-12% YoY changes and 13% peak-to-trough seasonality. ""Year and Month let the model learn this. At For Sale inference, ""current Year/Month are injected to obtain today-priced estimates." ),}decisions_path ="report_data/eda_decisions.json"withopen(decisions_path, "w") as f: json.dump(decisions, f, indent=2)print(f"Saved decisions -> {decisions_path}")
Saved decisions -> report_data/eda_decisions.json
The EDA report is complete. The next notebook (ml_report.ipynb and ml_report.html) will:
Load cleaned_data.parquet and eda_decisions.json.
Engineer features including is_new_build, rare-type grouping, and frequency encoding for Suburb.
Apply a time-based 70/15/15 split on sorted Date_parsed.
Train Linear Regression, Random Forest, and XGBoost; tune hyperparameters on the validation fold.
Select the best model, evaluate on the test fold, retrain on full Sold, and predict For Sale prices with Year = 2026, Month = 5.
Save model artifacts and predictions for the FastAPI dashboard.