ETL Pipeline Architecture

An end-to-end Extract, Transform, Load (ETL) system engineered to harvest global financial data, calculate complex metrics, and ensure data integrity for downstream analytics.

1. Data Extraction

backend/etl/fetch_data.py

This module acts as the gateway to financial markets. Using the yFinance API, it iteratively requests data for 21 target companies (e.g., Apple, Tesla, JP Morgan). The script is designed to be fault-tolerant, using try-catch blocks to handle API timeouts gracefully.

Historical Prices: Pulls 5 years of Open, High, Low, Close, Volume data.
Financial Statements: Merges Income Statement, Balance Sheet, and Cash Flow into a single view.
Company Profile: Captures metadata like Sector, Industry, and Website for context.

2. Data Transformation

backend/etl/clean_data.py

Raw data is rarely analysis-ready. This script cleans the dataset and, more importantly, generates financial intelligence. It fills missing values and converts raw numbers into meaningful ratios used by investors.

Standardization: Renames columns to snake_case and fixes UTC timezones.
Ratio Calculation: Automatically computes 15+ metrics including ROE, ROA, Current Ratio, and Debt-to-Equity.
Data Hygiene: Handles NaN values and prevents "Division by Zero" errors to ensure database stability.

3. Database Loading

backend/etl/load_to_mysql.py & backend/etl/load_to_supabase.py

This step safely deposits processed data into a dual-database architecture. Data is loaded into both a Local MySQL database for backup and a Cloud-based Supabase (PostgreSQL) database via SQLAlchemy.

Integrity Checks: Uses ON DUPLICATE KEY UPDATE (MySQL) and ON CONFLICT (PostgreSQL) to handle "Upsert" operations seamlessly.
Batch Processing: Loads data in chunks to optimize memory usage and prevent connection timeouts.
Hybrid Storage: Maintains perfect synchronization between local development environments and cloud production.

4. CI/CD Automation

.github/workflows/daily_etl.yml

To ensure the dashboard is always "Fresh", the entire pipeline is automated via GitHub Actions. It runs without human intervention.

Daily Trigger: Scheduled to run automatically at 22:00 UTC to capture the U.S. market close of the same day.
Environment Setup: Automatically provisions Python 3.10, installs dependencies, and configures secure cloud credentials.
Data Sync & Auto-Commit: Generates newly processed CSV files for the dashboards first, then pushes the data to the Supabase cloud database.

Data Availability Strategy

Optimizing performance by separating "Live" data from "Stable" data.

Dashboard Data Feed

Interactive dashboards are powered by lightweight CSV files regenerated daily. This ensures users see latest market movements immediately without database query overhead. The data flows directly from GitHub-hosted CSV files to the frontend visualizations.

Updated Daily via GitHub Actions

Financial Reports Data

The detailed written reports are based on a verified snapshot of the SQL database. This ensures consistency in analysis. The current reports utilize the dataset integrity-checked and loaded on April 8, 2026.

Snapshot Date: 08/04/2026

Technology Stack

Modern tools and frameworks powering the ETL pipeline

Python

SQL

GitHub Actions

Pandas

yFinance API

SQLAlchemy