ETL Pipeline Architecture
An end-to-end Extract, Transform, Load (ETL) system engineered to harvest global financial data, calculate complex metrics, and ensure data integrity for downstream analytics.
1. Data Extraction
backend/etl/fetch_data.pyThis module acts as the gateway to financial markets. Using the yFinance API, it iteratively requests data for 21 target companies (e.g., Apple, Tesla, JP Morgan). The script is designed to be fault-tolerant, using try-catch blocks to handle API timeouts gracefully.
- Historical Prices: Pulls 5 years of Open, High, Low, Close, Volume data.
- Financial Statements: Merges Income Statement, Balance Sheet, and Cash Flow into a single view.
- Company Profile: Captures metadata like Sector, Industry, and Website for context.
2. Data Transformation
backend/etl/clean_data.pyRaw data is rarely analysis-ready. This script cleans the dataset and, more importantly, generates financial intelligence. It fills missing values and converts raw numbers into meaningful ratios used by investors.
- Standardization: Renames columns to snake_case and fixes UTC timezones.
- Ratio Calculation: Automatically computes 15+ metrics including ROE, ROA, Current Ratio, and Debt-to-Equity.
- Data Hygiene: Handles
NaNvalues and prevents "Division by Zero" errors to ensure database stability.
3. Database Loading
backend/etl/load_to_mysql.py & backend/etl/load_to_supabase.pyThis step safely deposits processed data into a dual-database architecture. Data is loaded into both a Local MySQL database for backup and a Cloud-based Supabase (PostgreSQL) database via SQLAlchemy.
- Integrity Checks: Uses
ON DUPLICATE KEY UPDATE(MySQL) andON CONFLICT(PostgreSQL) to handle "Upsert" operations seamlessly. - Batch Processing: Loads data in chunks to optimize memory usage and prevent connection timeouts.
- Hybrid Storage: Maintains perfect synchronization between local development environments and cloud production.
4. CI/CD Automation
.github/workflows/daily_etl.ymlTo ensure the dashboard is always "Fresh", the entire pipeline is automated via GitHub Actions. It runs without human intervention.
- Daily Trigger: Scheduled to run automatically at 22:00 UTC to capture the U.S. market close of the same day.
- Environment Setup: Automatically provisions Python 3.10, installs dependencies, and configures secure cloud credentials.
- Data Sync & Auto-Commit: Generates newly processed CSV files for the dashboards first, then pushes the data to the Supabase cloud database.
Data Availability Strategy
Optimizing performance by separating "Live" data from "Stable" data.
Dashboard Data Feed
Interactive dashboards are powered by lightweight CSV files regenerated daily. This ensures users see latest market movements immediately without database query overhead. The data flows directly from GitHub-hosted CSV files to the frontend visualizations.
Updated Daily via GitHub ActionsFinancial Reports Data
The detailed written reports are based on a verified snapshot of the SQL database. This ensures consistency in analysis. The current reports utilize the dataset integrity-checked and loaded on April 8, 2026.
Snapshot Date: 08/04/2026Technology Stack
Modern tools and frameworks powering the ETL pipeline