ETL Pipeline Architecture
An end-to-end Extract, Transform, Load (ETL) system engineered to harvest global financial data, calculate complex metrics, and ensure data integrity for downstream analytics.
1. Data Extraction
backend/etl/fetch_data.pyThis module acts as the gateway to financial markets. Using the yFinance API, it iteratively requests data for 21 target companies (e.g., Apple, Tesla, JP Morgan). The script is designed to be fault-tolerant, using try-catch blocks to handle API timeouts gracefully.
- Historical Prices: Pulls 5 years of Open, High, Low, Close, Volume data.
- Financial Statements: Merges Income Statement, Balance Sheet, and Cash Flow into a single view.
- Company Profile: Captures metadata like Sector, Industry, and Website for context.
2. Data Transformation
backend/etl/clean_data.pyRaw data is rarely analysis-ready. This script cleans the dataset and, more importantly, generates financial intelligence. It fills missing values and converts raw numbers into meaningful ratios used by investors.
- Standardization: Renames columns to snake_case and fixes UTC timezones.
- Ratio Calculation: Automatically computes 15+ metrics including ROE, ROA, Current Ratio, and Debt-to-Equity.
- Data Hygiene: Handles
NaNvalues and prevents "Division by Zero" errors to ensure database stability.
3. Database Loading
backend/etl/load_to_mysql.pyThis step safely deposits processed data into a MySQL database using SQLAlchemy. The logic is built to support "Upsert" operations - meaning it can update existing records or insert new ones without creating duplicates.
- Integrity Checks: Uses
INSERT IGNOREandON DUPLICATE KEY UPDATEto manage data history. - Batch Processing: Loads data in chunks (2000 rows/batch) to optimize memory usage.
- Type Mapping: Ensures Python Pandas data types align perfectly with MySQL table schemas.
4. CI/CD Automation
.github/workflows/daily_etl.ymlTo ensure the dashboard is always "Fresh", the entire pipeline is automated via GitHub Actions. It runs without human intervention.
- Daily Trigger: Scheduled to run automatically at 06:00 AM (Vietnam Time) to capture the previous day's US market close.
- Environment: Sets up Python 3.10 and installs dependencies on the fly.
- Auto-Commit: Push the newly processed CSV data back to the repository for the frontend to consume.
Data Availability Strategy
Optimizing performance by separating "Live" data from "Stable" data.
Dashboard Data Feed
Interactive dashboards are powered by lightweight CSV files regenerated daily. This ensures users see latest market movements immediately without database query overhead. The data flows directly from GitHub-hosted CSV files to the frontend visualizations.
Updated Daily via GitHub ActionsFinancial Reports Data
The detailed written reports are based on a verified snapshot of the MySQL database. This ensures consistency in analysis. The current reports utilize the dataset integrity-checked and loaded on February 3, 2026.
Snapshot Date: 03/02/2026Technology Stack
Modern tools and frameworks powering the ETL pipeline