Stock Market Trend Forecast

Technical Analysis
Quantitative Analysis
Machine Learning
Author

Hoang Son Lai

Published

February 4, 2026

Introduction

The stock market is inherently volatile and fraught with unpredictable risks. To make accurate and safe investment decisions, relying on a single analytical lens often exposes investors to significant blind spots. The “Stock Market Trend Forecast” report provides a comprehensive, in-depth evaluation of 21 representative stock tickers.

The foundation of this report is an exceptionally uniform and high-quality historical price dataset spanning from December 7, 2020, to February 3, 2026.

To navigate the complexities of market valuation and forecasting, this report applies a multi-dimensional analytical framework comprising three main pillars:

  • Technical Analysis: Focuses on reading current market behaviour by evaluating trend strength, momentum, and volatility.

  • Quantitative Analysis: Assesses the long-term structural profitability of the stocks, measuring risk-adjusted returns and the ability to control capital drawdowns.

  • Machine Learning: Moves beyond traditional static rules by applying advanced artificial intelligence algorithms to identify complex, non-linear patterns, thereby generating forward-looking predictions regarding medium-term profitability.

By synthesizing the results from all three approaches, this report not only diagnoses the past and present “health” of each stock but also constructs a composite scoring system. Ultimately, the report provides actionable, strategic recommendations to help investors efficiently allocate their capital for medium- and long-term objectives.

Data Description

Code
# Load libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from pandas.plotting import table
import os
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import plotly.io as pio
pio.renderers.default = "notebook_connected"
pio.templates.default = "plotly_white"
import pandas as pd
import pandas_ta as ta
import mplfinance as mpf
import seaborn as sns
import lightgbm as lgb
import xgboost as xgb
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.metrics import accuracy_score, precision_score, r2_score, recall_score
from sklearn.metrics import f1_score, confusion_matrix, mean_squared_error, accuracy_score
from sklearn.metrics import classification_report, roc_auc_score, mean_absolute_error, explained_variance_score
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor, HistGradientBoostingRegressor
from sklearn.linear_model import LogisticRegression
from xgboost import XGBRegressor, XGBClassifier
from sklearn.model_selection import TimeSeriesSplit, GridSearchCV
import joblib
import ipywidgets as widgets
from IPython.display import display, clear_output
# To convert to html, quarto render ml_report.ipynb

# Load Data
try:
    df = pd.read_csv('report_data/stock_prices.csv')
    df['date'] = pd.to_datetime(df['date'])
    # Sort by Ticker and Date for accurate calculation
    df = df.sort_values(['ticker', 'date'])
except FileNotFoundError:
    print("Error: File 'stock_prices.csv' not found.")

This dataset provides a robust historical record of market performance for 21 unique tickers, spanning from December 7, 2020, to February 3, 2026.

Notably, the data exhibits exceptional quality and balance: every ticker contains exactly 1,295 trading days, ensuring a uniform time-series structure. With 0% missing values and no duplicate entries found, this clean dataset serves as a reliable foundation for technical analysis and algorithmic backtesting.

Code
def describe_market_data(df):
# 1. Basic Structure
    print(f"▶ Total Rows:       {df.shape[0]:,}")
    print(f"▶ Total Columns:    {df.shape[1]}")
    print(f"▶ Column Name: {df.columns}")
    
# 2. Temporal Analysis (Date Range)
    # Ensure date column is datetime
    if 'date' in df.columns:
        if not pd.api.types.is_datetime64_any_dtype(df['date']):
            try:
                df['date'] = pd.to_datetime(df['date'])
                print("✓ 'date' column converted to datetime format.")
            except:
                print("⚠ Warning: Could not convert 'date' column.")
        
        start_date = df['date'].min().strftime('%Y-%m-%d')
        end_date = df['date'].max().strftime('%Y-%m-%d')
        duration = (df['date'].max() - df['date'].min()).days
        
        print("\nTIME PERIOD:")
        print(f"▶ Start Date:       {start_date}")
        print(f"▶ End Date:         {end_date}")
        print(f"▶ Duration:         {duration} days")
    else:
        print("\n⚠ 'date' column not found!")

    # 3. Ticker Analysis
    if 'ticker' in df.columns:
        unique_tickers = df['ticker'].nunique()
        tickers_list = df['ticker'].unique()
        
        print("\nTICKER STATISTICS:")
        print(f"▶ Unique Tickers:   {unique_tickers}")
        
        # Check for data balance (Top 5 and Bottom 5 tickers by row count)
        ticker_counts = df['ticker'].value_counts()
        print(f"▶ Most Data:        {ticker_counts.index[0]} ({ticker_counts.iloc[0]} rows)")
        print(f"▶ Least Data:       {ticker_counts.index[-1]} ({ticker_counts.iloc[-1]} rows)")
        
        # Example list
        if unique_tickers > 10:
            print(f"▶ Examples:         {', '.join(tickers_list[:5])} ... {', '.join(tickers_list[-5:])}")
        else:
            print(f"▶ List:             {', '.join(tickers_list)}")
    else:
        print("\n⚠ 'ticker' column not found!")

    # 4. Data Quality Check
    print("\nDATA QUALITY CHECK:")
    
    # Missing Values
    missing_data = df.isnull().sum()
    total_cells = np.prod(df.shape)
    total_missing = missing_data.sum()
    
    print(f"▶ Missing Values:   {total_missing:,} cells ({total_missing/total_cells:.2%})")
    
    if total_missing > 0:
        print("  - Columns with most missing values:")
        print(missing_data[missing_data > 0].sort_values(ascending=False).head(5).to_string(header=False))

    # Duplicates Check (assuming Ticker + Date should be unique)
    if 'ticker' in df.columns and 'date' in df.columns:
        duplicates = df.duplicated(subset=['ticker', 'date']).sum()
        if duplicates > 0:
            print(f"⚠ CRITICAL: Found {duplicates} duplicate rows based on Ticker + Date!")
        else:
            print("✓ Integrity Check:  No duplicate (Ticker + Date) pairs found.")
        
describe_market_data(df)
▶ Total Rows:       27,195
▶ Total Columns:    9
▶ Column Name: Index(['id', 'ticker', 'date', 'open', 'high', 'low', 'close', 'adj_close',
       'volume'],
      dtype='str')

TIME PERIOD:
▶ Start Date:       2020-12-07
▶ End Date:         2026-02-03
▶ Duration:         1884 days

TICKER STATISTICS:
▶ Unique Tickers:   21
▶ Most Data:        AAPL (1295 rows)
▶ Least Data:       WMT (1295 rows)
▶ Examples:         AAPL, ADBE, AMZN, BAC, DIS ... PYPL, TSLA, UNH, V, WMT

DATA QUALITY CHECK:
▶ Missing Values:   0 cells (0.00%)
✓ Integrity Check:  No duplicate (Ticker + Date) pairs found.

Part 1. Technical Analysis

1.1 Objective

The purpose of this section is to identify medium- to long-term investment opportunities by analyzing price trends, momentum, and volatility. I evaluate three core dimensions for each stock:

  • Trend Strength - Is the stock in a sustained uptrend?

  • Momentum - Is buying pressure dominant?

  • Volatility - Is the stock stable enough for long-term holding?

To ensure objectivity and consistency, I apply a transparent, rule-based technical scoring system (maximum 8 points). Higher scores indicate stronger technical setups suitable for core portfolio holdings.

1.2 Indicator Calculation

I calculate the following key technical indicators on a per-ticker basis using groupby operations. This approach ensures calculations are performed independently for each stock, preventing look-ahead bias and data leakage across different securities.

Here is a detailed explanation of each indicator:

SMA50 & SMA200 (Simple Moving Averages):

The 50-day and 200-day simple moving averages represent short- and long-term price trends, respectively. The SMA200 is particularly important for identifying the primary (long-term) trend. A price trading above the SMA200 generally indicates a bullish long-term environment.

RSI (Relative Strength Index, 14-period):

The Relative Strength Index (RSI) measures the speed and magnitude of recent price movements by comparing average gains to average losses over a specified period. Values above 70 typically indicate overbought conditions, while values below 30 suggest oversold conditions.

In this framework, RSI is transformed into a momentum score ranging from 0 to 2. A score of 2 is assigned when RSI is below 30 (strong buy zone) or between 50 and 70 (healthy uptrend momentum). A score of 1 is assigned when RSI falls between 30 and 50, reflecting mild bullish momentum. A score of 0 is given when RSI exceeds 70, indicating elevated overbought risk and reduced upside sustainability.

MACD (Moving Average Convergence Divergence):

The MACD consists of three components:

  • MACD line: The difference between the 12-day and 26-day EMAs.

  • Signal line: A 9-day EMA of the MACD line.

  • Histogram: The difference between the MACD line and the signal line.

A positive and rising histogram indicates strengthening bullish momentum.

ATR (Average True Range, 14-period):

ATR quantifies market volatility by measuring the average range between high and low prices over a period. Lower ATR values suggest more stable price movement, which is preferable for long-term investors who want to avoid excessive risk.

Code
# Work on clean copy
df_tech = df.copy()
df_tech = df_tech.sort_values(['ticker', 'date']).reset_index(drop=True)

# ----- Moving Averages -----

df_tech['SMA50'] = (
    df_tech.groupby('ticker')['close']
    .transform(lambda x: x.rolling(50, min_periods=50).mean())
)

df_tech['SMA200'] = (
    df_tech.groupby('ticker')['close']
    .transform(lambda x: x.rolling(200, min_periods=200).mean())
)

# ----- RSI -----
df_tech['RSI'] = (
    df_tech.groupby('ticker')['close']
    .transform(lambda x: ta.rsi(x, length=14))
)

# ----- MACD -----
macd_df = (
    df_tech.groupby('ticker')['close']
    .apply(lambda x: ta.macd(x))
)

# macd_df returns multi-index 
macd_df = macd_df.reset_index(level=0, drop=True)

df_tech[['MACD','MACD_signal','MACD_hist']] = macd_df.values

# ----- ATR -----
df_tech['ATR'] = (
    df_tech.groupby('ticker')
    .apply(lambda x: ta.atr(x['high'], x['low'], x['close'], length=14))
    .reset_index(level=0, drop=True)
)

# Final sanity check
print(df_tech.columns)
Index(['id', 'ticker', 'date', 'open', 'high', 'low', 'close', 'adj_close',
       'volume', 'SMA50', 'SMA200', 'RSI', 'MACD', 'MACD_signal', 'MACD_hist',
       'ATR'],
      dtype='str')

1.3 Technical Scoring Framework

I constructed a rule-based scoring model to rank stocks objectively.

Scoring Rules (Maximum = 8 points):

  • Price > SMA200 → +2 points (Long-term uptrend)

  • SMA50 > SMA200 → +2 points (Bullish structure / Golden Cross)

  • RSI < 30 → +2 points (Strong buy zone); RSI between 30 and 50 → +1 point (Mild bullish momentum); RSI between 50 and 70 → +2 points (Healthy uptrend momentum).

  • MACD Histogram > 0 → +1 point (Positive momentum)

  • ATR below cross-sectional median → +1 point (Controlled volatility, suitable for long-term holding)

Higher scores reflect stronger technical positioning for medium- to long-term investment.

Code
# ============================================
# TECHNICAL SCORING
# ============================================

latest_data = (
    df_tech
    .dropna(subset=['SMA200', 'SMA50', 'RSI', 'MACD_hist', 'ATR'])
    .sort_values(['ticker', 'date'])
    .groupby('ticker', as_index=False)
    .last()
    .copy()
)

atr_median = latest_data['ATR'].median()
latest_data['Technical_Score'] = 0

# Rules
latest_data.loc[latest_data['close'] > latest_data['SMA200'], 'Technical_Score'] += 2
latest_data.loc[latest_data['SMA50'] > latest_data['SMA200'], 'Technical_Score'] += 2
latest_data.loc[latest_data['RSI'] < 30, 'Technical_Score'] += 2
latest_data.loc[(latest_data['RSI'] >= 30) & (latest_data['RSI'] < 50), 'Technical_Score'] += 1
latest_data.loc[(latest_data['RSI'] >= 50) & (latest_data['RSI'] <= 70), 'Technical_Score'] += 2
latest_data.loc[latest_data['MACD_hist'] > 0, 'Technical_Score'] += 1
latest_data.loc[latest_data['ATR'] <= atr_median, 'Technical_Score'] += 1

# Final ranking table with styling
technical_ranking = (
    latest_data
    .sort_values('Technical_Score', ascending=False)
    [['ticker', 'close', 'SMA50', 'SMA200', 'RSI', 'MACD_hist', 'ATR', 'Technical_Score']]
    .reset_index(drop=True)
)

# Styled table
styled_table = technical_ranking.style\
    .format({
        'close': '{:.2f}',
        'SMA50': '{:.2f}',
        'SMA200': '{:.2f}',
        'RSI': '{:.2f}',
        'MACD_hist': '{:.4f}',
        'ATR': '{:.4f}',
        'Technical_Score': '{:.0f}'
    })\
    .set_table_styles([
        {'selector': 'th',
         'props': [('background-color', 'steelblue'),
                   ('color', 'white'),
                   ('font-weight', 'bold'),
                   ('text-align', 'center')]}
    ])\
    .background_gradient(subset=['Technical_Score'], cmap='RdYlGn', vmin=0, vmax=8)\
    .hide(axis='index')

styled_table
ticker close SMA50 SMA200 RSI MACD_hist ATR Technical_Score
AMZN 238.62 233.27 222.43 51.20 2.0392 6.0538 8
AAPL 269.48 268.38 237.10 61.64 -3.4410 5.7653 7
BAC 54.45 53.99 48.90 55.86 -0.4719 1.0486 7
NVDA 180.34 183.78 168.49 42.39 0.8151 5.5977 7
GOOGL 339.71 320.15 236.20 64.11 6.3572 8.3686 7
PG 155.32 145.44 152.82 69.74 1.3932 2.4784 6
WMT 127.71 114.71 103.25 76.45 1.7854 2.6163 6
JPM 314.85 313.15 292.76 54.39 -3.2201 6.4258 6
JNJ 233.10 211.16 178.97 82.03 4.8995 3.8927 6
KO 76.89 71.06 69.28 78.42 0.8591 1.1461 6
TSLA 421.96 444.33 378.96 42.56 -5.2252 15.4590 5
META 691.70 652.10 682.94 57.50 6.6052 21.5360 5
HD 381.10 359.42 371.62 59.69 6.1622 8.5331 5
UNH 284.18 328.36 324.68 31.48 -5.5849 11.6865 3
NFLX 79.94 93.60 112.67 21.28 -3.2426 2.3706 3
PYPL 41.70 58.55 66.88 12.84 -1.6890 2.0007 3
ADBE 271.93 326.49 355.65 23.30 -11.3831 9.4392 2
DIS 104.22 110.07 112.09 34.15 -0.1644 2.6815 2
MA 550.72 553.05 562.91 52.06 -6.9823 11.5092 2
MSFT 411.21 473.26 485.25 27.38 -8.3554 13.1559 2
V 328.93 337.68 344.12 45.09 -4.0178 6.9771 1

Key Findings:

  • AMZN shows the strongest technical setups (score 8/8).

  • Stocks scoring 6-8 are considered high-conviction candidates.

  • Several names (e.g., V, ADBE, MSFT, MA, DIS) score 1-2, indicating weak trends or excessive volatility.

A stock is considered technically strong and suitable for medium- to long-term holding if it meets most of the following: sustained price above SMA200, Golden Cross, healthy RSI, positive MACD histogram, and moderate ATR.

1.4 Interactive Technical Charts

The following interactive dashboards provide a comprehensive visualization of each stock’s technical condition:

  • Price Action - Historical closing prices to observe overall movement and structure

  • Trend Indicators - 50-day and 200-day Simple Moving Averages (SMA50 & SMA200) to evaluate long-term trend direction

  • Momentum (RSI) - Relative Strength Index to assess overbought and oversold conditions

  • MACD & Histogram - Momentum acceleration and trend confirmation signals

Users can dynamically switch between tickers using the dropdown menu located at the top-left of each chart.

Code
df_plot = df_tech.copy() 
df_plot['date'] = pd.to_datetime(df_plot['date']) 
df_plot = df_plot.sort_values(['ticker', 'date']) 
tickers = df_plot['ticker'].unique()

# Ensure datetime
df_plot['date'] = pd.to_datetime(df_plot['date'])

# Get full date range for padding
min_date = df_plot['date'].min()
max_date = df_plot['date'].max()
date_padding = pd.Timedelta(days=10)

fig = make_subplots(
    rows=3, cols=1,
    shared_xaxes=True,
    vertical_spacing=0.08,
    row_heights=[0.5, 0.25, 0.25],
    subplot_titles=("Price & Moving Averages", "RSI", "MACD")
)

TRACES_PER_TICKER = 7
visibility_matrix = []

for i, ticker in enumerate(tickers):

    data = df_plot[df_plot['ticker'] == ticker]
    visible = [False] * (len(tickers) * TRACES_PER_TICKER)
    base = i * TRACES_PER_TICKER
    for j in range(TRACES_PER_TICKER):
        visible[base + j] = True
    visibility_matrix.append(visible)

    # ================= PRICE =================
    fig.add_trace(go.Scatter(
        x=data['date'], y=data['close'],
        mode='lines',
        name='Close',
        legendgroup=f'price_{ticker}',
        legend='legend1',
        visible=(i==0),
        hovertemplate="Date: %{x|%Y-%m-%d}<br>Close: %{y:.2f}<extra></extra>"
    ), row=1, col=1)

    fig.add_trace(go.Scatter(
        x=data['date'], y=data['SMA50'],
        mode='lines',
        name='SMA50',
        legendgroup=f'price_{ticker}',
        legend='legend1',
        visible=(i==0),
        hovertemplate="Date: %{x|%Y-%m-%d}<br>SMA50: %{y:.2f}<extra></extra>"
    ), row=1, col=1)

    fig.add_trace(go.Scatter(
        x=data['date'], y=data['SMA200'],
        mode='lines',
        name='SMA200',
        legendgroup=f'price_{ticker}',
        legend='legend1',
        visible=(i==0),
        hovertemplate="Date: %{x|%Y-%m-%d}<br>SMA200: %{y:.2f}<extra></extra>"
    ), row=1, col=1)

    # ================= RSI =================
    fig.add_trace(go.Scatter(
        x=data['date'], y=data['RSI'],
        mode='lines',
        name='RSI',
        legendgroup=f'rsi_{ticker}',
        legend='legend2',
        visible=(i==0),
        hovertemplate="Date: %{x|%Y-%m-%d}<br>RSI: %{y:.2f}<extra></extra>"
    ), row=2, col=1)

    # ================= MACD =================
    fig.add_trace(go.Scatter(
        x=data['date'], y=data['MACD'],
        mode='lines',
        name='MACD',
        legendgroup=f'macd_{ticker}',
        legend='legend3',
        visible=(i==0),
        line=dict(color='blue'),
        hovertemplate="Date: %{x|%Y-%m-%d}<br>MACD: %{y:.4f}<extra></extra>"
    ), row=3, col=1)

    fig.add_trace(go.Scatter(
        x=data['date'], y=data['MACD_signal'],
        mode='lines',
        name='Signal',
        legendgroup=f'macd_{ticker}',
        legend='legend3',
        visible=(i==0),
        line=dict(color='orange'),
        hovertemplate="Date: %{x|%Y-%m-%d}<br>Signal: %{y:.4f}<extra></extra>"
    ), row=3, col=1)

    fig.add_trace(go.Scatter(
        x=data['date'], y=data['MACD_hist'],
        mode='lines',
        name='Histogram',
        legendgroup=f'macd_{ticker}',
        legend='legend3',
        visible=(i==0),
        line=dict(color='red', dash='dot'),
        hovertemplate="Date: %{x|%Y-%m-%d}<br>Histogram: %{y:.4f}<extra></extra>"
    ), row=3, col=1)

# Dropdown
buttons = []
for i, ticker in enumerate(tickers):
    buttons.append(dict(
        label=ticker,
        method="update",
        args=[
            {"visible": visibility_matrix[i]},
            {"title": f"{ticker} - Technical Dashboard"}
        ]
    ))

# Reference lines
fig.add_hline(y=70, line_dash="dash", row=2, col=1)
fig.add_hline(y=30, line_dash="dash", row=2, col=1)
fig.add_hline(y=0, line_dash="dash", row=3, col=1)

fig.update_layout(
    height=1000,
    hovermode="x unified",

    legend1=dict(x=1.02, y=0.92, xanchor="left"),
    legend2=dict(x=1.02, y=0.55, xanchor="left"),
    legend3=dict(x=1.02, y=0.18, xanchor="left"),

    updatemenus=[dict(
        buttons=buttons,
        direction="down",
        x=0.01,
        y=1.08,
        xanchor="left",
        yanchor="top"
    )],

    margin=dict(l=60, r=160, t=80, b=60)
)

# Add padding so first & last dates are visible
fig.update_xaxes(
    type="date",
    range=[min_date - date_padding, max_date + date_padding]
)

fig.show()

Part 2. Quantitative Analysis

2.1 Objective

The purpose of this section is to evaluate each stock from a quantitative investment perspective by measuring three core aspects:

  • Historical returns (how much value was created)

  • Risk exposure (how much volatility and drawdown investors endured)

  • Risk-adjusted performance (how efficiently returns were generated relative to risk)

Unlike technical analysis, which focuses on timing and momentum, quantitative metrics assess structural, long-term performance. The goal is to identify stocks that not only delivered strong returns but did so with disciplined risk management.

2.2. Historical Returns

Return measures how much value an investment generates over time. I evaluate three complementary metrics:

  • Daily Return: Percentage change in adjusted close price from one day to the next.

  • Total Return: Cumulative growth from the first to the last trading day.

  • CAGR (Compound Annual Growth Rate): The smoothed annual return that equates the starting value to the ending value over the full period. CAGR is especially valuable for medium- to long-term investors because it accounts for compounding and removes the distorting effect of volatility.

Code
# Ensure proper datetime format
df = df_tech.copy()
df['date'] = pd.to_datetime(df['date'])
df = df.sort_values(['ticker', 'date'])

# -----------------------------
# Daily Returns
# -----------------------------
df['daily_return'] = df.groupby('ticker')['adj_close'].pct_change()

# -----------------------------
# Cumulative Returns
# -----------------------------
df['cum_return'] = ( df.groupby('ticker')['daily_return'] .transform(lambda x: (1 + x).cumprod()) )

# Ensure datetime format
df['date'] = pd.to_datetime(df['date'])
# Get full date range for padding
min_date = df['date'].min()
max_date = df['date'].max()
date_padding = pd.Timedelta(days=10)

# Get latest cumulative return for each ticker
latest_perf = (
    df.sort_values('date')
      .groupby('ticker')
      .tail(1)[['ticker', 'cum_return']]
      .sort_values('cum_return', ascending=False)
)

ordered_tickers = latest_perf['ticker'].tolist()

df['ticker'] = pd.Categorical(
    df['ticker'],
    categories=ordered_tickers,
    ordered=True
)

df = df.sort_values(['ticker', 'date'])

# Build figure
fig_cum = px.line(
    df,
    x='date',
    y='cum_return',
    color='ticker',
    category_orders={"ticker": ordered_tickers},
    title='Cumulative Return Over Time'
)

# Force daily format + unified hover
fig_cum.update_traces(
    hovertemplate=
    "<b>%{fullData.name}</b><br>" +
    "Date: %{x|%Y-%m-%d}<br>" +
    "Cumulative Return: %{y:.2f}<extra></extra>"
)

fig_cum.update_layout(
    height=500,
    yaxis_title="Cumulative Return (Growth of $1)",
    xaxis_title="Date",
    legend_title="Ticker",
    xaxis=dict(
        type="date",
        hoverformat="%Y-%m-%d"
    )
)

# Add padding so first & last dates are visible
fig.update_xaxes(
    type="date",
    range=[min_date - date_padding, max_date + date_padding]
)

fig_cum.show()
Code
# -----------------------------
# Total Return
# -----------------------------
total_return = (
    df.groupby('ticker')['adj_close']
    .agg(lambda x: x.iloc[-1] / x.iloc[0] - 1)
    .reset_index(name='Total Return')
)

# -----------------------------
# CAGR
# -----------------------------
years = (df['date'].max() - df['date'].min()).days / 365

cagr = (
    df.groupby('ticker')['adj_close']
    .agg(lambda x: (x.iloc[-1] / x.iloc[0]) ** (1 / years) - 1)
    .reset_index(name='CAGR')
)

mean_daily = (
    df.groupby('ticker')['daily_return']
    .mean()
    .reset_index(name='Mean Daily Return')
)

# -----------------------------
# Mean Daily Return
# -----------------------------

mean_daily = (
    df.groupby('ticker')['daily_return']
    .mean()
    .reset_index(name='Mean Daily Return')
)

# -----------------------------
# Merge All Return Metrics
# -----------------------------

return_table = (
    total_return
    .merge(mean_daily, on='ticker')
    .merge(cagr, on='ticker')
)

# -----------------------------
# Display Table
# -----------------------------

return_table.sort_values('CAGR', ascending=False)\
    .style.format({
        'Total Return': '{:.2%}',
        'Mean Daily Return': '{:.4%}',
        'CAGR': '{:.2%}'
    })

# Display Return Table (no index, styled header)
return_table_styled = (
    return_table
    .sort_values('CAGR', ascending=False)
    .style
    .format({
        'Total Return': '{:.2%}',
        'Mean Daily Return': '{:.4%}',
        'CAGR': '{:.2%}'
    })
    .set_table_styles([
        {'selector': 'th', 
         'props': [('background-color', 'steelblue'), 
                   ('color', 'white'), 
                   ('font-weight', 'bold')]}
    ])
    .hide(axis='index')   # Xóa số thứ tự
)

return_table_styled
ticker Total Return Mean Daily Return CAGR
NVDA 1229.10% 0.2524% 65.07%
GOOGL 276.50% 0.1213% 29.28%
JPM 194.13% 0.0951% 23.25%
WMT 177.32% 0.0875% 21.85%
META 143.70% 0.1063% 18.84%
AAPL 123.70% 0.0775% 16.88%
BAC 111.21% 0.0722% 15.59%
MSFT 99.99% 0.0670% 14.37%
TSLA 97.25% 0.1244% 14.07%
JNJ 80.36% 0.0512% 12.10%
KO 68.66% 0.0455% 10.66%
MA 66.16% 0.0508% 10.34%
HD 63.90% 0.0492% 10.05%
V 60.39% 0.0467% 9.59%
NFLX 54.99% 0.0718% 8.86%
AMZN 51.12% 0.0560% 8.33%
PG 27.85% 0.0249% 4.87%
UNH -11.96% 0.0096% -2.44%
DIS -31.05% -0.0112% -6.95%
ADBE -44.76% -0.0210% -10.86%
PYPL -80.82% -0.0912% -27.38%

Key Findings:

  • NVDA delivered exceptional performance with a CAGR of 65.07%.

  • Strong performers also include GOOGL, JPM, and WMT.

  • Several stocks (PYPL, ADBE, DIS) posted negative returns over the period.

2.3 Risk Analysis

I evaluate risk using three complementary metrics:

  • Annualized Volatility: Standard deviation of daily returns, scaled by √252 (trading days per year). It captures total price fluctuation.

  • Maximum Drawdown: The largest peak-to-trough decline. It answers: “What was the worst loss an investor would have suffered if they bought at the peak?”

  • Downside Deviation: Volatility of only negative returns. It focuses purely on harmful downside risk, making it more relevant for long-term investors than total volatility.

Code
# -----------------------------
# Annualized Volatility
# -----------------------------
volatility = (
    df.groupby('ticker')['daily_return']
    .std()
    .reset_index(name='Volatility')
)

volatility['Volatility'] *= np.sqrt(252)

# -----------------------------
# Max Drawdown
# -----------------------------
def max_drawdown(series):
    cumulative = (1 + series).cumprod()
    peak = cumulative.cummax()
    drawdown = cumulative / peak - 1
    return drawdown.min()

max_dd = (
    df.groupby('ticker')['daily_return']
    .apply(max_drawdown)
    .reset_index(name='Max Drawdown')
)

# -----------------------------
# Downside Deviation
# -----------------------------
def downside_dev(series):
    negative_returns = series[series < 0]
    return negative_returns.std() * np.sqrt(252)

downside = (
    df.groupby('ticker')['daily_return']
    .apply(downside_dev)
    .reset_index(name='Downside Deviation')
)

#

risk_table = volatility.merge(max_dd, on='ticker') \
                       .merge(downside, on='ticker')

risk_table.sort_values('Volatility', ascending=False)

# Display table

risk_table_styled = (
    risk_table
    .sort_values('Volatility', ascending=False)
    .style
    .format({
        'Volatility': '{:.2%}',
        'Max Drawdown': '{:.2%}',
        'Downside Deviation': '{:.2%}'
    })
    .set_table_styles([
        {'selector': 'th', 
         'props': [('background-color', 'steelblue'), 
                   ('color', 'white'), 
                   ('font-weight', 'bold')]}
    ])
    .hide(axis='index')
)

risk_table_styled
ticker Volatility Max Drawdown Downside Deviation
TSLA 60.39% -73.63% 38.64%
NVDA 51.60% -66.34% 32.15%
META 43.25% -76.74% 32.76%
NFLX 42.90% -75.95% 34.69%
PYPL 42.30% -86.45% 33.41%
ADBE 35.13% -60.50% 28.77%
AMZN 34.85% -56.15% 24.05%
GOOGL 30.82% -44.32% 20.87%
UNH 30.72% -61.39% 28.59%
DIS 29.76% -60.72% 21.20%
AAPL 27.83% -33.36% 18.78%
BAC 26.95% -46.64% 17.94%
MSFT 26.02% -37.15% 18.10%
JPM 24.28% -38.77% 17.12%
MA 24.17% -28.25% 17.36%
HD 23.56% -34.73% 16.34%
V 22.68% -28.60% 16.28%
WMT 20.84% -25.74% 16.21%
PG 17.26% -23.77% 12.68%
JNJ 16.78% -18.41% 11.35%
KO 15.94% -17.27% 11.20%
Code
risk_return = return_table.merge(risk_table, on='ticker')

fig_scatter = px.scatter(
    risk_return,
    x='Volatility',
    y='CAGR',
    text='ticker',
    title='Risk vs Return (CAGR vs Volatility)'
)

fig_scatter.update_traces(
    textposition='top center',
    marker=dict(size=12),
    hovertemplate=
        "<b>%{text}</b><br>" +
        "CAGR: %{y:.2%}<br>" +
        "Volatility: %{x:.2%}<br>" +
        "Max Drawdown: %{customdata[0]:.2%}<br>" +
        "<extra></extra>",
    customdata=risk_return[['Max Drawdown']]
)

fig_scatter.update_layout(
    height=500,
    hovermode="closest"
)

fig_scatter.show()

Key observations:

  • NVDA stands out in the top-right quadrant: extremely high return but also the highest volatility. This reflects its aggressive growth profile - suitable for investors with high risk tolerance.

  • WMT, JPM, and GOOGL occupy a favorable middle zone: strong CAGR with relatively moderate volatility, indicating efficient return generation.

  • PYPL and TSLA show high volatility combined with poor or negative returns - the least attractive risk-return profile.

  • Defensive names such as JNJ and KO appear on the left side (low volatility) with modest but stable returns, ideal for conservative portfolios.

2.4. Risk-Adjusted Performance

To evaluate how efficiently returns were generated per unit of risk, I compute three widely respected ratios:

  • Sharpe Ratio = (CAGR - Risk-free rate) / Annualized Volatility

This is the most common risk-adjusted metric. It shows excess return (above a risk-free benchmark, here 2%) per unit of total volatility. A Sharpe Ratio > 1.0 is generally considered excellent; values above 0.5 are acceptable for equity strategies.

  • Sortino Ratio = (CAGR - Risk-free rate) / Downside Deviation

An improvement over the Sharpe Ratio because it only penalizes harmful (downside) volatility. It is particularly useful for equity investors who are more concerned about large losses than upward swings. Higher Sortino values indicate stronger protection against downside risk.

  • Calmar Ratio = CAGR / |Maximum Drawdown|

This ratio focuses on drawdown risk, which is highly relevant for long-term holding periods. It measures how much annual return is earned for every 1% of the worst historical loss. A Calmar Ratio > 0.5 is often viewed as attractive for medium- to long-term strategies.

Code
risk_free_rate = 0.02

metrics = risk_return.copy()

# Sharpe
metrics['Sharpe'] = (
    (metrics['CAGR'] - risk_free_rate) / metrics['Volatility']
)

# Sortino
metrics['Sortino'] = (
    (metrics['CAGR'] - risk_free_rate) / metrics['Downside Deviation']
)

# Calmar
metrics['Calmar'] = (
    metrics['CAGR'] / abs(metrics['Max Drawdown'])
)

# 

metrics = metrics[
    ['ticker','CAGR','Volatility','Max Drawdown',
     'Sharpe','Sortino','Calmar']
]

mt = metrics[
    ['ticker','Sharpe','Sortino','Calmar']
]

mt.sort_values('Sharpe', ascending=False)

# Display table
risk_adjusted_styled = (
    metrics[['ticker', 'Sharpe', 'Sortino', 'Calmar']]
    .sort_values('Sharpe', ascending=False)
    .style
    .format({
        'Sharpe': '{:.3f}',
        'Sortino': '{:.3f}',
        'Calmar': '{:.3f}'
    })
    .set_table_styles([
        {'selector': 'th', 
         'props': [('background-color', 'steelblue'), 
                   ('color', 'white'), 
                   ('font-weight', 'bold')]}
    ])
    .hide(axis='index')
)

risk_adjusted_styled
ticker Sharpe Sortino Calmar
NVDA 1.222 1.962 0.981
WMT 0.952 1.224 0.849
GOOGL 0.885 1.307 0.661
JPM 0.875 1.241 0.600
JNJ 0.602 0.890 0.657
KO 0.543 0.773 0.617
AAPL 0.535 0.792 0.506
BAC 0.504 0.757 0.334
MSFT 0.475 0.683 0.387
META 0.389 0.514 0.245
MA 0.345 0.480 0.366
HD 0.341 0.492 0.289
V 0.334 0.466 0.335
TSLA 0.200 0.312 0.191
AMZN 0.182 0.263 0.148
PG 0.167 0.227 0.205
NFLX 0.160 0.198 0.117
UNH -0.144 -0.155 -0.040
DIS -0.301 -0.422 -0.114
ADBE -0.366 -0.447 -0.180
PYPL -0.695 -0.879 -0.317

Key findings

  • NVDA leads with the highest Sharpe (1.22) and Sortino (1.96), showing it delivered exceptional excess return even after accounting for its high volatility.

  • WMT and JPM also score very well, offering strong returns with better risk control.

  • Negative ratios (UNH, DIS, ADBE, PYPL) indicate that these stocks failed to compensate investors adequately for the risk endured.

2.5. Quantitative Scoring

To create a single, objective ranking that combines return, risk, and efficiency, I developed a Quantitative Score ranging from 0 to 10.

Step-by-step methodology:

  1. Min-Max Normalization:

Each raw metric is scaled to a 0-1 range so they become comparable:

  • For “higher is better” metrics (CAGR, Sharpe): score = (value - min) / (max - min)

  • For “lower is better” risk metrics (Volatility, Max Drawdown): score = 1 - (value - min) / (max - min)

This removes differences in scale and units.

  1. Weighted Aggregation

The normalized scores are combined using the following weights, chosen to reflect long-term investor priorities:

  • 40% CAGR → Emphasizes actual wealth creation (growth is the ultimate goal).

  • 30% Sharpe Ratio → Rewards efficient return per unit of total risk.

  • 20% Volatility control → Penalizes excessive price swings.

  • 10% Drawdown control → Gives modest weight to capital preservation (the worst historical loss).

These weights prioritize sustainable compounding while still penalizing high risk.

  1. Final Scaling

The weighted sum is multiplied by 10 to produce a clean 0-10 score.

  • Score > 7.0: Excellent risk-return profile (core holdings).

  • Score 6 - 7.0: Solid, well-balanced candidates.

  • Score < 4.0: Weak or high-risk underperformers.

Code
# Step 1 - Normalize Metrics (Min-Max)
score_df = metrics.copy()

def minmax(series):
    return (series - series.min()) / (series.max() - series.min())

score_df['Return_score'] = minmax(score_df['CAGR'])
score_df['Sharpe_score'] = minmax(score_df['Sharpe'])
score_df['Vol_score'] = 1 - minmax(score_df['Volatility'])
score_df['DD_score'] = 1 - minmax(abs(score_df['Max Drawdown']))

# Step 2 - Weighted Quant Score
score_df['Quant Score'] = (
    0.4 * score_df['Return_score'] +
    0.3 * score_df['Sharpe_score'] +
    0.2 * score_df['Vol_score'] +
    0.1 * score_df['DD_score']
)

score_df['Quant Score'] *= 10

score_df = score_df.sort_values('Quant Score', ascending=False)

score_df[['ticker','Quant Score']]

# Step 3 - Quant Score Ranking 
# Round values for presentation
display_table = score_df.copy()

display_table = display_table[[
    'ticker',
    'CAGR',
    'Volatility',
    'Max Drawdown',
    'Sharpe',
    'Return_score',
    'Sharpe_score',
    'Vol_score',
    'DD_score',
    'Quant Score'
]]

display_table = display_table.round(3)

display_table = display_table.sort_values('Quant Score', ascending=False)

def quant_score_style(val):
    if val > 7.0:
        return 'background-color: #006400; color: white; font-weight: bold'   # Dark green
    elif val >= 6:
        return 'background-color: #90EE90; color: black; font-weight: bold'   # Light green
    elif val < 4.0:
        return 'background-color: #FFB6C1; color: black; font-weight: bold'   # Light red
    else:
        return 'background-color: #F0E68C; color: black; font-weight: bold'

# Create styled table
quant_styled = (
    display_table
    .style
    .format({
        'CAGR': '{:.3f}',
        'Volatility': '{:.3f}',
        'Max Drawdown': '{:.3f}',
        'Sharpe': '{:.3f}',
        'Quant Score': '{:.2f}'
    })
    .map(quant_score_style, subset=['Quant Score'])
    .set_table_styles([
        {'selector': 'th', 
         'props': [('background-color', 'steelblue'), 
                   ('color', 'white'), 
                   ('font-weight', 'bold')]}
    ])
    .hide(axis='index')
)

quant_styled
ticker CAGR Volatility Max Drawdown Sharpe Return_score Sharpe_score Vol_score DD_score Quant Score
NVDA 0.651 0.516 -0.663 1.222 1.000000 1.000000 0.198000 0.291000 7.69
WMT 0.218 0.208 -0.257 0.952 0.532000 0.859000 0.890000 0.878000 7.37
JPM 0.232 0.243 -0.388 0.875 0.548000 0.819000 0.812000 0.689000 6.96
GOOGL 0.293 0.308 -0.443 0.885 0.613000 0.824000 0.665000 0.609000 6.86
JNJ 0.121 0.168 -0.184 0.602 0.427000 0.676000 0.981000 0.984000 6.68
KO 0.107 0.159 -0.173 0.543 0.411000 0.646000 1.000000 1.000000 6.58
AAPL 0.169 0.278 -0.334 0.535 0.479000 0.641000 0.733000 0.767000 6.07
MSFT 0.144 0.260 -0.371 0.475 0.452000 0.610000 0.773000 0.713000 5.90
BAC 0.156 0.269 -0.466 0.504 0.465000 0.625000 0.752000 0.575000 5.82
V 0.096 0.227 -0.286 0.334 0.400000 0.537000 0.848000 0.836000 5.74
MA 0.103 0.242 -0.283 0.345 0.408000 0.542000 0.815000 0.841000 5.73
HD 0.100 0.236 -0.347 0.341 0.405000 0.540000 0.829000 0.748000 5.64
PG 0.049 0.173 -0.238 0.167 0.349000 0.449000 0.970000 0.906000 5.59
META 0.188 0.432 -0.767 0.389 0.500000 0.565000 0.386000 0.140000 4.61
AMZN 0.083 0.349 -0.561 0.182 0.386000 0.457000 0.575000 0.438000 4.50
NFLX 0.089 0.429 -0.759 0.160 0.392000 0.446000 0.393000 0.152000 3.84
UNH -0.024 0.307 -0.614 -0.144 0.270000 0.287000 0.667000 0.362000 3.64
TSLA 0.141 0.604 -0.736 0.200 0.448000 0.467000 0.000000 0.185000 3.38
DIS -0.070 0.298 -0.607 -0.301 0.221000 0.205000 0.689000 0.372000 3.25
ADBE -0.109 0.351 -0.605 -0.366 0.179000 0.171000 0.568000 0.375000 2.74
PYPL -0.274 0.423 -0.865 -0.695 0.000000 0.000000 0.407000 0.000000 0.81

Key takeaway:

NVDA leads due to its extraordinary growth, while WMT ranks second because of its outstanding balance between solid returns and very low risk.

Part 3. Machine Learning

3.1. Objective

The objective of this section is to develop a robust machine learning framework that complements our technical and quantitative analysis. Rather than attempting to predict exact price levels — a notoriously difficult task — I focus on two practical goals:

  • Forecasting the expected 60-day forward return (future_return_60d)

  • Predicting the direction of the next 60 trading days (Up or Down)

By combining regression and classification models, I generate high-confidence, actionable trading signals that can be directly integrated into our overall scoring system. This hybrid approach allows us to estimate both the magnitude of potential returns and the probability of a positive outcome, significantly improving decision-making for medium-term investment strategies.

3.2. Feature Engineering

Feature engineering is the foundation of any successful financial machine learning model. I construct a feature set that captures momentum, trend strength, volatility, liquidity, and cross-variable interactions.

Key Feature Categories:

  • Lagged Returns: 1, 3, 5, 10, 20, and 40-day returns

  • Technical Indicators: RSI(14), MACD, MACD Histogram, ATR(14)

  • Oscillators & Bands: Stochastic %K, Bollinger Band position

  • Trend Strength: Price/SMA200 ratio, RSI Z-score

  • Liquidity & Volume: 5-day volume change, volume Z-score

  • Interaction Terms: RSI × MACD Histogram, Stochastic × BB position, Return × Volume (these significantly boost directional accuracy)

Code
# Load data from Part 1 (df_tech already contains RSI, MACD, ATR, SMA)
df_ml = df_tech.copy()
df_ml = df_ml.sort_values(['ticker', 'date']).reset_index(drop=True)

# -----------------------------
# 1. Lagged Returns
# -----------------------------
for lag in [1, 3, 5, 10, 20, 40]:
    df_ml[f'return_{lag}d'] = df_ml.groupby('ticker')['close'].pct_change(lag)

# -----------------------------
# 2. Technical Indicators
# -----------------------------
# RSI (manual calculation for robustness)
delta = df_ml.groupby('ticker')['close'].diff()
gain = delta.clip(lower=0).rolling(window=14).mean()
loss = -delta.clip(upper=0).rolling(window=14).mean()
rs = gain / loss
df_ml['rsi'] = 100 - (100 / (1 + rs))

# MACD
ema12 = df_ml.groupby('ticker')['close'].transform(lambda x: x.ewm(span=12, adjust=False).mean())
ema26 = df_ml.groupby('ticker')['close'].transform(lambda x: x.ewm(span=26, adjust=False).mean())
df_ml['macd'] = ema12 - ema26
df_ml['macd_hist'] = df_ml['macd'] - df_ml['macd'].groupby(df_ml['ticker']).transform(lambda x: x.ewm(span=9, adjust=False).mean())

# ATR (simplified)
tr = pd.concat([df_ml['high'] - df_ml['low'],
                abs(df_ml['high'] - df_ml['close'].shift(1)),
                abs(df_ml['low'] - df_ml['close'].shift(1))], axis=1).max(axis=1)
df_ml['atr'] = tr.groupby(df_ml['ticker']).rolling(14).mean().reset_index(0, drop=True)

# -----------------------------
# 3. New Features for Higher Directional Accuracy
# -----------------------------
df_ml['stoch_k'] = 100 * (
    df_ml['close'] - df_ml.groupby('ticker')['low'].transform(lambda x: x.rolling(14).min())
) / (
    df_ml.groupby('ticker')['high'].transform(lambda x: x.rolling(14).max()) - 
    df_ml.groupby('ticker')['low'].transform(lambda x: x.rolling(14).min())
)

df_ml['bb_mid'] = df_ml.groupby('ticker')['close'].transform(lambda x: x.rolling(20).mean())
df_ml['bb_std'] = df_ml.groupby('ticker')['close'].transform(lambda x: x.rolling(20).std())
df_ml['bb_position'] = (df_ml['close'] - (df_ml['bb_mid'] - 2 * df_ml['bb_std'])) / (4 * df_ml['bb_std'])

df_ml['price_sma200_ratio'] = df_ml['close'] / df_ml.groupby('ticker')['close'].transform(lambda x: x.rolling(200).mean())
df_ml['rsi_zscore'] = df_ml.groupby('ticker')['rsi'].transform(lambda x: (x - x.rolling(60).mean()) / x.rolling(60).std())

df_ml['volume_change_5d'] = df_ml.groupby('ticker')['volume'].pct_change(5)
df_ml['volume_zscore'] = df_ml.groupby('ticker')['volume'].transform(lambda x: (x - x.rolling(60).mean()) / x.rolling(60).std())

# -----------------------------
# 4. Interaction Features (Key to improving directional accuracy)
# -----------------------------
df_ml['rsi_macd_interact'] = df_ml['rsi'] * df_ml['macd_hist']
df_ml['stoch_bb_interact'] = df_ml['stoch_k'] * df_ml['bb_position']
df_ml['return_volume_interact'] = df_ml['return_5d'] * df_ml['volume_change_5d']
df_ml['rsi_momentum'] = df_ml['rsi'] * df_ml['return_10d']

# -----------------------------
# 5. Targets
# -----------------------------
df_ml['future_return_60d'] = df_ml.groupby('ticker')['close'].shift(-60) / df_ml['close'] - 1
df_ml['direction_60d'] = (df_ml['future_return_60d'] > 0).astype(int)

df_ml = df_ml.dropna().reset_index(drop=True)

print(f"Final ML dataset: {df_ml.shape[0]:,} rows across {df_ml['ticker'].nunique()} tickers")
Final ML dataset: 21,756 rows across 21 tickers

3.3. Model Architecture

To capture both the potential size and the likelihood of future price movements, this analysis employs a dual-model architecture utilizing XGBoost. Specifically, two complementary models are trained:

  • XGBoost Regressor (Magnitude Prediction): This model is tasked with forecasting the continuous numerical value of the 60-day forward return. It answers the question of how much a stock is expected to gain or lose.

  • XGBoost Classifier (Directional Probability): This model treats forecasting as a binary classification problem (positive vs. negative return). It calculates the statistical probability that the stock will yield a positive return over the 60-day window, effectively serving as a confidence metric for the trade direction.

Both models use the same feature set and are trained with time-series aware splitting to prevent leakage.

3.4. Training & Evaluation

Code
# Feature list
features = [
    'return_1d', 'return_3d', 'return_5d', 'return_10d', 'return_20d', 'return_40d',
    'rsi', 'macd', 'macd_hist', 'atr', 'stoch_k', 'bb_position',
    'price_sma200_ratio', 'rsi_zscore', 'volume_change_5d', 'volume_zscore',
    'rsi_macd_interact', 'stoch_bb_interact', 'return_volume_interact', 'rsi_momentum'
]

X = df_ml[features]
y_reg = df_ml['future_return_60d']
y_clf = df_ml['direction_60d']

# Time-based split (80% train, 20% test)
split_idx = int(len(X) * 0.8)
X_train, X_test = X.iloc[:split_idx], X.iloc[split_idx:]
y_reg_train, y_reg_test = y_reg.iloc[:split_idx], y_reg.iloc[split_idx:]
y_clf_train, y_clf_test = y_clf.iloc[:split_idx], y_clf.iloc[split_idx:]

# Train models
reg_model = XGBRegressor(n_estimators=600, max_depth=5, learning_rate=0.03, 
                         subsample=0.85, colsample_bytree=0.8, random_state=42)
clf_model = XGBClassifier(n_estimators=500, max_depth=4, learning_rate=0.04, 
                          subsample=0.85, colsample_bytree=0.8, random_state=42)

reg_model.fit(X_train, y_reg_train)
clf_model.fit(X_train, y_clf_train)

# Evaluation
reg_pred = reg_model.predict(X_test)
clf_pred = clf_model.predict(X_test)
clf_prob = clf_model.predict_proba(X_test)[:, 1]
Code
# -----------------------------
# 1. Regression Metrics
# -----------------------------
rmse = np.sqrt(mean_squared_error(y_reg_test, reg_pred))
mae  = mean_absolute_error(y_reg_test, reg_pred)
r2   = r2_score(y_reg_test, reg_pred)
ev   = explained_variance_score(y_reg_test, reg_pred)

# -----------------------------
# 2. Classification Metrics
# -----------------------------
accuracy = accuracy_score(y_clf_test, clf_pred)
precision = precision_score(y_clf_test, clf_pred)
recall    = recall_score(y_clf_test, clf_pred)
f1        = f1_score(y_clf_test, clf_pred)
auc       = roc_auc_score(y_clf_test, clf_prob)

# -----------------------------
# 3. Strategy Metrics
# -----------------------------
win_rate = (clf_pred == y_clf_test).mean()

# -----------------------------
# 4. In bảng đẹp
# -----------------------------
performance_data = {
    'Metric': [
        'Regression RMSE',
        'Regression MAE',
        'R² Score',
        'Explained Variance',
        'Directional Accuracy',
        'Precision (Up)',
        'Recall (Up)',
        'F1-Score',
        'ROC-AUC',
        'Win Rate'
    ],
    'Value': [
        f"{rmse:.4f}",
        f"{mae:.4f}",
        f"{r2:.4f}",
        f"{ev:.4f}",
        f"{accuracy:.1%}",
        f"{precision:.1%}",
        f"{recall:.1%}",
        f"{f1:.4f}",
        f"{auc:.4f}",
        f"{win_rate:.1%}"
    ]
}

perf_df = pd.DataFrame(performance_data)

# Display table
styled_perf = perf_df.style\
    .set_table_styles([
        {'selector': 'th', 'props': [
            ('background-color', 'steelblue'),
            ('color', 'white'),
            ('font-weight', 'bold'),
            ('text-align', 'center')
        ]},
        {'selector': 'td', 'props': [('text-align', 'center')]}
    ])\
    .hide(axis='index')

print("="*70)
print("MODEL PERFORMANCE - FULL EVALUATION")
print("="*70)
styled_perf
======================================================================
MODEL PERFORMANCE - FULL EVALUATION
======================================================================
Metric Value
Regression RMSE 0.1831
Regression MAE 0.1242
R² Score -0.0648
Explained Variance -0.0615
Directional Accuracy 57.1%
Precision (Up) 63.3%
Recall (Up) 71.5%
F1-Score 0.6716
ROC-AUC 0.5258
Win Rate 57.1%

The regression results show limited ability to predict the exact magnitude of returns, as indicated by a negative R² and explained variance. This suggests the model is not suitable for precise return forecasting.

However, directional performance is more encouraging. The model achieves 57.1% directional accuracy, which is meaningfully above the random 50% baseline. Precision (63.3%) and recall (71.5%) for upward moves indicate a solid ability to identify positive trends.

Although the ROC-AUC (0.5258) reflects only a modest classification edge, the overall results suggest the presence of moderate predictive signal.

In conclusion, the model is better suited for directional ranking within a multi-factor framework rather than standalone return prediction.

Code
# -----------------------------
# 1. Prepare Data for Both Models
# -----------------------------
# Regressor Data
imp_reg = pd.Series(reg_model.feature_importances_, index=features).sort_values(ascending=True)

# Classifier Data
imp_clf = pd.Series(clf_model.feature_importances_, index=features).sort_values(ascending=True)

# -----------------------------
# 2. Create Figure & Add Traces
# -----------------------------
fig_imp = go.Figure()

# Trace 1: Regressor 
fig_imp.add_trace(
    go.Bar(
        x=imp_reg.values,
        y=imp_reg.index,
        orientation='h',
        name='Regressor',
        marker_color='steelblue'
    )
)

# Trace 2: Classifier 
fig_imp.add_trace(
    go.Bar(
        x=imp_clf.values,
        y=imp_clf.index,
        orientation='h',
        name='Classifier',
        marker_color='mediumseagreen', 
        visible=False
    )
)

# -----------------------------
# 3. Add Dropdown Menu & Update Layout
# -----------------------------
fig_imp.update_layout(
    updatemenus=[
        dict(
            active=0, 
            buttons=list([
                # 1: Regressor
                dict(label="XGBoost Regressor (Magnitude)",
                     method="update",
                     args=[{"visible": [True, False]},
                           {"title.text": "Feature Importance - XGBoost Regressor (60-day Return Prediction)"}]),
                
                # 2: Classifier
                dict(label="XGBoost Classifier (Direction)",
                     method="update",
                     args=[{"visible": [False, True]},
                           {"title.text": "Feature Importance - XGBoost Classifier (Probability of Up)"}]), 
            ]),
            direction="down",
            showactive=True,
            x=0,
            xanchor="left",
            y=1.3, 
            yanchor="top"
        )
    ],
    title=dict(
            text="Feature Importance - XGBoost Regressor (60-day Return Prediction)",
            y=0.85, 
            x=0.05, 
            xanchor="left",
            yanchor="top"
    ),
    height=750, 
    xaxis_title="Importance Score",
    yaxis_title="Feature",
    template="plotly_white",
    showlegend=False,
    plot_bgcolor='white',
    paper_bgcolor='white',
    margin=dict(t=80) 
)

# -----------------------------
# 4. Show Chart
# -----------------------------
fig_imp.show()

The charts above show the relative importance of each feature. The top contributors are ATR, Price/SMA200 ratio, and 40-day return - confirming that volatility control and long-term trend structure are key drivers of 60-day performance.

3.5. Signal Generation & ML Scoring

To create a single, interpretable score that combines both models, the following weighted formula is applied:

ML Score=(0.3×Z-Scored Predicted Return)+(0.7×Z-Scored Probability of Up)

  • 30% Z-Scored Predicted Return: Captures the relative expected magnitude of gain across stocks on the same date. Using cross-sectional z-scores ensures comparability and reduces scale distortion.

  • 70% Z-Scored Probability of Up: Reflects the model’s confidence in a positive move. Given the stronger directional performance relative to magnitude forecasting, probability is assigned a higher weight.

Both components are standardized cross-sectionally (per date) to preserve ranking quality and avoid scale bias. The combined score is then rescaled to a 0-10 range for interpretability and integration with Technical and Quantitative factors.

This approach ensures the ML Score remains smooth, continuous, and ranking-oriented, making it suitable for multi-factor portfolio construction rather than threshold-based signal generation.

Code
import numpy as np
from sklearn.preprocessing import MinMaxScaler

# =========================================
# Add predictions
# =========================================

df_ml['pred_return_60d'] = reg_model.predict(X)
df_ml['pred_prob_up'] = clf_model.predict_proba(X)[:, 1]

# =========================================
# Cross-sectional Z-score (per date)
# =========================================

df_ml['z_return'] = df_ml.groupby('date')['pred_return_60d'] \
    .transform(lambda x: (x - x.mean()) / (x.std() + 1e-9))

df_ml['z_prob'] = df_ml.groupby('date')['pred_prob_up'] \
    .transform(lambda x: (x - x.mean()) / (x.std() + 1e-9))

# =========================================
# Weighted ML Raw Score (0.3 / 0.7)
# =========================================

df_ml['ml_raw'] = (
    0.3 * df_ml['z_return'] +
    0.7 * df_ml['z_prob']
)

# =========================================
# Rescale to 0-10 for interpretability
# =========================================

scaler = MinMaxScaler()
df_ml['ml_score'] = scaler.fit_transform(df_ml[['ml_raw']]) * 10

# =========================================
# Create Summary Table
# =========================================

ml_table = (df_ml.groupby('ticker')
            .agg({
                'pred_return_60d': 'mean',
                'pred_prob_up': 'mean',
                'ml_score': 'mean'
            })
            .round(3)
            .sort_values('ml_score', ascending=False)
            .reset_index())

ml_table.columns = ['Ticker', 'Pred. 60d Return', 'Prob Up', 'ML Score']

# =========================================
# Styling
# =========================================

styled_table = (
    ml_table.style
    .format({
        'Pred. 60d Return': '{:.1%}',
        'Prob Up': '{:.0%}',
        'ML Score': '{:.2f}'
    })
    .set_table_styles([
        {'selector': 'th',
         'props': [
             ('background-color', 'steelblue'),
             ('color', 'white'),
             ('font-weight', 'bold'),
             ('text-align', 'center'),
             ('font-size', '14px')
         ]},
        {'selector': 'td',
         'props': [
             ('text-align', 'center'),
             ('font-size', '13px')
         ]}
    ])
    .hide(axis='index')
)

styled_table
Ticker Pred. 60d Return Prob Up ML Score
NVDA 14.0% 67% 6.38
NFLX 5.9% 66% 5.83
BAC 4.9% 65% 5.80
KO 3.3% 65% 5.64
WMT 4.0% 63% 5.57
GOOGL 3.8% 57% 5.10
JNJ 2.6% 58% 5.04
V 2.9% 58% 5.04
AAPL 3.2% 57% 5.03
AMZN 2.8% 57% 5.02
JPM 2.9% 57% 4.99
PG 2.2% 57% 4.92
META 4.5% 55% 4.87
HD 2.2% 55% 4.86
MSFT 2.1% 55% 4.78
DIS 2.2% 54% 4.77
TSLA 2.7% 53% 4.73
MA 1.4% 54% 4.70
PYPL -0.3% 52% 4.43
UNH -1.0% 48% 4.00
ADBE -2.0% 45% 3.76

The table displays the top tickers ranked by ML Score. NVDA leads with the strongest combination of expected return and high probability of upside.

Part 4. Comprehensive Synthesis and Strategic Recommendations

4.1. Objective and Methodology

The final phase of this report aims to synthesize the findings from all three analytical lenses - Technical, Quantitative, and Machine Learning - into a single, actionable framework. Relying on a single method can expose investors to blind spots; for instance, a stock might have great historical returns (Quantitative) but is currently in a severe downtrend (Technical), or it might look technically strong but lacks predictive forward momentum (ML).

To resolve this, we create a Final Composite Score (Scale: 0 - 10) that blends these perspectives using a weighted approach:

  • Quantitative Score (40% Weight): Carries the heaviest weight because it reflects the structural foundation of the stock - its historical ability to generate risk-adjusted returns and manage drawdowns.

  • Technical Score (30% Weight): Represents the current market reality, trend strength, and capital flow. (Note: Since the original Technical Score is out of 8, it is normalized to a 10-point scale before weighting).

  • Machine Learning Score (30% Weight): Provides the forward-looking predictive edge, forecasting the probability and magnitude of upward movement over the next medium-term window (e.g., 60 days).

Based on the Final Composite Score, each stock is assigned a strategic recommendation for the medium-to-long term:

  • Strong Buy / Core Holding (Score >= 7.0): Exceptional balance of historical performance, current trend, and future predictive momentum. These should form the foundation of a long-term portfolio.

  • Accumulate / Buy (Score 6.0 - 6.99): Solid companies with positive setups. Suitable for adding to positions, especially during minor pullbacks.

  • Hold / Neutral (Score 4.0 - 5.99): Mixed signals. May have good fundamentals but weak technicals, or vice versa. Best to hold existing positions but wait for better clarity before deploying new capital.

  • Avoid / Sell (Score < 4.0): Poor risk-adjusted returns, broken technical trends, and bearish ML predictions. Capital should be reallocated elsewhere.

4.2. Data Aggregation and Final Scoring

Below is the code to merge the results from the previous three sections, calculate the Final Composite Score, and generate the ultimate recommendation table.

Code
# 1. Prepare the DataFrames

# Standardize column names for merging
df_tech = technical_ranking[['ticker', 'Technical_Score']].copy()
df_tech.columns = ['Ticker', 'Tech_Score']

df_quant = score_df[['ticker', 'Quant Score']].copy()
df_quant.columns = ['Ticker', 'Quant_Score']

df_ml = ml_table[['Ticker', 'ML Score']].copy()
df_ml.columns = ['Ticker', 'ML_Score']

# Convert ML_Score to float if it's currently a string from formatting
if df_ml['ML_Score'].dtype == 'O':
    df_ml['ML_Score'] = df_ml['ML_Score'].astype(float)

# 2. Merge the three tables
final_df = df_tech.merge(df_quant, on='Ticker', how='inner').merge(df_ml, on='Ticker', how='inner')

# 3. Calculate Normalized Scores and Final Composite Score
# Normalize Technical Score from a 0-8 scale to a 0-10 scale
final_df['Tech_Score_Norm'] = (final_df['Tech_Score'] / 8.0) * 10.0

# Apply Weights: 30% Tech, 40% Quant, 30% ML
final_df['Final_Score'] = (
    0.30 * final_df['Tech_Score_Norm'] + 
    0.40 * final_df['Quant_Score'] + 
    0.30 * final_df['ML_Score']
)

# 4. Generate Strategic Recommendations
def get_recommendation(score):
    if score >= 7.0:
        return 'Strong Buy'
    elif score >= 6:
        return 'Accumulate'
    elif score >= 4.0:
        return 'Hold / Neutral'
    else:
        return 'Avoid / Sell'

final_df['Recommendation'] = final_df['Final_Score'].apply(get_recommendation)

# Sort by Final Score in descending order
final_df = final_df.sort_values('Final_Score', ascending=False).reset_index(drop=True)

# 5. Styling the Final Table
def highlight_recommendation(val):
    if val == 'Strong Buy':
        return 'background-color: #006400; color: white; font-weight: bold'
    elif val == 'Accumulate':
        return 'background-color: #90EE90; color: black; font-weight: bold'
    elif val == 'Hold / Neutral':
        return 'background-color: #F0E68C; color: black; font-weight: bold'
    else:
        return 'background-color: #FFB6C1; color: black; font-weight: bold'

final_table_styled = (
    final_df[['Ticker', 'Tech_Score', 'Quant_Score', 'ML_Score', 'Final_Score', 'Recommendation']]
    .style
    .format({
        'Tech_Score': '{:.0f}/8',
        'Quant_Score': '{:.2f}',
        'ML_Score': '{:.2f}',
        'Final_Score': '{:.2f}'
    })
    .map(highlight_recommendation, subset=['Recommendation'])
    .set_table_styles([
        {'selector': 'th',
         'props': [('background-color', '#2c3e50'), 
                   ('color', 'white'), 
                   ('font-weight', 'bold'),
                   ('text-align', 'center'),
                   ('font-size', '14px')]},
        {'selector': 'td',
         'props': [('text-align', 'center'), ('font-size', '13px')]}
    ])
    .hide(axis='index')
)

# Display the table
display(final_table_styled)
Ticker Tech_Score Quant_Score ML_Score Final_Score Recommendation
NVDA 7/8 7.69 6.38 7.61 Strong Buy
GOOGL 7/8 6.86 5.10 6.90 Accumulate
WMT 6/8 7.36 5.57 6.87 Accumulate
BAC 7/8 5.82 5.80 6.69 Accumulate
KO 6/8 6.58 5.64 6.57 Accumulate
AAPL 7/8 6.07 5.03 6.56 Accumulate
JPM 6/8 6.96 4.99 6.53 Accumulate
JNJ 6/8 6.68 5.04 6.44 Accumulate
AMZN 8/8 4.50 5.02 6.31 Accumulate
PG 6/8 5.59 4.92 5.96 Hold / Neutral
HD 5/8 5.65 4.86 5.59 Hold / Neutral
META 5/8 4.61 4.87 5.18 Hold / Neutral
TSLA 5/8 3.38 4.73 4.65 Hold / Neutral
MSFT 2/8 5.90 4.78 4.54 Hold / Neutral
MA 2/8 5.73 4.70 4.45 Hold / Neutral
NFLX 3/8 3.84 5.83 4.41 Hold / Neutral
V 1/8 5.74 5.04 4.18 Hold / Neutral
UNH 3/8 3.64 4.00 3.78 Avoid / Sell
DIS 2/8 3.25 4.77 3.48 Avoid / Sell
ADBE 2/8 2.74 3.76 2.98 Avoid / Sell
PYPL 3/8 0.81 4.43 2.78 Avoid / Sell

4.3. Strategic Recommendation

Based on the integrated data from our Technical framework, Quantitative risk-return metrics, and the latest 60-day Machine Learning predictions, several clear investment narratives emerge for the medium-to-long term.

  1. The Undisputed Growth Leader: NVDA Nvidia (NVDA) continues to dominate across all analytical dimensions. Quantitatively, it boasts the highest historical returns and Sharpe ratio. Technically, it maintains a 7/8 score. Validating this historical strength, our ML model projects a massive 14.0% expected return over the next 60 days with a 67% probability of upward movement, earning it the highest ML Score of 6.38.

Recommendation: NVDA remains a Strong Buy and a core growth holding. However, given its high historical volatility, investors should size positions appropriately.

  1. The Balanced Core Performer: GOOGL

Alphabet (GOOGL) emerges as a highly compelling, well-rounded asset, securing the second-highest Final Score (6.90) in our composite ranking. Technically, it demonstrates strong bullish momentum with a near-perfect Technical Score of 7/8. Quantitatively, it delivers an impressive balance of growth and risk-adjusted efficiency, featuring a robust historical CAGR of 29.3% and a high Sharpe ratio of 0.885. While its ML predicted 60-day return (3.8%) is more measured than our aggressive growth leaders, its 57% probability of upward movement confirms a steady, reliable trajectory.

Recommendation: Accumulate. GOOGL acts as a powerful bridge between aggressive growth and defensive stability. It is an excellent core holding for the portfolio, providing strong, consistent risk-adjusted returns without the extreme volatility of higher-beta tech stocks.

  1. The Defensive Anchors: WMT

Walmart (WMT) highlight the importance of risk-adjusted efficiency. While its predicted 60-day return is more modest (4.0%), its upward probability remains highly reliable (63%). WMT was also a standout in our Quantitative Analysis due to its exceptionally low maximum drawdown and high Sharpe ratio.

Recommendation: Hold / Accumulate. This ticker serves as the ultimate defensive anchors. It provides steady, positive expectancy while protecting the portfolio against broader market volatility.

Overall Portfolio Strategy:

The data suggests a modified “barbell” approach anchored by a strong core for the upcoming 60-day horizon. We recommend establishing a solid foundation with balanced, highly efficient performers like GOOGL. Around this core, the portfolio should overweight high-conviction, aggressive growth names like NVDA to capture alpha, while heavily utilizing low-drawdown defensive staples like WMT to cushion against unexpected systemic shocks.

Limitations

While this report provides a comprehensive framework for stock market forecasting, it is subject to several inherent limitations that investors must consider:

  1. Reliance on Historical Data (Past Performance vs. Future Results)

The Technical, Quantitative, and Machine Learning models built in this report rely entirely on historical price and volume data from December 2020 to February 2026. Financial markets are dynamic and subject to regime changes. Patterns that generated strong predictive signals in the past may deteriorate or completely reverse in future market conditions.

  1. Absence of Fundamental and Macroeconomic Variables

The current Machine Learning architecture and Quantitative scoring system are strictly price-derived. The models do not account for critical fundamental data (e.g., earnings per share, P/E ratios, revenue growth) or macroeconomic indicators (e.g., inflation rates, Federal Reserve interest rate decisions, GDP growth). Consequently, the models may misprice a stock if a sudden shift occurs in the underlying company’s financial health or the broader economic environment.

  1. Vulnerability to “Black Swan” Events and News Shocks

Quantitative and Machine Learning algorithms cannot predict exogenous shocks. Geopolitical conflicts, sudden regulatory crackdowns, global pandemics, or unexpected executive changes (e.g., a sudden CEO resignation) can cause immediate and drastic price movements that bypass all technical support levels and ML predictions.

  1. Exclusion of Transaction Costs and Slippage

The projected returns, CAGR, and Machine Learning expected returns discussed in this report assume frictionless trading. In reality, actual portfolio performance will be lower due to transaction fees, bid-ask spreads, liquidity constraints, and slippage (the difference between the expected price of a trade and the price at which the trade is executed).