Project — Nov 2025 to Present

Python Equity Research Pipeline

A self-built quantitative research system integrating LSTM forecasting, mean-variance optimization, and systematic backtesting.

Type Quantitative Research Project
Timeline Nov 2025 – Present
Benchmark QQQ (Nasdaq-100)
Stack Python, pandas, NumPy, CVXPY
View on GitHub

Overview

A self-built quantitative equity research pipeline in Python, integrating LSTM-based return forecasting, mean-variance portfolio optimization, and systematic backtesting. The pipeline ingests daily OHLCV data, generates forward return estimates, constructs optimized portfolio allocations, and evaluates performance against the Nasdaq-100 (QQQ) benchmark.

The project reflects the intersection of my CS background and finance interest, translating academic concepts from modern portfolio theory into a functional, end-to-end research tool. It also served as the practical foundation for the quantitative framing in the Alphabet equity research brief, specifically the beta and correlation analysis underpinning the Nasdaq-100 proxy thesis.

Data Ingestion and Storage

Built a Python-based pipeline for equity strategy research. Daily OHLCV data is ingested and cached in Parquet format, enabling fast reads and reproducible data snapshots across research runs. Run manifests and data fingerprints are stored alongside outputs to ensure reproducibility and workflow reliability.

LSTM Return Forecasting

Developed LSTM-based return forecasting models for 5-day forward equity returns using pandas and NumPy. Predictions feed into a weekly Top-N long-only portfolio construction framework with T+1 execution, keeping the pipeline operationally realistic rather than purely theoretical. Turnover-based transaction costs and volatility targeting are applied to make backtests more conservative and honest.

Portfolio Construction

Mean-variance optimization is implemented using CVXPY, incorporating return forecasts, covariance estimates, and position constraints to generate optimal long-only portfolio weights. The optimizer is configurable: maximum position sizes, sector constraints, and turnover limits can all be adjusted to simulate different mandate profiles.

Backtesting and Evaluation

Performance is evaluated through multi-year out-of-sample backtests, with primary metrics including Sharpe ratio, maximum drawdown, and annualized alpha relative to the QQQ benchmark. CI tests are included to catch data leakage and logic regressions as the codebase evolves.

Infrastructure

Developed in a Unix-based environment with Bash scripting for pipeline scheduling and execution automation. Profiling outputs are logged alongside results to identify computational bottlenecks as data and model complexity scale.

Python LSTM CVXPY pandas NumPy Parquet Bash Unix vs. QQQ Benchmark