Finance-RL DQN Stock Trading with Deep RL

# ShubbhRM / Finance-RL

finance_rl ~ bash

[ view on github ] [ explore experiments ] [ about me ]

14 Experiments

3 Architectures

2 Algorithms

3 Reward Functions

about --me

whoami

$ whoami

ShubbhRM

$ cat bio.txt

ML / AI Engineer with a focus on Reinforcement Learning and Financial Markets. Building trading agents that learn from market data using Deep RL — DQN and PPO — across multiple neural architectures.

$ cat interests.txt

→ Deep Reinforcement Learning (DQN, PPO, DDPG)
→ Financial Markets & Algorithmic Trading
→ Neural Architecture Design (CNN, LSTM, Hybrid)
→ Reward Function Engineering & Sharpe Optimization
→ PyTorch & Stable-Baselines3 ecosystem

$ cat status.txt

Available for collaboration & research

cat project.md

$ cat project.md

# Finance-RL: RL Trading Agents

A systematic study comparing Deep RL algorithms and neural architectures for stock trading. Each experiment varies one axis — architecture, algorithm, or reward signal — to isolate the effect on trading performance.

## Research Questions

→ Does CNN beat MLP for price pattern recognition?
→ Does LSTM memory improve sequential decisions?
→ Does Sharpe reward outperform raw P&L?
→ DQN vs PPO on-policy vs off-policy trade-offs?

## Next: DDPG continuous control | Sentiment signals | Multi-stock

skills --tech-stack

> Core competencies across languages, deep learning, financial ML, and tooling.

# languages

Python 92%

Bash / Shell 60%

SQL 55%

# deep-learning

PyTorch 85%

Stable-Baselines3 88%

CNN (Conv1D/2D) 80%

LSTM / RNN 78%

DQN 90%

PPO / Actor-Critic 82%

# finance-ml

RL Trading Envs (Gym) 87%

Sharpe Ratio Optimization 83%

Feature Engineering 78%

Backtesting 72%

OHLCV Data 85%

# tools

Jupyter Notebook 92%

Git / GitHub 80%

NumPy / Pandas 88%

Matplotlib 80%

Conda / Pip 75%

experiments --list

> 14 experiments across 3 architectures × 2 strategies × 3 reward signals × 2 algorithms. Use filters to navigate.

> Showing 14 / 14 experiments

EXP_001

DQN + MLP + OneStockPolicy + P&L

MLP DQN PnL OneStockPolicy BASELINE

Baseline MLP agent with simple P&L reward. Buy/Sell/Hold one unit at a time.

→ open notebook

EXP_002

DQN + MLP + OneStockPolicy + NetWorth

MLP DQN NetWorth OneStockPolicy

MLP agent optimizing total portfolio net worth rather than per-trade P&L.

→ open notebook

EXP_003

DQN + MLP + AllInPolicy + P&L

MLP DQN PnL AllInPolicy

AllIn strategy — agent commits 100% of portfolio to a position. Maximum aggression.

→ open notebook

EXP_004

DQN + MLP + AllInPolicy + NetWorth

MLP DQN NetWorth AllInPolicy

AllIn strategy with net-worth-centric reward — tracks total portfolio wealth.

→ open notebook

EXP_005

DQN + MLP + Feature Scaling

MLP DQN PnL OneStockPolicy ABLATION

Ablation study: MLP with feature normalization/scaling enabled. Tests whether scaled inputs improve convergence.

→ open notebook

EXP_006

DQN + CNN + OneStockPolicy + P&L

CNN DQN PnL OneStockPolicy

Custom CNN feature extractor for temporal price pattern recognition. Conv1D layers over OHLCV sequences.

→ open notebook

EXP_007

DQN + CNN + OneStockPolicy + Sharpe

CNN DQN Sharpe OneStockPolicy RISK-ADJUSTED

CNN + Sharpe Ratio reward — convolutional features combined with risk-adjusted optimization.

→ open notebook

EXP_008

DQN + CNN + AllInPolicy + P&L

CNN DQN PnL AllInPolicy

CNN with aggressive AllIn trading strategy. Tests whether spatial features support high-conviction entries.

→ open notebook

EXP_009

DQN + CNN + AllInPolicy + Sharpe

CNN DQN Sharpe AllInPolicy ADVANCED

Maximum risk environment: CNN agent making all-in decisions with Sharpe-penalized reward signal.

→ open notebook

EXP_010

DQN + CNN-LSTM + OneStockPolicy + P&L

CNN-LSTM DQN PnL OneStockPolicy HYBRID

Hybrid CNN-LSTM: local Conv1D pattern detection feeding into LSTM(256) for trend memory.

→ open notebook

EXP_011

DQN + CNN-LSTM + AllInPolicy + P&L

CNN-LSTM DQN PnL AllInPolicy

Full sequential CNN-LSTM architecture with aggressive AllIn position sizing.

→ open notebook

EXP_012

PPO + MLP + AllInPolicy + P&L

MLP PPO PnL AllInPolicy PPO

PPO baseline: on-policy proximal policy optimization vs. DQN off-policy. MLP policy network.

→ open notebook

EXP_013

PPO + MLP + AllInPolicy + P&L (Copy)

MLP PPO PnL AllInPolicy PPO

PPO variant / checkpoint run. On-policy learning experiment iteration.

→ open notebook

EXP_014

ML Modifications Experiment

MLP DQN PnL OneStockPolicy RESEARCH

Broader ML modifications experiment — architecture search, hyperparameter tuning, and policy comparisons.

→ open notebook

results --summary

> Experiment matrix. Populate sharpe and pnl fields in _data/experiments.yml after running notebooks.

ID	Architecture	Algorithm	Strategy	Reward	Badge
EXP_001	MLP	DQN	OneStockPolicy	PnL	BASELINE
EXP_002	MLP	DQN	OneStockPolicy	NetWorth	—
EXP_003	MLP	DQN	AllInPolicy	PnL	—
EXP_004	MLP	DQN	AllInPolicy	NetWorth	—
EXP_005	MLP	DQN	OneStockPolicy	PnL	ABLATION
EXP_006	CNN	DQN	OneStockPolicy	PnL	—
EXP_007	CNN	DQN	OneStockPolicy	Sharpe	RISK-ADJUSTED
EXP_008	CNN	DQN	AllInPolicy	PnL	—
EXP_009	CNN	DQN	AllInPolicy	Sharpe	ADVANCED
EXP_010	CNN-LSTM	DQN	OneStockPolicy	PnL	HYBRID
EXP_011	CNN-LSTM	DQN	AllInPolicy	PnL	—
EXP_012	MLP	PPO	AllInPolicy	PnL	PPO
EXP_013	MLP	PPO	AllInPolicy	PnL	PPO
EXP_014	MLP	DQN	OneStockPolicy	PnL	RESEARCH

# MLP

Stable-Baselines3 default MlpPolicy. Fully-connected layers (256→128→64→actions). Fast to train, strong baseline.

7 experiments

# CNN

Custom Conv1D feature extractor. Learns local price patterns from OHLCV windows. Lower parameter count than MLP-equiv.

4 experiments

# CNN-LSTM

Hybrid: Conv1D for local patterns → LSTM(256) for sequential memory → FC head. Most expressive architecture in this study.

3 experiments

> To populate metrics: extract final Sharpe and P&L values from each notebook and add sharpe: 1.23 & pnl: "+12.4%" fields to _data/experiments.yml.

github --stats

> repository

> built-with

> contributor-activity

> top-languages