Baseline MLP agent with simple P&L reward. Buy/Sell/Hold one unit at a time.
→ open notebook
about --me
$ whoami
ShubbhRM
$ cat bio.txt
ML / AI Engineer with a focus on Reinforcement Learning and Financial Markets. Building trading agents that learn from market data using Deep RL — DQN and PPO — across multiple neural architectures.
$ cat interests.txt
- → Deep Reinforcement Learning (DQN, PPO, DDPG)
- → Financial Markets & Algorithmic Trading
- → Neural Architecture Design (CNN, LSTM, Hybrid)
- → Reward Function Engineering & Sharpe Optimization
- → PyTorch & Stable-Baselines3 ecosystem
$ cat status.txt
Available for collaboration & research
$ cat project.md
# Finance-RL: RL Trading Agents
A systematic study comparing Deep RL algorithms and neural architectures for stock trading. Each experiment varies one axis — architecture, algorithm, or reward signal — to isolate the effect on trading performance.
## Research Questions
- → Does CNN beat MLP for price pattern recognition?
- → Does LSTM memory improve sequential decisions?
- → Does Sharpe reward outperform raw P&L?
- → DQN vs PPO on-policy vs off-policy trade-offs?
## Next: DDPG continuous control | Sentiment signals | Multi-stock
skills --tech-stack
> Core competencies across languages, deep learning, financial ML, and tooling.
experiments --list
> 14 experiments across 3 architectures × 2 strategies × 3 reward signals × 2 algorithms. Use filters to navigate.
MLP agent optimizing total portfolio net worth rather than per-trade P&L.
→ open notebookAllIn strategy — agent commits 100% of portfolio to a position. Maximum aggression.
→ open notebookAllIn strategy with net-worth-centric reward — tracks total portfolio wealth.
→ open notebookAblation study: MLP with feature normalization/scaling enabled. Tests whether scaled inputs improve convergence.
→ open notebookCustom CNN feature extractor for temporal price pattern recognition. Conv1D layers over OHLCV sequences.
→ open notebookCNN + Sharpe Ratio reward — convolutional features combined with risk-adjusted optimization.
→ open notebookCNN with aggressive AllIn trading strategy. Tests whether spatial features support high-conviction entries.
→ open notebookMaximum risk environment: CNN agent making all-in decisions with Sharpe-penalized reward signal.
→ open notebookHybrid CNN-LSTM: local Conv1D pattern detection feeding into LSTM(256) for trend memory.
→ open notebookFull sequential CNN-LSTM architecture with aggressive AllIn position sizing.
→ open notebookPPO baseline: on-policy proximal policy optimization vs. DQN off-policy. MLP policy network.
→ open notebookPPO variant / checkpoint run. On-policy learning experiment iteration.
→ open notebookBroader ML modifications experiment — architecture search, hyperparameter tuning, and policy comparisons.
→ open notebookNo experiments match this filter combination.
results --summary
>
Experiment matrix. Populate sharpe and pnl fields in
_data/experiments.yml after running notebooks.
| ID | Architecture | Algorithm | Strategy | Reward | Badge |
|---|---|---|---|---|---|
| EXP_001 | MLP | DQN | OneStockPolicy | PnL | BASELINE |
| EXP_002 | MLP | DQN | OneStockPolicy | NetWorth | — |
| EXP_003 | MLP | DQN | AllInPolicy | PnL | — |
| EXP_004 | MLP | DQN | AllInPolicy | NetWorth | — |
| EXP_005 | MLP | DQN | OneStockPolicy | PnL | ABLATION |
| EXP_006 | CNN | DQN | OneStockPolicy | PnL | — |
| EXP_007 | CNN | DQN | OneStockPolicy | Sharpe | RISK-ADJUSTED |
| EXP_008 | CNN | DQN | AllInPolicy | PnL | — |
| EXP_009 | CNN | DQN | AllInPolicy | Sharpe | ADVANCED |
| EXP_010 | CNN-LSTM | DQN | OneStockPolicy | PnL | HYBRID |
| EXP_011 | CNN-LSTM | DQN | AllInPolicy | PnL | — |
| EXP_012 | MLP | PPO | AllInPolicy | PnL | PPO |
| EXP_013 | MLP | PPO | AllInPolicy | PnL | PPO |
| EXP_014 | MLP | DQN | OneStockPolicy | PnL | RESEARCH |
>
To populate metrics: extract final Sharpe and P&L values from each notebook
and add sharpe: 1.23 & pnl: "+12.4%" fields to
_data/experiments.yml.