| Hyperparameter | PPO | TD3 | SAC |
| Network Architecture | [64, 64] | [256, 256] | [256, 256] |
| Activation | ReLU | ReLU | ReLU |
| Optimizer | Adam | Adam | Adam |
| Learning Rate | 0.0003 | 0.001 | 0.0003 |
| Target Update Rate | 2048 Steps | 1 Episode | 1 Episode |
| Batch Size | 64 | 100 | 256 |
| Epochs | 10 | - | - |
| Discount Factor (γ) | 0.99 | 0.99 | 0.99 |
| Replay Buffer Size | - | 106 | 106 |
| Clip Range (ε) | 0.2 | - | - |
| GAE (λ) | 0.95 | - | - |
| Soft Update Coefficient (τ) | - | 0.005 | 0.005 |
| Target Entropy (α) | - | - | Auto |
| Action Noise | - |
| - |
| Policy Delay | - | 2 | - |