Hyperparameter

PPO

TD3

SAC

Network Architecture

[64, 64]

[256, 256]

[256, 256]

Activation

ReLU

ReLU

ReLU

Optimizer

Adam

Adam

Adam

Learning Rate

0.0003

0.001

0.0003

Target Update Rate

2048 Steps

1 Episode

1 Episode

Batch Size

64

100

256

Epochs

10

-

-

Discount Factor (γ)

0.99

0.99

0.99

Replay Buffer Size

-

106

106

Clip Range (ε)

0.2

-

-

GAE (λ)

0.95

-

-

Soft Update Coefficient (τ)

-

0.005

0.005

Target Entropy (α)

-

-

Auto

Action Noise

-

N ( 0 , 0.1 )

-

Policy Delay

-

2

-