How is it that in policy gradient, parametrizing the policy by a Deep Neural Network enables the application of these methods to extremely large state and action space (potentially continuous actions)? How does deep learning (or a function approximate) magically make large state action spaces tractable in the learning procedure? How does it enable this to be tractable? How does it compare to non-neural network methods that wouldn’t be tractable?
- Why is the state-action space so large in the first place (examples?)
- How does the neural network make the state-action action space “small” (or tractable/learnable)?