If I have a complete set of past data (observations) and a list of the actions taken by some agent (or human), could I update my policy using that instead of running my simulated environment dynamics?
I have a DQN agent that was initially trained using simulated data. As usual, my agent chose actions following some policy and some action selection method (in my case, epsilon greedy selection). Now I would like to update my dnn with real world past data, how could that be done?
I don't seem to be able to modify the action as an input in the step function (I could modify it afterwards but if I do that, then the agent would be evaluating the wrong action). Is there a way to "force" the action value (at the input of the step function) so that the system evaluates that action instead of the one selected by my current exploration/exploitaition method?