Reading documentation I find that "For agents with a critic, Episode Q0 is the estimate of the discounted long-term reward at the start of each episode, given the initial observation of the environment. As training progresses, if the critic is well designed. Episode Q0 approaches the true discounted long-term reward"
But I cannot grasp exactly what is Q0 because, except in a few examples (like this one) where it "converges" to some value rather quickly, I have seen Q0 value do different things and I cannot understad or interpret them (like the two examples shown here). I also don't understand what "true discounted reward" means exaclty. Is it for each episode, average or something cumulative? In this answer it is suggested that Q0 should track the average episode reward, but I don't see that in the examples.
For example, in the cartpole example, if one continues the training for more episodes (changing the stop training criteria to avoid stopping for average reward), Q0 value reaches very high values that have nothing to do with the average reward or the episodes. I simulated 1000 episodes for the cartpole example and Q0 values even mess up the scale because they go way too high. The agent seams too learn properly and it even manages to get out of some local minimums sucessfully but still, I cannot grasp what information Q0 yields
I have not found Q0 defined in Reinforcement Learning bibliography either. Could you please clarify a bit or give some bibliogtaphy where I can read further on this specific parameter?