espadrine 2 days ago

This is a topic I love to study. The mathematical analysis is reasonable; the policy gradient is a classic approach; I love Sutton’s RL book on it: http://incompleteideas.net/book/RLbook2020.pdf Even though nowadays many people rather use cross-entropy for training RL, which I believe leads to more stable training.

Some of the equations feel a bit imprecise. I prefer to use random variables (rather than the U term), so that uncertainty permeates every value. As a result, the epsilon-t formula might be a perfectible fit (it goes negative past a certain t, which is unrealistic).

nmca 2 days ago

A wonderful treatise on the same topic, “Reinforcement Learning Bit by Bit”, for anyone looking for a more advanced treatment of explore/exploit.

https://arxiv.org/abs/2103.04047

bob1029 2 days ago

I've been using evolutionary techniques with Pareto front optimization to deal with this tradeoff. If two (or more) objectives are in direct conflict, but otherwise don't dominate one another, we simply take them all. Then, the problem becomes one of making sure you have enough resources to maintain the frontier across the generations. If you do have to collapse the frontier, you can use things like crowding score to maintain diversity.

Example use case - minimizing total neuron activations in a spiking neural simulation while simultaneously maximizing activation coverage and output correctness scores. You don't necessarily know if the cheaper to simulate, but less correct, candidate represents a better evolutionary path until you go down it a few steps.

This also gives you an idea of what your horizon looks like on the fitness landscape. If your Pareto front is small, you are typically deep into exploitation. A simple forward path. If it is large, you are putting more resources into exploration, but you aren't giving up exploitation - it just takes longer to get through the jungle bits. This can be used as a heuristic to inform restarting or other strategic corrective measures in the evolutionary algorithm. If we are stuck in jungle for too long, we might decide to abandon the run and pick new hyperparameters.

  • nzhaa 2 days ago

    I think the way life frames many of its problems forces you to simply run into the jungle, cutting through the shrubbery. At least in the sense of deeply technical projects, it's difficult to utilize a Pareto front optimization system, as learning something fully requires long stretches of complete focus. You lose a lot of cached information if you're always switching between contexts. So, I guess it depends on the type of end goal you hope to achieve, but I personally wouldn't utilize Pareto front optimization (maybe I only think this because I am relatively deep into exploitation though).

FailMore 2 days ago

Thanks for writing, a problem I struggle with. I think confidence in oneself impacts the decision. You seem to be highly employable, meaning you can err on the side of exploration. I’m less convinced about that for myself, which makes me anxious to keep exploring. Although I find it very difficult to resist that part of my nature!

  • nzhaa 2 days ago

    I personally think that if you aren't employable you should probably try to explore more to find a good arm to begin to exploit (supposing you are young with long-range T and are able to take advantage of learning opportunities once given them). There's usually a lot of variance in life's arms, so it's good to try as long as exploring doesn't ruin your other possibilities (or you have other responsibilities). Otherwise, you would be stuck exploiting an overall not-so-beneficial arm.

throwpoaster 2 days ago

A simplified heuristic[0] for this for those interested in personal applications: it works out that you should exploit the best opportunity you come across once you’ve explored about 37% of the search space.

[0]: https://en.m.wikipedia.org/wiki/Secretary_problem

  • GuB-42 2 days ago

    Note that it is if you only want the best candidate, all others being equally bad. If, instead, you are content with just a "good" candidate to avoid having to pick a terrible one, then a proposed solution is for the cutoff to be sqrt(n), where n is the number of candidates.

    See "Cardinal payoff variant" in the Wikipedia page.

  • schneems 2 days ago

    With the caveat that this number is for situations where you cannot backtrack (choose a previously seen option that isn’t the current one).

    There is a great book Algorithms to Live by that goes over this explore/exploit problem and more. I recommend it.

    • Y_Y 2 days ago

      I second that recommendation. There's some really good discussion on when and how to apply heuristics like this to real-life decisions.

matheist 2 days ago

See also Thompson sampling[+] for a different approach to multi-armed bandits that doesn't depend on explicitly distinguishing between explore-exploit.

[+] https://en.wikipedia.org/wiki/Thompson_sampling

  • nzhaa 2 days ago

    I've read about this -- decided not to include in the blog because wasn't very easy to analogize to the problem in an especially unique way. Still interesting though!

bsaul 2 days ago

genuine question (the maths lost me): this seems like a philosophical problem. Did the writing of that dilemma into mathematical language bring any interesting result, compared to using regular english ?

  • nabla9 2 days ago

    It's a dynamic problem. The correct choice depends on the dynamics of the system itself.

    Explore vs. Exploit Dilemma is studied and simulated in computational biology. The correct strategy depends on the environment.