Introduction to Multi-Arm Bandits

6 min readJust now

Introduction

What’s the Deal with Reinforcement Learning..?

Reinforcement learning, a fascinating branch of AI, focuses on teaching agents to make decisions by interacting with their environment and maximizing cumulative rewards. This trial-and-error approach has gained immense popularity, powering applications in robotics, gaming, and recommendation systems.

Imagine training a dog to perform a trick. Each time the trick is executed correctly, the dog earns a treat. Over time, the dog associates the action with the reward, performing the trick more often. Here, the dog is the agent, the trick is the action, the treat is the reward, and the training area is the environment.

Similarly, in reinforcement learning, an agent learns to take optimal actions by exploring its environment and receiving feedback. The ultimate goal? To discover the best set of actions — or policy—that maximizes the expected reward.

An agent interacts with the environment, navigating through states, actions, and rewards

In the reinforcement learning cycle, at any instant, the agent takes action Aₜ in its current state Sₜ, receiving a reward Rₜ from the environment. The environment then transitions the agent to a new state Sₜ₊₁ and provides a new reward Rₜ₊₁. This new state and reward become Sₜ and Rₜ for the next step, continuing the cycle.

What are Multi-Arm Bandits?

The term bandit originates from the concept of a slot machine, often referred to as a one-armed bandit in a casino—a machine that takes money and returns a small, uncertain reward. In machine learning, a multi-armed bandit refers to a problem where an agent must choose from a set of options (or ”arms”) with uncertain rewards.

The agent’s challenge is to balance exploration trying new options to learn their rewards and exploitation choosing the best-known option to maximize the total reward.

**Slot machines**: A classic analogy for the multi-armed bandit problem, where each machine represents an arm with unknown rewards

The Problem

The multi-armed bandit problem is a classic reinforcement learning problem where an agent must choose between multiple actions (or arms), each associated with an unknown probabilistic reward distribution, to maximize its total reward over time. At each time step, the agent selects an action/arm and observes the corresponding reward. The objective is to maximize the total reward over a fixed number of time steps while simultaneously learning the reward distributions of the actions/arms.

Example:

Suppose you’re an online advertising company with several ad designs to show users. Each ad has an unknown click-through rate (CTR), representing the probability of a user clicking on the ad. Your goal is to maximize the total number of clicks over a fixed number of impressions (i.e., times the ads are shown) while learning the CTRs of the various ad designs. Each time a user visits your website, you can choose an ad to display and observe whether the user clicks on it.

The multi-armed bandit problem appears in numerous real-world applications involving multiple choices and limited resources. Balancing exploration and exploitation to identify the best option while learning each option’s reward distribution is a significant challenge in fields like marketing, finance, and healthcare.

**Interaction loop** in a multi-armed bandit problem: The learner takes **actions** and receives **rewards** from the environment, refining its strategy over time.

Types of Multi-Arm Bandits

When it comes to the multi-armed bandit problem, things can get pretty interesting (and a little tricky). Depending on how the rewards behave or how much information you have, the problem can take on different flavors. Let’s break down the most common types of multi-armed bandit problems—with some fun examples along the way!

Stochastic Multi-Armed Bandits:
Here, each arm has a fixed but unknown reward distribution. The agent learns the expected rewards of each arm to maximize total reward.
Example: Imagine playing multiple slot machines, each with a fixed chance of winning. Over time, you figure out which machine pays out the most while minimizing attempts on the less rewarding ones.
Adversarial Multi-Armed Bandits:
In this case, rewards are controlled by an adversary who adjusts them to hinder the agent. The agent aims to minimize regret—the difference between actual rewards and the maximum possible rewards.
Example: Choosing doors in an online game where an adversary moves prizes to trick you. You adapt your strategy based on past outcomes.
Contextual Multi-Armed Bandits:
Arms are associated with contextual features, and rewards depend on the context. The agent learns to map context to rewards for better decisions.
Example: Recommending ice cream flavors based on customer preferences (e.g., kids liking chocolate). Over time, you learn which flavor works best for each customer type.
Semi-Bandits:
The agent gets partial feedback about rewards from all arms, not just the chosen one, allowing for more informed decisions.
Example: A fruit shop where customers pick combinations of fruits. You learn profits from each type (e.g., apples $5, bananas $3), helping optimize future choices.

These different types of multi-armed bandit problems require different strategies and algorithms to solve. Depending on the context and available information, you might even combine a few approaches to find the best solution.

The exploration vs. exploitation trade-off

Ah, the classic exploration vs. exploitation trade-off — one of the trickiest challenges in multi-armed bandit problems. The agent is constantly torn between two conflicting goals:

Should it explore new options to uncover potentially better rewards?
Or should it exploit the best-known option to rake in those sweet, immediate rewards?

Exploration

Exploration involves trying new options with uncertain rewards to gather information about their payoff potential. While exploration can lead to better long-term decisions, it often comes at the cost of lower immediate rewards.

Exploitation

Exploitation focuses on choosing the option with the highest known reward based on current information. This strategy maximizes immediate rewards but risks missing out on better options in the long run.

An Example

Imagine you’re a marketer deciding which ad to show to users. Each ad has an unknown click-through rate (CTR)—essentially its reward potential.

During the exploration phase, you display ads randomly to gather data on their CTRs. This helps you estimate how well each ad performs.
Once you’ve got enough data, you switch to exploitation, focusing on displaying the ad with the highest CTR to maximize immediate clicks (and maybe impress your boss).

But here’s the catch — what if the environment changes? Suddenly, the “best” ad isn’t performing as well anymore. That’s when you might need to explore again to reassess those CTRs and adapt to the new situation.

The real challenge? Balancing exploration and exploitation effectively to maximize overall rewards. Explore too much, and you lose out on immediate gains. Exploit too much, and you might miss the golden opportunities hiding in the shadows. It’s like walking a tightrope, except the rope keeps shifting under your feet.

Closing Thoughts…

Multi-armed bandits offer a powerful framework for tackling decision-making problems in diverse real-world scenarios. This article introduced the concept of multi-armed bandits, explored their significance in reinforcement learning, and discussed their various types. We also examined the exploration-exploitation trade-off, a key challenge where the agent balances trying new options and maximizing immediate rewards.

In the next article, we’ll dive into popular algorithms for solving multi-armed bandit problems, such as Epsilon-Greedy, Upper Confidence Bound (UCB), and Thompson Sampling, exploring their mechanisms and applications.

References:

Multi-armed bandit, Wikipedia
David de Villiers’ Blog
Shivam Mohan’s Blog

Still have any queries? Feel free to comment:)

Compiled by: Sriram Gonella

Refined by: Shravan Badgujar