BanditUCBLearner

class coba.learners.BanditUCBLearner

Select the action with the highest upper confidence bound estimate.

This algorithm is an implementation of Auer et al. (2002) UCB1-Tuned algorithm. The paper’s proven regret bounds only assume that rewards have support in [0,1].

References

Auer, Peter, Nicolo Cesa-Bianchi, and Paul Fischer. “Finite-time analysis of the multiarmed bandit problem.” Machine learning 47.2-3 (2002): 235-256.

Constructors

__init__(seed: int = 1)

Instantiate a BanditUcbLearner.

Parameters:

seed – The seed used to select actions in predict.

Methods

learn(context: Context, action: Action, reward: float, probability: float) None

Learn about the action taken in the context.

Parameters:
  • context – The context in which the action was taken.

  • action – The action that was taken.

  • reward – The reward for the given context and action (feedback for IGL problems).

  • probability – The probability the given action was taken.

  • **kwargs – Optional information returned during prediction.

predict(context: None | str | Number | Sequence | Mapping, actions: None | Sequence[Action]) Tuple[Action, Prob]

Predict which action to take in the context.

Parameters:
  • context – The current context. It will either be None (multi-armed bandit), a value (a single feature), a sequence of values (dense features), or a dictionary (sparse features).

  • actions – The current set of actions to choose from in the given context. Each action will either be a value (a single feature), a sequence of values (dense features), or a dictionary (sparse features).

Returns:

A Prediction. Several prediction formats are supported. See the type-hint for these.

score(context: None | str | Number | Sequence | Mapping, actions: None | Sequence[Action], action: str | Number | Sequence | Mapping) float

Propensity score an action.

Parameters:
  • context – The current context.

  • actions – The current set of actions that can be chosen.

  • action – The action to propensity score.

Returns:

The propensity score of the given action. That is, P(action|context,actions).

Attributes

params