LinUCBLearner

class coba.learners.LinUCBLearner

A contextual bandit learner using upper confidence bounds to explore.

This is an implementation of the Chu et al. (2011) LinUCB algorithm. The Sherman-Morrison formula is utilized to iteratively calculate the inversion matrix. Expected reward is represented as a linear function of context and action features.

Remarks:: The Sherman-Morrsion implementation used below is given in long form here.

References

Chu, Wei, Lihong Li, Lev Reyzin, and Robert Schapire. “Contextual bandits with linear payoff functions.” In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 208-214. JMLR Workshop and Conference Proceedings, 2011.

Constructors

__init__(alpha: float = 1, features: Sequence[str] = [1, 'a', 'ax'], seed: int = 1) → None

Instantiate a LinUCBLearner.

Parameters:

alpha – This parameter controls the exploration rate of the algorithm. A value of 0 will cause actions to be selected based on the current best point estimate (i.e., no exploration) while a value of inf means that actions will be selected based solely on the estimated uper bound for each action (i.e., we will always take actions that have the largest upper bound on their point estimate).
features – Feature set interactions to use when calculating action value estimates. Context features are indicated by x’s while action features are indicated by a’s. For example, xaa means to cross the features between context and actions and actions.
seed – A seed for a random number generation.

Methods

learn(context: Context, action: Action, reward: float, probability: float) → None

Learn about the action taken in the context.

Parameters:

context – The context in which the action was taken.
action – The action that was taken.
reward – The reward for the given context and action (feedback for IGL problems).
probability – The probability the given action was taken.
**kwargs – Optional information returned during prediction.

predict(context: Context, actions: Actions) → Tuple[Action, Prob]

Predict which action to take in the context.

Parameters:

context – The current context. It will either be None (multi-armed bandit), a value (a single feature), a sequence of values (dense features), or a dictionary (sparse features).
actions – The current set of actions to choose from in the given context. Each action will either be a value (a single feature), a sequence of values (dense features), or a dictionary (sparse features).

Returns:

A Prediction. Several prediction formats are supported. See the type-hint for these.

score(context: Context, actions: Actions, action: Action) → Prob

Propensity score an action.

Parameters:

context – The current context.
actions – The current set of actions that can be chosen.
action – The action to propensity score.

Returns:

The propensity score of the given action. That is, P(action|context,actions).

Attributes

params