CorralLearner

class coba.learners.CorralLearner

A contextual bandit learner that optimizes a collection of learners.

This is an implementation of the Agarwal et al. (2017) Corral algorithm and requires that the reward is always in [0,1].

References

Agarwal, Alekh, Haipeng Luo, Behnam Neyshabur, and Robert E. Schapire. “Corralling a band of bandit algorithms.” In Conference on Learning Theory, pp. 12-38. PMLR, 2017.

Constructors

__init__(learners: Sequence[Learner], eta: float = 0.075, T: float = inf, mode: Literal['importance', 'off-policy'] = 'importance', seed: int = 1) → None

Instantiate a CorralLearner.

Parameters:

learners – The collection of base learners.
eta – The learning rate. This controls how quickly Corral picks a best base_learner.
T – The number of interactions expected during the learning process. A small T will cause the learning rate to shrink towards 0 quickly while a large value for T will cause the learning rate to shrink towards 0 slowly. A value of inf means that the learning rate will remain constant.
mode – Determines the method with which feedback is provided to the base learners. The original paper used importance sampling. We also support off-policy.
seed – A seed for a random number generation.

Methods

learn(context: Context, action: Action, reward: float, probability: float, info) → None

Learn about the action taken in the context.

Parameters:

context – The context in which the action was taken.
action – The action that was taken.
reward – The reward for the given context and action (feedback for IGL problems).
probability – The probability the given action was taken.
**kwargs – Optional information returned during prediction.

predict(context: Context, actions: Actions) → Tuple[Action, Prob, Kwargs]

Predict which action to take in the context.

Parameters:

context – The current context. It will either be None (multi-armed bandit), a value (a single feature), a sequence of values (dense features), or a dictionary (sparse features).
actions – The current set of actions to choose from in the given context. Each action will either be a value (a single feature), a sequence of values (dense features), or a dictionary (sparse features).

Returns:

A Prediction. Several prediction formats are supported. See the type-hint for these.

score(context: Context, actions: Actions, action: Action) → Prob

Propensity score an action.

Parameters:

context – The current context.
actions – The current set of actions that can be chosen.
action – The action to propensity score.

Returns:

The propensity score of the given action. That is, P(action|context,actions).

Attributes

params