VowpalRegcbLearner

class coba.learners.VowpalRegcbLearner

RegCB exploration with a VW contextual bandit learner.

For more information on this algorithm see Foster et al. (2014) and here.

References

Foster, D., Agarwal, A., Dudik, M., Luo, H. & Schapire, R.. (2018). Practical Contextual Bandits with Regression Oracles. Proceedings of the 35th International Conference on Machine Learning, in Proceedings of Machine Learning Research 80:1539-1548.

Constructors

__init__(mode: Literal['optimistic', 'elimination'] = 'elimination', features: Sequence[str] = (1, 'a', 'ax', 'axx'), seed: int | None = 1, **kwargs) → None

Instantiate a VowpalRegcbLearner.

Parameters:

mode – Indicate that exploration should only predict the optimal upper bound action or should use an elimination technique to remove actions that no longer seem plausible and pick randomly from the remaining actions.
features – A list of namespaces and interactions to use when learning reward functions.
seed – The seed used by VW to generate any necessary random numbers.
kwargs – Additional key-word args are passed on as VW CLI arguments (unless removed in the function).

Methods

finish(): Finish all pending work (e.g., write buffers to disk).

learn(context: Context, action: Action, reward: float, probability: float, actions: Actions | None = None) → None

Learn about the action taken in the context.

Parameters:

context – The context in which the action was taken.
action – The action that was taken.
reward – The reward for the given context and action (feedback for IGL problems).
probability – The probability the given action was taken.
**kwargs – Optional information returned during prediction.

predict(context: Context, actions: Actions) → Tuple[Action, Prob, Kwargs]

Predict which action to take in the context.

Parameters:

context – The current context. It will either be None (multi-armed bandit), a value (a single feature), a sequence of values (dense features), or a dictionary (sparse features).
actions – The current set of actions to choose from in the given context. Each action will either be a value (a single feature), a sequence of values (dense features), or a dictionary (sparse features).

Returns:

A Prediction. Several prediction formats are supported. See the type-hint for these.

score(context: Context, actions: Actions, action: Action) → Prob

Propensity score an action.

Parameters:

context – The current context.
actions – The current set of actions that can be chosen.
action – The action to propensity score.

Returns:

The propensity score of the given action. That is, P(action|context,actions).

Attributes

params