VowpalSquarecbLearner
- class coba.learners.VowpalSquarecbLearner
SquareCB exploration with a VW contextual bandit learner.
For more information on this algorithm see Foster et al. (2020) and here.
References
Foster, D.& Rakhlin, A.. (2020). Beyond UCB: Optimal and Efficient Contextual Bandits with Regression Oracles. Proceedings of the 37th International Conference on Machine Learning, in Proceedings of Machine Learning Research 119:3199-3210.
Constructors
- __init__(mode: Literal['standard', 'elimination'] = 'standard', gamma_scale: float = 10, features: Sequence[str] = (1, 'a', 'ax', 'axx'), seed: int | None = 1, **kwargs) None
Instantiate a VowpalSquarecbLearner.
- Parameters:
mode – Indicates iwhether all actions should be considered for exploration on each step or actions which no longer seem plausible should be eliminated.
gamma_scale – Controls how quickly squarecb exploration converges to a greedy policy. The larger the gamma_scale the faster the algorithm will converge to a greedy policy. This value is the same as gamma in the original paper.
features – A list of namespaces and interactions to use when learning reward functions.
seed – The seed used by VW to generate any necessary random numbers.
kwargs – Additional key-word args are passed on as VW CLI arguments (unless removed in the function).
Methods
- finish()
Finish all pending work (e.g., write buffers to disk).
- learn(context: Context, action: Action, reward: float, probability: float, actions: Actions | None = None) None
Learn about the action taken in the context.
- Parameters:
context – The context in which the action was taken.
action – The action that was taken.
reward – The reward for the given context and action (feedback for IGL problems).
probability – The probability the given action was taken.
**kwargs – Optional information returned during prediction.
- predict(context: Context, actions: Actions) Tuple[Action, Prob, Kwargs]
Predict which action to take in the context.
- Parameters:
context – The current context. It will either be None (multi-armed bandit), a value (a single feature), a sequence of values (dense features), or a dictionary (sparse features).
actions – The current set of actions to choose from in the given context. Each action will either be a value (a single feature), a sequence of values (dense features), or a dictionary (sparse features).
- Returns:
A Prediction. Several prediction formats are supported. See the type-hint for these.
- score(context: Context, actions: Actions, action: Action) Prob
Propensity score an action.
- Parameters:
context – The current context.
actions – The current set of actions that can be chosen.
action – The action to propensity score.
- Returns:
The propensity score of the given action. That is, P(action|context,actions).
Attributes
- params