VowpalRndLearner

class coba.learners.VowpalRndLearner

RND exploration with a VW contextual bandit learner.

Inspired by Random Network Distillation, this explorer constructs an auxiliary prediction problem whose expected target value is zero and uses the prediction magnitude to construct a confidence interval. In the contextual bandit case this is equivalent to a randomized approximation to the LinUCB bound.

For more information see the wiki.

Constructors

__init__(rnd: int = 3, features: Sequence[str] = (1, 'a', 'ax', 'axx'), epsilon: float | None = 0.025, rnd_alpha: float | None = None, rnd_invlambda: float | None = None, seed: int | None = 1, **kwargs) None

Instantiate a VowpalRndLearner.

Parameters:
  • rnd – Number of predictors

  • features – A list of namespaces and interactions to use when learning reward functions

  • epsilon – Uniform exploration term for stabilization

  • rnd_alpha – Increase for more exploration on a repeated example

  • rnd_invlambda – Increase for more exploration on examples with new features/actions

  • seed – The seed used by VW to generate any necessary random numbers

  • kwargs – Additional key-word args are passed on as VW CLI arguments (unless removed in the function).

Methods

finish()

Finish all pending work (e.g., write buffers to disk).

learn(context: Context, action: Action, reward: float, probability: float, actions: Actions | None = None) None

Learn about the action taken in the context.

Parameters:
  • context – The context in which the action was taken.

  • action – The action that was taken.

  • reward – The reward for the given context and action (feedback for IGL problems).

  • probability – The probability the given action was taken.

  • **kwargs – Optional information returned during prediction.

predict(context: Context, actions: Actions) Tuple[Action, Prob, Kwargs]

Predict which action to take in the context.

Parameters:
  • context – The current context. It will either be None (multi-armed bandit), a value (a single feature), a sequence of values (dense features), or a dictionary (sparse features).

  • actions – The current set of actions to choose from in the given context. Each action will either be a value (a single feature), a sequence of values (dense features), or a dictionary (sparse features).

Returns:

A Prediction. Several prediction formats are supported. See the type-hint for these.

score(context: Context, actions: Actions, action: Action) Prob

Propensity score an action.

Parameters:
  • context – The current context.

  • actions – The current set of actions that can be chosen.

  • action – The action to propensity score.

Returns:

The propensity score of the given action. That is, P(action|context,actions).

Attributes

params