LinTSLearner

class coba.learners.LinTSLearner

A contextual bandit learner using Thompson Sampling for exploration.

This is an implementation of the Agrawal et al. (2013) Thompson Sapmling algorithm. The Sherman-Morrison formula is utilized to iteratively calculate the inversion matrix. Expected reward is represented as a linear function of context and action features.

Remarks:

A small note on the stability of the Sherman-Morrison formula can be found here.

References

Agrawal, Shipra, and Navin Goyal. “Thompson sampling for contextual bandits with linear payoffs.” International conference on machine learning. PMLR, 2013.

Constructors

__init__(v: float = 1, features: Sequence[str] = [1, 'a', 'ax'], seed: int = 1) None

Instantiate a LinUCBLearner.

Parameters:
  • v – Modify the exploration rate of the algorithm. A value of 0 will not explore while a value of inf will explores uniformly forever. The appropriate setting of v will depend to some degree on the scale of given feature vectors and rewards.

  • features – Feature set interactions to use when calculating action value estimates. Context features are indicated by x’s while action features are indicated by a’s. For example, xaa means to cross context and action and action features.

  • seed – A seed for a random number generation.

Methods

learn(context: Context, action: Action, reward: float, probability: float) None

Learn about the action taken in the context.

Parameters:
  • context – The context in which the action was taken.

  • action – The action that was taken.

  • reward – The reward for the given context and action (feedback for IGL problems).

  • probability – The probability the given action was taken.

  • **kwargs – Optional information returned during prediction.

predict(context: Context, actions: Actions) Tuple[Action, Prob]

Predict which action to take in the context.

Parameters:
  • context – The current context. It will either be None (multi-armed bandit), a value (a single feature), a sequence of values (dense features), or a dictionary (sparse features).

  • actions – The current set of actions to choose from in the given context. Each action will either be a value (a single feature), a sequence of values (dense features), or a dictionary (sparse features).

Returns:

A Prediction. Several prediction formats are supported. See the type-hint for these.

score(context: Context, actions: Actions, action: Action) Prob

Propensity score an action.

Parameters:
  • context – The current context.

  • actions – The current set of actions that can be chosen.

  • action – The action to propensity score.

Returns:

The propensity score of the given action. That is, P(action|context,actions).

Attributes

params