RejectionCB

class coba.evaluators.RejectionCB

Rejective evaluation for CB learners.

This evaluator uses rejection sampling to simulate on-policy learner performance using only off-policy logged data. For this to work the evaluator requires each interaction to have ‘actions’, ‘action’, ‘reward’, and ‘probability’.

This gives an unbiased estimation of on-policy performance assuming two conditions

  1. The reward distribution of each interaction is stationary.

  2. The cpct parameter of the evaluator is set to 0.

Remarks:

This is an implementation of Dudík et al. (2012). The cpct parameter of our implementation is what Dudík calls q and cinit is Dudík calls c1. To use double-robust off-policy estimation as Dudík does also set ope to ‘dr’.

References

  • Miroslav Dudík, Dumitru Erhan, John Langford, and Lihong Li. 2012. Sample-efficient nonstationary policy evaluation for contextual bandits. In Proceedings of the Twenty-Eighth Conference on Uncertainty in Artificial Intelligence (UAI’12). AUAI Press, Arlington, Virginia, USA, 247-254.

Constructors

__init__(record: Sequence[Literal['context', 'actions', 'action', 'reward', 'probability', 'time']] = ['reward'], ope: Literal['ips', 'dr', 'dm'] | None = None, cpct: float = 0.005, cmax: float = 1.0, cinit: float | None = None, seed: float | None = None) None

Instantiate a RejectionCB evaluator.

Parameters:
  • record – The datapoints to record for each interaction.

  • ope – Indicates whether off-policy estimates should be included from rejected training examples.

  • cpct – The unbiased case is q = 0. Smaller values give better estimates but rejects more data.

  • cmax – The maximum value that the evaluator is allowed to use for c (the rejection sampling multiplier). To get an unbiased estimate we need a c value such that c*on_prob/log_prob <= 1 for all on_prob/log_prob. The value cmax determines the maximum value c can be in order to guarantee c will be an unbiased estimate. In practice, it is often better to not modify this value and instead change qpct to control the biasedness of the estimate.

  • cinit – The initial value to use for c (the rejection sampling multiplier). If left as None then a very conservative, data-adaptive estimate is used to initialize c. Without prior knowledge of the data leaving this as None is likely the best course of action.

  • seed – Provide an explicit seed to use during evaluation. If not provided a default is used.

Methods

evaluate(environment: Environment | None, learner: Learner | None) Iterable[Mapping[Any, Any]]

Evaluate the learner on the given interactions.

Parameters:
  • environment – The Environment we want to evaluate against.

  • learner – The Learner that we wish to evaluate.

Returns:

Evaluation results

Attributes

params