Sampling

Module in charge of functionalities related to the sampling of alternatives

biogeme.sampling module

Module in charge of functionalities related to the sampling of alternatives

author:

Michel Bierlaire

date:

Wed Sep 7 15:54:55 2022

class biogeme.sampling.StratumTuple(subset, sample_size)[source]

Bases: NamedTuple

sample_size: int

Alias for field number 1

subset: Set[int]

Alias for field number 0

biogeme.sampling.mev_cnl_sampling(V, availability, sampling_log_probability, nests)[source]

Generate the expression of the CNL G_i function in the context of sampling of alternatives.

It is assumed that the following variables are available in the

data: for each nest m and each alternative i, a variable m_i that is the level of membership of alternative i to nest m.

Parameters:
  • V (dict(int:biogeme.expressions.expr.Expression)) – dict of objects representing the utility functions of each alternative, indexed by numerical ids.

  • availability (dict(int:biogeme.expressions.expr.Expression)) – dict of objects representing the availability of each alternative, indexed by numerical ids. Must be consistent with V, or None. In this case, all alternatives are supposed to be always available.

  • sampling_log_probability (dict(int: biogeme.expressions.Expression)) – if not None, it means that the choice set is actually a subset that has been sampled from the full choice set. In that case, this is a dictionary mapping each alternative with the logarithm of its probability to be selected in the sample.

  • nests (dict(str: biogeme.expressions.Beta)) – a dictionary where the keys are the names of the nests, and the values are the nest parameters.

biogeme.sampling.sample_alternatives(alternatives, id_column, partition, chosen=None)[source]

Performing the sampling of alternatives

Parameters:
  • alternatives (pandas.DataFrame) – Pandas data frame containing all the alternatives as rows. One column must contain a unique ID identifying the alternatives. The other columns contain variables to include in the data file.

  • id_column (str) – name of the columns where the IDs of the alternatives are stored.

  • partition (tuple(StratumTuple)) – each StratumTuple contains a set of IDs characterizing the subset, and the sample size, that is the number of alternatives to randomly draw from the subset.

  • chosen (int) – ID of the chosen alternative, that must be included in the choice set. If None, no alternative is added deterministically to the choice set.

Raises:
  • BiogemeError – if one alternative belongs to several subsets of the partition.

  • BiogemeError – if a set in the partition is empty.

  • BiogemeError – if the chosen alternative is unknown.

  • BiogemeError – if the requested sample size for a stratum if larger than the size of the stratum

  • BiogemeError – if some alternative do not appear in the partition

biogeme.sampling.sampling_of_alternatives(partition, individuals, choice_column, alternatives, id_column, always_include_chosen=True)[source]

Generation of databases with samples of alternatives

param partition:

each StratumTuple contains a set of IDs characterizing the subset, and the sample size, that is the number of alternatives to randomly draw from the subset.

Parameters:
  • individuals (pandas.DataFrame) – Pandas data frame containing all the individuals as rows. One column must contain the choice of each individual.

  • choice_column (str) – name of the column containing the choice of each individual.

  • alternatives (pandas.DataFrame) – Pandas data frame containing all the alternatives as rows. One column must contain a unique ID identifying the alternatives. The other columns contain variables to include in the data file.

  • id_column (str) – name of the column containing the Ids of the alternatives.

  • always_include_chosen (bool) – if True, the chosen alternative is always included in the choice set with label 0.

Returns:

data frame containing the data ready for Biogeme.

Return type:

pandas.DataFrame