Biogeme

The core routines of Biogeme.

biogeme.biogeme module

Implementation of the main Biogeme class

author:: Michel Bierlaire
date:: Tue Mar 26 16:45:15 2019

It combines the database and the model specification.

class biogeme.biogeme.BIOGEME(database, formulas, userNotes=None, parameter_file=None, skip_audit=False, **kwargs)[source]

Bases: object

Main class that combines the database and the model: specification.

It works in two modes: estimation and simulation.

__init__(database, formulas, userNotes=None, parameter_file=None, skip_audit=False, **kwargs)[source]

Constructor

Parameters:

database (biogeme.database.Database) – choice data.
formulas (biogeme.expressions.Expression, or dict(biogeme.expressions.Expression)) – expression or dictionary of expressions that define the model specification. The concept is that each expression is applied to each entry of the database. The keys of the dictionary allow to provide a name to each formula. In the estimation mode, two formulas are needed, with the keys ‘loglike’ and ‘weight’. If only one formula is provided, it is associated with the label ‘loglike’. If no formula is labeled ‘weight’, the weight of each piece of data is supposed to be 1.0. In the simulation mode, the labels of each formula are used as labels of the resulting database.
userNotes (str) – these notes will be included in the report file.
parameter_file (str) – name of the .toml file where the parameters are read

Raises:

BiogemeError – an audit of the formulas is performed. If a formula has issues, an error is detected and an exception is raised.

property algorithm_name: Name of the optimization algorithm

argument_warning()[source]: Displays a deprecation warning when parameters are provided as arguments.

bestIteration: Store the best iteration found so far.

beta_values_dict_to_list(beta_dict=None)[source]

Transforms a dict with the names of the betas associated: with their values, into a list consistent with the numbering of the ids.

Parameters:

beta_dict (dict(str: float)) – dict with the values of the parameters

Raises:

BiogemeError – if the parameter is not a dict
BiogemeError – if a parameter is missing in the dict

bootstrap_results: Results of the bootstrap calculation.

bootstrap_time: Time needed to calculate the bootstrap standard errors

calculateInitLikelihood()[source]

Calculate the value of the log likelihood function

The default values of the parameters are used.

Returns:: value of the log likelihood.
Return type:: float.

calculateLikelihood(x, scaled, batch=None)[source]

Calculates the value of the log likelihood function

Parameters:

x (list(float)) – vector of values for the parameters.
scaled (bool) – if True, the value is divided by the number of observations used to calculate it. In this case, the values with different sample sizes are comparable. Default: True
batch (float) – if not None, calculates the likelihood on a random sample of the data. The value of the parameter must be strictly between 0 and 1, and represents the share of the data that will be used. Default: None

Returns:

the calculated value of the log likelihood

Return type:

float.

Raises:

ValueError – if the length of the list x is incorrect.
BiogemeError – if calculatation with batch is requested

calculateLikelihoodAndDerivatives(x, scaled, hessian=False, bhhh=False, batch=None)[source]

Calculate the value of the log likelihood function and its derivatives.

Parameters:

x (list(float)) – vector of values for the parameters.
scaled (bool) – if True, the results are devided by the number of observations.
hessian (bool) – if True, the hessian is calculated. Default: False.
bhhh (bool) – if True, the BHHH matrix is calculated. Default: False.
batch (float) – if not None, calculates the likelihood on a random sample of the data. The value of the parameter must be strictly between 0 and 1, and represents the share of the data that will be used. Default: None

Returns:

f, g, h, bh where

f is the value of the function (float)
g is the gradient (numpy.array)
h is the hessian (numpy.array)
bh is the BHHH matrix (numpy.array)

Return type:

tuple float, numpy.array, numpy.array, numpy.array

Raises:

ValueError – if the length of the list x is incorrect
BiogemeError – if the norm of the gradient is not finite, an error is raised.
BiogemeError – if calculatation with batch is requested

calculateNullLoglikelihood(avail)[source]

Calculate the log likelihood of the null model that predicts equal probability for each alternative

Parameters:: avail (list of biogeme.expressions.Expression) – list of expressions to evaluate the availability conditions for each alternative. If None, all alternatives are always available.
Returns:: value of the log likelihood
Return type:: float

changeInitValues(betas)[source]

Modifies the initial values of the pameters in all formula

Parameters:: betas (dict(string:float)) – dictionary where the keys are the names of the parameters, and the values are the new value for the parameters.

checkDerivatives(beta, verbose=False)[source]

Verifies the implementation of the derivatives.

It compares the analytical version with the finite differences approximation.

Parameters:

x (list(float)) – vector of values for the parameters.
verbose (bool) – if True, the comparisons are reported. Default: False.

Return type:

tuple.

Returns:

f, g, h, gdiff, hdiff where

f is the value of the function,
g is the analytical gradient,
h is the analytical hessian,
gdiff is the difference between the analytical and the finite differences gradient,
hdiff is the difference between the analytical and the finite differences hessian,

confidenceIntervals(betaValues, intervalSize=0.9)[source]

Calculate confidence intervals on the simulated quantities

Parameters:

betaValues (list(dict(str: float))) – array of parameters values to be used in the calculations. Typically, it is a sample drawn from a distribution.
intervalSize (float) – size of the reported confidence interval, in percentage. If it is denoted by s, the interval is calculated for the quantiles (1-s)/2 and (1+s)/2. The default (0.9) corresponds to quantiles for the confidence interval [0.05, 0.95].

Returns:

two pandas data frames ‘left’ and ‘right’ with the same dimensions. Each row corresponds to a row in the database, and each column to a formula. ‘left’ contains the left value of the confidence interval, and ‘right’ the right value

Example:

# Read the estimation results from a file
results = res.bioResults(pickleFile = 'myModel.pickle')
# Retrieve the names of the betas parameters that have been
# estimated
betas = biogeme.freeBetaNames

# Draw 100 realization of the distribution of the estimators
b = results.getBetasForSensitivityAnalysis(betas, size = 100)

# Simulate the formulas using the nominal values
simulatedValues = biogeme.simulate(betaValues)

# Calculate the confidence intervals for each formula
left, right = biogeme.confidenceIntervals(b, 0.9)

Return type:

tuple of two Pandas dataframes.

database: biogeme.database.Database object

property dogleg: getter for the parameter

drawsProcessingTime: Time needed to generate the draws.

property enlarging_factor: getter for the parameter

estimate(recycle=False, bootstrap=0, **kwargs)[source]

Estimate the parameters of the model(s).

Parameters:

recycle (bool) – if True, the results are read from the pickle file, if it exists. If False, the estimation is performed.
bootstrap (int) – number of bootstrap resampling used to calculate the variance-covariance matrix using bootstrapping. If the number is 0, bootstrapping is not applied. Default: 0.

Returns:

object containing the estimation results.

Return type:

biogeme.bioResults

Example:

# Create an instance of biogeme
biogeme  = bio.BIOGEME(database, logprob)

# Gives a name to the model
biogeme.modelName = 'mymodel'

# Estimate the parameters
results = biogeme.estimate()

Raises:: BiogemeError – if no expression has been provided for the likelihood

estimate_catalog(selected_configurations=None, quick_estimate=False, recycle=False, bootstrap=0)[source]

Estimate all or selected versions of a model with Catalog’s, corresponding to multiple specifications.

Parameters:: selected_configurations – set of configurations. If

None, all configurations are considered. :type selected_configurations: set(biogeme.pareto.SetElement)

Parameters:

quick_estimate (bool) – if True, the final statistics are not calculated.
recycle (bool) – if True, the results are read from the pickle file, if it exists. If False, the estimation is performed.
bootstrap (int) – number of bootstrap resampling used to calculate the variance-covariance matrix using bootstrapping. If the number is 0, bootstrapping is not applied. Default: 0.

Returns:

object containing the estimation results associated with the name of each specification, as well as a description of each configuration

Return type:

dict(str: bioResults)

files_of_type(extension, all_files=False)[source]

Identify the list of files with a given extension in the local directory

Parameters:

extension (str) – extension of the requested files (without the dot): ‘pickle’, or ‘html’
all_files (bool) – if all_files is False, only files containing the name of the model are identified. If all_files is True, all files with the requested extension are identified.

Returns:

list of files with the requested extension.

Return type:

list(str)

formulas: Dictionary containing Biogeme formulas of type biogeme.expressions.Expression. The keys are the names of the formulas.

freeBetaNames()[source]

Returns the names of the parameters that must be estimated

Returns:: list of names of the parameters
Return type:: list(str)

property generateHtml: Boolean variable, True if the HTML file with the results must be generated.

property generatePickle: Boolean variable, True if the PICKLE file with the results must be generated.

property generate_html: Boolean variable, True if the HTML file with the results must be generated.

property generate_pickle: Boolean variable, True if the PICKLE file with the results must be generated.

getBoundsOnBeta(betaName)[source]

Returns the bounds on the parameter as defined by the user.

Parameters:: betaName (string) – name of the parameter
Returns:: lower bound, upper bound
Return type:: tuple
Raises:: BiogemeError – if the name of the parameter is not found.

property identification_threshold: Threshold for the eigenvalue to trigger an identification warning

property infeasible_cg: getter for the parameter

initLogLike: Init value of the likelihood function

property initial_radius: getter for the parameter

lastSample: keeps track of the sample of data used to calculate the stochastic gradient / hessian

likelihoodFiniteDifferenceHessian(x)[source]

Calculate the hessian of the log likelihood function using finite differences.

May be useful when the analytical hessian has numerical issues.

Parameters:: x (list(float)) – vector of values for the parameters.
Returns:: finite differences approximation of the hessian.
Return type:: numpy.array
Raises:: ValueError – if the length of the list x is incorrect

loglike: Object of type biogeme.expressions.Expression calculating the formula for the loglikelihood

loglikeName: Keyword used for the name of the loglikelihood formula. Default: ‘loglike’

loglikeSignatures: Internal signature of the formula for the loglikelihood.

property maximum_number_catalog_expressions: Maximum number of multiple expressions when Catalog’s are used.

property maxiter: getter for the parameter

property missingData: Code for missing data

property missing_data: Code for missing data

modelName: Name of the model. Default: ‘biogemeModelDefaultName’

monteCarlo: monteCarlo is True if one of the expressions involves a Monte-Carlo integration.

nullLogLike: Log likelihood of the null model

property numberOfDraws: Number of draws for Monte-Carlo integration.

property numberOfThreads: Number of threads used for parallel computing. Default: the number of available CPU.

property number_of_draws: Number of draws for Monte-Carlo integration.

property number_of_threads: Number of threads used for parallel computing. Default: the number of available CPU.

property only_robust_stats: True if only the robust statistics need to be reported. If False, the statistics from the Rao-Cramer bound are also reported.

optimizationMessages: Information provided by the optimization algorithm after completion.

optimize(startingValues=None)[source]

Calls the optimization algorithm. The function self.algorithm is called.

Parameters:

startingValues (list(float)) – starting point for the algorithm

Returns:

x, messages

x is the solution generated by the algorithm,
messages is a dictionary describing several information about the algorithm

Return type:

numpay.array, dict(str:object)

Raises:

BiogemeError – an error is raised if no algorithm is specified.

quickEstimate(**kwargs)[source]

Estimate the parameters of the model. Same as estimate, where any extra calculation is skipped (init loglikelihood, t-statistics, etc.)

Returns:: object containing the estimation results.
Return type:: biogeme.results.bioResults

Example:

# Create an instance of biogeme
biogeme  = bio.BIOGEME(database, logprob)

# Gives a name to the model
biogeme.modelName = 'mymodel'

# Estimate the parameters
results = biogeme.quickEstimate()

Raises:: BiogemeError – if no expression has been provided for the likelihood

reset_id_manager()[source]: Reset all the ids of the elementary expression in the formulas

property saveIterations: If True, the current iterate is saved after each iteration, in a file named __[modelName].iter, where [modelName] is the name given to the model. If such a file exists, the starting values for the estimation are replaced by the values saved in the file.

property save_iterations: Same as saveIterations, with another syntax

property second_derivatives: getter for the parameter

property seed_param: getter for the parameter

setRandomInitValues(defaultBound=100.0)[source]

Modifies the initial values of the parameters in all formulas, using randomly generated values. The value is drawn from a uniform distribution on the interval defined by the bounds.

Parameters:: defaultBound (float) – If the upper bound is missing, it is replaced by this value. If the lower bound is missing, it is replaced by the opposite of this value. Default: 100.

short_names: biogeme.tools.ModelNames

simulate(theBetaValues)[source]

Applies the formulas to each row of the database.

Parameters:: theBetaValues (dict(str, float)) – values of the parameters to be used in the calculations. If None, the default values are used. Default: None.
Returns:: a pandas data frame with the simulated value. Each row corresponds to a row in the database, and each column to a formula.
Return type:: Pandas data frame

Example:

# Read the estimation results from a file
results = res.bioResults(pickleFile = 'myModel.pickle')
# Simulate the formulas using the nominal values
simulatedValues = biogeme.simulate(betaValues)

Raises:

BiogemeError – if the number of parameters is incorrect
BiogemeError – if theBetaValues is None.

property steptol: getter for the parameter

property tolerance: getter for the parameter

userNotes: User notes

validate(estimationResults, validationData)[source]

Perform out-of-sample validation.

The function performs the following tasks:

each slice defines a validation set (the slice itself) and an estimation set (the rest of the data),

the model is re-estimated on the estimation set,

the estimated model is applied on the validation set,

the value of the log likelihood for each observation is reported.

Parameters:

estimationResults (biogeme.results.bioResults) – results of the model estimation based on the full data.
validationData (list(tuple(pandas.DataFrame, pandas.DataFrame))) – list of estimation and validation data sets

Returns:

a list containing as many items as slices. Each item is the result of the simulation on the validation set.

Return type:

list(pandas.DataFrame)

Raises:

BiogemeError – An error is raised if the database is structured as panel data.

weight: Object of type biogeme.expressions.Expression calculating the weight of each observation in the sample.

weightName: Keyword used for the name of the weight formula. Default: ‘weight’

weightSignatures: Internal signature of the formula for the weight.

class biogeme.biogeme.OldNewParamTuple(old, new, section)

Bases: tuple

new: Alias for field number 1

old: Alias for field number 0

section: Alias for field number 2