Biogeme

The core routines of Biogeme.

biogeme.biogeme module

Implementation of the main Biogeme class that combines the database and the model specification.

author: Michel Bierlaire
date: Tue Mar 26 16:45:15 2019

class biogeme.biogeme.BIOGEME(database, formulas, userNotes=None, numberOfThreads=None, numberOfDraws=1000, seed=None, skipAudit=False, suggestScales=True, missingData=99999)[source]

Bases: object

Main class that combines the database and the model specification.

It works in two modes: estimation and simulation.

__init__(database, formulas, userNotes=None, numberOfThreads=None, numberOfDraws=1000, seed=None, skipAudit=False, suggestScales=True, missingData=99999)[source]

Constructor

Parameters

database (biogeme.database.Database) – choice data.
formulas (biogeme.expressions.Expression, or dict(biogeme.expressions.Expression)) – expression or dictionary of expressions that define the model specification. The concept is that each expression is applied to each entry of the database. The keys of the dictionary allow to provide a name to each formula. In the estimation mode, two formulas are needed, with the keys ‘loglike’ and ‘weight’. If only one formula is provided, it is associated with the label ‘loglike’. If no formula is labeled ‘weight’, the weight of each piece of data is supposed to be 1.0. In the simulation mode, the labels of each formula are used as labels of the resulting database.
userNotes (str) – these notes will be included in the report file.
numberOfThreads (int) – multi-threading can be used for estimation. This parameter defines the number of threads to be used. If the parameter is set to None, the number of available threads is calculated using cpu_count(). Ignored in simulation mode. Defaults: None.
numberOfDraws (int) – number of draws used for Monte-Carlo integration. Default: 1000.
seed (int) – seed used for the pseudo-random number generation. It is useful only when each run should generate the exact same result. If None, a new seed is used at each run. Default: None.
skipAudit (bool) – if True, does not check the validity of the formulas. It may save significant amount of time for large models and large data sets. Default: False.
suggestScales (bool.) – if True, Biogeme suggests the scaling of the variables in the database. Default: True. See also biogeme.database.Database.suggestScaling()
missingData (float) – if one variable has this value, it is assumed that a data is missing and an exception will be triggered. Default: 99999.

Raises

biogemeError – an audit of the formulas is performed. If a formula has issues, an error is detected and an exception is raised.

algoParameters: Parameters to be transferred to the optimization algorithm

algorithm: Optimization algorithm

bestIteration: Store the best iteration found so far.

bootstrap_results: Results of the bootstrap calculation.

bootstrap_time: Time needed to calculate the bootstrap standard errors

calculateInitLikelihood()[source]

Calculate the value of the log likelihood function

The default values of the parameters are used.

Returns: value of the log likelihood.
Return type: float.

calculateLikelihood(x, scaled, batch=None)[source]

Calculates the value of the log likelihood function

Parameters

x (list(float)) – vector of values for the parameters.
scaled (bool) – if True, the value is divided by the number of observations used to calculate it. In this case, the values with different sample sizes are comparable. Default: True
batch (float) – if not None, calculates the likelihood on a random sample of the data. The value of the parameter must be strictly between 0 and 1, and represents the share of the data that will be used. Default: None

Returns

the calculated value of the log likelihood

Return type

float.

Raises

ValueError – if the length of the list x is incorrect.

calculateLikelihoodAndDerivatives(x, scaled, hessian=False, bhhh=False, batch=None)[source]

Calculate the value of the log likelihood function and its derivatives.

Parameters

x (list(float)) – vector of values for the parameters.
scaled (bool) – if True, the results are devided by the number of observations.
hessian (bool) – if True, the hessian is calculated. Default: False.
bhhh (bool) – if True, the BHHH matrix is calculated. Default: False.
batch (float) – if not None, calculates the likelihood on a random sample of the data. The value of the parameter must be strictly between 0 and 1, and represents the share of the data that will be used. Default: None

Returns

f, g, h, bh where

f is the value of the function (float)
g is the gradient (numpy.array)
h is the hessian (numpy.array)
bh is the BHHH matrix (numpy.array)

Return type

tuple float, numpy.array, numpy.array, numpy.array

Raises

ValueError – if the length of the list x is incorrect
biogemeError – if the norm of the gradient is not finite, an error is raised.

calculateNullLoglikelihood(avail)[source]

Calculate the log likelihood of the null model that predicts equal probability for each alternative

Parameters: avail (list of biogeme.expressions.Expression) – list of expressions to evaluate the availability conditions for each alternative. If None, all alternatives are always available.
Returns: value of the log likelihood
Return type: float

changeInitValues(betas)[source]

Modifies the initial values of the pameters in all formula

Parameters: betas (dict(string:float)) – dictionary where the keys are the names of the parameters, and the values are the new value for the parameters.

checkDerivatives(verbose=False)[source]

Verifies the implementation of the derivatives.

It compares the analytical version with the finite differences approximation.

Parameters

verbose (bool) – if True, the comparisons are reported. Default: False.

Return type

tuple.

Returns

f, g, h, gdiff, hdiff where

f is the value of the function,
g is the analytical gradient,
h is the analytical hessian,
gdiff is the difference between the analytical and the finite differences gradient,
hdiff is the difference between the analytical and the finite differences hessian,

columnForBatchSamplingWeights: Name of the column defining weights for batch sampling in stochastic optimization.

confidenceIntervals(betaValues, intervalSize=0.9)[source]

Calculate confidence intervals on the simulated quantities

Parameters

betaValues (list(dict(str: float))) – array of parameters values to be used in the calculations. Typically, it is a sample drawn from a distribution.
intervalSize (float) – size of the reported confidence interval, in percentage. If it is denoted by s, the interval is calculated for the quantiles (1-s)/2 and (1+s)/2. The default (0.9) corresponds to quantiles for the confidence interval [0.05, 0.95].

Returns

two pandas data frames ‘left’ and ‘right’ with the same dimensions. Each row corresponds to a row in the database, and each column to a formula. ‘left’ contains the left value of the confidence interval, and ‘right’ the right value

Example:

# Read the estimation results from a file
results = res.bioResults(pickleFile = 'myModel.pickle')
# Retrieve the names of the betas parameters that have been
# estimated
betas = biogeme.freeBetaNames

# Draw 100 realization of the distribution of the estimators
b = results.getBetasForSensitivityAnalysis(betas, size = 100)

# Simulate the formulas using the nominal values
simulatedValues = biogeme.simulate(betaValues)

# Calculate the confidence intervals for each formula
left, right = biogeme.confidenceIntervals(b, 0.9)

Return type

tuple of two Pandas dataframes.

createLogFile(verbosity=3)[source]

Creates a log file with the messages produced by Biogeme.

The name of the file is the name of the model with an extension .log

Parameters

verbosity (int) –

types of messages to be captured

0: no output
1: warnings
2: only general information
3: more verbose
4: debug messages

Default: 3.

database: biogeme.database.Database object

drawsProcessingTime: Time needed to generate the draws.

estimate(recycle=False, bootstrap=0, algorithm=<function simpleBoundsNewtonAlgorithmForBiogeme>, algoParameters=None)[source]

Estimate the parameters of the model.

Parameters

recycle (bool) – if True, the results are read from the pickle file, if it exists. If False, the estimation is performed.
bootstrap (int) – number of bootstrap resampling used to calculate the variance-covariance matrix using bootstrapping. If the number is 0, bootstrapping is not applied. Default: 0.
algorithm (function) – optimization algorithm to use for the maximum likelihood estimation. Default: Biogeme’s Newton’s algorithm with simple bounds.
algoParameters (dict) – parameters to transfer to the optimization algorithm

Returns

object containing the estimation results.

Return type

biogeme.bioResults

Example:

# Create an instance of biogeme
biogeme  = bio.BIOGEME(database, logprob)

# Gives a name to the model
biogeme.modelName = 'mymodel'

# Estimate the parameters
results = biogeme.estimate()

Raises: biogemeError – if no expression has been provided for the likelihood

files_of_type(extension, all_files=False)[source]

Identify the list of files with a given extension in the local directory

Parameters

extension (str) – extension of the requested files (without the dot): ‘pickle’, or ‘html’
all_files (bool) – if all_files is False, only files containing the name of the model are identified. If all_files is True, all files with the requested extension are identified.

Returns

list of files with the requested extension.

Return type

list(str)

formulas: Dictionary containing Biogeme formulas of type biogeme.expressions.Expression. The keys are the names of the formulas.

freeBetaNames()[source]

Returns the names of the parameters that must be estimated

Returns: list of names of the parameters
Return type: list(str)

generateHtml: Boolean variable, True if the HTML file with the results must be generated.

generatePickle: Boolean variable, True if the pickle file with the results must be generated.

getBoundsOnBeta(betaName)[source]

Returns the bounds on the parameter as defined by the user.

Parameters: betaName (string) – name of the parameter
Returns: lower bound, upper bound
Return type: tuple
Raises: biogemeError – if the name of the parameter is not found.

initLogLike: Init value of the likelihood function

lastSample: keeps track of the sample of data used to calculate the stochastic gradient / hessian

likelihoodFiniteDifferenceHessian(x)[source]

Calculate the hessian of the log likelihood function using finite differences.

May be useful when the analytical hessian has numerical issues.

Parameters: x (list(float)) – vector of values for the parameters.
Returns: finite differences approximation of the hessian.
Return type: numpy.array
Raises: ValueError – if the length of the list x is incorrect

loglike: Object of type biogeme.expressions.Expression calculating the formula for the loglikelihood

loglikeName: Keyword used for the name of the loglikelihood formula. Default: ‘loglike’

loglikeSignatures: Internal signature of the formula for the loglikelihood.

missingData: code for missing data

modelName: Name of the model. Default: ‘biogemeModelDefaultName’

monteCarlo: monteCarlo is True if one of the expressions involves a Monte-Carlo integration.

nullLogLike: Log likelihood of the null model

numberOfDraws: Number of draws for Monte-Carlo integration.

numberOfThreads: Number of threads used for parallel computing. Default: the number of available CPU.

optimizationMessages: Information provided by the optimization algorithm after completion.

optimize(startingValues=None)[source]

Calls the optimization algorithm. The function self.algorithm is called.

Parameters

startingValues (list(float)) – starting point for the algorithm

Returns

x, messages

x is the solution generated by the algorithm,
messages is a dictionary describing several information about the algorithm

Return type

numpay.array, dict(str:object)

Raises

biogemeError – an error is raised if no algorithm is specified.

quickEstimate(algorithm=<function simpleBoundsNewtonAlgorithmForBiogeme>, algoParameters=None)[source]

Estimate the parameters of the model. Same as estimate, where any extra calculation is skipped (init loglikelihood, t-statistics, etc.)

Parameters

algorithm (function) – optimization algorithm to use for the maximum likelihood estimation.Default: Biogeme’s Newton’s algorithm with simple bounds.
algoParameters (dict) – parameters to transfer to the optimization algorithm

Returns

object containing the estimation results.

Return type

biogeme.results.bioResults

Example:

# Create an instance of biogeme
biogeme  = bio.BIOGEME(database, logprob)

# Gives a name to the model
biogeme.modelName = 'mymodel'

# Estimate the parameters
results = biogeme.quickEstimate()

Raises: biogemeError – if no expression has been provided for the likelihood

reset_id_manager()[source]: Reset all the ids of the elementary expression in the formulas

saveIterations: If True, the current iterate is saved after each iteration, in a file named __[modelName].iter, where [modelName] is the name given to the model. If such a file exists, the starting values for the estimation are replaced by the values saved in the file.

setRandomInitValues(defaultBound=100.0)[source]

Modifies the initial values of the parameters in all formulas, using randomly generated values. The value is drawn from a uniform distribution on the interval defined by the bounds.

Parameters: defaultBound (float) – If the upper bound is missing, it is replaced by this value. If the lower bound is missing, it is replaced by the opposite of this value. Default: 100.

simulate(theBetaValues=None)[source]

Applies the formulas to each row of the database.

Parameters: theBetaValues (dict(str, float)) – values of the parameters to be used in the calculations. If None, the default values are used. Default: None.
Returns: a pandas data frame with the simulated value. Each row corresponds to a row in the database, and each column to a formula.
Return type: Pandas data frame

Example:

# Read the estimation results from a file
results = res.bioResults(pickleFile = 'myModel.pickle')
# Simulate the formulas using the nominal values
simulatedValues = biogeme.simulate(betaValues)

Raises: biogemeError – if the number of parameters is incorrect

userNotes: User notes

validate(estimationResults, validationData)[source]

Perform out-of-sample validation.

The function performs the following tasks:

each slice defines a validation set (the slice itself) and an estimation set (the rest of the data),

the model is re-estimated on the estimation set,

the estimated model is applied on the validation set,

the value of the log likelihood for each observation is reported.

Parameters

estimationResults (biogeme.results.bioResults) – results of the model estimation based on the full data.
validationData (list(tuple(pandas.DataFrame, pandas.DataFrame))) – list of estimation and validation data sets

Returns

a list containing as many items as slices. Each item is the result of the simulation on the validation set.

Return type

list(pandas.DataFrame)

Raises

biogemeError – An error is raised if the database is structured as panel data.

weight: Object of type biogeme.expressions.Expression calculating the weight of each observation in the sample.

weightName: Keyword used for the name of the weight formula. Default: ‘weight’

weightSignatures: Internal signature of the formula for the weight.

biogeme.biogeme.logger = <biogeme.messaging.bioMessage object>: Logger that controls the output of messages to the screen and log file. Type: class biogeme.messaging.bioMessage.

class biogeme.biogeme.negLikelihood(like, like_deriv, scaled)[source]

Bases: functionToMinimize

Provides the value of the function to be minimized, as well as its derivatives. To be used by the opimization package.

__init__(like, like_deriv, scaled)[source]: Constructor

batch: Value betwen 0 and 1 defining the size of the batch, that is the percentage of the data that should be used to approximate the log likelihood.

bhhhv: BHHH matrix

f(batch=None)[source]

Calculate the value of the function

Parameters: batch (float) – for data driven functions (such as a log likelikood function), it is possible to approximate the value of the function using a sample of the data called a batch. This argument is a value between 0 and 1 representing the percentage of the data that should be used for thre random batch. If None, the full data set is used. Default: None pass
Returns: value of the function
Return type: float

f_g(batch=None)[source]

Calculate the value of the function and the gradient

Parameters: batch (float) – for data driven functions (such as a log likelikood function), it is possible to approximate the value of the function using a sample of the data called a batch. This argument is a value between 0 and 1 representing the percentage of the data that should be used for the random batch. If None, the full data set is used. Default: None pass
Returns: value of the function and the gradient
Return type: tuple float, numpy.array

f_g_bhhh(batch=None)[source]

Calculate the value of the function, the gradient and the BHHH matrix

Parameters: batch (float) – for data driven functions (such as a log likelikood function), it is possible to approximate the value of the function using a sample of the data called a batch. This argument is a value between 0 and 1 representing the percentage of the data that should be used for the random batch. If None, the full data set is used. Default: None pass
Returns: value of the function, the gradient and the BHHH
Return type: tuple float, numpy.array, numpy.array

f_g_h(batch=None)[source]

Calculate the value of the function, the gradient and the Hessian

Parameters: batch (float) – for data driven functions (such as a log likelikood function), it is possible to approximate the value of the function using a sample of the data called a batch. This argument is a value between 0 and 1 representing the percentage of the data that should be used for the random batch. If None, the full data set is used. Default: None pass
Returns: value of the function, the gradient and the Hessian
Return type: tuple float, numpy.array, numpy.array

fv: value of the function

gv: vector with the gradient

hv: second derivatives matrix

like: function calculating the log likelihood

like_deriv: function calculating the log likelihood and its derivatives.

recalculate: True if the log likelihood must be recalculated

scaled: if True, the value of the log likelihood is divided by the number of observations used to calculate it. In this case, the values with different sample sizes are comparable.

setVariables(x)[source]

Set the values of the variables for which the function has to be calculated.

Parameters: x (numpy.array) – values

x: Vector of unknown parameters values