Biogeme

The core routines of Biogeme.

biogeme.biogeme module

Implementation of the main Biogeme class that combines the database and the model specification.

author

Michel Bierlaire

date

Tue Mar 26 16:45:15 2019

class biogeme.biogeme.BIOGEME(database, formulas, userNotes=None, numberOfThreads=None, numberOfDraws=1000, seed=None, skipAudit=False, suggestScales=True, missingData=99999)[source]

Bases: object

Main class that combines the database and the model specification.

It works in two modes: estimation and simulation.

__init__(database, formulas, userNotes=None, numberOfThreads=None, numberOfDraws=1000, seed=None, skipAudit=False, suggestScales=True, missingData=99999)[source]

Constructor

Parameters
  • database (biogeme.database.Database) – choice data.

  • formulas (biogeme.expressions.Expression, or dict(biogeme.expressions.Expression)) – expression or dictionary of expressions that define the model specification. The concept is that each expression is applied to each entry of the database. The keys of the dictionary allow to provide a name to each formula. In the estimation mode, two formulas are needed, with the keys ‘loglike’ and ‘weight’. If only one formula is provided, it is associated with the label ‘loglike’. If no formula is labeled ‘weight’, the weight of each piece of data is supposed to be 1.0. In the simulation mode, the labels of each formula are used as labels of the resulting database.

  • userNotes (str) – these notes will be included in the report file.

  • numberOfThreads (int) – multi-threading can be used for estimation. This parameter defines the number of threads to be used. If the parameter is set to None, the number of available threads is calculated using cpu_count(). Ignored in simulation mode. Defaults: None.

  • numberOfDraws (int) – number of draws used for Monte-Carlo integration. Default: 1000.

  • seed (int) – seed used for the pseudo-random number generation. It is useful only when each run should generate the exact same result. If None, a new seed is used at each run. Default: None.

  • skipAudit (bool) – if True, does not check the validity of the formulas. It may save significant amount of time for large models and large data sets. Default: False.

  • suggestScales (bool.) – if True, Biogeme suggests the scaling of the variables in the database. Default: True. See also biogeme.database.Database.suggestScaling()

  • missingData (float) – if one variable has this value, it is assumed that a data is missing and an exception will be triggered. Default: 99999.

Raises

biogemeError – an audit of the formulas is performed. If a formula has issues, an error is detected and an exception is raised.

algoParameters

Parameters to be transferred to the optimization algorithm

algorithm

Optimization algorithm

bestIteration

Store the best iteration found so far.

bootstrap_results

Results of the bootstrap calculation.

bootstrap_time

Time needed to calculate the bootstrap standard errors

calculateInitLikelihood()[source]

Calculate the value of the log likelihood function

The default values of the parameters are used.

Returns

value of the log likelihood.

Return type

float.

calculateLikelihood(x, scaled, batch=None)[source]

Calculates the value of the log likelihood function

Parameters
  • x (list(float)) – vector of values for the parameters.

  • scaled (bool) – if True, the value is divided by the number of observations used to calculate it. In this case, the values with different sample sizes are comparable. Default: True

  • batch (float) – if not None, calculates the likelihood on a random sample of the data. The value of the parameter must be strictly between 0 and 1, and represents the share of the data that will be used. Default: None

Returns

the calculated value of the log likelihood

Return type

float.

Raises

ValueError – if the length of the list x is incorrect.

calculateLikelihoodAndDerivatives(x, scaled, hessian=False, bhhh=False, batch=None)[source]

Calculate the value of the log likelihood function and its derivatives.

Parameters
  • x (list(float)) – vector of values for the parameters.

  • scaled (bool) – if True, the results are devided by the number of observations.

  • hessian (bool) – if True, the hessian is calculated. Default: False.

  • bhhh (bool) – if True, the BHHH matrix is calculated. Default: False.

  • batch (float) – if not None, calculates the likelihood on a random sample of the data. The value of the parameter must be strictly between 0 and 1, and represents the share of the data that will be used. Default: None

Returns

f, g, h, bh where

  • f is the value of the function (float)

  • g is the gradient (numpy.array)

  • h is the hessian (numpy.array)

  • bh is the BHHH matrix (numpy.array)

Return type

tuple float, numpy.array, numpy.array, numpy.array

Raises
  • ValueError – if the length of the list x is incorrect

  • biogemeError – if the norm of the gradient is not finite, an error is raised.

calculateNullLoglikelihood(avail)[source]

Calculate the log likelihood of the null model that predicts equal probability for each alternative

Parameters

avail (list of biogeme.expressions.Expression) – list of expressions to evaluate the availability conditions for each alternative. If None, all alternatives are always available.

Returns

value of the log likelihood

Return type

float

changeInitValues(betas)[source]

Modifies the initial values of the pameters in all formula

Parameters

betas (dict(string:float)) – dictionary where the keys are the names of the parameters, and the values are the new value for the parameters.

checkDerivatives(verbose=False)[source]

Verifies the implementation of the derivatives.

It compares the analytical version with the finite differences approximation.

Parameters

verbose (bool) – if True, the comparisons are reported. Default: False.

Return type

tuple.

Returns

f, g, h, gdiff, hdiff where

  • f is the value of the function,

  • g is the analytical gradient,

  • h is the analytical hessian,

  • gdiff is the difference between the analytical and the finite differences gradient,

  • hdiff is the difference between the analytical and the finite differences hessian,

columnForBatchSamplingWeights

Name of the column defining weights for batch sampling in stochastic optimization.

confidenceIntervals(betaValues, intervalSize=0.9)[source]

Calculate confidence intervals on the simulated quantities

Parameters
  • betaValues (list(dict(str: float))) – array of parameters values to be used in the calculations. Typically, it is a sample drawn from a distribution.

  • intervalSize (float) – size of the reported confidence interval, in percentage. If it is denoted by s, the interval is calculated for the quantiles (1-s)/2 and (1+s)/2. The default (0.9) corresponds to quantiles for the confidence interval [0.05, 0.95].

Returns

two pandas data frames ‘left’ and ‘right’ with the same dimensions. Each row corresponds to a row in the database, and each column to a formula. ‘left’ contains the left value of the confidence interval, and ‘right’ the right value

Example:

# Read the estimation results from a file
results = res.bioResults(pickleFile = 'myModel.pickle')
# Retrieve the names of the betas parameters that have been
# estimated
betas = biogeme.freeBetaNames

# Draw 100 realization of the distribution of the estimators
b = results.getBetasForSensitivityAnalysis(betas, size = 100)

# Simulate the formulas using the nominal values
simulatedValues = biogeme.simulate(betaValues)

# Calculate the confidence intervals for each formula
left, right = biogeme.confidenceIntervals(b, 0.9)

Return type

tuple of two Pandas dataframes.

createLogFile(verbosity=3)[source]

Creates a log file with the messages produced by Biogeme.

The name of the file is the name of the model with an extension .log

Parameters

verbosity (int) –

types of messages to be captured

  • 0: no output

  • 1: warnings

  • 2: only general information

  • 3: more verbose

  • 4: debug messages

Default: 3.

database

biogeme.database.Database object

drawsProcessingTime

Time needed to generate the draws.

estimate(recycle=False, bootstrap=0, algorithm=<function simpleBoundsNewtonAlgorithmForBiogeme>, algoParameters=None)[source]

Estimate the parameters of the model.

Parameters
  • recycle (bool) – if True, the results are read from the pickle file, if it exists. If False, the estimation is performed.

  • bootstrap (int) – number of bootstrap resampling used to calculate the variance-covariance matrix using bootstrapping. If the number is 0, bootstrapping is not applied. Default: 0.

  • algorithm (function) – optimization algorithm to use for the maximum likelihood estimation. Default: Biogeme’s Newton’s algorithm with simple bounds.

  • algoParameters (dict) – parameters to transfer to the optimization algorithm

Returns

object containing the estimation results.

Return type

biogeme.bioResults

Example:

# Create an instance of biogeme
biogeme  = bio.BIOGEME(database, logprob)

# Gives a name to the model
biogeme.modelName = 'mymodel'

# Estimate the parameters
results = biogeme.estimate()
Raises

biogemeError – if no expression has been provided for the likelihood

files_of_type(extension, all_files=False)[source]

Identify the list of files with a given extension in the local directory

Parameters
  • extension (str) – extension of the requested files (without the dot): ‘pickle’, or ‘html’

  • all_files (bool) – if all_files is False, only files containing the name of the model are identified. If all_files is True, all files with the requested extension are identified.

Returns

list of files with the requested extension.

Return type

list(str)

formulas

Dictionary containing Biogeme formulas of type biogeme.expressions.Expression. The keys are the names of the formulas.

freeBetaNames()[source]

Returns the names of the parameters that must be estimated

Returns

list of names of the parameters

Return type

list(str)

generateHtml

Boolean variable, True if the HTML file with the results must be generated.

generatePickle

Boolean variable, True if the pickle file with the results must be generated.

getBoundsOnBeta(betaName)[source]

Returns the bounds on the parameter as defined by the user.

Parameters

betaName (string) – name of the parameter

Returns

lower bound, upper bound

Return type

tuple

Raises

biogemeError – if the name of the parameter is not found.

initLogLike

Init value of the likelihood function

lastSample

keeps track of the sample of data used to calculate the stochastic gradient / hessian

likelihoodFiniteDifferenceHessian(x)[source]

Calculate the hessian of the log likelihood function using finite differences.

May be useful when the analytical hessian has numerical issues.

Parameters

x (list(float)) – vector of values for the parameters.

Returns

finite differences approximation of the hessian.

Return type

numpy.array

Raises

ValueError – if the length of the list x is incorrect

loglike

Object of type biogeme.expressions.Expression calculating the formula for the loglikelihood

loglikeName

Keyword used for the name of the loglikelihood formula. Default: ‘loglike’

loglikeSignatures

Internal signature of the formula for the loglikelihood.

missingData

code for missing data

modelName

Name of the model. Default: ‘biogemeModelDefaultName’

monteCarlo

monteCarlo is True if one of the expressions involves a Monte-Carlo integration.

nullLogLike

Log likelihood of the null model

numberOfDraws

Number of draws for Monte-Carlo integration.

numberOfThreads

Number of threads used for parallel computing. Default: the number of available CPU.

optimizationMessages

Information provided by the optimization algorithm after completion.

optimize(startingValues=None)[source]

Calls the optimization algorithm. The function self.algorithm is called.

Parameters

startingValues (list(float)) – starting point for the algorithm

Returns

x, messages

  • x is the solution generated by the algorithm,

  • messages is a dictionary describing several information about the algorithm

Return type

numpay.array, dict(str:object)

Raises

biogemeError – an error is raised if no algorithm is specified.

quickEstimate(algorithm=<function simpleBoundsNewtonAlgorithmForBiogeme>, algoParameters=None)[source]
Estimate the parameters of the model. Same as estimate, where any extra calculation is skipped (init loglikelihood, t-statistics, etc.)
Parameters
  • algorithm (function) – optimization algorithm to use for the maximum likelihood estimation.Default: Biogeme’s Newton’s algorithm with simple bounds.

  • algoParameters (dict) – parameters to transfer to the optimization algorithm

Returns

object containing the estimation results.

Return type

biogeme.results.bioResults

Example:

# Create an instance of biogeme
biogeme  = bio.BIOGEME(database, logprob)

# Gives a name to the model
biogeme.modelName = 'mymodel'

# Estimate the parameters
results = biogeme.quickEstimate()
Raises

biogemeError – if no expression has been provided for the likelihood

reset_id_manager()[source]

Reset all the ids of the elementary expression in the formulas

saveIterations

If True, the current iterate is saved after each iteration, in a file named __[modelName].iter, where [modelName] is the name given to the model. If such a file exists, the starting values for the estimation are replaced by the values saved in the file.

setRandomInitValues(defaultBound=100.0)[source]

Modifies the initial values of the parameters in all formulas, using randomly generated values. The value is drawn from a uniform distribution on the interval defined by the bounds.

Parameters

defaultBound (float) – If the upper bound is missing, it is replaced by this value. If the lower bound is missing, it is replaced by the opposite of this value. Default: 100.

simulate(theBetaValues=None)[source]

Applies the formulas to each row of the database.

Parameters

theBetaValues (dict(str, float)) – values of the parameters to be used in the calculations. If None, the default values are used. Default: None.

Returns

a pandas data frame with the simulated value. Each row corresponds to a row in the database, and each column to a formula.

Return type

Pandas data frame

Example:

# Read the estimation results from a file
results = res.bioResults(pickleFile = 'myModel.pickle')
# Simulate the formulas using the nominal values
simulatedValues = biogeme.simulate(betaValues)
Raises

biogemeError – if the number of parameters is incorrect

userNotes

User notes

validate(estimationResults, validationData)[source]

Perform out-of-sample validation.

The function performs the following tasks:

  • each slice defines a validation set (the slice itself) and an estimation set (the rest of the data),

  • the model is re-estimated on the estimation set,

  • the estimated model is applied on the validation set,

  • the value of the log likelihood for each observation is reported.

Parameters
  • estimationResults (biogeme.results.bioResults) – results of the model estimation based on the full data.

  • validationData (list(tuple(pandas.DataFrame, pandas.DataFrame))) – list of estimation and validation data sets

Returns

a list containing as many items as slices. Each item is the result of the simulation on the validation set.

Return type

list(pandas.DataFrame)

Raises

biogemeError – An error is raised if the database is structured as panel data.

weight

Object of type biogeme.expressions.Expression calculating the weight of each observation in the sample.

weightName

Keyword used for the name of the weight formula. Default: ‘weight’

weightSignatures

Internal signature of the formula for the weight.

biogeme.biogeme.logger = <biogeme.messaging.bioMessage object>

Logger that controls the output of messages to the screen and log file. Type: class biogeme.messaging.bioMessage.

class biogeme.biogeme.negLikelihood(like, like_deriv, scaled)[source]

Bases: functionToMinimize

Provides the value of the function to be minimized, as well as its derivatives. To be used by the opimization package.

__init__(like, like_deriv, scaled)[source]

Constructor

batch

Value betwen 0 and 1 defining the size of the batch, that is the percentage of the data that should be used to approximate the log likelihood.

bhhhv

BHHH matrix

f(batch=None)[source]

Calculate the value of the function

Parameters

batch (float) – for data driven functions (such as a log likelikood function), it is possible to approximate the value of the function using a sample of the data called a batch. This argument is a value between 0 and 1 representing the percentage of the data that should be used for thre random batch. If None, the full data set is used. Default: None pass

Returns

value of the function

Return type

float

f_g(batch=None)[source]

Calculate the value of the function and the gradient

Parameters

batch (float) – for data driven functions (such as a log likelikood function), it is possible to approximate the value of the function using a sample of the data called a batch. This argument is a value between 0 and 1 representing the percentage of the data that should be used for the random batch. If None, the full data set is used. Default: None pass

Returns

value of the function and the gradient

Return type

tuple float, numpy.array

f_g_bhhh(batch=None)[source]

Calculate the value of the function, the gradient and the BHHH matrix

Parameters

batch (float) – for data driven functions (such as a log likelikood function), it is possible to approximate the value of the function using a sample of the data called a batch. This argument is a value between 0 and 1 representing the percentage of the data that should be used for the random batch. If None, the full data set is used. Default: None pass

Returns

value of the function, the gradient and the BHHH

Return type

tuple float, numpy.array, numpy.array

f_g_h(batch=None)[source]

Calculate the value of the function, the gradient and the Hessian

Parameters

batch (float) – for data driven functions (such as a log likelikood function), it is possible to approximate the value of the function using a sample of the data called a batch. This argument is a value between 0 and 1 representing the percentage of the data that should be used for the random batch. If None, the full data set is used. Default: None pass

Returns

value of the function, the gradient and the Hessian

Return type

tuple float, numpy.array, numpy.array

fv

value of the function

gv

vector with the gradient

hv

second derivatives matrix

like

function calculating the log likelihood

like_deriv

function calculating the log likelihood and its derivatives.

recalculate

True if the log likelihood must be recalculated

scaled

if True, the value of the log likelihood is divided by the number of observations used to calculate it. In this case, the values with different sample sizes are comparable.

setVariables(x)[source]

Set the values of the variables for which the function has to be calculated.

Parameters

x (numpy.array) – values

x

Vector of unknown parameters values