Biogeme
The core routines of Biogeme.
biogeme.biogeme module
Implementation of the main Biogeme class
- author:
Michel Bierlaire
- date:
Tue Mar 26 16:45:15 2019
It combines the database and the model specification.
- class biogeme.biogeme.BIOGEME(database, formulas, userNotes=None, parameter_file=None, skip_audit=False, **kwargs)[source]
Bases:
object
- Main class that combines the database and the model
specification.
It works in two modes: estimation and simulation.
- __init__(database, formulas, userNotes=None, parameter_file=None, skip_audit=False, **kwargs)[source]
Constructor
- Parameters:
database (
biogeme.database.Database
) – choice data.formulas (
biogeme.expressions.Expression
, or dict(biogeme.expressions.Expression
)) – expression or dictionary of expressions that define the model specification. The concept is that each expression is applied to each entry of the database. The keys of the dictionary allow to provide a name to each formula. In the estimation mode, two formulas are needed, with the keys ‘loglike’ and ‘weight’. If only one formula is provided, it is associated with the label ‘loglike’. If no formula is labeled ‘weight’, the weight of each piece of data is supposed to be 1.0. In the simulation mode, the labels of each formula are used as labels of the resulting database.userNotes (str) – these notes will be included in the report file.
parameter_file (str) – name of the .toml file where the parameters are read
- Raises:
BiogemeError – an audit of the formulas is performed. If a formula has issues, an error is detected and an exception is raised.
- property algorithm_name
Name of the optimization algorithm
- argument_warning()[source]
Displays a deprecation warning when parameters are provided as arguments.
- bestIteration
Store the best iteration found so far.
- beta_values_dict_to_list(beta_dict=None)[source]
- Transforms a dict with the names of the betas associated
with their values, into a list consistent with the numbering of the ids.
- Parameters:
beta_dict (dict(str: float)) – dict with the values of the parameters
- Raises:
BiogemeError – if the parameter is not a dict
BiogemeError – if a parameter is missing in the dict
- bootstrap_results
Results of the bootstrap calculation.
- bootstrap_time
Time needed to calculate the bootstrap standard errors
- calculateInitLikelihood()[source]
Calculate the value of the log likelihood function
The default values of the parameters are used.
- Returns:
value of the log likelihood.
- Return type:
float.
- calculateLikelihood(x, scaled, batch=None)[source]
Calculates the value of the log likelihood function
- Parameters:
x (list(float)) – vector of values for the parameters.
scaled (bool) – if True, the value is divided by the number of observations used to calculate it. In this case, the values with different sample sizes are comparable. Default: True
batch (float) – if not None, calculates the likelihood on a random sample of the data. The value of the parameter must be strictly between 0 and 1, and represents the share of the data that will be used. Default: None
- Returns:
the calculated value of the log likelihood
- Return type:
float.
- Raises:
ValueError – if the length of the list x is incorrect.
BiogemeError – if calculatation with batch is requested
- calculateLikelihoodAndDerivatives(x, scaled, hessian=False, bhhh=False, batch=None)[source]
Calculate the value of the log likelihood function and its derivatives.
- Parameters:
x (list(float)) – vector of values for the parameters.
scaled (bool) – if True, the results are devided by the number of observations.
hessian (bool) – if True, the hessian is calculated. Default: False.
bhhh (bool) – if True, the BHHH matrix is calculated. Default: False.
batch (float) – if not None, calculates the likelihood on a random sample of the data. The value of the parameter must be strictly between 0 and 1, and represents the share of the data that will be used. Default: None
- Returns:
f, g, h, bh where
f is the value of the function (float)
g is the gradient (numpy.array)
h is the hessian (numpy.array)
bh is the BHHH matrix (numpy.array)
- Return type:
tuple float, numpy.array, numpy.array, numpy.array
- Raises:
ValueError – if the length of the list x is incorrect
BiogemeError – if the norm of the gradient is not finite, an error is raised.
BiogemeError – if calculatation with batch is requested
- calculateNullLoglikelihood(avail)[source]
Calculate the log likelihood of the null model that predicts equal probability for each alternative
- Parameters:
avail (list of
biogeme.expressions.Expression
) – list of expressions to evaluate the availability conditions for each alternative. If None, all alternatives are always available.- Returns:
value of the log likelihood
- Return type:
float
- changeInitValues(betas)[source]
Modifies the initial values of the pameters in all formula
- Parameters:
betas (dict(string:float)) – dictionary where the keys are the names of the parameters, and the values are the new value for the parameters.
- checkDerivatives(beta, verbose=False)[source]
Verifies the implementation of the derivatives.
It compares the analytical version with the finite differences approximation.
- Parameters:
x (list(float)) – vector of values for the parameters.
verbose (bool) – if True, the comparisons are reported. Default: False.
- Return type:
tuple.
- Returns:
f, g, h, gdiff, hdiff where
f is the value of the function,
g is the analytical gradient,
h is the analytical hessian,
gdiff is the difference between the analytical and the finite differences gradient,
hdiff is the difference between the analytical and the finite differences hessian,
- confidenceIntervals(betaValues, intervalSize=0.9)[source]
Calculate confidence intervals on the simulated quantities
- Parameters:
betaValues (list(dict(str: float))) – array of parameters values to be used in the calculations. Typically, it is a sample drawn from a distribution.
intervalSize (float) – size of the reported confidence interval, in percentage. If it is denoted by s, the interval is calculated for the quantiles (1-s)/2 and (1+s)/2. The default (0.9) corresponds to quantiles for the confidence interval [0.05, 0.95].
- Returns:
two pandas data frames ‘left’ and ‘right’ with the same dimensions. Each row corresponds to a row in the database, and each column to a formula. ‘left’ contains the left value of the confidence interval, and ‘right’ the right value
Example:
# Read the estimation results from a file results = res.bioResults(pickleFile = 'myModel.pickle') # Retrieve the names of the betas parameters that have been # estimated betas = biogeme.freeBetaNames # Draw 100 realization of the distribution of the estimators b = results.getBetasForSensitivityAnalysis(betas, size = 100) # Simulate the formulas using the nominal values simulatedValues = biogeme.simulate(betaValues) # Calculate the confidence intervals for each formula left, right = biogeme.confidenceIntervals(b, 0.9)
- Return type:
tuple of two Pandas dataframes.
- database
biogeme.database.Database
object
- property dogleg
getter for the parameter
- drawsProcessingTime
Time needed to generate the draws.
- property enlarging_factor
getter for the parameter
- estimate(recycle=False, bootstrap=0, **kwargs)[source]
Estimate the parameters of the model(s).
- Parameters:
recycle (bool) – if True, the results are read from the pickle file, if it exists. If False, the estimation is performed.
bootstrap (int) – number of bootstrap resampling used to calculate the variance-covariance matrix using bootstrapping. If the number is 0, bootstrapping is not applied. Default: 0.
- Returns:
object containing the estimation results.
- Return type:
biogeme.bioResults
Example:
# Create an instance of biogeme biogeme = bio.BIOGEME(database, logprob) # Gives a name to the model biogeme.modelName = 'mymodel' # Estimate the parameters results = biogeme.estimate()
- Raises:
BiogemeError – if no expression has been provided for the likelihood
- estimate_catalog(selected_configurations=None, quick_estimate=False, recycle=False, bootstrap=0)[source]
Estimate all or selected versions of a model with Catalog’s, corresponding to multiple specifications.
- Parameters:
selected_configurations – set of configurations. If
None, all configurations are considered. :type selected_configurations: set(biogeme.pareto.SetElement)
- Parameters:
quick_estimate (bool) – if True, the final statistics are not calculated.
recycle (bool) – if True, the results are read from the pickle file, if it exists. If False, the estimation is performed.
bootstrap (int) – number of bootstrap resampling used to calculate the variance-covariance matrix using bootstrapping. If the number is 0, bootstrapping is not applied. Default: 0.
- Returns:
object containing the estimation results associated with the name of each specification, as well as a description of each configuration
- Return type:
dict(str: bioResults)
- files_of_type(extension, all_files=False)[source]
Identify the list of files with a given extension in the local directory
- Parameters:
extension (str) – extension of the requested files (without the dot): ‘pickle’, or ‘html’
all_files (bool) – if all_files is False, only files containing the name of the model are identified. If all_files is True, all files with the requested extension are identified.
- Returns:
list of files with the requested extension.
- Return type:
list(str)
- formulas
Dictionary containing Biogeme formulas of type
biogeme.expressions.Expression
. The keys are the names of the formulas.
- freeBetaNames()[source]
Returns the names of the parameters that must be estimated
- Returns:
list of names of the parameters
- Return type:
list(str)
- property generateHtml
Boolean variable, True if the HTML file with the results must be generated.
- property generatePickle
Boolean variable, True if the PICKLE file with the results must be generated.
- property generate_html
Boolean variable, True if the HTML file with the results must be generated.
- property generate_pickle
Boolean variable, True if the PICKLE file with the results must be generated.
- getBoundsOnBeta(betaName)[source]
Returns the bounds on the parameter as defined by the user.
- Parameters:
betaName (string) – name of the parameter
- Returns:
lower bound, upper bound
- Return type:
tuple
- Raises:
BiogemeError – if the name of the parameter is not found.
- property identification_threshold
Threshold for the eigenvalue to trigger an identification warning
- property infeasible_cg
getter for the parameter
- initLogLike
Init value of the likelihood function
- property initial_radius
getter for the parameter
- lastSample
keeps track of the sample of data used to calculate the stochastic gradient / hessian
- likelihoodFiniteDifferenceHessian(x)[source]
Calculate the hessian of the log likelihood function using finite differences.
May be useful when the analytical hessian has numerical issues.
- Parameters:
x (list(float)) – vector of values for the parameters.
- Returns:
finite differences approximation of the hessian.
- Return type:
numpy.array
- Raises:
ValueError – if the length of the list x is incorrect
- loglike
Object of type
biogeme.expressions.Expression
calculating the formula for the loglikelihood
- loglikeName
Keyword used for the name of the loglikelihood formula. Default: ‘loglike’
- loglikeSignatures
Internal signature of the formula for the loglikelihood.
- property maximum_number_catalog_expressions
Maximum number of multiple expressions when Catalog’s are used.
- property maxiter
getter for the parameter
- property missingData
Code for missing data
- property missing_data
Code for missing data
- modelName
Name of the model. Default: ‘biogemeModelDefaultName’
- monteCarlo
monteCarlo
is True if one of the expressions involves a Monte-Carlo integration.
- nullLogLike
Log likelihood of the null model
- property numberOfDraws
Number of draws for Monte-Carlo integration.
- property numberOfThreads
Number of threads used for parallel computing. Default: the number of available CPU.
- property number_of_draws
Number of draws for Monte-Carlo integration.
- property number_of_threads
Number of threads used for parallel computing. Default: the number of available CPU.
- property only_robust_stats
True if only the robust statistics need to be reported. If False, the statistics from the Rao-Cramer bound are also reported.
- optimizationMessages
Information provided by the optimization algorithm after completion.
- optimize(startingValues=None)[source]
Calls the optimization algorithm. The function self.algorithm is called.
- Parameters:
startingValues (list(float)) – starting point for the algorithm
- Returns:
x, messages
x is the solution generated by the algorithm,
messages is a dictionary describing several information about the algorithm
- Return type:
numpay.array, dict(str:object)
- Raises:
BiogemeError – an error is raised if no algorithm is specified.
- quickEstimate(**kwargs)[source]
- Estimate the parameters of the model. Same as estimate, where any extra calculation is skipped (init loglikelihood, t-statistics, etc.)
- Returns:
object containing the estimation results.
- Return type:
Example:
# Create an instance of biogeme biogeme = bio.BIOGEME(database, logprob) # Gives a name to the model biogeme.modelName = 'mymodel' # Estimate the parameters results = biogeme.quickEstimate()
- Raises:
BiogemeError – if no expression has been provided for the likelihood
- property saveIterations
If True, the current iterate is saved after each iteration, in a file named
__[modelName].iter
, where[modelName]
is the name given to the model. If such a file exists, the starting values for the estimation are replaced by the values saved in the file.
- property save_iterations
Same as saveIterations, with another syntax
- property second_derivatives
getter for the parameter
- property seed_param
getter for the parameter
- setRandomInitValues(defaultBound=100.0)[source]
Modifies the initial values of the parameters in all formulas, using randomly generated values. The value is drawn from a uniform distribution on the interval defined by the bounds.
- Parameters:
defaultBound (float) – If the upper bound is missing, it is replaced by this value. If the lower bound is missing, it is replaced by the opposite of this value. Default: 100.
- short_names
- simulate(theBetaValues)[source]
Applies the formulas to each row of the database.
- Parameters:
theBetaValues (dict(str, float)) – values of the parameters to be used in the calculations. If None, the default values are used. Default: None.
- Returns:
a pandas data frame with the simulated value. Each row corresponds to a row in the database, and each column to a formula.
- Return type:
Pandas data frame
Example:
# Read the estimation results from a file results = res.bioResults(pickleFile = 'myModel.pickle') # Simulate the formulas using the nominal values simulatedValues = biogeme.simulate(betaValues)
- Raises:
BiogemeError – if the number of parameters is incorrect
BiogemeError – if theBetaValues is None.
- property steptol
getter for the parameter
- property tolerance
getter for the parameter
- userNotes
User notes
- validate(estimationResults, validationData)[source]
Perform out-of-sample validation.
The function performs the following tasks:
each slice defines a validation set (the slice itself) and an estimation set (the rest of the data),
the model is re-estimated on the estimation set,
the estimated model is applied on the validation set,
the value of the log likelihood for each observation is reported.
- Parameters:
estimationResults (biogeme.results.bioResults) – results of the model estimation based on the full data.
validationData (list(tuple(pandas.DataFrame, pandas.DataFrame))) – list of estimation and validation data sets
- Returns:
a list containing as many items as slices. Each item is the result of the simulation on the validation set.
- Return type:
list(pandas.DataFrame)
- Raises:
BiogemeError – An error is raised if the database is structured as panel data.
- weight
Object of type
biogeme.expressions.Expression
calculating the weight of each observation in the sample.
- weightName
Keyword used for the name of the weight formula. Default: ‘weight’
- weightSignatures
Internal signature of the formula for the weight.