PythonBiogeme: a short introduction

The package Biogeme (biogeme.epfl.ch) is designed to estimate the parameters of various models using maximum likelihood estimation. It is particularly designed for discrete choice models. In this document, we present step by step how to specify a simple model, estimate its parameters and interpret the output of the software package. We assume that the reader is already familiar with discrete choice models, and has successfully installed PythonBiogeme. This document has been written using PythonBiogeme 2.5, but should remain valid for future versions.

1 The data file

Biogeme assumes that the data file contains in its first line a list of labels corresponding to the available data, and that each subsequent line contains the exact same number of numerical data, each row corresponding to an observation. Delimiters can be tabs or spaces. The tool biopreparedata can be used to transform a file in Comma Separated Version (CSV) into the required format. The tool biocheckdata verifies if the data file complies with the required format.

The data file used for this example is swissmetro.dat. Note that the first time a data file is used by Biogeme, it is compressed and saved in binary format in a file. The name of this file is the same as the original file, preceeded by __bin_. In our example, the binary file is __bin_optima.dat. If the original text file is modifed, the binary file must be erased from the directory in order to account for the changes. The name of the file that has actually been used is reported in the output file.

Biogeme is available in two versions. BisonBiogeme is designed to estimate the parameters of a list of predetermined discrete choice models such as logit, binary probit, nested logit, cross-nested logit, multivariate extreme value models, discrete and continuous mixtures of multivariate extreme value models, models with nonlinear utility functions, models designed for panel data, and heteroscedastic models. It is based on a formal and simple language for model specification. PythonBiogeme is designed for general purpose parametric models. The specification of the model and of the likelihood function is based on an extension of the python programming language. A series of discrete choice models are precoded for an easy use.

2 The model

The model is a logit model with 3 alternatives: train, Swissmetro and car. The utility functions are defined as:

where TRAIN_TT_SCALED, TRAIN_COST_SCALED, SM_TT_SCALED, SM_COST_SCALED, CAR_TT_SCALED, CAR_CO_SCALED are variables, and ASC_TRAIN, ASC_SM, ASC_CAR, B_TIME, B_COST are parameters to be estimated. Note that it is not possible to identify all alternative specific constants ASC_TRAIN, ASC_SM, ASC_CAR from data. Consequently, ASC_SM is normalized to 0.

The availability of an alternative i is determined by the variable y_i, i=1,...3, which is equal to 1 if the alternative is available, 0 otherwise. The probability of choosing an available alternative i is given by the logit model:

3 Model specification: PythonBiogeme

The model specification file must have an extension .py. The file 01logit.py is reported in Section A.1. We describe here its content.

THe objective is to provide to PythonBiogeme the formula of the log likelihood function to maximize, using a syntax based on the Python programming language, and extended for the specific needs of Biogeme. The file can contain comments, designed to document the specification. Comments are included using the characters #, consistently with the Python syntax. All characters after this command, up to the end of the current line, are ignored by PythonBiogeme. In our example, the file starts with comments describing the name of the file, its author and the date when it was created. A short description of its content is also provided.

These comments are completely ignored by PythonBiogeme. However, it is recommended to use many comments to describe the model specification, for future reference, or to help other persons to understand the specification.

The specification file must start by loading the Python libraries needed by PythonBiogeme. Two libraries are mandatory biogeme and headers. The first includes the extension of the PYthon programming language needed by PythonBiogeme. The second imports the names of the headers of the data file, so that they can be directly used in the specification of the model. In this example, an additional library is loaded as well: statistics. It implements some functions that report statistics about the data file.

The next statements use the function Beta to define the parameters to be estimated. For each parameter, the following information must be mentioned:

Note that, in Python, case sensitivity is enforced, so that varname and Varname would represent two different variables. In our example, the default value of each parameter is 0. If a previous estimation had been performed before, we could have used the previous estimates as default value. Note that, for the parameters that are estimated by PythonBiogeme, the default value is used as the starting value for the optimization algorithm. For the parameters that are not estimated, the default value is used throughout the estimation process. In our example, the parameter ASC_SM is not estimated (as specified by the 1 in the fifth argument on the corresponding line), and its value is fixed to 0. A lower bound and an upper bound must be specified. By default, we suggest to use -1000 and 1000. If the estimated value of the parameter happens to equal to one of these bounds, it is a sign that the bounds are too tight and larger values should be provided. However, most of the time, if a coefficient reaches the value 1000 or -1000, it means that its variable is poorly scaled, and that its units should be changed.

Note that none of the Python variables is used by PythonBiogeme. They are used only to simplify the writing of the formula. Therefore, nothing prevents to write

and to use car_cte later in the specification. The variable car_cte will be unknown by PythonBiogeme and will not appear in any reporting file. We strongly advise against this practice, and suggest to use the exact same name for the Python variable on the left hand side, and for the PythonBiogeme variable, appearing as the first argument of the function, as illustrated in this example.

It is possible to define new variables in addition to the variables defined in the data files. It can be done either by defining Python variables using the Python syntax:

It can also be done by defining PythonBiogeme variables, using the function DefineVariable.

The latter definition is equivalent to add a column with the specified header to the data file. It means that the value of the new variables for each observation is calculated once before the estimation starts. On the contrary, with the method based on Python variable, the calculation will be applied again and again, each time it is needed by the algorithm. For small models, it may not make any difference, and the first method may be more readable. But for models requiring a significant amount of time to be estimated, the time savings may be substantial.

When boolean expressions are involved, the value TRUE is represented by 1, and the value FALSE is represented by 0. Therefore, a multiplication involving a boolean expression is equivalent to a “AND” operator. The above code is interpreted in the following way:

Variables can be also be rescaled. For numerical reasons, it is good practice to scale the data so that the values of the estimated parameters are around 1.0. A previous estimation with the unscaled data has generated parameters around -0.01 for both cost and time. Therefore, time and cost are divided by 100.

We need to associate each utility function with the number of the alternative, using the numering convention in the data file. In this example, the convention is described in Table 1. To do this, we use a Python dictionary:

We use also a dictionary to describe the availability conditions of each alternative:

We now define the choice model. The function bioLogLogit provides the logarithm of the choice probability of the logit model. It takes three arguments:

and define the ESTIMATE variable of the BIOGEME_OBJECT with the formula of the log likelihood function:

Other variables can be defined in the BIOGEME_OBJECT. In particular, the EXCLUDE variable allows to ignore some observations in the data file. It contains a boolean expression that is evaluated for each observation in the data file. Each observation such that this expression is “true” is discarded from the sample. In our example, the modeler has developed the model only for work trips, so that every observation such that the trip purpose is not 1 or 3 is removed. Observations such that the dependent variable CHOICE is 0 are also removed. Remember the convention that “false” is represented by 0, and “true” by 1, so that the ‘*’ can be interpreted as a “and”, and the ‘+’ as a “or”. Note also that the result of the ‘+’ can be 2, so that we test is the result is equal to 0 or not. The exclude condition in our example is therefore interpreted as: either (PURPOSE different from 1 and PURPOSE different from 3), or CHOICE equal to 0.

Note that we have conveniently used an intermediary Python variable exclude in this example. It is not necessary. The above statement is completely equivalent to

The variable PARAMETERS allows to define various parameters controlling the configuration of PythonBiogeme. In this example, we have selected to use the optimization algorithm BIO using the following syntax.

The variable FORMULAS is used to select the parts of the model specification that are reported in the report file. In general, the formula of the log likelihood function is too complicated to be readable, and it is preferred to report only the specification of the utility functions, as in this example.

Finally, we request PythonBiogeme to calculate some statistics about the null log likelihood, the log likelihood of a model with constants only, and statistics about the availability of the alternatives.

The function nullLoglikelihood computes the null loglikelihood from the sample and ask PythonBiogeme to include it in the output file. The first argument is a dictionary mapping each alternative ID with its availability condition. The second is an iterator on the data file. The result is the log likelihood of a model where the choice probability for observation n is given by is 1∕J_n, where J_n is the number of available alternatives, i.e.

The function cteLoglikelihood computes the constant loglikelihood from the sample and ask PythonBiogeme to include it in the output file. It assumes that the full choice set is available for each observation. The first argument is a list containing the alternatives in the choice set. The second argument is the choice expression producing the id of the chosen alternative. The third argument is an iterator on the data file. The result is the log likelihood of a logit model where the only parameters are the alternative specific constants. If n_i is the number of times alternative i is chosen, then it is given by

The function availabilityStatistics computes the number of times each alternative is declared available in the data set and ask PythonBiogeme to include it in the output file. The first argument is a dictionary containing for each alternative the expression for its availability. The second is an iterator on the data file. The result is a dictionary D with an entry D[i] for each alternative i containing the number of times it is available.

4 Running PythonBiogeme

In order to avoid erasing previously generated results, the name of the files may vary from one run to the next. Therefore, PythonBiogeme explicitly mentions the name of the main files that have been generated.

5 PythonBiogeme: the report file

The report file generated by PythonBiogeme gathers various information about the result of the estimation. First, some information about the version of Biogeme, and some links to relevant URLs is provided. Next, the name of the report file and the sample file are reported.

If some formulas have been requested to be reported, it is done in the next section. After is a list of statistics requested in the model specification file. The estimation report follows, including

The following section reports the estimates of the parameters of the utility function, together with some statistics. For each parameter β_k, the following is reported:

The final line reports the value of the smallest singular value of the second derivatives matrix. A value close to zero is a sign of singularity, that may be due to a lack of variation in the data or an unidentified model.

A Complete specification file

A.1 01logit.py

1########################################
2#
3# @file 01logit.py
4# @author: Michel Bierlaire, EPFL
5# @date: Wed Dec 21 13:23:27 2011
6#
7# Logit model
8# Three alternatives: Train, Car and Swissmetro
9# SP data
10#
11#######################################
12
13from biogeme import *
14from headers import *
15from statistics import *
16
17#Parameters to be estimated
18# Arguments:
19#   - 1  Name for report; Typically, the same as the variable.
20#   - 2  Starting value.
21#   - 3  Lower bound.
22#   - 4  Upper bound.
23#   - 5  0: estimate the parameter, 1: keep it fixed.
24#
25ASC_CAR = Beta(’ASC_CAR’,0,-1000,1000,0,’Car cte.’)
26ASC_TRAIN = Beta(’ASC_TRAIN’,0,-1000,1000,0,’Train cte.’)
27ASC_SM = Beta(’ASC_SM’,0,-1000,1000,1,’Swissmetro cte.’)
28B_TIME = Beta(’B_TIME’,0,-1000,1000,0,’Travel time’)
29B_COST = Beta(’B_COST’,0,-1000,1000,0,’Travel cost’)
30
31# Utility functions
32
33#If the person has a GA (season ticket) her incremental cost
34#is actually 0 rather than the cost value gathered from the
35# network data.
36
37SM_COST =  SM_CO   * (  GA   ==  0  )
38TRAIN_COST =  TRAIN_CO   * (  GA   ==  0  )
39
40# For numerical reasons, it is good practice to scale the data to
41# that the values of the parameters are around 1.0.
42# A previous estimation with the unscaled data has generated
43# parameters around -0.01 for both cost and time. Therefore, time
44# and cost are divided by 100.
45
46# The following statements are designed to preprocess the data.
47# It is like creating a new columns in the data file. This should
48# be preferred to the statement like
49# TRAIN_TT_SCALED = TRAIN_TT / 100.0
50# which causes the division to be reevaluated again and again,
51# throuh the iterations. For models taking a long time to
52# estimate, it may make a significant difference.
53
54TRAIN_TT_SCALED = DefineVariable(’TRAIN_TT_SCALED’,\
55   TRAIN_TT / 100.0)
56TRAIN_COST_SCALED = DefineVariable(’TRAIN_COST_SCALED’,\
57   TRAIN_COST / 100)
58SM_TT_SCALED = DefineVariable(’SM_TT_SCALED’, SM_TT / 100.0)
59SM_COST_SCALED = DefineVariable(’SM_COST_SCALED’, SM_COST / 100)
60CAR_TT_SCALED = DefineVariable(’CAR_TT_SCALED’, CAR_TT / 100)
61CAR_CO_SCALED = DefineVariable(’CAR_CO_SCALED’, CAR_CO / 100)
62
63V1 = ASC_TRAIN + \
64     B_TIME * TRAIN_TT_SCALED + \
65     B_COST * TRAIN_COST_SCALED
66V2 = ASC_SM + \
67     B_TIME * SM_TT_SCALED + \
68     B_COST * SM_COST_SCALED
69V3 = ASC_CAR + \
70     B_TIME * CAR_TT_SCALED + \
71     B_COST * CAR_CO_SCALED
72
73# Associate utility functions with the numbering of alternatives
74V = {1: V1,
75     2: V2,
76     3: V3}
77
78# Associate the availability conditions with the alternatives
79CAR_AV_SP =  DefineVariable(’CAR_AV_SP’,CAR_AV  * (  SP   !=  0  ))
80TRAIN_AV_SP =  DefineVariable(’TRAIN_AV_SP’,TRAIN_AV  * (  SP   !=  0  ))
81
82av = {1: TRAIN_AV_SP,
83      2: SM_AV,
84      3: CAR_AV_SP}
85
86# The choice model is a logit, with availability conditions
87logprob = bioLogLogit(V,av,CHOICE)
88
89# Defines an itertor on the data
90rowIterator(’obsIter’)
91
92# DEfine the likelihood function for the estimation
93BIOGEME_OBJECT.ESTIMATE = Sum(logprob,’obsIter’)
94
95# All observations verifying the following expression will not be
96# considered for estimation
97# The modeler here has developed the model only for work trips.
98# Observations such that the dependent variable CHOICE is 0
99# are also removed.
100exclude = (( PURPOSE != 1 ) * (  PURPOSE   !=  3  ) + \
101   ( CHOICE == 0 )) > 0
102
103BIOGEME_OBJECT.EXCLUDE = exclude
104
105
106BIOGEME_OBJECT.PARAMETERS[’optimizationAlgorithm’] = "IPOPT"
107BIOGEME_OBJECT.PARAMETERS[’biogemeDisplay’] = "3"
108
109BIOGEME_OBJECT.FORMULAS[’Train utility’] = V1
110BIOGEME_OBJECT.FORMULAS[’Swissmetro utility’] = V2
111BIOGEME_OBJECT.FORMULAS[’Car utility’] = V3
112
113# Statistics
114
115nullLoglikelihood(av,’obsIter’)
116choiceSet = [1,2,3]
117cteLoglikelihood(choiceSet,CHOICE,’obsIter’)
118availabilityStatistics(av,’obsIter’)
119
120#BIOGEME_OBJECT.PARAMETERS[’printGradient’] = "1"
121#BIOGEME_OBJECT.PARAMETERS[’forceScientificNotation’] = "0"
122
123#BIOGEME_OBJECT.PARAMETERS[’precisionParameters’] = "3"
124#BIOGEME_OBJECT.PARAMETERS[’precisionStatistics’] = "3"
125#BIOGEME_OBJECT.PARAMETERS[’precisionTStats’] = "14"
126
127#BIOGEME_OBJECT.PARAMETERS[’bootstrapStdErr’] = "100"

B Estimation of the variance-covariance matrix

Under relatively general conditions, the asymptotic variance-covariance matrix of the maximum likelihood estimates of the vector of parameters θ ∈ℝ^K is given by the Cramer-Rao bound

The term in square brackets is the matrix of the second derivatives of the log likelihood function with respect to the parameters evaluated at the true parameters. Thus the entry in the kth row and the ℓth column is

Since we do not know the actual values of the parameters at which to evaluate the second derivatives, or the distribution of x_in and x_jn over which to take their expected value, we estimate the variance-covariance matrix by evaluating the second derivatives at the estimated parameters

and the sample distribution of x_in and x_jn instead of their true distribution. Thus we use

Denote this matrix as Â. Note that, from the second order optimality conditions of the optimization problem, this matrix is negative semi-definite, which is the algebraic equivalent of the local concavity of the log likelihood function. If the maximum is unique, the matrix is negative definite, and the function is locally strictly concave.

If the matrix Â is negative definite then -Â is invertible and the Cramer-Rao bound is positive definite.

Another consistent estimator of the (negative of the) second derivatives matrix can be obtained by the matrix of the cross-products of first derivatives as follows:

is the gradient vector of the likelihood of observation n. This approximation is employed by the BHHH algorithm, from the work by Berndt et al. (1974). Therefore, an estimate of the variance-covariance matrix is given by

although it is rarely used. Instead,

is used to derive a third consistent estimator of the variance-covariance matrix of the parameters, defined as

It is called the robust estimator, or sometimes the sandwich estimator, due to the form of equation (19). Biogeme reports statistics based on both the Cramer-Rao estimate (15) and the robust estimate (19).

When the true likelihood function is maximized, these estimators are asymptotically equivalent, and the Cramer-Rao bound should be preferred (Kauermann and Carroll, 2001). When other consistent estimators are used, the robust estimator must be used (White, 1982). Consistent non-maximum likelihood estimators, known as pseudo maximum likelihood estimators, are often used when the true likelihood function is unknown or difficult to compute. In such cases, it is often possible to obtain consistent estimators by maximizing an objective function based on a simplified probability distribution.