PYTHON BIOGEME Walkthrough
Python Biogeme is a version of Biogeme based on the Python language (see python.org). It allows the user to write explicitly the model and the likelihood function, allowing to estimate a wide variety of models, not necessarily available in the traditional Biogeme software. The development of Python Biogeme has been motivated by the need to estimate discrete choice models with latent classes and latent variables. However, it is (in principle) able to estimate the parameters of any model using maximum likelihood estimation.
In order to introduce the syntax of Python Biogeme, we are explaining in details an example where a logit model with 6 alternatives is estimated. The following files are necessary to run the example:
The model
The model is a logit model with 3 alternatives. The utility functions are defined as:
V_1 = V_TRAIN = ASC_TRAIN + B_TIME * TRAIN_TT_SCALED + B_COST * TRAIN_COST_SCALED
V_2 = V_SM = ASC_SM + B_TIME * SM_TT_SCALED + B_COST * SM_COST_SCALED
V_3 = V_CAR = ASC_CAR + B_TIME * CAR_TT_SCALED + B_COST * CAR_CO_SCALED
where
TRAIN_TT_SCALED
,
TRAIN_COST_SCALED
,
SM_TT_SCALED
,
SM_COST_SCALED
,
CAR_TT_SCALED
,
CAR_CO_SCALED
are variables, and
ASC_TRAIN
,
ASC_SM
,
ASC_CAR
,
B_TIME
,
B_COST
are parameters to be estimated. Note that it is not possible to identify all alternative specific constants
ASC_TRAIN
,
ASC_SM
,
ASC_CAR
from data. Consequently, ASC_SM
is normalized to 0.
The availability of an alternative i
is determined by the variable avi
, i
=1,...3, which is equal to 1 if the alternative is available, 0 otherwise. The probability of choosing an available alternative i
is given by the logit model:
P(i) = exp(Vi) / (av exp(V1)+av2 exp(V2)+av3 exp(V3)).
N
observations, the loglikelihood of the sample is
L = Σnlog P(in)
in
is the alternative actually chosen by individual n
.
The data file
Biogeme assumes that the data file contains in its first line a list of labels corresponding to the available data, and that each subsequent line contains the exact same number of numerical data, each row corresponding to an observation. Delimiters can be tabs or spaces.
Warning: the name of the variables cannot be a reserved keyword of the python language, such as del
or class
. In that case, pythonbiogeme will trigger a syntax error:
File "headers.py", line 82
del=Variable('del')
^
SyntaxError: invalid syntax
or
File "headers.py", line 101
class=Variable(class')
^
SyntaxError: invalid syntax
An easy way around is to rename your variables _del
and _class
.
The data file used for this example can be downloaded here.
The model specification file
We explain here line by line the model specification file written in Python. It is probably useful to start being familiar with the syntax of the Python language first.Libraries
The first lines import the modules necessary to run Biogeme.
# Import modules
from biogeme import *
from headers import *
from loglikelihood import *
from statistics import *
The #
sign indicates a line of comments, which is ignored.
- The module
biogeme
contains additions to the Python language needed to run Biogeme. - The module
headers
give access to the labels in the first of the data file, so that their names can be used directly. - The module
loglikelihood
contains definitions of standard likelihood functions.
The module
statistics
contains definitions of standard staitstics.
Parameters
Each parameter to be estimated must be declared using the functionBeta
. It takes 5 arguments:
- the name of the parameter
- the default value
- a lower bound
- an upper bound
- a flag that indicates if the parameter must be estimated (0) or if it keeps its default value (1).
Beta
function is not a Python function. It is part of Biogeme. The list of all Python functions specific to Biogeme is available here.
ASC_CAR = Beta('Car cte.',0,-10,10,0)
ASC_TRAIN = Beta('Train cte.',0,-10,10,0)
ASC_SM = Beta('Swissmetro cte.',0,-10,10,1)
B_TIME = Beta('Travel time',0,-10,10,0)
B_COST = Beta('Travel cost',0,-10,10,0)
Note that the last argument of the function for ASC_SM
is 1, as we want to maintain it to its default value, that is 0.
Equations defining the model
The equations defining the model can be written using the Python syntax. In this example, we define first some new variables.- If the decision maker has a GA (season ticket) her incremental cost is actually 0 rather than the cost value gathered from the network data:
SM_COST = SM_CO * ( GA == 0 ) TRAIN_COST = TRAIN_CO * ( GA == 0 )
- For numerical reasons, it is good practice to scale the data to
that the values of the parameters are around 1.0.
A previous estimation with the unscaled data has generated
parameters around -0.01 for both cost and time. Therefore, time and
cost are multipled my 0.01.
The following statements are designed to preprocess the data. It is
like creating a new columns in the data file. This should be
preferred to the statement like
TRAIN_TT_SCALED = TRAIN_TT / 100.0
which will cause the division to be reevaluated again and again, throuh the iterations. For models taking a long time to estimate, it may make a significant difference.TRAIN_TT_SCALED = DefineVariable('TRAIN_TT_SCALED', TRAIN_TT / 100.0) TRAIN_COST_SCALED = DefineVariable('TRAIN_COST_SCALED', TRAIN_COST / 100) SM_TT_SCALED = DefineVariable('SM_TT_SCALED', SM_TT / 100.0) SM_COST_SCALED = DefineVariable('SM_COST_SCALED', SM_COST / 100) CAR_TT_SCALED = DefineVariable('CAR_TT_SCALED', CAR_TT / 100) CAR_CO_SCALED = DefineVariable('CAR_CO_SCALED', CAR_CO / 100)
V1 = ASC_TRAIN + B_TIME * TRAIN_TT_SCALED + B_COST * TRAIN_COST_SCALED
V2 = ASC_SM + B_TIME * SM_TT_SCALED + B_COST * SM_COST_SCALED
V3 = ASC_CAR + B_TIME * CAR_TT_SCALED + B_COST * CAR_CO_SCALED
Note that all Python variables used in these expressions must have been defined before, in three possible ways:
- as a Biogeme variable, in the header of the data file,
- as a parameter, using the
Beta
function, - as a Python variable.
:
is the identifier or the associated alternative. The expressions on the right-hand side describes the associated utility function.
V = {1: V1,
2: V2,
3: V3}
Note that we could have merged the two previous steps, and write
V = {1: ASC_TRAIN + B_TIME * TRAIN_TT_SCALED + B_COST * TRAIN_COST_SCALED,
2: ASC_SM + B_TIME * SM_TT_SCALED + B_COST * SM_COST_SCALED,
3: ASC_CAR + B_TIME * CAR_TT_SCALED + B_COST * CAR_CO_SCALED}
We use the same tool (a Python dictionary) to associate each alternative in the choice set with an availability condition.
The convention is that zero is treated as "false", and one is treated as "true". Actually, any value different from zero is considered as "true".
av = {1: TRAIN_AV_SP,
2: SM_AV,
3: CAR_AV_SP}
We illustrate two ways of coding the logit model: with the predefined function, or by writing the choice model explicitly. In practice, it is adviced to use the predefined function, as it makes the estimation significantly faster. Here, writing the model illustrates some useful features of the syntax.
(i)Predefined function
We compute the logit probability and its logarithm:
prob = bioLogit(V,av,CHOICE)
logP = log(prob)
(ii) Writing the model
We use the Biogeme functionElem
(documented here) to identify the utility of the chosen alternative, and store it in the variable Vchosen
.
Vchosen = Elem(V,CHOICE)
We compute the denominator, using the shifted utilities. This uses the loop syntax of Python (documented here). Note that, in Python, the indentation of the code is very important and must be respected (see Section 2.1.8 of the Python manual).
den = 0
for i,v in V.items() :
den += av[i] * exp(v-Vchosen)
And we finally compute the logarithm of the probability.
logP = -log(den)
The likelihood function
We first define an iterator on the data file using the following statement. It is an object able to scan each row of the file.
rowIterator('obsIter')
Note that the name obsIter
is between quotes.
The likelihood function is defined and communicated to Biogeme using the following statement.
BIOGEME_OBJECT.ESTIMATE = Sum(logP,'obsIter')
Excluding observations
We define a boolean expression that is evaluated for each observation of the data file. Each observation such that this expression is "true" is discarded from the sample. The modeler here has developed the model only for work trips. Observations such that the dependent variable CHOICE is 0 are also removed.
exclude = (( PURPOSE != 1 ) * ( PURPOSE != 3 ) + ( CHOICE == 0 )) > 0
BIOGEME_OBJECT.EXCLUDE = exclude
Note that a valid biogeme expression is needed. The statement
BIOGEME_OBJECT.EXCLUDE = 0
will produce the following error message:
AttributeError: 'int' object has no attribute 'operatorIndex'
FATAL ERROR: [10:52:55]bioModelParser.cc:579 Unable to get the object operatorIndex
because "0" is not a valid biogeme expression. Instead, use
BIOGEME_OBJECT.EXCLUDE = Numeric(0)
Computing statistics
It is possible to program Biogeme to compute some statistics. Here we compute the log likelihood of the model where all parameters are zero. In this case, each alternative has the exact same probability 1/Jn to be chosen. Therefore, for each observation, the number of available alternatives Jn is first established, and the contribution to the log likelihood function is-ln(Jn
. The code is:
total = 0
for i,a in av.items() :
total += (a != 0)
nl = -Sum(log(total),'obsIter')
BIOGEME_OBJECT.STATISTICS['Null loglikelihood'] = nl
Note that a valid biogeme expression is needed. The statement
BIOGEME_OBJECT.STATISTICS['Zero'] = 0
will produce the following error message:
AttributeError: 'int' object has no attribute 'operatorIndex'
FATAL ERROR: [10:52:55]bioModelParser.cc:579 Unable to get the object operatorIndex
because "0" is not a valid biogeme expression. Instead, use
BIOGEME_OBJECT.STATISTICS['Zero'] = Numeric(0).
The text Null loglikelihood
will appear in the output file. This statistics, as well as others, are standard and have already been coded in Python Biogeme. The example contains three of these statistics:
nullLoglikelihood(av,'obsIter')
choiceSet = [1,2,3]
cteLoglikelihood(choiceSet,CHOICE,'obsIter')
availabilityStatistics(av,'obsIter')
Parallel computing
If your computer has several cores, Biogeme can be executed faster using multi threading. The following statement tells Biogeme how many threads must be used. It is basically the maximum number of processors that will be used for the estimation.
BIOGEME_OBJECT.PARAMETERS['numberOfThreads'] = "2"
Optimization algorithm
The optimization algorithm is determined by the following statement:
BIOGEME_OBJECT.PARAMETERS['optimizationAlgorithm'] = "CFSQP"
Valid entries are BIO
, CFSQP
, DONLP2
and SOLVOPT
. These algorithms are documented here.
Running biogeme
If Biogeme has been installed properly, the estimation is started with the following statement:
pythonbiogeme 01logit swissmetro.dat
A lot of information is displayed on the screen, that can be ignored (if everything goes well). An output file with the results is generated, in HTML format.
Biogeme