Estimating discrete choice models with
Biogeme
Start

Start

If you have not used PandasBiogeme yet, here are some tips to get you started.

Install

pip install biogeme

More detailed instructions here.

Get started

Read the primer.

Watch the video.

Examples

Examples are available on Github

Users group

Post your questions on the users group: groups.google.com/d/forum/biogeme.

About

Biogeme

Biogeme is a open source Python package designed for the maximum likelihood estimation of parametric models in general, with a special emphasis on discrete choice models. It relies on the package Python Data Analysis Library called Pandas.

It is developed and maintained by Prof. Michel Bierlaire, Transport and Mobility Laboratory, Ecole Polytechnique Fédérale de Lausanne, Switzerland.

Biogeme used to be a stand alone software package, written in C++. All the material related to the previous versions of Biogeme are available on the old webpage.

What's new in Biogeme 3.2.10?

Note: versions 3.2.9 and 3.2.10 are identical. Therefore, version 3.2.9 has been removed from the official distribution platform.

New syntax for DefineVariable

DefineVariable actually defines a new column in the database. The old syntax was:

myvar = DefineVariable('myvar', x * y + 2, database)
The new syntax is:
myvar = database.DefineVariable('myvar', x * y + 2)
Likelihood ratio test
It is now possible to perform a likelihood ratio test directly from the estimation results. See documentation here. It relies on a function that can be used in more general context. See documentation here.
Comparing several models
It is now possible to compile the estimation results from several models into a single data frame. See documentation here.
Automatic segmentation
It is now possible to define a parameter such that it has a different value for each segment in the population. See the example 01logitBis.py.
Simulation of panel data
It is now possible to use Biogeme in simulation mode for panel data. See the following example: 13panel_simul.py.
Flattening panel data
This new feature transforms a database organized in panel mode (that is, one row per observation) into a database organized in normal mode (that is, one row per individual, and the observations of each individual across columns). See documentation here and here
Covariance and correlation matrix of the nested and the cross-nested logit models
These new functions calculate the covariance and the correlation matrix of the error terms of a cross-nested logit model from the estimated parameters. See documentation here, here, here and here.
Recycling estimation results
It is now possible to skip estimation and read the estimation results from the pickle file by setting the parameter recycle=True. See the online documentation [here].
The feature removing unused variables has been canceled.
The parameters removeUnusedVariables and displayUsedVariables in the BIOGEME constructor have been removed.
More functionalities for the mathematical expressions.
The expressions have now been designed to also be available outside of the BIOGEME class. A detailed illustration of the functionalities is available [Click here].
New syntax for the assisted specification algorithm
The new syntax involves NamedTuple to make the code more readable. Refer to the examples, such as optima.py.

Conditions of use

BIOGEME is distributed free of charge. We ask each user

Author

Biogeme has been developed by Michel Bierlaire, Ecole Polytechnique Fédérale de Lausanne, Switzerland.

Acknowledgments

I would like to thank the following persons who played various roles in the development of Biogeme along the years. The list is certainly not complete, and I apologize for those who are omitted: Alexandre Alahi, Nicolas Antille, Gianluca Antonini, Cristian Arteaga, Kay Axhausen, John Bates, Denis Bolduc, David Bunch, Andrew Daly, Anna Fernandez Antolin, Mamy Fetiarison, Mogens Fosgerau, Emma Frejinger, Carmine Gioia, Marie-Hélène Godbout, Stephane Hess, Tim Hillel, Richard Hurni, Eva Kazagli, Jasper Knockaert, Xinjun Lai, Gael Lederrey, Virginie Lurkin, Nicholas Molyneaux, Nicola Ortelli, Carolina Osorio, Meritxell Pacheco Paneque, Thomas Robin, Pascal Scheiben, Matteo Sorci, Ewout ter Hoeven, Michael Thémans, Joan Walker.

I would like to give special thanks to Moshe Ben-Akiva and Daniel McFadden for their friendship, and for the immense influence that they had and still have on my work.

Install Biogeme

Install Python

Biogeme is an open source Python package, that relies on the version 3 of Python. Make sure that Python 3.x is installed on your computer. If you have never used Python before, you may want to consider a complete platform such as Anaconda.

If Python is already installed on your computer, verify the version. Two versions of Python are distributed: version 2 and version 3. Biogeme works only with version 3.

Installing PandasBiogeme on MaxOSX

Installing PandasBiogeme on Windows

Install Biogeme from pip

Biogeme is distributed using the pip package manager. There are several tutorials available on the internet such as this one or this one.

The command to install is simply

pip install biogeme

Depending on your OS and the version of Python, pip will either directly install the executable (it is called a "wheel"), or attempt to compile the package from sources.

In the latter case, it requires a proper environment to compile C++ code. In general, it is readily available on Linux, and MacOSX (if Xcode has been installed). It may be more complicated on Windows.

Biogeme on Github

The source code of Biogeme is available on GitHub. There are several tutorials available on the internet such as this one or this one.

The command to install Biogeme from source is

pip install -ve .

that must be executed in the directory containing the file setup.py.

Note that it requires a proper environment to compile C++ code. In general, it is readily available on Linux, and MacOSX (if Xcode has been installed).

On Windows,

  1. Install MSYS2.
  2. Add c:\msys64\mingw64\bin in the Windows PATH.
  3. Install using the following command:
    pip install --global-option build_ext --global-option --compiler=mingw32

Check the installation

To verify if biogeme is correctly installed, you can print the version of Biogeme. To do so, execute the following commands in Python:

  • Import the package:
    import biogeme.version as ver
  • Print the version information:
    print(ver.getText())
The result should look like the following:
Python 3.10.4 (main, Mar 31 2022, 03:38:35) [Clang 12.0.0 ] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import biogeme.version as ver
>>> print(ver.getText())
biogeme 3.2.9 [2022-08-19]
Version entirely written in Python
Home page: http://biogeme.epfl.ch
Submit questions to https://groups.google.com/d/forum/biogeme
Michel Bierlaire, Transport and Mobility Laboratory, Ecole Polytechnique Fédérale de Lausanne (EPFL)

Getting help

Biogeme users group

If you need help, submit your questions to the users' group:

groups.google.com/d/forum/biogeme

The forum is moderated. Please keep the following in mind before posting a question:

  • Check that the same question has not already been addressed on the forum.
  • Try to submit only questions about the software.
  • Make sure to read completely the documentation and to try the examples before submitting a question.
  • Do not submit large files (typically, data files) to the forum.

Frequently asked questions

Note: versions 3.2.9 and 3.2.10 are identical. Therefore, version 3.2.9 has been removed from the official distribution platform.

New syntax for DefineVariable

DefineVariable actually defines a new column in the database. The old syntax was:

myvar = DefineVariable('myvar', x * y + 2, database)

The new syntax is:

myvar = database.DefineVariable('myvar', x * y + 2)

Likelihood ratio test
It is now possible to perform a likelihood ratio test directly from the estimation results. See documentation here. It relies on a function that can be used in more general context. See documentation here.
Comparing several models
It is now possible to compile the estimation results from several models into a single data frame. See documentation here.
Automatic segmentation
It is now possible to define a parameter such that it has a different value for each segment in the population. See the example 01logitBis.py.
Simulation of panel data
It is now possible to use Biogeme in simulation mode for panel data. See the following example: 13panel_simul.py.
Flattening panel data
This new feature transforms a database organized in panel mode (that is, one row per observation) into a database organized in normal mode (that is, one row per individual, and the observations of each individual across columns). See documentation here and here
Covariance and correlation matrix of the nested and the cross-nested logit models
These new functions calculate the covariance and the correlation matrix of the error terms of a cross-nested logit model from the estimated parameters. See documentation here, here, here and here.
Recycling estimation results
It is now possible to skip estimation and read the estimation results from the pickle file by setting the parameter recycle=True. See the online documentation [here].
The feature removing unused variables has been canceled.
The parameters removeUnusedVariables and displayUsedVariables in the BIOGEME constructor have been removed.
More functionalities for the mathematical expressions.
The expressions have now been designed to also be available outside of the BIOGEME class. A detailed illustration of the functionalities is available [Click here].
New syntax for the assisted specification algorithm
The new syntax involves NamedTuple to make the code more readable. Refer to the examples, such as optima.py.

Note that version 3.2.7 and 3.2.8 are almost identical. The description belows compares to version 3.2.6.

Assisted specification
The asssisted specification algorithm by Ortelli et al. (2021) is now available.
Optimization
The optimization algorithms have been organized into two modules. The module algorithms.py contains generic optimization algorithms. The module optimization.py contains the functions that can be called directly by Biogeme [Click here for the documentation of the estimate function]. [Click here for an example.]
CFSQP
The CFSQP algorithm has been removed from the distribution.
Null log likelihood
The log likelihood is calculated. The null model predicts equal probability for each alternative.
Saved iterations
Iterations are saved in a file with extension .iter. If the file exists, Biogeme will initialize the parameters from this files, and ignore the starting values provided. To turn this feature off, set biogeme.saveIterations=False
Random starting values
It is possible to modify the initial values of the parameters in all formulas, using randomly generated values. The value is drawn from a uniform distribution on the interval defined by the bounds (by default [-100, 100].) [Click here for the documentation].
Sensitivity analysis
The betas for sensitivity analysis are now generated by bootstrapping. [Click here for the documentation].
Box-Cox
The implementation of the Box-Cox transform was incorrect and has been corrected.
Validation
The out-of-sample validation has been improved. [Click here for the documentation]. It has to be compined with the split function of the database object.
Statistics about chosen alternatives
It is now possible to calculate the number of time each alternative is chosen and available in the sample. [Click here for the documentation].
Validity check for the nests
The validity of the specification of the nests for nested and cross nested logit models is new checked.
ALOGIT file
Output files in F12 format compatible with ALOGIT can now be produced. [Click here for the documentation.
Likelihood ratio test
A function to perform the likelihood ratio test has been implemented. [Click here for the documentation].

Optimization
New optimization algorithms are available for estimation See the documentation of the estimate function, and the optimization module. See also an example.
Stochastic log likelihood
It is now possible to calculate the log likelihood function on a sample (a batch) of the full data file. This is particularly useful with large databases. It can be used in the implementation of a stochastic gradient algorithm, for instance. See documentation.
User's notes
It is possible to include your own notes in the HTML file using the userNotes parameter of the biogeme object. See documentation. See example.
Scaling
It is possible to have Biogeme suggesting the scales of the variables in the database using the suggestScales parameter of the biogeme object. See documentation.
Estimation
A new function quickEstimate performs the estimation of the parameters, and skips the calculation of the statistics. See documentation.
Validation
A new function in the database module allows to split the database in order to prepare an estimation and a validation sets, for out-of-sample validation. See documentation. It is used by the new function validate in the biogeme module. See documentation. See example.
Messages
A new function allows to extract all the messages generated during a run. See documentation. See example. It is also possible to make the logger temporarily silent using the functions temporarySilence and resume.

In order to comply better with good programming practice in Python, the syntax to import the variable names from the data file has been modified since version 3.2.5. The file headers.py is not generated anymore. The best practice is to declare every variable explicity:

PURPOSE = Variable('PURPOSE')
CHOICE = Variable('CHOICE')
GA = Variable('GA')
TRAIN_CO = Variable('TRAIN_CO')
CAR_AV = Variable('CAR_AV')
SP = Variable('SP')
TRAIN_AV = Variable('TRAIN_AV')
TRAIN_TT = Variable('TRAIN_TT')

If, for any reason, this explicit declaration is not desired, it is possible to replace the statement

from headers import *

by

globals().update(database.variables)

where database is the object containing the database, created as follows:

import biogeme.database as db
df = pd.read_csv('swissmetro.dat', '\t')
database = db.Database('swissmetro', df)

Also, in order to avoid any ambiguity, the operators used by Biogeme must be explicitly imported. For instance:

from biogeme.expressions import Beta, bioDraws, PanelLikelihoodTrajectory, MonteCarlo, log

Note that it is also possible to import all of them using the following syntax

from biogeme.expressions import *

although this is not a good Python programming practice.

If you have the results of a previous estimation, it may be a good idea to use the estimated values as a starting point for the estimation of similar models. If not, it depends on the nature of the parameters:
  • If the parameter is a coefficient (traditionally denoted by β), the value 0 is appropriate.
  • If the parameter is a nest parameter of a nested or cross-nested logit model (traditionally denoted by μ), the value 1 is appropriate. Make sure to define the lower bound of the parameter to 1.
  • If the parameter is the nest membership coefficient of a cross-nested logit model (traditionally denoted by α), the value 0.5 is appropriate. Make sure to define the lower bound to 0 and the upper bound to 1.
  • If the parameter captures the membership to a class of a latent class model, the value 0.5 is appropriate. Make sure to define the lower bound to 0 and the upper bound to 1.
  • If the parameter is the scale of an error component in a mixture of logit model (traditionally denoted by σ), the value must be sufficient large so that the likelihood of each observation is not too close to zero. It is suggested to try first with the value one. If there are numerical issues, try a larger value, such as 10. See Section 7 in the report Estimating choice models with latent variables with PandasBiogeme for a detailed discussion.

Yes. It is actually the default behavior. At each iteration, Biogeme creates a file __myModel.iter. This file will be read the next time Biogeme tries to estimate the same model. If you want to turn this feature off, set the BIOGEME class variable saveIterations to False.

Yes. See example 04validation.py on Github.

If the model returns a probability 0 for the chosen alternative for at least one observation in the sample, then the likelihood is 0, and the log likelihood is minus infinity. For the sake of robustness, Biogeme assigns the value -1.797693e+308 to the log likelihood in this context.

A possible reason is when the initial value of a scale parameter is too close to zero.

But there are many other possible reasons. The best way to investigate the source of the problem is to use Biogeme in simulation mode, and report the probability of the chosen alternative for each observation. Once you have identified the problematic entries, it is easier to investigate the reason why the model returns a probability of zero.

The issue is that in Python 3.8 and older on Windows, DLLs are loaded from trusted locations only (see this). It is necessary to add the path of the DLLs. Here is a way proposed by Facundo Storani, University of Salerno:
  • Search the DLLs folder of anaconda3. It may be similar to: C:\Users\[USER_NAME]\anaconda3\DLLs or C:\ProgramData\Anaconda3\DLLs.
  • Click the Start button, type "environment properties" into the search bar and hit Enter.
  • In the System Properties window, click "Environment Variables."
  • Select "Path" on the users' list, and modify.
  • Add the path of the dlls folder to the list. It may be similar to: C:\Users\[USER_NAME]\anaconda3\DLLs or C:\ProgramData\Anaconda3\DLLs.
(credit: Facundo Storani)

On Mac OSX, the following error is sometimes generated:
ImportError:
dlopen(/Users/~/anaconda3/lib/python3.6/site-packages/biogeme/cbiogeme.cpython-36m-darwin.so,
2): Symbol not found:
__ZNSt15__exception_ptr13exception_ptrD1Ev

It is likely to be due to a conflict of versions of Python packages. The best way to deal with it is to reinstall Biogeme using the following steps:

  • First, make sure that you have the latest version of pip:
    pip install --upgrade pip
    
  • Uninstall biogeme:
    pip uninstall biogeme
    
  • Install cython:
    pip install —-upgrade cython
    
  • Reinstall biogeme, without using the cache:
    pip install biogeme -—no-cache-dir
    
If it does not work, try first to install gcc:
conda install gcc
If it does not work, try creating a new conda environment:
conda create -n python310 python=3.10 pip
conda activate python310
pip install biogeme
If it does not work... I don't know :-(

On Mac OSX and Windows, the procedure is designed to install from binaries, not sources. If you get messages that look like the following, it means that pip is trying to compile from sources. And it will most certainly fail as the environment must be properly configured.
Running setup.py install for biogeme ... error
Complete output from command
c:\users\willi\anaconda3\python.exe -u -c "import setuptools,
tokenize;
__file__='C:\Users\willi\AppData\Local\Temp\pip-install-iaflhasr\biogeme\setup.py';
f=getattr(tokenize, 'open', open)(__file__);
code=f.read().replace('\r\n', '\n');
f.close();
exec(compile(code, __file__, 'exec'))" install --record C:\Users\willi\AppData\Local\Temp\pip-record-v6_zn0ff\install-record.txt --single-version-externally-managed --compile:
Using Cython
Please put "# distutils: language=c++" in your .pyx or .pxd file(s)
running install
It means that there is no binaries available for your version of Python. To check which versions are supported, go to the repository

pypi.org/project/biogeme/

For instance, the following files are available for version 3.2.10:

biogeme-3.2.10.tar.gz
biogeme-3.2.10-cp310-cp310-win_amd64.whl
biogeme-3.2.10-cp310-cp310-macosx_10_9_x86_64.whl
biogeme-3.2.10-cp39-cp39-win_amd64.whl
biogeme-3.2.10-cp39-cp39-macosx_10_9_x86_64.whl
biogeme-3.2.10-cp38-cp38-win_amd64.whl
biogeme-3.2.10-cp38-cp38-macosx_10_9_x86_64.whl
biogeme-3.2.10-cp37-cp37m-win_amd64.whl
biogeme-3.2.10-cp37-cp37m-macosx_10_9_x86_64.whl
biogeme-3.2.10-cp36-cp36m-macosx_10_9_x86_64.whl
It means that you can use Python 3.7, 3.8 and 3.9 on both platforms, while the version for Python 3.6 is only available on MacOSX.

Documentation

My first choice model with PandasBiogeme

Resources

EPFL Winter Course

Click here for information about the course

EPFL proposes a 5-day short course entitled "Discrete Choice Analysis: Predicting Individual Behavior and Market Demand". It is organized every year in March (occasionally in February).

Content:

  1. Fundamental methodology, e.g. the foundations of individual choice modeling, random utility models, discrete choice models (binary, multinomial, nested, cross-nested logit models, MEV models, probit models, and hybrid choice models such as logit kernel and mixed logit);
  2. Data collection issues, e.g. choice-based samples, enriched samples, stated preferences surveys, conjoint analysis, panel data;
  3. Model design issues, e.g. specification of utility functions, generic and alternative specific variables, joint discrete/continuous models, dynamic choice models;
  4. Model estimation issues, e.g. statistical estimation, testing procedures, software packages, estimation with individual and grouped data, Bayesian estimation;
  5. Forecasting techniques, e.g. aggregate predictions, sample enumeration, micro-simulation, elasticities, pivot-point predictions and transferability of parameters;
  6. Examples and case studies, including marketing (e.g., brand choice), housing (e.g., residential location), telecommunications (e.g., choice of residential telephone service), energy (e.g., appliance type), transportation (e.g., mode of travel).

Lecturers:Prof. Moshe Ben-AkivaMassachusetts Institute of Technology, Cambridge, Ma (USA)
Prof. Daniel McFaddenUniversity of Southern California [Nobel Prize Laureate, 2000]
Prof. Michel BierlaireEcole Polytechnique Fédérale de Lausanne, Switzerland

Online courses

An online course entitled "Introduction to Discrete Choice Models" is available on the following platforms:

MIT Summer Course

Click here for information about the course

MIT proposes a 5-day short course entitled "Discrete Choice Analysis: Predicting demand and market shares". It is organized every year in June.

Lecturer: Prof. Moshe Ben-Akiva, Massachusetts Institute of Technology, Cambridge, Ma (USA)

Other software packages

mixl
Simulated Maximum Likelihood Estimation of Mixed Logit Models for Large Datasets, by Joseph Malloy
LARCH
LARCH: A Freeware Package for Estimating Discrete Choice Models, by Jeffrey Newman.
Apollo
Apollo: a flexible, powerful and customisable freeware package for choice model estimation and application, by Stephane Hess and David Palma.
Pylogit
PyLogit is a Python package developed by Timothy Brathwaite for performing maximum likelihood estimation of conditional logit models and similar discrete choice models.

Archives

PythonBiogeme Version 2.4

PythonBiogeme Version 2.3

PythonBiogeme Version 2.2

Data

We provide here some choice data sets that can be used for research and education.

Airline itinerary
Airline itinerary
SP data collected by Boeing Commercial Airplanes in 2004 and 2005.
 
Mode choice in the Netherlands
Mode choice in the Netherlands
RP data collected in 1987 for the Netherlands Railways. Mode choice.
 
Mode choice in Switzerland
Mode choice in Switzerland
RP data collected between 2009 and 2010 for CarPostal, a public transportation operator.
 
Parking choice in Spain
Parking choice in Spain
SP data
 
Swissmetro
Swissmetro
SP data involving an hypothetical mode of transportation
 
Telephone data
Telephone data
RP data collected in 1984 in Pennsylvania
 
London Passenger Mode Choice
London Passenger Mode Choice
81 086 trips from the London Travel Demand Survey from April 2012 to March 2015
 

History of Biogeme

Several versions of Biogeme have been developed over the years. Several names of animals appear: Gnu, Bison, Python, and now, Pandas..