class Database: examples of use of each function

This webpage is for programmers who need examples of use of the functions of the class. The examples are designed to illustrate the syntax. They do not correspond to any meaningful model. For examples of models, visit biogeme.epfl.ch.

In [1]:
import datetime
print(datetime.datetime.now())
2019-12-29 21:16:33.735984
In [2]:
import biogeme.version as ver
print(ver.getText())
biogeme 3.2.5 [2019-12-29]
Version entirely written in Python
Home page: http://biogeme.epfl.ch
Submit questions to https://groups.google.com/d/forum/biogeme
Michel Bierlaire, Transport and Mobility Laboratory, Ecole Polytechnique Fédérale de Lausanne (EPFL)

In [3]:
import biogeme.database as db
import pandas as pd
import numpy as np
from biogeme.expressions import Variable, exp, bioDraws
np.random.seed(90267) 

Create a database from a pandas data frame

In [4]:
df = pd.DataFrame({'Person':[1,1,1,2,2],'Exclude':[0,0,1,0,1],'Variable1':[1,2,3,4,5],'Variable2':[10,20,30,40,50],'Choice':[1,2,3,1,2],'Av1':[0,1,1,1,1],'Av2':[1,1,1,1,1],'Av3':[0,1,1,1,1]})
myData = db.Database('test',df)
print(myData)
biogeme database test:
   Person  Exclude  Variable1  Variable2  Choice  Av1  Av2  Av3
0       1        0          1         10       1    0    1    0
1       1        0          2         20       2    1    1    1
2       1        1          3         30       3    1    1    1
3       2        0          4         40       1    1    1    1
4       2        1          5         50       2    1    1    1

valuesFromDatabase

Evaluates an expression for each entry of the database.

    Args:
       expression: object of type biogeme.expressions. 

    Returns: 
       numpy series, long as the number of entries in the database, containing the calculated quantities.
In [5]:
Variable1=Variable('Variable1')
Variable2=Variable('Variable2')
expr = Variable1 + Variable2
result = myData.valuesFromDatabase(expr)
print(result)
0    11
1    22
2    33
3    44
4    55
dtype: int64

checkAvailabilityOfChosenAlt

Check if the chosen alternative is available for each entry in the database.

    Args: 
        avail: list of biogeme.expressions to evaluate the availability conditions for each alternative.
        choice: biogeme.expressions to evaluate the chosen alternative.

    Returns:
       numpy series of bool, long as the number of entries in the database, containing True is the chosen alternative is available, False otherwise.
In [6]:
Av1=Variable('Av1')
Av2=Variable('Av2')
Av3=Variable('Av3')
Choice=Variable('Choice')
avail = {1:Av1,2:Av2,3:Av3}
result = myData.checkAvailabilityOfChosenAlt(avail,Choice)
print(result)
0    False
1     True
2     True
3     True
4     True
dtype: bool

sumFromDatabase

Calculates the value of an expression for each entry in the database, and retturns the sum.

    Args:
        expression: object of type biogeme.expressions 

    Returns:
        Sum of the expressions over the database.
In [7]:
Variable1=Variable('Variable1')
Variable2=Variable('Variable2')
expression = Variable2 / Variable1
result = myData.sumFromDatabase(expression)
print(result)
50.0

Suggest scaling

Suggest a scaling of the variables in the database

Returns: 

    A Pandas dataframe where each row contains the name of
    the variable and the suggested scale s. Ideally, the column
    should be multiplied by s.
In [8]:
myData.suggestScaling()
Out[8]:
Column Scale
2 Variable1 0.10
3 Variable2 0.01

scaleColumn

Divide an entire column by a scale value

       Args:
          column: name of the column

          scale: value of the scale. All values of the column will
          be multiplied by that scale.
In [9]:
myData.data
Out[9]:
Person Exclude Variable1 Variable2 Choice Av1 Av2 Av3
0 1 0 1 10 1 0 1 0
1 1 0 2 20 2 1 1 1
2 1 1 3 30 3 1 1 1
3 2 0 4 40 1 1 1 1
4 2 1 5 50 2 1 1 1
In [10]:
myData.scaleColumn('Variable2',0.01)
In [11]:
myData.data
Out[11]:
Person Exclude Variable1 Variable2 Choice Av1 Av2 Av3
0 1 0 1 0.1 1 0 1 0
1 1 0 2 0.2 2 1 1 1
2 1 1 3 0.3 3 1 1 1
3 2 0 4 0.4 1 1 1 1
4 2 1 5 0.5 2 1 1 1

addColumn

Add a new column in the database, calculated from an expression.

    Args:
       expression: object of type biogeme.expressions describing the expression to evaluate
       column: name of the column to add.

    Returns:
       nothing

    Raises:
          ValueError: if the column name already exists.
In [12]:
Variable1=Variable('Variable1')
Variable2=Variable('Variable2')
expression = exp(0.5*Variable2) / Variable1
expression = Variable2 * Variable1
result = myData.addColumn(expression,'NewVariable')
print(myData.data['NewVariable'].tolist())
[0.1, 0.4, 0.8999999999999999, 1.6, 2.5]

count

Counts the number of observations that have a specific value in a given column.

    Args:
        columnName: name of the column.
        value: value that is seeked.

    Returns: 
        Number of times that the value appears in the column.
In [13]:
# Count the number of entries for individual 1.
myData.count('Person',1)
Out[13]:
3

remove

Removes from the database all entries such that the value of the expression is not 0.

    Args:
       expression: object of type biogeme.expressions describing the expression to evaluate
    Returns:
       Nothing.
In [14]:
Exclude=Variable('Exclude')
myData.remove(Exclude)
myData.data
Out[14]:
Person Exclude Variable1 Variable2 Choice Av1 Av2 Av3 NewVariable
0 1 0 1 0.1 1 0 1 0 0.1
1 1 0 2 0.2 2 1 1 1 0.4
3 2 0 4 0.4 1 1 1 1 1.6

dumpOnFile

Dumps the database in a CSV formatted file.

    Returns:  name of the file
In [15]:
myData.dumpOnFile()
Out[15]:
'test_dumped~12.dat'
In [16]:
%%bash
cat test_dumped.dat
__rowId	Person	Exclude	Variable1	Variable2	Choice	Av1	Av2	Av3	NewVariable
0	1	0	1	0.1	1	0	1	0	0.1
1	1	0	2	0.2	2	1	1	1	0.4
3	2	0	4	0.4	1	1	1	1	1.6

generateDraws

Generate draws for each variable.

    Args:
         types:
             A dict indexed by the names of the variables,
             describing the types of draws. Each of them can be a
             native type or any type defined by the function
             database.setRandomNumberGenerators

         names: 
             the list of names of the variables that require draws to be generated.
         numberOfDraws: 
             number of draws to generate.

    Returns: 
         a 3-dimensional table with draws. The 3 dimensions are
          1. number of individuals
          2. number of draws
          3. number of variables

List native types and their description

In [17]:
myData.descriptionOfNativeDraws()
Out[17]:
['UNIFORM: Uniform U[0,1]',
 'UNIFORM_ANTI: Antithetic uniform U[0,1]',
 'UNIFORM_HALTON2: Halton draws with base 2, skipping the first 10',
 'UNIFORM_HALTON3: Halton draws with base 3, skipping the first 10',
 'UNIFORM_HALTON5: Halton draws with base 5, skipping the first 10',
 'UNIFORM_MLHS: Modified Latin Hypercube Sampling on [0,1]',
 'UNIFORM_MLHS_ANTI: Antithetic Modified Latin Hypercube Sampling on [0,1]',
 'UNIFORMSYM: Uniform U[-1,1]',
 'UNIFORMSYM_ANTI: Antithetic uniform U[-1,1]',
 'UNIFORMSYM_HALTON2: Halton draws on [-1,1] with base 2, skipping the first 10',
 'UNIFORMSYM_HALTON3: Halton draws on [-1,1] with base 3, skipping the first 10',
 'UNIFORMSYM_HALTON5: Halton draws on [-1,1] with base 5, skipping the first 10',
 'UNIFORMSYM_MLHS: Modified Latin Hypercube Sampling on [-1,1]',
 'UNIFORMSYM_MLHS_ANTI: Antithetic Modified Latin Hypercube Sampling on [-1,1]',
 'NORMAL: Normal N(0,1) draws',
 'NORMAL_ANTI: Antithetic normal draws',
 'NORMAL_HALTON2: Normal draws from Halton base 2 sequence',
 'NORMAL_HALTON3: Normal draws from Halton base 3 sequence',
 'NORMAL_HALTON5: Normal draws from Halton base 5 sequence',
 'NORMAL_MLHS: Normal draws from Modified Latin Hypercube Sampling',
 'NORMAL_MLHS_ANTI: Antithetic normal draws from Modified Latin Hypercube Sampling']
In [18]:
randomDraws1 = bioDraws('randomDraws1','NORMAL_MLHS_ANTI')
randomDraws2 = bioDraws('randomDraws2','UNIFORM_MLHS_ANTI')
randomDraws3 = bioDraws('randomDraws3','UNIFORMSYM_MLHS_ANTI')
# We build an expression that involves the three random variables
x = randomDraws1 + randomDraws2 + randomDraws3
types = x.dictOfDraws()
print(types)
{'randomDraws1': 'NORMAL_MLHS_ANTI', 'randomDraws2': 'UNIFORM_MLHS_ANTI', 'randomDraws3': 'UNIFORMSYM_MLHS_ANTI'}
In [19]:
theDrawsTable = myData.generateDraws(types,
                                     ['randomDraws1','randomDraws2','randomDraws3'],
                                     10)
theDrawsTable
Out[19]:
array([[[-1.12573246,  0.93262683,  0.70857146],
        [-1.92470618,  0.13390071, -0.55461327],
        [-0.74491474,  0.71369664,  0.50241637],
        [ 1.33576487,  0.96002502,  0.8463998 ],
        [-0.03641225,  0.01997585,  0.25911123],
        [ 1.12573246,  0.06737317, -0.70857146],
        [ 1.92470618,  0.86609929,  0.55461327],
        [ 0.74491474,  0.28630336, -0.50241637],
        [-1.33576487,  0.03997498, -0.8463998 ],
        [ 0.03641225,  0.98002415, -0.25911123]],

       [[-0.54516235,  0.78855215,  0.04809018],
        [-0.99629534,  0.45168533, -0.20964249],
        [ 0.56701476,  0.27813892, -0.09528222],
        [ 0.13318064,  0.3898702 , -0.83334911],
        [-0.31319   ,  0.48227605,  0.93865361],
        [ 0.54516235,  0.21144785, -0.04809018],
        [ 0.99629534,  0.54831467,  0.20964249],
        [-0.56701476,  0.72186108,  0.09528222],
        [-0.13318064,  0.6101298 ,  0.83334911],
        [ 0.31319   ,  0.51772395, -0.93865361]],

       [[-0.24097234,  0.06821618, -0.67465188],
        [ 1.06841955,  0.58471984,  0.42538689],
        [ 0.83603143,  0.61288538, -0.4196572 ],
        [ 0.30449664,  0.22061102,  0.11025077],
        [ 1.5535143 ,  0.82074407, -0.95678773],
        [ 0.24097234,  0.93178382,  0.67465188],
        [-1.06841955,  0.41528016, -0.42538689],
        [-0.83603143,  0.38711462,  0.4196572 ],
        [-0.30449664,  0.77938898, -0.11025077],
        [-1.5535143 ,  0.17925593,  0.95678773]]])

setRandomNumberGenerators

Defines user-defined random numbers generators.

    Args:

       rng: a dictionary of generators. The keys of the dictionary
       characterize the name of the generators, and must be
       different from the pre-defined generators in Biogeme:
       NORMAL, UNIFORM and UNIFORMSYM. The elements of the
       dictionary are functions that take two arguments: the
       number of series to generate (typically, the size of the
       database), and the number of draws per series.

    Returns: 
         nothing.
In [20]:
# We first define functions returning draws, given the number of observations, and the number of draws
def logNormalDraws(sampleSize,numberOfDraws):
    return np.exp(np.random.randn(sampleSize,numberOfDraws))

def exponentialDraws(sampleSize,numberOfDraws):
    return -1.0 * np.log(np.random.rand(sampleSize,numberOfDraws))

# We associate these functions with a name
dict = {'LOGNORMAL':(logNormalDraws,'Draws from lognormal distribution'),'EXP':(exponentialDraws,'Draws from exponential distributions')}
myData.setRandomNumberGenerators(dict)

# We can now generate draws from these distributions
randomDraws1 = bioDraws('randomDraws1','LOGNORMAL')
randomDraws2 = bioDraws('randomDraws2','EXP')
x = randomDraws1 + randomDraws2
types = x.dictOfDraws()
theDrawsTable = myData.generateDraws(types,['randomDraws1','randomDraws2'],10)
print(theDrawsTable)
[[[2.6541064  1.72040543]
  [1.02928192 0.62725734]
  [2.15336577 0.35541854]
  [0.92036707 0.38330687]
  [1.35125462 2.83842826]
  [0.27817501 0.46249413]
  [0.5007549  0.6961861 ]
  [1.11902088 1.05840875]
  [0.6539865  0.15909907]
  [0.11955894 0.38886736]]

 [[0.60108954 0.40525196]
  [3.93153651 0.35868107]
  [4.60723253 0.18021421]
  [1.27062239 2.2373742 ]
  [2.73460167 1.17203962]
  [5.61600938 1.8920716 ]
  [2.54756523 0.07930524]
  [0.77284243 2.56028383]
  [5.16153268 0.59225528]
  [0.58972275 0.67940422]]

 [[0.88324351 0.63497716]
  [3.67625403 3.030641  ]
  [2.24536739 0.70518133]
  [0.46930501 0.67990918]
  [4.86579395 0.4097506 ]
  [2.14129298 0.8086017 ]
  [0.20614091 0.06963184]
  [0.2096891  0.02382351]
  [1.70933977 0.78170648]
  [0.63660909 1.83653019]]]

sampleWithReplacement

Extract a random sample from the database, with replacement. Useful for bootstrapping. Args: size: size of the sample. If None, a sample of the same size as the database will be generated.

    Returns:
        pandas dataframe with the sample.
In [21]:
myData.sampleWithReplacement()
Out[21]:
Person Exclude Variable1 Variable2 Choice Av1 Av2 Av3 NewVariable
1 1 0 2 0.2 2 1 1 1 0.4
0 1 0 1 0.1 1 0 1 0 0.1
0 1 0 1 0.1 1 0 1 0 0.1
In [22]:
myData.sampleWithReplacement(6)
Out[22]:
Person Exclude Variable1 Variable2 Choice Av1 Av2 Av3 NewVariable
3 2 0 4 0.4 1 1 1 1 1.6
3 2 0 4 0.4 1 1 1 1 1.6
1 1 0 2 0.2 2 1 1 1 0.4
3 2 0 4 0.4 1 1 1 1 1.6
0 1 0 1 0.1 1 0 1 0 0.1
0 1 0 1 0.1 1 0 1 0 0.1

panel

Defines the data as panel data

    Args:
       columnName: name of the columns that identifies individuals.
In [23]:
myPanelData = db.Database('test',df)
# Data is not considered panel yet
myPanelData.isPanel()
Out[23]:
False
In [24]:
myPanelData.panel('Person')
# Now it is panel
print(myPanelData.isPanel())
print(myPanelData)
True
biogeme database test:
   Person  Exclude  Variable1  Variable2  Choice  Av1  Av2  Av3  NewVariable  \
0       1        0          1        0.1       1    0    1    0          0.1   
1       1        0          2        0.2       2    1    1    1          0.4   
2       2        0          4        0.4       1    1    1    1          1.6   

   _biogroups  
0           1  
1           1  
2           2  
Panel data
   0  1
1  0  1
2  2  2

When draws are generated for panel data, a set of draws is generated per person, not per observation.

In [25]:
randomDraws1 = bioDraws('randomDraws1','NORMAL')
randomDraws2 = bioDraws('randomDraws2','UNIFORM_HALTON3')
# We build an expression that involves the two random variables
x = randomDraws1 + randomDraws2
types = x.dictOfDraws()
theDrawsTable = myPanelData.generateDraws(types,['randomDraws1','randomDraws2'],10)
print(theDrawsTable)
[[[ 0.87940272  0.33333333]
  [-0.41629853  0.66666667]
  [-1.57792232  0.11111111]
  [ 0.10870961  0.44444444]
  [ 0.05140378  0.77777778]
  [ 1.800922    0.22222222]
  [-1.85148982  0.55555556]
  [ 0.87938314  0.88888889]
  [ 1.353763    0.03703704]
  [-0.46741631  0.37037037]]

 [[-1.09546279  0.7037037 ]
  [-0.09265338  0.14814815]
  [ 1.92991243  0.48148148]
  [-0.29388122  0.81481481]
  [-0.49084943  0.25925926]
  [ 0.2439256   0.59259259]
  [ 0.42498657  0.92592593]
  [-2.72496968  0.07407407]
  [ 2.0755831   0.40740741]
  [ 0.44793057  0.74074074]]]

getNumberOfObservations

Reports the number of observations in the database. Note that it returns the same value, irrespectively if the database contains panel data or not.

    Returns:
        Number of observations.

    See:  getSampleSize()
In [26]:
myData.getNumberOfObservations()
Out[26]:
3
In [27]:
myPanelData.getNumberOfObservations()
Out[27]:
3

getSampleSize

Reports the size of the sample. If the data is cross-sectional, it is the number of observations in the database. If the data is panel, it is the number of individuals.

    Returns: 
       Sample size.

    See: getNumberOfObservations()
In [28]:
myData.getSampleSize()
Out[28]:
3
In [29]:
myPanelData.getSampleSize()
Out[29]:
2

sampleIndividualMapWithReplacement

Extract a random sample of the individual map from a panel data database, with replacement. Useful for bootstrapping.

    Args:
        size: size of the sample. If None, a sample of the same size as the database will be generated.

    Returns:
        pandas dataframe with the sample.
In [30]:
myPanelData.sampleIndividualMapWithReplacement(10)
Out[30]:
0 1
1 0 1
1 0 1
1 0 1
2 2 2
2 2 2
1 0 1
1 0 1
1 0 1
1 0 1
1 0 1

sampleWithoutReplacement

It is possible as well to sample without replacement. This is typically useful for stochastic algorithms that use only part of the database.

In [31]:
df = pd.DataFrame({'Person':[1,1,1,2,2],'Exclude':[0,0,1,0,1],'Variable1':[1,2,3,4,5],'Variable2':[10,20,30,40,50],'Choice':[1,2,3,1,2],'Av1':[0,1,1,1,1],'Av2':[1,1,1,1,1],'Av3':[0,1,1,1,1],'Weight':[1,5,1,1,5]})
myData = db.Database('test',df)
myData.data
Out[31]:
Person Exclude Variable1 Variable2 Choice Av1 Av2 Av3 Weight
0 1 0 1 10 1 0 1 0 1
1 1 0 2 20 2 1 1 1 5
2 1 1 3 30 3 1 1 1 1
3 2 0 4 40 1 1 1 1 1
4 2 1 5 50 2 1 1 1 5
In [32]:
myData.data.Choice.value_counts()
Out[32]:
2    2
1    2
3    1
Name: Choice, dtype: int64
In [33]:
myData.sampleWithoutReplacement(0.7)
myData.data
Out[33]:
Person Exclude Variable1 Variable2 Choice Av1 Av2 Av3 Weight
1 1 0 2 20 2 1 1 1 5
4 2 1 5 50 2 1 1 1 5
3 2 0 4 40 1 1 1 1 1
0 1 0 1 10 1 0 1 0 1
In [34]:
myData.data.Choice.value_counts()
Out[34]:
2    2
1    2
Name: Choice, dtype: int64

The sampling does not have to be uniform. Here, we oversample data corresponding to Choice = 2

In [35]:
myData.sampleWithoutReplacement(0.7,'Weight')
myData.data
Out[35]:
Person Exclude Variable1 Variable2 Choice Av1 Av2 Av3 Weight
1 1 0 2 20 2 1 1 1 5
0 1 0 1 10 1 0 1 0 1
4 2 1 5 50 2 1 1 1 5
2 1 1 3 30 3 1 1 1 1
In [36]:
myData.data.Choice.value_counts()
Out[36]:
2    2
3    1
1    1
Name: Choice, dtype: int64
In [ ]: