2.4 Datasets and data transformations

The fundamental data type in BayES is the matrix and BayES’ language is designed to work primarily with matrices. However, to facilitate statistical analysis and reporting of results BayES uses an additional type that can store data: the dataset. For practical purposes, a dataset in BayES is a special type of a matrix with each one of its columns storing data on a particular variable and having an associated name; the variable’s name.

Typically datasets are defined by importing data into BayES using a statement like:

myData = import("c:/myFiles/mydata/dataset1.csv", ",");

This statement reads the data from the csv file dataset1.csv, located at "c:/myFiles/mydata", into a dataset with id value myData and using comma as the field separator. Each column of dataset1.csv must contain the data on a variable, with the first row specifying the variable names.

Datasets can also be constructed in BayES by turning a matrix into a dataset. For example, if X is a matrix in the current workspace, then the statement:

myData = dataset(X);

will create a dataset with id value myData by treating each column of X as a variable. Because variables in datasets must have an associated name, this statement will name the variables in myData as _V1, _V2, etc. Alternatively, variable names can be provided as an optional argument to the dataset() function or the variables in myData can be renamed using the rename() function after the dataset has been constructed (see section B.13 for details).

Once a dataset item is constructed then the functions operating on datasets can be used. These functions include printing summary statistics, creating, deleting and renaming variables, sorting data, etc. and are documented in B.13. Sample "2Working with datasets.bsf", located at "$BayESHOME/Samples/1GettingStarted", demonstrates many of BayES’ capabilities in working with datasets.

Because datasets in BayES are simply special matrices, all operations and functions defined for matrices work also on datasets. Importantly, the indexing methods described in section 2.3.2 can be used to extract or alter the values stored in a dataset. For example, if myData is a dataset with three variables named var1, var2 and var3, then the statement:

X = myData(:,3);

will construct a matrix X that stores the values of var3 (third column of the dataset). Variables in a dataset can also be referred to by their name by using the ‘.’ operator. For example, an equivalent way of constructing X from the values of var3 would be:

X = myData.var3;

The ‘.’ operator can also be used to define new variables for a dataset. Continuing working with the myData dataset, the following statement creates a new variable named logvar2 with values equal to the natural logartihm of var2:

myData.logvar2 = log(myData.var2);

Indexing and range operations can be used in the left-hand side of assignment statements involving datasets as well. For example, the statement:

myData(1:3,1) = zeros(3,1);

makes the first three rows of the first variable in myData equal to zero. As long as var1 is the first variable in myData, this statement is equivalent to:

myData.var1(1:3) = zeros(3,1);

where, instead of using a column index in the left-hand side of the statement, the column of the dataset is referenced by the variable name.

When mathematical operations or operations that involve indexing are performed repeatedly on a dataset (for example within a loop), it is usually faster to turn the dataset into a matrix, perform the operations on the matrix and then turn the final matrix back into a dataset. For example, the following statements:4

X = myData; 
for (i=1:rows(myData)) 
    X(i,1) = i; 
myData = dataset(X, {var1,var2,var3});

store the data in myData into a matrix X, alter the values in the first column of X using a loop and replace the original dataset with a new one, constructed from the values in X. The same task could be accomplished by:

for (i=1:rows(myData)) 
    myData(i,1) = i; 

but the second set of statements could be slower, especially if myData contains many observations.

4See section 2.9 for a general discussion on program flow and subsection 2.9.3 for a description of for loops.

Share this content:
Facebook Twitter LinkedIn Email
© 2016–20 Grigorios Emvalomatis