### B.13 Statements for working with datasets

For all practical purposes datasets in BayES are matrices with additional structure. This means that if, for example, D is a dataset, then indexing operations and functions operating on matrices work on D in the same way as if D were a matrix. There are, however, some additional functions and statements that work on datasets, but not on matrices. These are documented in the following table.

 Syntax Arguments and performed function X = D.varname; X is a column vector with entries equal to the values of variable varname in dataset D. D must be a dataset varname must be the name of a variable contained in D D.varname = ; Creates a new variable called varname (deﬁned by ) and adds it to dataset D. If D already has a variable called varname then its values are replaced. D must be a dataset varname must be a valid id value could by any mathematical expression that returns a scalar or a column vector with number of rows equal to the number of observations in D when returns a scalar, then its value is expanded to match the number of rows of D (note that this is the only instance where BayES will expand a scalar in the right-hand side of an assignment statement to match the dimensions of the left-hand side item) clear(D.varname); Deletes the variable called varname from dataset D. D must be a dataset varname must be the name of a variable contained in D D = dataset(A $\left[$, {ID1, ID2, ...}$\right]$); D is a dataset constructed by the data contained in matrix A. ID1, ID2, ... is a list of id values (id values inside curly brackets) to be used as the variable names. If variable names are not provided then the variables are named _V1, _V2, etc. A must be a matrix ID1, ID2, ... must be distinct id values the number of id values provided must be equal to the number of columns in A see also import rename(D.oldname, newname); Renames variable oldname in dataset D to newname. D must be a dataset oldname must be a variable within D newname must be an id value keepif( $\left[$,"dataset"=D$\right]$); Keeps the observations in dataset D that satisfy the logical . The remaining observations (those that do not satisfy ) are permanently deleted from D. If D is not provided then the function operates on the ﬁrst dataset available in the current workspace. The statement has no return value and the data in D are overwritten. D must be a dataset could be a simple or compound logical condition, for example: D.var1 >= D.var2  D.var1 >= D.var2 | D.var3 == 0 see also dropif and dropmissing dropif( $\left[$,"dataset"=D$\right]$); Drops (permanentlydeletes) the observations in dataset D that satisfy the logical . The remaining observations (those that do not satisfy ) are retained. If D is not provided then the function operates on the ﬁrst dataset available in the current workspace. The statement has no return value and the data in D are overwritten. D must be a dataset could be a simple or compound logical condition, for example: D.var1 >= D.var2  D.var1 >= D.var2 | D.var3 == 0 see also keepif and dropmissing W = dropmissing(X); W is a dataset constructed by reading the entries of X, row by row, but skipping any rows in X that contain at least one missing value. An error is produced if an empty dataset results from dropping the rows of X with missing values. X must be a matrix or dataset When the argument provided to dropmissing() is a matrix then the function returns a matrix. Therefore, this function is also documented in Section B.5. see also dropif and keepif sortd( $\left[$,"dataset"=D$\right]$); Sorts the data in dataset D in ascending order according to the values of the variables, the names of which are provided in the list. If D is not provided then the function operates on the ﬁrst dataset available in the current workspace. The statement has no return value and the data in D are overwritten. D must be a dataset could be one of the following: a single variable name (id value)ex: myVariable a list of variable names (id values inside curly brackets and separated by commas)ex: {variable1, variable2} In the latter case the data are sorted ﬁrst according to variable1, and in case of duplicate values in variable1, according to variable2, within each group of duplicate values of variable1 see also sort and sortrows summary( $\left[$,"dataset"=D$\right]$); Calculates and prints summary statistics of the variables in dataset D. If D is not provided then the function operates on the ﬁrst dataset available in the current workspace. D must be a dataset defnines the variables for which summary statistics are calculated and could be one of the following: a single variable nameex: myVariable a list of variable names (id values inside curly brackets and separated by commas)ex: {variable1, variable2} the keyword all which requests calculation of summary statistics for all variables in D set_ts(tid  $\left[$,"format"=s,"dataset"=D$\right]$); Declares the dataset as time series, with tid being the variable that identiﬁes time periods. If D is not provided then the function operates on the ﬁrst dataset available in the current workspace. D must be a dataset tid must be the name of a variable contained in D; this variable can have either numeric or string values s must be one of the following strings: "index": integer values "yyyyqx": quarterly data "yyyymxx": monthly data "yyyy/mm/dd": daily data The default value for s is "index", in which the magnitude of the integer values indicates the time ordering and spacing of the observations. each observation in D must have a unique value for the tid variable set_pd(tid, pid  $\left[$,"format"=s,"dataset"=D$\right]$); Declares the dataset as a panel,with tid being the variable that identiﬁes time periods and pid the variable that identiﬁes groups. If D is not provided then the function operates on the ﬁrst dataset available in the current workspace. D must be a dataset tid must be the name of a variable contained in D; this variable can have either numeric or string values pid must be the name of a variable contained in D. Its values must be integers, with equal magnitude for observations which belong to the same group s must be one of the following strings: "index": integer values "yyyyqx": quarterly data "yyyymxx": monthly data "yyyy/mm/dd": daily data The default value for s is "index", in which the magnitude of the integer values indicates the time ordering and spacing of the observations. each observation in D must have a unique value for the tid-pid pair of variables set_cs($\left[$"dataset"=D$\right]$); Clears any time structure from dataset D, that was set by a call to either the set_ts or set_pd functions. If D is not provided then the function operates on the ﬁrst dataset available in the current workspace. D must be a dataset X = lag(varname $\left[$,l,"dataset"=D$\right]$); X is a column vector obtained by taking lags of length l on variable with name varname from dataset D. If D is not provided then the function operates on the ﬁrst dataset available in the current workspace. D must be a dataset, previously declared either as a time-series or panel dataset varname must be the name of a variable contained in D l must be an integer. The default value for l is 1 X = diﬀ(varname  $\left[$,o,l,"dataset"=D$\right]$); X is a column vector obtained by taking seasonal diﬀerences of order o and seasonal length l on variable with name varname from dataset D. If D is not provided then the function operates on the ﬁrst dataset available in the current workspace. D must be a dataset, previously declared either as a time-series or panel dataset varname must be the name of a variable contained in D o must be a positive integer. The default value for o is 1 l must be an integer. The default value for l is 1 X = groupmeans(varname  $\left[$,"dataset"=D$\right]$); X is a column vector obtained by calculating the arithmetic mean per group of the variable with name varname from dataset D. X has the same length as the number of observations in D and missing values are generated if varname has only missing values for a particular group. If D is not provided then the function operates on the ﬁrst dataset available in the current workspace. D must be a dataset, previously declared as a panel dataset varname must be the name of a variable contained in D X = groupvars(varname  $\left[$,"dataset"=D$\right]$); X is a column vector obtained by calculating the variance per group of the variable with name varname from dataset D. X has the same length as the number of observations in D and missing values are generated if varname has only missing values for a particular group. If D is not provided then the function operates on the ﬁrst dataset available in the current workspace. D must be a dataset, previously declared as a panel dataset varname must be the name of a variable contained in D X = groupsds(varname  $\left[$,"dataset"=D$\right]$); X is a column vector obtained by calculating the standard deviation per group of the variable with name varname from dataset D. X has the same length as the number of observations in D and missing values are generated if varname has only missing values for a particular group. If D is not provided then the function operates on the ﬁrst dataset available in the current workspace. D must be a dataset, previously declared as a panel dataset varname must be the name of a variable contained in D X = groupmedians(varname  $\left[$,"dataset"=D$\right]$); X is a column vector obtained by calculating the median per group of the variable with name varname from dataset D. X has the same length as the number of observations in D and missing values are generated if varname has only missing values for a particular group. If D is not provided then the function operates on the ﬁrst dataset available in the current workspace. D must be a dataset, previously declared as a panel dataset varname must be the name of a variable contained in D X = groupsums(varname  $\left[$,"dataset"=D$\right]$); X is a column vector obtained by calculating the sum per group of the variable with name varname from dataset D. X has the same length as the number of observations in D and missing values are generated if varname has only missing values for a particular group. If D is not provided then the function operates on the ﬁrst dataset available in the current workspace. D must be a dataset, previously declared as a panel dataset varname must be the name of a variable contained in D X = groupcounts(varname  $\left[$,"dataset"=D$\right]$); X is a column vector obtained by calculating the number of observations per group of the variable with name varname from dataset D. X has the same length as the number of observations in D and the number of observations per group excludes any observations with missing values on varname. If D is not provided then the function operates on the ﬁrst dataset available in the current workspace. D must be a dataset, previously declared as a panel dataset varname must be the name of a variable contained in D