The apop_data structure represents a data set. It joins together a gsl_vector, a gsl_matrix, an apop_name, and a table of strings. It tries to be lightweight, so you can use it everywhere you would use a gsl_matrix or a gsl_vector.
Here is a diagram showing a sample data set with all of the elements in place. Together, they represent a data set where each row is an observation, which includes both numeric and text values, and where each row/column may be named.
Rowname
Vector
Matrix
Text
Weights
"Steven"
"Sandra"
"Joe"
Outcome
1
0
1
Age
Weight (kg)
Height (cm)
32
65
175
41
61
165
40
73
181
Sex
State
Male
Alaska
Female
Alabama
Male
Alabama
1
3.2
2.4
In a regression, the vector would be the dependent variable, and the other columns (after factor-izing the text) the independent variables. Or think of the apop_data set as a partitioned matrix, where the vector is column -1, and the first column of the matrix is column zero. Here is some sample code to print the vector and matrix, starting at column -1 (but you can use apop_data_print to do this).
Most functions assume that each row represents one observation, so the data vector, data matrix, and text have the same row count: data->vector->size==data->matrix->size1 and data->vector->size==*data->textsize. This means that the apop_name structure doesn't have separate vector_names, row_names, or text_row_names elements: the rownames are assumed to apply for all.
See below for notes on managing the text element and the row/column names.
Pages
The apop_data set includes a more pointer, which will typically be NULL, but may point to another apop_data set. This is intended for a main data set and a second or third page with auxiliary information, such as estimated parameters on the front page and their covariance matrix on page two, or predicted data on the front page and a set of prediction intervals on page two.
The more pointer is not intended for a linked list for millions of data points. In such situations, you can often improve efficiency by restructuring your data to use a single table (perhaps via apop_data_pack and apop_data_unpack).
Most functions, such as apop_data_copy and apop_data_free, will handle all the pages of information. For example, an optimization search over multi-page parameter sets would search the space given by all pages.
Pages may also be appended as output or auxiliary information, such as covariances, and an MLE would not search over these elements. Any page with a name in XML-ish brackets, such as <Covariance>, is considered information about the data, not data itself, and therefore ignored by search routines, missing data routines, et cetera. This is achieved by a rule in apop_data_pack and apop_data_unpack.
Here is a toy example that establishes a baseline data set, adds a page, modifies it, and then later retrieves it.
apop_data *d = apop_data_alloc(10, 10, 10); //the base data set, a 10-item vector + 10x10 matrix
Apophenia builds upon the GSL, but it would be inappropriate to redundantly replicate the GSL's documentation here. Meanwhile, here are prototypes for a few common functions. The GSL's naming scheme is very consistent, so a simple reminder of the function name may be sufficient to indicate how they are used.
gsl_matrix_swap_rows (gsl_matrix * m, size_t i, size_t j)
gsl_matrix_swap_columns (gsl_matrix * m, size_t i, size_t j)
gsl_matrix_swap_rowcol (gsl_matrix * m, size_t i, size_t j)
The apop_text_to_data() function takes in the name of a text file with a grid of data in (comma|tab|pipe|whatever)-delimited format and reads it to a matrix. If there are names in the text file, they are copied in to the data set. See Input text file formatting for the full range and details of what can be read in.
If you have any columns of text, then you will need to read in via the database: use apop_text_to_db() to convert your text file to a database table, do any database-appropriate cleaning of the input data, then use apop_query_to_data() or apop_query_to_mixed_data() to pull the data to an apop_data set.
You may not need to use these functions often, given that apop_query_to_data, apop_text_to_data, and many transformation functions will auto-allocate apop_data sets for you.
The apop_data_alloc function allocates a vector, a matrix, or both. After this call, the structure will have blank names, NULLtext element, and NULLweights. See Name handling for discussion of filling the names. Use apop_text_alloc to allocate the text grid. The weights are a simple gsl_vector, so allocate a 100-unit weights vector via allocated_data_set->weights = gsl_vector_alloc(100).
Examples of use can be found throughout the documentation; for example, see A quick overview.
A second set of macros have a slightly different syntax, taking the name of the object to be declared as the last argument. These can not be used as expressions such as function arguments.
The view is an automatic variable, not a pointer, and therefore disappears at the end of the scope in which it is declared. If you want to retain the data after the function exits, copy it to another vector:
Curly braces always delimit scope, not just at the end of a function. When program evaluation exits a given block, all variables in that block are erased. Here is some sample code that won't work:
apop_data_print(outdata); //breaks: outdata points to out-of-scope variables.
For this if/then statement, there are two sets of local variables generated: one for the if block, and one for the else block. By the last line, neither exists. You can get around the problem here by making sure to not put the macro declaring new variables in a block. E.g.:
int apop_name_add(apop_name *n, char const *add_me, char type)
Definition apop_name.c:42
The versions that take a column/row name use apop_name_find for the search; see notes there on the name matching rules.
For those that take a column number, column -1 is the vector element.
For those that take a column name, I will search the vector last—if I don't find the name among the matrix columns, but the name matches the vector name, I return column -1.
If you give me both a .row and a .rowname, I go with the name; similarly for .col and .colname.
Numeric values default to zero, which is how the examples above that treated the apop_data set as a vector or scalar could do so relatively gracefully. So apop_data_get(dataset, 1) gets item (1, 0) from the matrix element of dataset. But as a do-what-I-mean exception, if there is no matrix element but there is a vector, then this form will get vector element 1. Relying on this DWIM exception is useful iff you can guarantee that a data set will have only a vector or a matrix but not both. Otherwise, be explicit and use apop_data_get(dataset, 1, -1).
The apop_data_ptr function follows the lead of gsl_vector_ptr and gsl_matrix_ptr, and like those functions, returns a pointer to the appropriate double. For example, to increment the (3,7)th element of an apop_data set:
gsl_matrix_get_row (gsl_vector * v, const gsl_matrix * m, size_t i)
gsl_matrix_get_col (gsl_vector * v, const gsl_matrix * m, size_t j)
gsl_matrix_set_row (gsl_matrix * m, size_t i, const gsl_vector * v)
gsl_matrix_set_col (gsl_matrix * m, size_t j, const gsl_vector * v)
Map/apply
These functions allow you to send each element of a vector or matrix to a function, either producing a new matrix (map) or transforming the original (apply). The ..._sum functions return the sum of the mapped output.
There are two types, which were developed at different times. The apop_map and apop_map_sum functions use variadic function inputs to cover a lot of different types of process depending on the inputs. Other functions with types in their names, like apop_matrix_map and apop_vector_apply, may be easier to use in some cases. They use the same routines internally, so use whichever type is convenient.
You can do many things quickly with these functions.
Given your log likelihood function, which acts on a apop_data set with only one row, and a data set where each row of the matrix is an observation, find the total log likelihood via:
The following program randomly generates a data set where each row is a list of numbers with a different mean. It then finds the statistic for each row, and the confidence with which we reject the claim that the statistic is less than or equal to zero.
Notice how the older apop_vector_apply uses file-global variables to pass information into the functions, while the apop_map uses a pointer to send parameters to the functions.
double sum = apop_map_sum(d, .fn_r = weight_given_even);
assert(sum == 49*25*2);
}
If the number of threads is greater than one, then the matrix will be broken into chunks and each sent to a different thread. Notice that the GSL is generally threadsafe, and SQLite is threadsafe conditional on several commonsense caveats that you'll find in the SQLite documentation. See apop_rng_get_thread() to use the GSL's RNGs in a threaded environment.
The ...sum functions are convenience functions that call ...map and then add up the contents. Thus, you will need to have adequate memory for the allocation of the temp matrix/vector.
If you generate your data set via apop_text_to_data or from the database via apop_query_to_data (or apop_query_to_text or apop_query_to_mixed_data) then column names appear as expected. Set apop_opts.db_name_column to the name of a column in your query result to use that column name for row names.
The apop_data set includes a grid of strings, named text, for holding text data.
Text should be encoded in UTF-8. ASCII is a subset of UTF-8, so that's OK too.
There are a few simple forms for handling the text element of an apop_data set.
Use apop_text_alloc to allocate the block of text. It is actually a realloc function, which you can use to resize an existing block without leaks. See the example below.
Use apop_text_set to write text elements. It replaces any existing text in the given slot without memory leaks.
The number of rows of text data in tdata is tdata->textsize[0]; the number of columns is tdata->textsize[1].
Refer to individual elements using the usual 2-D array notation, tdata->text[row][col].
x[0] can always be written as *x, which may save some typing. The number of rows is *tdata->textsize. If you have a single column of text data (i.e., all data is in column zero), then item i is *tdata->text[i]. If you know you have exactly one cell of text, then its value is **tdata->text.
After apop_text_alloc, all elements are the empty string "", which you can check via
if (!strlen(dataset->text[i][j])) printf("<blank>")
//or
if (!*dataset->text[i][j]) printf("<blank>")
For the sake of efficiency when dealing with large, sparse data sets, all blank cells point to the same static empty string, meaning that freeing cells must be done with care. Your best bet is to rely on apop_text_set, apop_text_alloc, and apop_text_free to do the memory management for you.
Here is a sample program that uses these forms, plus a few text-handling functions.
#include <apop.h>
int main(){
apop_query("create table data (name, city, state);"
"insert into data values ('Mike Mills', 'Rockville', 'MD');"
"insert into data values ('Bill Berry', 'Athens', 'GA');"
"insert into data values ('Michael Stipe', 'Decatur', 'GA');");
apop_data *tdata = apop_query_to_text("select name, city, state from data");
apop_data_transpose() : also transposes the text data. Say that you use dataset = apop_query_to_text("select onecolumn from data"); then you have a sequence of strings, d->text[0][0], d->text[1][0], .... After apop_data
dt = apop_data_transpose(dataset), you will have a single list of strings, dt->text[0], which is often useful as input to list-of-strings handling functions.
Factor is jargon for a numbered category. Number-crunching programs prefer integers over text, so we need a function to produce a one-to-one mapping from text categories into numeric factors.
A dummy is a variable that is either one or zero, depending on membership in a given group. Some methods (typically when the variable is an input or independent variable in a regression) prefer dummies; some methods (typically for outcome or dependent variables) prefer factors. The functions that generate factors and dummies will add an informational page to your apop_data set with a name like <categories
for your_column> listing the conversion from the artificial numeric factor to the original data. Use apop_data_get_factor_names to get a pointer to that page.
You can use the factor table to translate from numeric categories back to text (though you probably have the original text column in your data anyway).
Having the factor list in an auxiliary table makes it easy to ensure that multiple apop_data sets use the same single categorization scheme. Generate factors in the first set, then copy the factor list to the second, then run apop_data_to_factors on the second: