This reference section describes the assumptions made by apop_text_to_db and apop_text_to_data.
Each row of the file will be converted to one record in the database or one row in the matrix. Values on one row are separated by delimiters. Fixed-width input is also OK; see below.
By default, the delimiters are set to "|,\t", meaning that a pipe, comma, or tab will delimit separate entries. To change the default, use an argument to apop_text_to_db or apop_text_to_data like .delimiters=" \t" or .delimiters="|".
The input text file must be UTF-8 or traditional ASCII encoding. Delimiters must be ASCII characters. If your data is in another encoding, try the POSIX-standard iconv program to filter the data to UTF-8.
- The character after a backslash is read as a normal character, even if it is a delimiter,
#, or ".
\li If a field contains several such special characters, surround it by \c "s. The surrounding marks are stripped and the text read verbatim.
- Text does not need to be delimited by quotes (unless there are special characters). If a text field is quote-delimited, I'll strip them. E.g., "Males, 30-40", is an OK column name, as is "Males named \"Joe\\"".
- Everything after an unprotected
# is taken to be comments and ignored.
- Blank lines (empty or consisting only of white space) are also ignored.
- If you are reading into the gsl_matrix element of an apop_data set, all text fields are taken as zeros. You will be warned of such substitutions unless you set apop_opts.verbose==0 beforehand. For mixed text/numeric data, try using apop_text_to_db and then apop_query_to_mixed_data.
- There are often two delimiters in a row, e.g., "23, 32,, 12". When it's two commas like this, the user typically means that there is a missing value and the system should insert a NAN; when it is two tabs in a row, this is typically just a formatting glitch. Thus, if there are multiple delimiters in a row, I check whether the second (and subsequent) is a space or a tab; if it is, then it is ignored, and if it is any other delimiter (including the end of the line) then a NaN is inserted.
If this rule doesn't work for your situation, you can explicitly insert a note that there is a missing data point. E.g., try:
perl -pi.bak -e 's/,,/,NaN,/g' data_file
If you have missing data delimiters, you will need to set apop_opts.nan_string to text that matches the given format. E.g.,
apop_opts.nan_string = "NaN";
apop_opts.nan_string = "Missing";
apop_opts.nan_string = ".";
apop_opts.nan_string = NULL;
SQLite stores these NaN-type values internally as NULL; that means that functions like apop_query_to_data will convert both your nan_string string and NULL to NaN.
- The system uses the standards for C's
atof() function for floating-point numbers: INFINITY, -INFINITY, and NaN work as expected.
- If there are row names and column names, then the input will not be perfectly square: there should be no first entry in the sequence of column names like row names. That is, for a 100x100 data set with row and column names, there are 100 names in the top row, and 101 entries in each subsequent row (name plus 100 data points).
- White space before or after a field is ignored. So 1, 2,3, 4 , 5, " six ",7 is eqivalent to 1,2,3,4,5," six ",7.
- NUL characters ('\0') are treated as white space, so if your fields have NULs as padding, you should have no problem. NULs inside of a string terminates the string as it always does in C.
- Fixed-width formats are supported (for plain ASCII encoding only), but you have to provide a list of field ending positions. For example, given and .field_ends=(int[]){3, 5, 7}, we have three columns, named NUM, LE, and OL. The names can be read from the first row by setting .has_row_names='y'.