What's New In khmer 2.0?¶
New behavior¶
Streaming I/O from Unix Pipes¶
All scripts now accept input from named (like /dev/stdin, or that created
using <( list ) process substituion) and unamed pipes (like output piped in
from another program with |). The STDIN stream can also be specified using
a single dash: -.
New parameter for memory usage, and/or tablesize/number of table parameters.¶
There is now a -M/
--max-memory-usage
parameter that sets the number of tables (
-N/
--n_tables) and tablesize
(-x/--max-tablesize) parameters automatically to match the
desired memory usage.
Digital normalization script now supports mixed paired and unpaired read input¶
normalize-by-median.py now supports mixed paired and unpaired (or
"broken-paired") input. Behavior can be forced to either treat all
reads as singletons or to require all reads be properly paired using
--force_single or
--paired, respectively. If
--paired is set,
--unpaired-reads can be
used to include a file of unpaired reads. The unpaired reads will be examined
after all of the other sequence files.
normalize-by-median.py --quiet can be used to reduce the amount of
diagnostic output.
Mixed-pair sequence file format support¶
split-paired-reads.py --output-orphaned/-0 has been added to allow for orphaned reads and give
them a file to be sorted into.
Scripts now output columnar data in CSV format by default¶
All scripts that output any kind of columnar data now do so in CSV format,
with headers. Previously this had to be enabled with --csv.
(Affects abundance-dist-single.py, abundance-dist.py,
count-median.py, and count-overlap.py.)
normalize-by-median.py --report also now outputs in CSV format.
Reservoir sampling script extracts paired reads by default¶
sample-reads-randomly.py now retains pairs in the output, by
default. This can be overridden to match previous behavior
with --force_single.
New scripts¶
Estimate number of unique kmers¶
unique-kmers.py estimates the k-mer cardinality of a dataset using the HyperLogLog probabilistic data structure. This allows very low memory consumption, which can be configured through an expected error rate. Even with low error rate (and higher memory consumption), it is still much more efficient than exact counting and alternative methods. It supports multicore processing (using OpenMP) and streaming, and so can be used in conjunction with other scripts (like normalize-by-median.py and filter-abund.py). This is the work of Luiz Irber and it is the subject of a paper in draft.
Incompatible changes¶
New datastructure and script names¶
For clarity the Count-Min Sketch based data structure previously known as
"counting_hash" or "counting_table" and variations of these is now known as
countgraph. Likewise with the Bloom Filter based data structure previously
known at "hashbits", "presence_table" and variations of these is now known as
nodegraph. Many options relating to table have been changed to
graph.
Binary file formats have changed¶
All binary khmer formats (presence tables, counting tables, tag sets,
stop tags, and partition subsets) have changed. Files are now
pre-pended with the string OXLI to indicate that they are from
this project.
Files of the above types made in previous versions of khmer are not compatible with v2.0; the reverse is also true.
In addition to the OXLI string, the Nodegraph and Countgraph file format
now includes the number of occupied bins. See khmer/Oxli Binary File Formats
for details.
load-graph.py no longer appends .pt to the specified filename¶
Previously, load-graph.py` appended a .pt extension to the
specified output filename and partition-graph.py appended a .pt
to the given input filename. Now, load-graph.py writes to the
specified output filename and partition-graph.py does not append a
.pt to the given input filename.
Some reporting options have been turned always on¶
The total number of unique k-mers will always be reported every time a new
countgraph is made. The --report-total-kmers option has been removed from
abundance-dist-single.py, filter-abund-single.py, and
normalize-by-median.py to reflect this. Likewise with
write-fp-rate for load-into-counting.py and
load-graph.py; the false positive rate will always be
written to the .info files.
An uncommon error recovery routine was removed¶
To simplify the codebase --save-on-failure and its helper option
--dump-frequency have been removed from normalize-by-median.py.
Single file output option names have been normalized¶
--out is now --output for both normalize-by-median.py and trim-low-abund.py.
Miscellaneous changes¶
The common option --min-tablesize was renamed to
--max-tablesize to reflect
this more desirable behavior.
In conjuction with the new split-paired-reads.py --output-orphaned
option, the option --force-paired/-p has been eliminated.
As CSV format is now the default, the --csv option has been removed.
Removed script¶
count-overlap.py has been removed.