[gt-users] Sorted vs. unsorted streams (was: New feature announcement)
Gordon Gremme
gremme at gmail.com
Wed Jan 28 16:19:41 CET 2009
>>> This approach is very memory efficient, because you do not have to
>>> read the sometimes rather large annotation files all at once. But,
>>> since they are sequential, you have to read the whole file every time
>>> you process it.
>>
>> except if gff file is sorted? or does it read the entire file anyway?
>
> AFAIK, the file is streamed to its end completely. You can query the
> sortedness of a GtNodeStream implementation via
> gt_node_stream_is_sorted() so it _may_ be possible to implement this but
> I am not too involved with the internals of the stream system to give a
> definitive answer to this.
Maybe some explanations will be helpful here.
A GtGFF3InStream created with gt_gff3_in_stream_new_unsorted() or
gt_gff3_in_stream_new_sorted() reads the input files always from the
beginning and till the end (unless you delete the stream object before
the last GtGenomeNode was returned).
The memory consumption of this two input streams is mainly determined
by the underlying GFF3 parser which can only return new GtGenomeNodes
if a ### line has been encountered or a feature line does not have an
ID attribute. Because otherwise it would still be allowed to have new
subfeatures in the file which refer to a previously introduced ID (see
the GFF3 specification for details).
That is, in the worst case of an GFF3 file without any ### lines, the
GFF3 parser used by the GtGFF3InStream has to read in the whole file
and hold it in memory before it can start handing out GtGenomeNodes.
For some stream-based applications, it makes sense that the adjacent
features on the genome flow in an ordered fashion through the stream.
For this reason, sorted streams have been introduced.
As was already mentioned by Sascha, the sortedness of a GtNodeStream
implementation can be checked via gt_node_stream_is_sorted().
To sort an unsorted stream, a GtSortStream (not part of the 'official'
API yet) can be used.
This stream has a memory requirement of O(filesize), because it has to
hold all features in memory, before it can sort them and start handing
them out.
A way around this for huge GFF3 files is to split them smaller pieces,
sort them separately and merge them with a MergeStream (as it is done
by `gt merge`).
Now back to the topic of multiple range queries. Currently there are
only two options: Either read the entire file multiple times
sequentially and retaining the desired features (slow) or loading the
entire file into an GtFeatureIndexMemory and querying it (O(filesize)
memory consumption).
If one wants to combine the best of both approaches (reading only once
and low memory consumption) a new FeatureIndex implementation which
stores the features in a new index structure on the hard disk would
probably be the best approach.
While it would be technically possible to seek around sorted GFF3
files on the hard disk to find the desired features, this approach
wouldn't be very efficient and most likely very ugly to implement.
Gordon
More information about the gt-users
mailing list