[gt-users] New feature announcement
Sascha Steinbiss
steinbiss at zbh.uni-hamburg.de
Wed Jan 28 13:21:46 CET 2009
Brent Pedersen wrote:
> this example, and the shell commands below clear it up a lot. i've been only
> looking around in the python stuff. specifically the filtering -- which i had
> been doing by looping over an entire stream. there are some cases where i
> dont want to pull an entire genome gff into memory.
At the moment, all that is possible is either use the FeatureIndex (for
which there is only a memory-based implementation) or stream the file
completely through a stream for sequential processing. See below for more.
> i just tried to implement a FilterStream class in python, but
> gt_filter_stream_new relies on #define'd stuff in
> undef.h and i dont know how to get that into python ctypes. (sascha, is
> that possible?)
It does? As far as I can see, it only uses GtNodeStream, GtStr, GtRange,
GtStrand and bool (which are equivalent to int), unsigned long and
double, all of which are available from Python.
Maybe you mean that the range positions must be set to UNDEF_ULONG and
the strands to GT_NUM_OF_STRAND_TYPES if range or strand filter
conditions should be ignored? I do not know how to get these from Python
out-of-the-box. For this the ctypes_configure software could really make
sense. However, it should be compatible with the BSD license so it can
be distributed with GenomeTools (we usually do not like having
dependencies that require prior installation). Alternatively, you could
write your own script that defines gtlib.UNDEF_* constants from the
limits.h on the target system. This could be called from the Makefile,
for example.
Any better ideas to do this, anyone?
>> This approach is very memory efficient, because you do not have to
>> read the sometimes rather large annotation files all at once. But,
>> since they are sequential, you have to read the whole file every time
>> you process it.
>
> except if gff file is sorted? or does it read the entire file anyway?
AFAIK, the file is streamed to its end completely. You can query the
sortedness of a GtNodeStream implementation via
gt_node_stream_is_sorted() so it _may_ be possible to implement this but
I am not too involved with the internals of the stream system to give a
definitive answer to this.
>> So now we covered the GenomeNode interface (to represent annotations),
>> the NodeStream interface (to process annotations sequentially), and
>> the FeatureIndex interface (to process annotations with random
>> access).
>
> yes, makes sense, though much of the time, my use case is in between,
> where i want like 500K basepairs of a single chromosome or just a single
> chromosome. and if i understand correctly, that's the usecase for a filterstream
> so at least it's filtering in c.
Actually, doing efficient range overlap queries to pull out features in
a specific range (e.g. in your case all features in a 500K region) or a
single chromosome (via the sequence region) is more what the
FeatureIndex is intended to do. As stated above, it must be read into
memory though (but may be kept around for multiple queries).
This request has come up before and this shows that a more
memory-efficient FeatureIndex implementation with a more flexible query
strategy is needed and is likely to come at some point in time, but at
the moment something like this is not in active development. We have to
think about what exactly is needed and how typical use cases would look
like. The more input I get in this matter, the better...
> both sascha and gordon, thanks for taking the time to help me
> figure this out.
> i'm having fun tinkering with gt (and using it for some actual work) so far.
I am happy to hear that and always glad to receive feedback,
recommendations or contributions. Unfortunately, the example URL from
your earlier email does not work anymore as the server seems to be down.
Can we look forward at some time to some GenomeTools-based software? ;)
> -b
Sascha
--
Sascha Steinbiss
Center for Bioinformatics
University of Hamburg
Bundesstr. 43
20146 Hamburg
Germany
Email: steinbiss at zbh.uni-hamburg.de
URL: http://www.zbh.uni-hamburg.de/steinbiss
Phone: +49 (40) 42838 7322
FAX: +49 (40) 42838 7312
More information about the gt-users
mailing list