[gt-users] New feature announcement

Sascha Steinbiss steinbiss at zbh.uni-hamburg.de
Wed Jan 28 13:21:46 CET 2009


Brent Pedersen wrote:
> this example, and the shell commands below clear it up a lot. i've been only
> looking around in the python stuff. specifically the filtering -- which i had
> been doing by looping over an entire stream. there are some cases where i
> dont want to pull an entire genome gff into memory.

At the moment, all that is possible is either use the FeatureIndex (for 
which there is only a memory-based implementation) or stream the file 
completely through a stream for sequential processing. See below for more.

> i just tried to implement a FilterStream class in python, but
> gt_filter_stream_new relies on #define'd stuff in
> undef.h and i dont know how to get that into python ctypes. (sascha, is
> that possible?)

It does? As far as I can see, it only uses GtNodeStream, GtStr, GtRange, 
GtStrand and bool (which are equivalent to int), unsigned long and 
double, all of which are available from Python.
Maybe you mean that the range positions must be set to UNDEF_ULONG and 
the strands to GT_NUM_OF_STRAND_TYPES if range or strand filter 
conditions should be ignored? I do not know how to get these from Python 
out-of-the-box. For this the ctypes_configure software could really make 
sense.  However, it should be compatible with the BSD license so it can 
be distributed with GenomeTools (we usually do not like having 
dependencies that require prior installation). Alternatively, you could 
write your own script that defines gtlib.UNDEF_* constants from the 
limits.h on the target system. This could be called from the Makefile, 
for example.
Any better ideas to do this, anyone?

>> This approach is very memory efficient, because you do not have to
>> read the sometimes rather large annotation files all at once. But,
>> since they are sequential, you have to read the whole file every time
>> you process it.
> 
> except if gff file is sorted? or does it read the entire file anyway?

AFAIK, the file is streamed to its end completely. You can query the 
sortedness of a GtNodeStream implementation via 
gt_node_stream_is_sorted() so it _may_ be possible to implement this but 
I am not too involved with the internals of the stream system to give a 
definitive answer to this.

>> So now we covered the GenomeNode interface (to represent annotations),
>> the NodeStream interface (to process annotations sequentially), and
>> the FeatureIndex interface (to process annotations with random
>> access).
> 
> yes, makes sense, though much of the time, my use case is in between,
> where i want like 500K basepairs of a single chromosome or just a single
> chromosome. and if i understand correctly, that's the usecase for a filterstream
> so at least it's filtering in c.

Actually, doing efficient range overlap queries to pull out features in 
a specific range (e.g. in your case all features in a 500K region) or a 
single chromosome (via the sequence region) is more what the 
FeatureIndex is intended to do. As stated above, it must be read into 
memory though (but may be kept around for multiple queries).
This request has come up before and this shows that a more 
memory-efficient FeatureIndex implementation with a more flexible query 
strategy is needed and is likely to come at some point in time, but at 
the moment something like this is not in active development. We have to 
think about what exactly is needed and how typical use cases would look 
like. The more input I get in this matter, the better...

> both sascha and gordon, thanks for taking the time to help me
> figure this out.
> i'm having fun tinkering with gt (and using it for some actual work) so far.

I am happy to hear that and always glad to receive feedback, 
recommendations or contributions. Unfortunately, the example URL from 
your earlier email does not work anymore as the server seems to be down. 
Can we look forward at some time to some GenomeTools-based software? ;)

> -b

Sascha

-- 
Sascha Steinbiss
Center for Bioinformatics
University of Hamburg
Bundesstr. 43
20146 Hamburg
Germany

Email:  steinbiss at zbh.uni-hamburg.de
URL:    http://www.zbh.uni-hamburg.de/steinbiss
Phone:  +49 (40) 42838 7322
FAX:    +49 (40) 42838 7312



More information about the gt-users mailing list