[gt-users] New feature announcement

Brent Pedersen bpederse at gmail.com
Thu Jan 29 05:34:29 CET 2009


On Wed, Jan 28, 2009 at 4:21 AM, Sascha Steinbiss
<steinbiss at zbh.uni-hamburg.de> wrote:
> Brent Pedersen wrote:
>> this example, and the shell commands below clear it up a lot. i've been only
>> looking around in the python stuff. specifically the filtering -- which i had
>> been doing by looping over an entire stream. there are some cases where i
>> dont want to pull an entire genome gff into memory.
>
> At the moment, all that is possible is either use the FeatureIndex (for
> which there is only a memory-based implementation) or stream the file
> completely through a stream for sequential processing. See below for more.
>
>> i just tried to implement a FilterStream class in python, but
>> gt_filter_stream_new relies on #define'd stuff in
>> undef.h and i dont know how to get that into python ctypes. (sascha, is
>> that possible?)
>
> It does? As far as I can see, it only uses GtNodeStream, GtStr, GtRange,
> GtStrand and bool (which are equivalent to int), unsigned long and
> double, all of which are available from Python.
> Maybe you mean that the range positions must be set to UNDEF_ULONG and
> the strands to GT_NUM_OF_STRAND_TYPES if range or strand filter
> conditions should be ignored? I do not know how to get these from Python
> out-of-the-box. For this the ctypes_configure software could really make

yes, that's what i meant. i tried with just using large and small
values and that
didnt seem to work (no error, the filter just didnt give anything
back). so i figured it was something with the constants.
here's what i tried: http://rafb.net/p/zkghfW39.html

> sense.  However, it should be compatible with the BSD license so it can
> be distributed with GenomeTools (we usually do not like having
> dependencies that require prior installation). Alternatively, you could
> write your own script that defines gtlib.UNDEF_* constants from the
> limits.h on the target system. This could be called from the Makefile,
> for example.
> Any better ideas to do this, anyone?
>
>>> This approach is very memory efficient, because you do not have to
>>> read the sometimes rather large annotation files all at once. But,
>>> since they are sequential, you have to read the whole file every time
>>> you process it.
>>
>> except if gff file is sorted? or does it read the entire file anyway?
>
> AFAIK, the file is streamed to its end completely. You can query the
> sortedness of a GtNodeStream implementation via
> gt_node_stream_is_sorted() so it _may_ be possible to implement this but
> I am not too involved with the internals of the stream system to give a
> definitive answer to this.
>
>>> So now we covered the GenomeNode interface (to represent annotations),
>>> the NodeStream interface (to process annotations sequentially), and
>>> the FeatureIndex interface (to process annotations with random
>>> access).
>>
>> yes, makes sense, though much of the time, my use case is in between,
>> where i want like 500K basepairs of a single chromosome or just a single
>> chromosome. and if i understand correctly, that's the usecase for a filterstream
>> so at least it's filtering in c.
>
> Actually, doing efficient range overlap queries to pull out features in
> a specific range (e.g. in your case all features in a 500K region) or a
> single chromosome (via the sequence region) is more what the
> FeatureIndex is intended to do. As stated above, it must be read into
> memory though (but may be kept around for multiple queries).
> This request has come up before and this shows that a more
> memory-efficient FeatureIndex implementation with a more flexible query
> strategy is needed and is likely to come at some point in time, but at
> the moment something like this is not in active development. We have to
> think about what exactly is needed and how typical use cases would look
> like. The more input I get in this matter, the better...

actually, the FeatureIndexMemory does work quite well. my brain may be too
accustomed to relational db's. the only place it really matters is for
a web-app.


>
>> both sascha and gordon, thanks for taking the time to help me
>> figure this out.
>> i'm having fun tinkering with gt (and using it for some actual work) so far.
>
> I am happy to hear that and always glad to receive feedback,
> recommendations or contributions. Unfortunately, the example URL from
> your earlier email does not work anymore as the server seems to be down.
> Can we look forward at some time to some GenomeTools-based software? ;)
>

i've changed offices and so have a new ip address. but i havent had
much time to work with annotationsketch--i'd like
to get the genome-browser to which a user can add annotations, and a
visual comparative tool. hopefully something
soon.

re contributions, after switching out a few simple scripts for some
shell gt calls, i found gt splitfasta. it almost did what i needed.
so i did just push a change to github that adds a -numfiles option to
splitfasta. if it's useful to anyone else,
please let me know if/how i can clean it up. but the new option does
seem to work and i added a test.

thanks again,
-brent

>> -b
>
> Sascha
>
> --
> Sascha Steinbiss
> Center for Bioinformatics
> University of Hamburg
> Bundesstr. 43
> 20146 Hamburg
> Germany
>
> Email:  steinbiss at zbh.uni-hamburg.de
> URL:    http://www.zbh.uni-hamburg.de/steinbiss
> Phone:  +49 (40) 42838 7322
> FAX:    +49 (40) 42838 7312
>
> _______________________________________________
> gt-users mailing list
> gt-users at genometools.org
> http://genometools.org/mailman/listinfo/gt-users
>


More information about the gt-users mailing list