[gt-users] New feature announcement
Brent Pedersen
bpederse at gmail.com
Tue Jan 27 22:48:54 CET 2009
On Tue, Jan 27, 2009 at 12:57 PM, Brent Pedersen <bpederse at gmail.com> wrote:
> On Mon, Jan 26, 2009 at 6:13 PM, Gordon Gremme <gremme at gmail.com> wrote:
>>> i'm still foggy at best on understanding the Stream or vistor stuff.
>>
>> Let me try to shed some light on the subject ;-)
>> Parts of the following explanation are taken from another email,
>> therefore it also explains things you probably know already.
>>
>> GenomeTools uses the GenomeNode interface and most importantly its
>> FeatureNode implementation to represent all kinds of genome annotations.
>> A FeatureNode is basically a directed acyclic graph (DAG) whereas each
>> node represents an
>> annotated genomic region (e.g. exon from position 20 to 30) and each
>> vertex represents a part-of relationship (exon is part-of gene). The
>> nice thing about this data structure is its versatility. You can
>> create it automatically from GFF3 files (via a GFF3InStream) or
>> manually in a script (as is done in some of the AnnotationSketch
>> examples). Once you got all your GenomeNodes you can easily store them
>> in a GFF3 file with the GFF3OutStream.
>>
>> The other implementation (RegionNode, SequenceNode, CommentNode) are
>> mostly used to represent other parts of GFF3 files (sequence-region
>> lines, embedded Fasta files, and comment lines) and are usually not
>> used if one constructs annotations manually (FeatureNodes suffice for
>> this).
>>
>> To process annotations, e.g. for retrieval of all exons which are
>> below a certain length, two basic approaches exist: Sequentially via
>> NodeStreams or randomly via a FeatureIndex.
>>
>> Sequentially via NodeStreams is the approach most GFF3-related tools
>> contained in GenomeTools take. They implement the NodeStream interface
>> which easily allows to plug modules together which transform the
>> FeatureNodes on the C code level and on the shell level.
>>
>> Example: Our FilterStream allows to filter FeatureNodes according to
>> different criteria.
>> On the C code level we create three streams and plug them together:
>> The GFF3InStream reads from a set of GFF3 files and returns
>> GenomeNodes, the FilterStream takes any NodeStream and filters the
>> nodes in accordance with its settings, and the GFF3OutStreams takes any
>> NodeStream and writes its content as GFF3 output.
>> At the end we pull nodes through the GFF3OutStream which in turn asks
>> his predecessor (who ask his predecessor and so forth) for new
>> GenomeNodes until they are exhausted.
>>
>> gff3_in_stream = gt_gff3_in_stream_new_unsorted(argc - parsed_args,
>> argv + parsed_args);
>>
>> /* create a filter stream */
>> filter_stream = gt_filter_stream_new(gff3_in_stream, arguments->seqid, ...);
>>
>> /* create a gff3 output stream */
>> gff3_out_stream = gt_gff3_out_stream_new(arguments->targetbest
>> ? targetbest_filter_stream
>> : filter_stream,
>> arguments->outfp);
>>
>> /* pull the features through the stream and free them afterwards */
>> while (!(had_err = gt_node_stream_next(gff3_out_stream, &gn, err)) &&
>> gn) {
>> gt_genome_node_delete(gn);
>> }
>
> this example, and the shell commands below clear it up a lot. i've been only
> looking around in the python stuff. specifically the filtering -- which i had
> been doing by looping over an entire stream. there are some cases where i
> dont want to pull an entire genome gff into memory.
> i just tried to implement a FilterStream class in python, but
> gt_filter_stream_new relies on #define'd stuff in
> undef.h and i dont know how to get that into python ctypes. (sascha, is
> that possible?)
i think this might help:
http://codespeak.net/~fijal/configure.html
i'll have a look soon.
> meanwhile, as of this morning i'm making use of the shell commands. i
> didnt know about
> filter.
>
>
>>
>> This approach is very memory efficient, because you do not have to
>> read the sometimes rather large annotation files all at once. But,
>> since they are sequential, you have to read the whole file every time
>> you process it.
>
> except if gff file is sorted? or does it read the entire file anyway?
>
>>
>> The sequential approach allows to combine tools on the shell level
>> easily. Example:
>>
>> gth -gff3out -skipalignmentout ... | gt gff3 -sort - | gt filter
>> -seqid chr21 -overlap 1200 2000 | gt sketch test.png -
>>
>> If you need random access or multiple queries to the same annotation
>> set (as we had the need in the context of AnnotationSketch where we
>> have multiple range queries), the FeatureIndex interface is probably
>> the place to start. Our current implementation stores all the features
>> in main memory and allows only simple queries, but more sophisticated
>> indexing and query strategies could be implemented in another
>> FeatureIndex implementation.
>>
>> So now we covered the GenomeNode interface (to represent annotations),
>> the NodeStream interface (to process annotations sequentially), and
>> the FeatureIndex interface (to process annotations with random
>> access).
>
> yes, makes sense, though much of the time, my use case is in between,
> where i want like 500K basepairs of a single chromosome or just a single
> chromosome. and if i understand correctly, that's the usecase for a filterstream
> so at least it's filtering in c.
>
>
>>
>> The missing NodeVisitor serves mainly software engineering purposes.
>> It allows to process different implementations of the GenomeNode
>> interface without excessive downcasting. I can't describe visitors
>> better than it was done in the Design Patterns book
>> (http://en.wikipedia.org/wiki/Design_Patterns). See also
>> http://en.wikipedia.org/wiki/Visitor_pattern, but I find the
>> explanation in the book better.
>
> this makes a bit more sense now, though i'll have to let it digest for a while.
>
>>
>> I hope this makes things clearer. If not, please ask!
>>
>> Gordon
>
> it does. both sascha and gordon, thanks for taking the time to help me
> figure this out.
> i'm having fun tinkering with gt (and using it for some actual work) so far.
>
> -b
>
>> _______________________________________________
>> gt-users mailing list
>> gt-users at genometools.org
>> http://genometools.org/mailman/listinfo/gt-users
>>
>
More information about the gt-users
mailing list