[gt-users] New feature announcement

Brent Pedersen bpederse at gmail.com
Tue Jan 27 21:57:00 CET 2009


On Mon, Jan 26, 2009 at 6:13 PM, Gordon Gremme <gremme at gmail.com> wrote:
>> i'm still foggy at best on understanding the Stream or vistor stuff.
>
> Let me try to shed some light on the subject ;-)
> Parts of the following explanation are taken from another email,
> therefore it also explains things you probably know already.
>
> GenomeTools uses the GenomeNode interface and most importantly its
> FeatureNode implementation to represent all kinds of genome annotations.
> A FeatureNode is basically a directed acyclic graph (DAG) whereas each
> node represents an
> annotated genomic region (e.g. exon from position 20 to 30) and each
> vertex represents a part-of relationship (exon is part-of gene). The
> nice thing about this data structure is its versatility. You can
> create it automatically from GFF3 files (via a GFF3InStream) or
> manually in a script (as is done in some of the AnnotationSketch
> examples). Once you got all your GenomeNodes you can easily store them
> in a GFF3 file with the GFF3OutStream.
>
> The other implementation (RegionNode, SequenceNode, CommentNode) are
> mostly used to represent other parts of GFF3 files (sequence-region
> lines, embedded Fasta files, and comment lines) and are usually not
> used if one constructs annotations manually (FeatureNodes suffice for
> this).
>
> To process annotations, e.g. for retrieval of all exons which are
> below a certain length, two basic approaches exist: Sequentially via
> NodeStreams or randomly via a FeatureIndex.
>
> Sequentially via NodeStreams is the approach most GFF3-related tools
> contained in GenomeTools take. They implement the NodeStream interface
> which easily allows to plug modules together which transform the
> FeatureNodes on the C code level and on the shell level.
>
> Example: Our FilterStream allows to filter FeatureNodes according to
> different criteria.
> On the C code level we create three streams and plug them together:
> The GFF3InStream reads from a set of GFF3 files and returns
> GenomeNodes, the FilterStream takes any NodeStream and filters the
> nodes in accordance with its settings, and the GFF3OutStreams takes any
> NodeStream and writes its content as GFF3 output.
> At the end we pull nodes through the GFF3OutStream which in turn asks
> his predecessor (who ask his predecessor and so forth) for new
> GenomeNodes until they are exhausted.
>
>  gff3_in_stream = gt_gff3_in_stream_new_unsorted(argc - parsed_args,
>                                                 argv + parsed_args);
>
>  /* create a filter stream */
>  filter_stream = gt_filter_stream_new(gff3_in_stream, arguments->seqid, ...);
>
>  /* create a gff3 output stream */
>  gff3_out_stream = gt_gff3_out_stream_new(arguments->targetbest
>                                          ? targetbest_filter_stream
>                                          : filter_stream,
>                                          arguments->outfp);
>
>  /* pull the features through the stream and free them afterwards */
>  while (!(had_err = gt_node_stream_next(gff3_out_stream, &gn, err)) &&
>        gn) {
>   gt_genome_node_delete(gn);
>  }

this example, and the shell commands below clear it up a lot. i've been only
 looking around in the python stuff. specifically the filtering -- which i had
been doing by looping over an entire stream. there are some cases where i
dont want to pull an entire genome gff into memory.
i just tried to implement a FilterStream class in python, but
gt_filter_stream_new relies on #define'd stuff in
undef.h and i dont know how to get that into python ctypes. (sascha, is
that possible?)
meanwhile, as of this morning i'm making use of the shell commands. i
didnt know about
filter.


>
> This approach is very memory efficient, because you do not have to
> read the sometimes rather large annotation files all at once. But,
> since they are sequential, you have to read the whole file every time
> you process it.

except if gff file is sorted? or does it read the entire file anyway?

>
> The sequential approach allows to combine tools on the shell level
> easily. Example:
>
> gth -gff3out -skipalignmentout ... | gt gff3 -sort - | gt filter
> -seqid chr21 -overlap 1200 2000 | gt sketch test.png -
>
> If you need random access or multiple queries to the same annotation
> set (as we had the need in the context of AnnotationSketch where we
> have multiple range queries), the FeatureIndex interface is probably
> the place to start. Our current implementation stores all the features
> in main memory and allows only simple queries, but more sophisticated
> indexing and query strategies could be implemented in another
> FeatureIndex implementation.
>
> So now we covered the GenomeNode interface (to represent annotations),
> the NodeStream interface (to process annotations sequentially), and
> the FeatureIndex interface (to process annotations with random
> access).

yes, makes sense, though much of the time, my use case is in between,
where i want like 500K basepairs of a single chromosome or just a single
chromosome. and if i understand correctly, that's the usecase for a filterstream
so at least it's filtering in c.


>
> The missing NodeVisitor serves mainly software engineering purposes.
> It allows to process different implementations of the GenomeNode
> interface without excessive downcasting. I can't describe visitors
> better than it was done in the Design Patterns book
> (http://en.wikipedia.org/wiki/Design_Patterns). See also
> http://en.wikipedia.org/wiki/Visitor_pattern, but I find the
> explanation in the book better.

this makes a bit more sense now, though i'll have to let it digest for a while.

>
> I hope this makes things clearer. If not, please ask!
>
> Gordon

it does. both sascha and gordon, thanks for taking the time to help me
figure this out.
i'm having fun tinkering with gt (and using it for some actual work) so far.

-b

> _______________________________________________
> gt-users mailing list
> gt-users at genometools.org
> http://genometools.org/mailman/listinfo/gt-users
>


More information about the gt-users mailing list