[gt-users] A series of questions regarding GenomeTools C and its Python interface

Gordon Gremme gremme at gmail.com
Mon Apr 27 17:23:38 CEST 2009


>> This may lead to the following question?
>> I am trying to collect some statistics from large genomewide
>> annotation GFF3 files (and some genome centers use Gene, exon and CDS
>> features, while others use mRNA, CDS and UTRs or combinations of
>> all). What would be the easiest way to add exon and intron children
>> to a FeatureNode consisting only of CDS and UTR children? (and the
>> reverse, adding UTRs to features with gene, exon and CDS) ?.
>> Is there a way to merge adjacent 5_prime_UTR and CDS into a third
>> child node "exon" that spans both? Could a visitor do that (And
>> should I first implement it in C or could I do it from within the
>> Python interface).
>
> Yes, I suppose this could be done. There are multiple approaches to
> think of here:
>
> - First, by implementing a stream in C which takes input root (e.g.
> "gene") nodes from a GFF3InStream, and subsequently traverses and
> analyzes the child nodes, calculating the boundaries of the missing
> features and adding new nodes for the derived annotations on the fly.
> This stream could then be redirected into a GFF3OutStream to save the
> modified results into a file. For an example, see the
> add_introns_stream.[ch] and add_introns_visitor.[ch] in src/extended and
> the corresponding use of the add-introns stream in in the
> src/tools/gt_gff3 tool.
>
> Maybe Gordon has more ideas about how to do this, as I guess he's
> munging annotation data on a more regular basis...

If you want a high-performance solution, IMHO the C approach Sascha
outlined is the way to go.
The add_introns_stream.[ch] and add_introns_visitor.[ch] are good starters.
The add_introns_stream.[ch] is mostly boilerplate, all the interesting
stuff happens in the visitor.
In your case, you only have to implement the method  to handle
FeatureNodes (as it is done in the add_introns_visitor, too).

In contrast to the add_introns_visitor, I would use a
GtFeatureNodeIterator to iterate over the GtFeatureNodes (instead of
gt_genome_node_traverse_children()). This makes the code easier to
read and write.

Please don't hesitate to ask, if you need additional guidance.

Gordon


More information about the gt-users mailing list