[gt-users] A series of questions regarding GenomeTools C and its Python interface

Sascha Steinbiss steinbiss at zbh.uni-hamburg.de
Mon Apr 27 15:56:09 CEST 2009


La Rota, Mauricio wrote:
> Hello Dr. Steinbiss

Dear Mr La Rota, (btw, the 'Dr.' is definitely a bit early as I am still
a Ph.D. student at the moment).

> I have been reading through the documentation and the code in the
> genometools website and not being a python user (as many biologists
> of the past, I also started with Perl) and having limited knowledge
> of C, I have not been able to understand the function of a Visitor
> (i.e: the GFF3Visitor class).
> Would you mind giving me a quick explanation of what a visitor does
> or has the potential to do?  As an alternative, where can I read more
> about the function of the classes

A visitor is a design pattern which is an abstraction of code acting on
a data structure. It is used to separate data and logic in
object-oriented programs. For example, the GFF3Visitor 'visits' a
GenomeNode (or a subclass of GenomeNode, such as FeatureNode etc.) and
executes code contained in the visitor object on the currently visited
object. This is done by invoking the 'accept' method on the target
object given a Visitor instance. In the current implementation, the
GFF3Visitor is a C object which simply prints out the annotation data
represented by the visited node (and its child nodes) in GFF3 format.
The current GenomeTools Python bindings do not make use of the
GFF3Visitor, it is solely there for demonstration purposes. Most of the
visitors used in GenomeTools are written in C, anyway.

For more information on the subject please see the 'Gang of Four' book:
Gamma, Helm, Johnson, and Vlissides (1995). Design Patterns: Elements of
Reusable Object-Oriented Software. Addison-Wesley. ISBN 0-201-63361-2.

> This may lead to the following question?
> I am trying to collect some statistics from large genomewide
> annotation GFF3 files (and some genome centers use Gene, exon and CDS
> features, while others use mRNA, CDS and UTRs or combinations of
> all). What would be the easiest way to add exon and intron children
> to a FeatureNode consisting only of CDS and UTR children? (and the
> reverse, adding UTRs to features with gene, exon and CDS) ?.
> Is there a way to merge adjacent 5_prime_UTR and CDS into a third
> child node "exon" that spans both? Could a visitor do that (And
> should I first implement it in C or could I do it from within the
> Python interface).

Yes, I suppose this could be done. There are multiple approaches to
think of here:

- First, by implementing a stream in C which takes input root (e.g.
"gene") nodes from a GFF3InStream, and subsequently traverses and
analyzes the child nodes, calculating the boundaries of the missing
features and adding new nodes for the derived annotations on the fly.
This stream could then be redirected into a GFF3OutStream to save the
modified results into a file. For an example, see the
add_introns_stream.[ch] and add_introns_visitor.[ch] in src/extended and
the corresponding use of the add-introns stream in in the
src/tools/gt_gff3 tool.

- In Python, it is also possible to traverse the root nodes taken from a
GFF3InStream using a FeatureNodeIterator, analyzing the child nodes this
way, and then using FeatureNode.create_new() and FeatureNode.add_child()
to attach new nodes for the derived features to the corresponding parent
nodes. This would be similar to your second idea.

Maybe Gordon has more ideas about how to do this, as I guess he's
munging annotation data on a more regular basis...

> It may be that there are some C class methods within GenomTools that
> could do that but I haven't found them yet and perhaps the python
> interface does access them yet.

The Python bindings to the whole of GenomeTools are by no means
complete. Most of the existing bindings cover the AnnotationSketch
visualisation library and its dependencies. However, we are always glad
to accept contributions ;)

> I am about to give up and go back to
> the painfully slow BioPerl :) .

I would consider the best solution to write up a new tool for this task
using the C interface, which I guess will be superior with regard to
performance to any Python-based variant.

> Thanks,
> -Mauricio.

No problem,
Sascha

> PS:  Sorry, I see that there is a mailing list.  Maybe I will
> subscribe to it and repost if you see it necessary.

I will cross-post this reply there if you don't mind.

-- 
Sascha Steinbiss
Center for Bioinformatics
University of Hamburg
Bundesstr. 43
20146 Hamburg
Germany

Email:  steinbiss at zbh.uni-hamburg.de
URL:    http://www.zbh.uni-hamburg.de/steinbiss
Phone:  +49 (40) 42838 7322
FAX:    +49 (40) 42838 7312



More information about the gt-users mailing list