[gt-users] A series of questions regarding GenomeTools C and its Python interface

Gordon Gremme gremme at gmail.com
Tue Apr 28 16:05:22 CEST 2009


Dear Mauricio,

>> This may lead to the following question?
>> I am trying to collect some statistics from large genomewide
>> annotation GFF3 files (and some genome centers use Gene, exon and CDS
>> features, while others use mRNA, CDS and UTRs or combinations of
>> all). What would be the easiest way to add exon and intron children
>> to a FeatureNode consisting only of CDS and UTR children? (and the
>> reverse, adding UTRs to features with gene, exon and CDS) ?.
>> Is there a way to merge adjacent 5_prime_UTR and CDS into a third
>> child node "exon" that spans both? Could a visitor do that (And
>> should I first implement it in C or could I do it from within the
>> Python interface).

I thought some more about your problem and came to the conclusion that
it would be nice to have some generic NodeStreams in GenomeTools which
would allow to solve problems like this easily.

I am thinking of the following three NodeStreams:
AddIntermediaryStream, DuplicateFeatureStream and MergeFeatureStream.

An AddIntermediaryStream(outside_type, intermediary_type) would add
new features of type intermediary_type between adjacent features of
type outside_type which is a generalisation of the AddIntronsStream
(which would equal AddIntermediaryStream(exon, intron)).

A DuplicateFeatureStream(destination_type, source_type) would simply
duplicate features of type source_type as features of type
destination_type.

A MergeFeatureStream(merge_type) would merge directly adjacent
features of type merge_type into a single one.

With this three streams, most of your problems could be solved easily
by combining these streams (either on the Bindinding-Level or by
combining the corresponding tools via pipes).

Example (I combine symbolically via the pipe symbol here):

- "What would be the easiest way to add exon and intron children to a
FeatureNode consisting only of CDS and UTR children" could be solved
like this:

DuplicateFeatureStream(exon, CDS) | DuplicateFeatureStream(exon, UTR)
| MergeFeatureStream(exon) | AddIntermediaryStream(exon, intron)


- "Is there a way to merge adjacent 5_prime_UTR and CDS into a third
child node "exon" that spans bot"

could be solved similarily.

The reverse ("adding UTRs to features with gene, exon and CDS") would
require an additional (more complicated) stream, but wouldn't it be
enough to map the different GFF3 usages just in one direction?

I would code this up on the C side, but help with the python bindings
and providing of test data would be much appreciated.

What's your opinion on this?

Gordon


P.S.: Just in case you haven't noticed, there is already a StatStream
which computes some statistics, like the distribution of exon and
intron length. Maybe this would come in handy for your problem.


More information about the gt-users mailing list