[gt-users] dupfeat, interfeat and mergefeat tool (was: Re: A series of questions regarding GenomeTools C and its Python interface)
Gordon Gremme
gremme at gmail.com
Thu Apr 30 21:59:25 CEST 2009
Hi,
I just added three new tools to GenomeTools (dupfeat, interfeat and
mergefeat) which implement the functionality discussed below. The
corresponding streams on the C side are called GtDupFeatureStream,
GtInterFeatureStream and GtMergeFeatureStream. I tested a little bit,
but I will add some more tests soon and try it out on the example data
provided by Mauricio.
I haven't had time to work on the Python bindings yet. Any volunteers?
Shouldn't be too hard.
Gordon
>>> This may lead to the following question?
>>> I am trying to collect some statistics from large genomewide
>>> annotation GFF3 files (and some genome centers use Gene, exon and CDS
>>> features, while others use mRNA, CDS and UTRs or combinations of
>>> all). What would be the easiest way to add exon and intron children
>>> to a FeatureNode consisting only of CDS and UTR children? (and the
>>> reverse, adding UTRs to features with gene, exon and CDS) ?.
>>> Is there a way to merge adjacent 5_prime_UTR and CDS into a third
>>> child node "exon" that spans both? Could a visitor do that (And
>>> should I first implement it in C or could I do it from within the
>>> Python interface).
>
> I thought some more about your problem and came to the conclusion that
> it would be nice to have some generic NodeStreams in GenomeTools which
> would allow to solve problems like this easily.
>
> I am thinking of the following three NodeStreams:
> AddIntermediaryStream, DuplicateFeatureStream and MergeFeatureStream.
>
> An AddIntermediaryStream(outside_type, intermediary_type) would add
> new features of type intermediary_type between adjacent features of
> type outside_type which is a generalisation of the AddIntronsStream
> (which would equal AddIntermediaryStream(exon, intron)).
>
> A DuplicateFeatureStream(destination_type, source_type) would simply
> duplicate features of type source_type as features of type
> destination_type.
>
> A MergeFeatureStream(merge_type) would merge directly adjacent
> features of type merge_type into a single one.
>
> With this three streams, most of your problems could be solved easily
> by combining these streams (either on the Bindinding-Level or by
> combining the corresponding tools via pipes).
>
> Example (I combine symbolically via the pipe symbol here):
>
> - "What would be the easiest way to add exon and intron children to a
> FeatureNode consisting only of CDS and UTR children" could be solved
> like this:
>
> DuplicateFeatureStream(exon, CDS) | DuplicateFeatureStream(exon, UTR)
> | MergeFeatureStream(exon) | AddIntermediaryStream(exon, intron)
>
>
> - "Is there a way to merge adjacent 5_prime_UTR and CDS into a third
> child node "exon" that spans bot"
>
> could be solved similarily.
>
> The reverse ("adding UTRs to features with gene, exon and CDS") would
> require an additional (more complicated) stream, but wouldn't it be
> enough to map the different GFF3 usages just in one direction?
>
> I would code this up on the C side, but help with the python bindings
> and providing of test data would be much appreciated.
>
> What's your opinion on this?
>
> Gordon
>
>
> P.S.: Just in case you haven't noticed, there is already a StatStream
> which computes some statistics, like the distribution of exon and
> intron length. Maybe this would come in handy for your problem.
>
More information about the gt-users
mailing list