[gt-users] dupfeat, interfeat and mergefeat tool (was: Re: A series of questions regarding GenomeTools C and its Python interface)

Brent Pedersen bpederse at gmail.com
Fri May 1 20:03:36 CEST 2009


On Thu, Apr 30, 2009 at 12:59 PM, Gordon Gremme <gremme at gmail.com> wrote:
> Hi,
>
> I just added three new tools to GenomeTools (dupfeat, interfeat and
> mergefeat) which implement the functionality discussed below. The
> corresponding streams on the C side are called GtDupFeatureStream,
> GtInterFeatureStream and GtMergeFeatureStream. I tested a little bit,
> but I will add some more tests soon and try it out on the example data
> provided by Mauricio.
>
> I haven't had time to work on the Python bindings yet. Any volunteers?
> Shouldn't be too hard.
>

cool. this is useful!
i can add these. should i add a new python file for each? or put them
into stream_ops.py or
similar?



> Gordon
>
>
>>>> This may lead to the following question?
>>>> I am trying to collect some statistics from large genomewide
>>>> annotation GFF3 files (and some genome centers use Gene, exon and CDS
>>>> features, while others use mRNA, CDS and UTRs or combinations of
>>>> all). What would be the easiest way to add exon and intron children
>>>> to a FeatureNode consisting only of CDS and UTR children? (and the
>>>> reverse, adding UTRs to features with gene, exon and CDS) ?.
>>>> Is there a way to merge adjacent 5_prime_UTR and CDS into a third
>>>> child node "exon" that spans both? Could a visitor do that (And
>>>> should I first implement it in C or could I do it from within the
>>>> Python interface).
>>
>> I thought some more about your problem and came to the conclusion that
>> it would be nice to have some generic NodeStreams in GenomeTools which
>> would allow to solve problems like this easily.
>>
>> I am thinking of the following three NodeStreams:
>> AddIntermediaryStream, DuplicateFeatureStream and MergeFeatureStream.
>>
>> An AddIntermediaryStream(outside_type, intermediary_type) would add
>> new features of type intermediary_type between adjacent features of
>> type outside_type which is a generalisation of the AddIntronsStream
>> (which would equal AddIntermediaryStream(exon, intron)).
>>
>> A DuplicateFeatureStream(destination_type, source_type) would simply
>> duplicate features of type source_type as features of type
>> destination_type.
>>
>> A MergeFeatureStream(merge_type) would merge directly adjacent
>> features of type merge_type into a single one.
>>
>> With this three streams, most of your problems could be solved easily
>> by combining these streams (either on the Bindinding-Level or by
>> combining the corresponding tools via pipes).
>>
>> Example (I combine symbolically via the pipe symbol here):
>>
>> - "What would be the easiest way to add exon and intron children to a
>> FeatureNode consisting only of CDS and UTR children" could be solved
>> like this:
>>
>> DuplicateFeatureStream(exon, CDS) | DuplicateFeatureStream(exon, UTR)
>> | MergeFeatureStream(exon) | AddIntermediaryStream(exon, intron)
>>
>>
>> - "Is there a way to merge adjacent 5_prime_UTR and CDS into a third
>> child node "exon" that spans bot"
>>
>> could be solved similarily.
>>
>> The reverse ("adding UTRs to features with gene, exon and CDS") would
>> require an additional (more complicated) stream, but wouldn't it be
>> enough to map the different GFF3 usages just in one direction?
>>
>> I would code this up on the C side, but help with the python bindings
>> and providing of test data would be much appreciated.
>>
>> What's your opinion on this?
>>
>> Gordon
>>
>>
>> P.S.: Just in case you haven't noticed, there is already a StatStream
>> which computes some statistics, like the distribution of exon and
>> intron length. Maybe this would come in handy for your problem.
>>
> _______________________________________________
> gt-users mailing list
> gt-users at genometools.org
> http://genometools.org/mailman/listinfo/gt-users
>


More information about the gt-users mailing list