[gt-users] A series of questions regarding GenomeTools C and its Python interface
La Rota, Mauricio
MAURICIO.LAROTA at PIONEER.COM
Tue Apr 28 23:23:18 CEST 2009
Hello Gordon,
Having it done on the C side as streams would be nice. But I was able to solve the "Add Exon and Introns" for "FeatureNodes consisting only of CDS and UTR children" on the python side by doing dual traversal of the children with FeatureNodeIteratorDirect, once for filling missing children and a second time to do the counting stats. I suppose that now that it works, I could probably do both in a single traversal. That might also fix an error message I am getting and I think it is related to traversing the same tree 2 times (please also read my response to Sascha in a related thread " bug report on release 1.3.1?").
I have not worked out the adding of UTRs yet for the other style of annotations but might get it to work in a similar fashion, only I need to make sure that I test for the presence of introns inside the UTRs.
I am providing the test data in the attachment (these are excerpts from public files from JGI for several genomes (in this case for soybean) and their different release versions. In particular, they change annotation styles from one release to the next), But I am not confident that I can help with a robust python binding set as I am still learning my ways through python (I started this month !).
On a side note, please note that in test file 1 (the one with CDS/UTRs but no exons/introns) a few of the gene models have wrongly labeled UTRs. What should be 3'UTRs have, unfortunately been predicted as 5'UTRs, to the point that some gene models have 5'UTRs at both ends :)
This would be another use for your implementation discussed here, one could repair the offending UTR labels by simply reconstructing them from the stream that already provides exons introns and CDSs.
-Mauricio
-----Original Message-----
From: Gordon Gremme [mailto:gremme at gmail.com]
Sent: Tuesday, April 28, 2009 9:05 AM
To: GenomeTools Users
Cc: La Rota, Mauricio
Subject: Re: [gt-users] A series of questions regarding GenomeTools C and its Python interface
Dear Mauricio,
>> This may lead to the following question?
>> I am trying to collect some statistics from large genomewide
>> annotation GFF3 files (and some genome centers use Gene, exon and CDS
>> features, while others use mRNA, CDS and UTRs or combinations of
>> all). What would be the easiest way to add exon and intron children
>> to a FeatureNode consisting only of CDS and UTR children? (and the
>> reverse, adding UTRs to features with gene, exon and CDS) ?.
>> Is there a way to merge adjacent 5_prime_UTR and CDS into a third
>> child node "exon" that spans both? Could a visitor do that (And
>> should I first implement it in C or could I do it from within the
>> Python interface).
I thought some more about your problem and came to the conclusion that
it would be nice to have some generic NodeStreams in GenomeTools which
would allow to solve problems like this easily.
I am thinking of the following three NodeStreams:
AddIntermediaryStream, DuplicateFeatureStream and MergeFeatureStream.
An AddIntermediaryStream(outside_type, intermediary_type) would add
new features of type intermediary_type between adjacent features of
type outside_type which is a generalisation of the AddIntronsStream
(which would equal AddIntermediaryStream(exon, intron)).
A DuplicateFeatureStream(destination_type, source_type) would simply
duplicate features of type source_type as features of type
destination_type.
A MergeFeatureStream(merge_type) would merge directly adjacent
features of type merge_type into a single one.
With this three streams, most of your problems could be solved easily
by combining these streams (either on the Bindinding-Level or by
combining the corresponding tools via pipes).
Example (I combine symbolically via the pipe symbol here):
- "What would be the easiest way to add exon and intron children to a
FeatureNode consisting only of CDS and UTR children" could be solved
like this:
DuplicateFeatureStream(exon, CDS) | DuplicateFeatureStream(exon, UTR)
| MergeFeatureStream(exon) | AddIntermediaryStream(exon, intron)
- "Is there a way to merge adjacent 5_prime_UTR and CDS into a third
child node "exon" that spans bot"
could be solved similarily.
The reverse ("adding UTRs to features with gene, exon and CDS") would
require an additional (more complicated) stream, but wouldn't it be
enough to map the different GFF3 usages just in one direction?
I would code this up on the C side, but help with the python bindings
and providing of test data would be much appreciated.
What's your opinion on this?
Gordon
P.S.: Just in case you haven't noticed, there is already a StatStream
which computes some statistics, like the distribution of exon and
intron length. Maybe this would come in handy for your problem.
This communication is for use by the intended recipient and contains
information that may be Privileged, confidential or copyrighted under
applicable law. If you are not the intended recipient, you are hereby
formally notified that any use, copying or distribution of this e-mail,
in whole or in part, is strictly prohibited. Please notify the sender by
return e-mail and delete this e-mail from your system. Unless explicitly
and conspicuously designated as "E-Contract Intended", this e-mail does
not constitute a contract offer, a contract amendment, or an acceptance
of a contract offer. This e-mail does not constitute a consent to the
use of sender's contact information for direct marketing purposes or for
transfers of data to third parties.
Francais Deutsch Italiano Espanol Portugues Japanese Chinese Korean
http://www.DuPont.com/corp/email_disclaimer.html
-------------- next part --------------
A non-text attachment was scrubbed...
Name: test1_no_exon.gff3.bz2
Type: application/octet-stream
Size: 10106 bytes
Desc: test1_no_exon.gff3.bz2
Url : http://genometools.org/pipermail/gt-users/attachments/20090428/dfd5e4ee/attachment-0002.obj
-------------- next part --------------
A non-text attachment was scrubbed...
Name: test2_no_utr.gff3.bz2
Type: application/octet-stream
Size: 8297 bytes
Desc: test2_no_utr.gff3.bz2
Url : http://genometools.org/pipermail/gt-users/attachments/20090428/dfd5e4ee/attachment-0003.obj
More information about the gt-users
mailing list