[gt-users] gff3 parser
Gordon Gremme
gremme at gmail.com
Fri Jan 30 14:08:22 CET 2009
>> I agree that it would be more intuitive if the IDs would be retained
>> as much as possible.
>> Just created a ticket for it, but I probably won't have time to
>> implement it myself.
>>
>> If someone want's to tackle it, just let me know and I give the
>> corresponding pointers on where to look in the code. It's a good task
>> to get started with the codebase.
>
> i'll give it a try then, if you show me where to look.
Ok, cool!
All the action happens in the GtGFF3Visitor (gff3_visitor.[ch]) which
is employed by the GtGFF3OutStream to show all GenomeNodes flowing
through the it.
The GtFeatureNodes are processed by the method
gff3_visitor_feature_node(). It is called once for each top-level
feature which has all children attached to it. The parent-child
relationship is stored explicitly, but the original ID attribute is
stored in the attributes.
Therefore, you can still get it with gt_feature_node_get_attribute(fn, "ID").
There is a special case you have to consider, the so-called multi-features.
This are features which span multiple lines, but have the same ID.
Each such multi-feature has a 'representative' which can quite useful.
To store IDs which have already been used, a GtCstrTable could be
useful. To handle ID clashes, a naming scheme has to be introduced.
Something like: If an ID was already used, append .2 (if that was also
used, .3 instead and so forth). To check whether an ID ends with a
number according to the chosen naming scheme, gt_grep() might be
helpful.
It would be great, if the old behaviour would still be possible via an
option to the GtGFF3OutStream (analog to
gt_gff3_out_stream_set_fasta_width()).
If you encounter problems, please ask.
Gordon
More information about the gt-users
mailing list