[gt-users] gff3 parser
Gordon Gremme
gremme at gmail.com
Tue Feb 3 17:41:42 CET 2009
> ok. i guess and checked my way to a start. diff here:
> http://gist.github.com/57121
> there are a couple of Q: s in there that i'm not sure of.
I answer your questions stated in the source code first:
+ /* Q: is this const ok? */
+ const char *idstring;
Yes, it should be const, because gt_feature_node_get_attribute()
returns a const pointer.
+ /* Q: better way to test for parent ? */
+ else if ( ! gt_feature_node_get_attribute(gf, "Parent")) {
+ id = create_unique_id(gff3_visitor, gf);
+ has_id = 1;
+ }
You have this information implicitly in add_id() during the traversal
done by gt_genome_node_traverse_direct_children().
But I can't see why you would need this information here.
+ if (has_id == 1) {
+ gt_hashmap_add(gff3_visitor->gt_feature_node_to_unique_id_str, gf,
+ gt_str_ref(id));
/* for each child -> store the parent feature in the hash map */
add_id_info.gt_feature_node_to_id_array =
- gff3_visitor->gt_feature_node_to_id_array,
+ gff3_visitor->gt_feature_node_to_id_array,
add_id_info.id = gt_str_get(id);
had_err = gt_genome_node_traverse_direct_children(gn, &add_id_info, add_id,
err);
+ /* Q: needed? */
+ gt_str_delete(id);
}
In your case you need it, because you added a new reference to the
gff3_visitor->gt_feature_node_to_unique_id_str hashmap (which takes
the ownership of it, see the constructor call). A simpler solution
would be not calling gt_str_ref(id) and removing the
gt_str_delete(id).
Some formality: Please use the bool type for boolean values (instead
of int has_id).
> i didnt keep the old behavior, but can probably figure that out. just
> want to get an idea
> of whether this is the direction you had in mind. i'm not sure it is
> because i didnt do any of the
> stuff you mention in the 2nd to last paragraph above.
That is because you didn't handle ID clashes.
Let me restate the problem I was refering to with this paragraph.
If you parse two GFF3 files with one GFF3InStream (e.g., by calling
`gt gff3` with these two files), whereas each file contains a feature
with the same ID (let's say ID=foo), that would be perfectly legal
input, because foo is unique in each file. If you then simply reuse
the given IDs in the GFF3OutStream, you would produce illegal output
(i.e., a single file which contains the ID foo twice). That is you
have a name clash in this case which you have to handle. For example,
by renaming the second 'foo' to 'foo.1'.
That was the problem I was thinking about when referring to the
GtCstrTable, because that's how I would store all used IDs.
When I thought about this now, I realized that this ID keeping per
default might not be as good as previously thought. Because
introducing such an table would mean that the memory consumption would
grow in O(filesize) (for storing all seen IDs) and wouldn't be O(1)
anymore.
When processing huge GFF3 file this could be a problem.
I had a similar problem in the GFF3 parser, where I had to store all
seen IDs to check for uniqueness. I resolved it by making it optional
(option -checkids).
Hope that helps,
Gordon
More information about the gt-users
mailing list