[gt-users] gff3 parser

Brent Pedersen bpederse at gmail.com
Wed Feb 4 21:53:10 CET 2009


tOn Tue, Feb 3, 2009 at 8:41 AM, Gordon Gremme <gremme at gmail.com> wrote:
>> ok. i guess and checked my way to a start. diff here:
>> http://gist.github.com/57121
>> there are a couple of Q: s in there that i'm not sure of.
>
> I answer your questions stated in the source code first:
>
> +  /* Q: is this const ok? */
> +  const char *idstring;
>
> Yes, it should be const, because gt_feature_node_get_attribute()
> returns a const pointer.
>
>
> +    /* Q: better way to test for parent ? */
> +    else if ( ! gt_feature_node_get_attribute(gf, "Parent")) {
> +        id = create_unique_id(gff3_visitor, gf);
> +        has_id = 1;
> +    }
>
> You have this information implicitly in add_id() during the traversal
> done by gt_genome_node_traverse_direct_children().
> But I can't see why you would need this information here.
>
>
> +  if (has_id == 1) {
> +    gt_hashmap_add(gff3_visitor->gt_feature_node_to_unique_id_str, gf,
> +               gt_str_ref(id));
>
>     /* for each child -> store the parent feature in the hash map */
>     add_id_info.gt_feature_node_to_id_array =
> -      gff3_visitor->gt_feature_node_to_id_array,
> +       gff3_visitor->gt_feature_node_to_id_array,
>     add_id_info.id = gt_str_get(id);
>     had_err = gt_genome_node_traverse_direct_children(gn, &add_id_info, add_id,
>                                                       err);
> +    /* Q: needed? */
> +    gt_str_delete(id);
>   }
>
> In your case you need it, because you added a new reference to the
> gff3_visitor->gt_feature_node_to_unique_id_str hashmap (which takes
> the ownership of it, see the constructor call). A simpler solution
> would be not calling gt_str_ref(id) and removing the
> gt_str_delete(id).
>
>
> Some formality: Please use the bool type for boolean values (instead
> of int has_id).
>
>
>> i didnt keep the old behavior, but can probably figure that out. just
>> want to get an idea
>> of whether this is the direction you had in mind. i'm not sure it is
>> because i didnt do any of the
>> stuff you mention in the 2nd to last paragraph above.
>
> That is because you didn't handle ID clashes.
> Let me restate the problem I was refering to with this paragraph.
> If you parse two GFF3 files with one GFF3InStream (e.g., by calling
> `gt gff3` with these two files), whereas each file contains a feature
> with the same ID (let's say ID=foo), that would be perfectly legal
> input, because foo is unique in each file. If you then simply reuse
> the given IDs in the GFF3OutStream, you would produce illegal output
> (i.e., a single file which contains the ID foo twice). That is you
> have a name clash in this case which you have to handle. For example,
> by renaming the second 'foo' to 'foo.1'.
>
> That was the problem I was thinking about when referring to the
> GtCstrTable, because that's how I would store all used IDs.
>
> When I thought about this now, I realized that this ID keeping per
> default might not be as good as previously thought. Because
> introducing such an table would mean that the memory consumption would
> grow in O(filesize) (for storing all seen IDs) and wouldn't be O(1)
> anymore.
> When processing huge GFF3 file this could be a problem.
>
> I had a similar problem in the GFF3 parser, where I had to store all
> seen IDs to check for uniqueness. I resolved it by making it optional
> (option -checkids).
>
> Hope that helps,
>
> Gordon
> _______________________________________________
> gt-users mailing list
> gt-users at genometools.org
> http://genometools.org/mailman/listinfo/gt-users
>

ok. now i understand that problem, but not quite the solution.
where would it keep the GtCstrTable? as an additional field on the
GtGFF3Visitor struct?
or on the GtGFF3Instream?

and then, the option will be a boolean in tools/gt_gff3.c like: -keepids

so i think i can add an arg to gff3_parser.c
gt_gff3_parser_parse_genome_nodes and parse_regular_gff3_line
for the GtCstrTable *used_ids. then in parse_regular_gff3_line, it
modifies the id attribute if it's in used_ids already.

does that seem correct?

thanks for your patience, i usually like to say i know just enough C
to create segfaults at will.
-brent


More information about the gt-users mailing list