[gt-users] gff3 parser

Gordon Gremme gremme at gmail.com
Thu Jan 29 19:19:03 CET 2009


On Thu, Jan 29, 2009 at 6:20 PM, Brent Pedersen <bpederse at gmail.com> wrote:
> hi, so i cant see that this is a bug according to the spec, but i
> think it's unexpected.
> when i have a fasta like:
>
> ##gff-version 3
> 1   ucb gene    2234602 2234702 .   -   .
> ID=grape_1_2234602_2234702;match=EVM_prediction_supercontig_1.248,EVM_prediction_supercontig_1.248.mRNA
> 1   ucb gene    2300292 2302123 .   +   .
> ID=grape_1_2300292_2302123;match=EVM_prediction_supercontig_244.8
> 1   ucb gene    2303615 2303967 .   +   .
> ID=grape_1_2303615_2303967;match=EVM_prediction_supercontig_244.8
> 1   ucb gene    3596400 3596503 .   -   .
> ID=grape_1_3596400_3596503;match=evm.TU.supercontig_167.27
> 1   ucb gene    3600651 3600977 .   -   .
> ID=grape_1_3600651_3600977;match=evm.model.supercontig_1217.1,evm.model.supercontig_1217.1.mRNA
>
>
> and i run gt gff3 -sort on it, it gives:
>
> ##gff-version   3
> ##sequence-region   1 2234602 3600977
> 1       ucb     gene    2234602 2234702 .       -       .
> match=EVM_prediction_supercontig_1.248,EVM_prediction_supercontig_1.248.mRNA
> 1       ucb     gene    2300292 2302123 .       +       .
> match=EVM_prediction_supercontig_244.8
> 1       ucb     gene    2303615 2303967 .       +       .
> match=EVM_prediction_supercontig_244.8
> 1       ucb     gene    3596400 3596503 .       -       .
> match=evm.TU.supercontig_167.27
> 1       ucb     gene    3600651 3600977 .       -       .
> match=evm.model.supercontig_1217.1,evm.model.supercontig_1217.1.mRNA
>
>
> which is annoying because it removes the ID's. i think it's since they
> have no child features,
> but then when merged with another gff, it has lost vital info for the pipeline.
>
> is this intentional? or a can it be filed as a bug?

It is intentional. IDs are recreated (i.e., the original IDs are lost)
in the GFF3OutStream to ensure the required uniqueness of IDs when
more then one file is processed at once.
Features without children do not get an ID at all, otherwise a '###'
line would have to follow after each such feature (to allow for the
sequential reading with low memory requirements we discussed earlier).

But I can see your point that this behaviour is unexpected from the
user perspective.

One solution would be to use Name attributes instead of ID attributes.
They don't have to be unique and are properly retained. Another
solution would be to implement the possibility to retain the ID
attributes as far as possible (uniqueness might require
modifications).

Any opinions on this topic?

Gordon


More information about the gt-users mailing list