[gt-users] gff3 parser

Brent Pedersen bpederse at gmail.com
Thu Jan 29 20:31:17 CET 2009


On Thu, Jan 29, 2009 at 10:19 AM, Gordon Gremme <gremme at gmail.com> wrote:
> On Thu, Jan 29, 2009 at 6:20 PM, Brent Pedersen <bpederse at gmail.com> wrote:
>> hi, so i cant see that this is a bug according to the spec, but i
>> think it's unexpected.
>> when i have a fasta like:
>>
>> ##gff-version 3
>> 1   ucb gene    2234602 2234702 .   -   .
>> ID=grape_1_2234602_2234702;match=EVM_prediction_supercontig_1.248,EVM_prediction_supercontig_1.248.mRNA
>> 1   ucb gene    2300292 2302123 .   +   .
>> ID=grape_1_2300292_2302123;match=EVM_prediction_supercontig_244.8
>> 1   ucb gene    2303615 2303967 .   +   .
>> ID=grape_1_2303615_2303967;match=EVM_prediction_supercontig_244.8
>> 1   ucb gene    3596400 3596503 .   -   .
>> ID=grape_1_3596400_3596503;match=evm.TU.supercontig_167.27
>> 1   ucb gene    3600651 3600977 .   -   .
>> ID=grape_1_3600651_3600977;match=evm.model.supercontig_1217.1,evm.model.supercontig_1217.1.mRNA
>>
>>
>> and i run gt gff3 -sort on it, it gives:
>>
>> ##gff-version   3
>> ##sequence-region   1 2234602 3600977
>> 1       ucb     gene    2234602 2234702 .       -       .
>> match=EVM_prediction_supercontig_1.248,EVM_prediction_supercontig_1.248.mRNA
>> 1       ucb     gene    2300292 2302123 .       +       .
>> match=EVM_prediction_supercontig_244.8
>> 1       ucb     gene    2303615 2303967 .       +       .
>> match=EVM_prediction_supercontig_244.8
>> 1       ucb     gene    3596400 3596503 .       -       .
>> match=evm.TU.supercontig_167.27
>> 1       ucb     gene    3600651 3600977 .       -       .
>> match=evm.model.supercontig_1217.1,evm.model.supercontig_1217.1.mRNA
>>
>>
>> which is annoying because it removes the ID's. i think it's since they
>> have no child features,
>> but then when merged with another gff, it has lost vital info for the pipeline.
>>
>> is this intentional? or a can it be filed as a bug?
>
> It is intentional. IDs are recreated (i.e., the original IDs are lost)
> in the GFF3OutStream to ensure the required uniqueness of IDs when
> more then one file is processed at once.
> Features without children do not get an ID at all, otherwise a '###'
> line would have to follow after each such feature (to allow for the
> sequential reading with low memory requirements we discussed earlier).
>

i did expect each to be followed by '###'.


> But I can see your point that this behaviour is unexpected from the
> user perspective.
>
> One solution would be to use Name attributes instead of ID attributes.
> They don't have to be unique and are properly retained. Another
> solution would be to implement the possibility to retain the ID
> attributes as far as possible (uniqueness might require
> modifications).

yeah, i've added an arbitrary attribute instead of using ID or Name.
as a naive user, i'd prefer that gt left attributes unchanged, but using
an attribute other than id works fine.
-b


>
> Any opinions on this topic?
>
> Gordon
> _______________________________________________
> gt-users mailing list
> gt-users at genometools.org
> http://genometools.org/mailman/listinfo/gt-users
>


More information about the gt-users mailing list