[gt-users] gff3 parser
Brent Pedersen
bpederse at gmail.com
Thu Jan 29 20:31:17 CET 2009
On Thu, Jan 29, 2009 at 10:19 AM, Gordon Gremme <gremme at gmail.com> wrote:
> On Thu, Jan 29, 2009 at 6:20 PM, Brent Pedersen <bpederse at gmail.com> wrote:
>> hi, so i cant see that this is a bug according to the spec, but i
>> think it's unexpected.
>> when i have a fasta like:
>>
>> ##gff-version 3
>> 1 ucb gene 2234602 2234702 . - .
>> ID=grape_1_2234602_2234702;match=EVM_prediction_supercontig_1.248,EVM_prediction_supercontig_1.248.mRNA
>> 1 ucb gene 2300292 2302123 . + .
>> ID=grape_1_2300292_2302123;match=EVM_prediction_supercontig_244.8
>> 1 ucb gene 2303615 2303967 . + .
>> ID=grape_1_2303615_2303967;match=EVM_prediction_supercontig_244.8
>> 1 ucb gene 3596400 3596503 . - .
>> ID=grape_1_3596400_3596503;match=evm.TU.supercontig_167.27
>> 1 ucb gene 3600651 3600977 . - .
>> ID=grape_1_3600651_3600977;match=evm.model.supercontig_1217.1,evm.model.supercontig_1217.1.mRNA
>>
>>
>> and i run gt gff3 -sort on it, it gives:
>>
>> ##gff-version 3
>> ##sequence-region 1 2234602 3600977
>> 1 ucb gene 2234602 2234702 . - .
>> match=EVM_prediction_supercontig_1.248,EVM_prediction_supercontig_1.248.mRNA
>> 1 ucb gene 2300292 2302123 . + .
>> match=EVM_prediction_supercontig_244.8
>> 1 ucb gene 2303615 2303967 . + .
>> match=EVM_prediction_supercontig_244.8
>> 1 ucb gene 3596400 3596503 . - .
>> match=evm.TU.supercontig_167.27
>> 1 ucb gene 3600651 3600977 . - .
>> match=evm.model.supercontig_1217.1,evm.model.supercontig_1217.1.mRNA
>>
>>
>> which is annoying because it removes the ID's. i think it's since they
>> have no child features,
>> but then when merged with another gff, it has lost vital info for the pipeline.
>>
>> is this intentional? or a can it be filed as a bug?
>
> It is intentional. IDs are recreated (i.e., the original IDs are lost)
> in the GFF3OutStream to ensure the required uniqueness of IDs when
> more then one file is processed at once.
> Features without children do not get an ID at all, otherwise a '###'
> line would have to follow after each such feature (to allow for the
> sequential reading with low memory requirements we discussed earlier).
>
i did expect each to be followed by '###'.
> But I can see your point that this behaviour is unexpected from the
> user perspective.
>
> One solution would be to use Name attributes instead of ID attributes.
> They don't have to be unique and are properly retained. Another
> solution would be to implement the possibility to retain the ID
> attributes as far as possible (uniqueness might require
> modifications).
yeah, i've added an arbitrary attribute instead of using ID or Name.
as a naive user, i'd prefer that gt left attributes unchanged, but using
an attribute other than id works fine.
-b
>
> Any opinions on this topic?
>
> Gordon
> _______________________________________________
> gt-users mailing list
> gt-users at genometools.org
> http://genometools.org/mailman/listinfo/gt-users
>
More information about the gt-users
mailing list