[gt-users] gff3 parser
Gordon Gremme
gremme at gmail.com
Thu Jan 29 19:19:03 CET 2009
On Thu, Jan 29, 2009 at 6:20 PM, Brent Pedersen <bpederse at gmail.com> wrote:
> hi, so i cant see that this is a bug according to the spec, but i
> think it's unexpected.
> when i have a fasta like:
>
> ##gff-version 3
> 1 ucb gene 2234602 2234702 . - .
> ID=grape_1_2234602_2234702;match=EVM_prediction_supercontig_1.248,EVM_prediction_supercontig_1.248.mRNA
> 1 ucb gene 2300292 2302123 . + .
> ID=grape_1_2300292_2302123;match=EVM_prediction_supercontig_244.8
> 1 ucb gene 2303615 2303967 . + .
> ID=grape_1_2303615_2303967;match=EVM_prediction_supercontig_244.8
> 1 ucb gene 3596400 3596503 . - .
> ID=grape_1_3596400_3596503;match=evm.TU.supercontig_167.27
> 1 ucb gene 3600651 3600977 . - .
> ID=grape_1_3600651_3600977;match=evm.model.supercontig_1217.1,evm.model.supercontig_1217.1.mRNA
>
>
> and i run gt gff3 -sort on it, it gives:
>
> ##gff-version 3
> ##sequence-region 1 2234602 3600977
> 1 ucb gene 2234602 2234702 . - .
> match=EVM_prediction_supercontig_1.248,EVM_prediction_supercontig_1.248.mRNA
> 1 ucb gene 2300292 2302123 . + .
> match=EVM_prediction_supercontig_244.8
> 1 ucb gene 2303615 2303967 . + .
> match=EVM_prediction_supercontig_244.8
> 1 ucb gene 3596400 3596503 . - .
> match=evm.TU.supercontig_167.27
> 1 ucb gene 3600651 3600977 . - .
> match=evm.model.supercontig_1217.1,evm.model.supercontig_1217.1.mRNA
>
>
> which is annoying because it removes the ID's. i think it's since they
> have no child features,
> but then when merged with another gff, it has lost vital info for the pipeline.
>
> is this intentional? or a can it be filed as a bug?
It is intentional. IDs are recreated (i.e., the original IDs are lost)
in the GFF3OutStream to ensure the required uniqueness of IDs when
more then one file is processed at once.
Features without children do not get an ID at all, otherwise a '###'
line would have to follow after each such feature (to allow for the
sequential reading with low memory requirements we discussed earlier).
But I can see your point that this behaviour is unexpected from the
user perspective.
One solution would be to use Name attributes instead of ID attributes.
They don't have to be unique and are properly retained. Another
solution would be to implement the possibility to retain the ID
attributes as far as possible (uniqueness might require
modifications).
Any opinions on this topic?
Gordon
More information about the gt-users
mailing list