[gt-users] some questions

Gordon Gremme gremme at gmail.com
Mon Aug 3 15:01:33 CEST 2009


Hi Giorgio,

On Wed, Jul 29, 2009 at 2:31 PM, Giorgio Gonnella<ggonnell at yahoo.it> wrote:
> Thank you for answering so quickly. So, now I am writing a tool that I named "gt simreads" to simulate a set of sequencing reads for a given sequence. Maybe a combination of tools (shredder, filter, mutate) could already almost do what I am going to do, but it is quite complicated, so I think this could be useful.
>
> I thought to make the tool work as follow:
> - input: a sequence
> - output: newline separated list of fragments (the simulated reads)
> - options: fragment lenght (-lenght) or lenght range (-minlenght/-maxlenght), number of fragments (-num)
>
> And here are my questions, regarding conventions in gt:
> - Are all tools that accept a sequence as input actually accepting multiple sequences? At the moment I used GtBioSeqIterator, but I don't really know what should be the expected behaviour for a genometools tool if the input is multiple (concatenate input sequences before starting? return "-num" sequences for each input sequence?). Or should I better accept only a single sequence as input?

I think accepting multiple sequences would be the better solution. The
desired semantic depends on your expected use case.
Do you want to simulate reads of a whole genome? In this case it would
be better to consider all sequences at once and not process them
consecutively.

BTW, we are working on a solution to unify the different sequence
containers (e.g., GtBioseq and Encodedsequence) under a few
interfaces, so some changes in this area will occur in the near
future. For you, the important question is: Can I process the
sequences consecutively or do I need random access? If you can process
consecutively, you should architect your code around an iterator, in
the second case you should should architect around a set-like object
like the GtBioseq.

I would not manually concatenate sequences in memory, it severely
limits the amount of sequence data you can process at once.


> - Is the output format as written up here acceptable for a genometools tool?

Please use the FASTA-format. The output is only slightly longer than
the one you described, but it is much more standard and therefore
postprocessing with other software is much easier.
You can use gt_fasta_show_entry() or gt_fasta_show_entry_generic() to
write sequences in FASTA-format.


> - I was thinking to implement an error rate option, is it ok (i.e. mutate sequences inside the tool) or is it better to chain the tool with "gt mutate"?

In your case it makes sense to mutate the sequences inside the tool.
Of course, you should reuse the code used by `gt mutate` to mutate the
sequences (and extend it if necessary).

Gordon


More information about the gt-users mailing list