[gt-users] FASTA stream input implemented?

Gordon Gremme gremme at gmail.com
Tue Feb 17 16:47:52 CET 2009


>> it's been a long time that I've written some code using the genometools.
>> And I am really impressed what is now available in the libraries for
>> implementing
>> efficient and clean C programs!! So, I tried to implement something with
>> the genometools ... :-)

Great, thanks!


>> @Sascha and Gordon:
>> Are there any functions in the genometools library that I can use for
>> reading FASTA files
>> as a stream?
>> I tried using the "core/bioseq_iterator.c" functions to read FASTA files:
> [snip]
>> Is something like a FASTA stream already implemented, and if not, is
>> there a stream implementation which I can use
>> to guide my own FASTA stream implementation?
>
> Maybe using multiple BioSeq objects (one per file) with a different
> FastaReader implementation (via gt_bioseq_new_with_fasta_reader()) may
> help you.
>
> My guess would be the one based on the GtSeqIterator
> (GtFastaReaderGtSeqIt) as it employs GtFastaBuffer which uses regular
> file operations instead of mapping the whole file into memory. Never
> tried that myself though, so let's wait what Gordon has to say about this.

Unfortunately, that's not possible. The FastaReader is only used to
construct an preprocessed index for the Fasta-File which is then
mapped completely into memory (no matter which FastaReader
implementation is used).


> Alternatively, you can try to implement a sequence stream the way the
> GtNodeStream for annotations is implemented (look at the *_stream.[ch]
> files in extended). Such as stream could pass around GtSeq objects
> instead of GtGenomeNodes. The GtSeqIterator interface looks already
> quite like that, just without the stream connection capability.

Yes, that would be the best solution (but the stream returns
GtSequenceNode objects and not GtSeq objects).
You can easily implement it yourself (let's call it FastaInStream).
The FastaInStream would implement the NodeStream interface (you can
copy the BedInStream for example).
The constructor gets the Fasta file name(s) and creates an
GtSeqIterator internally (to process the Fasta file(s)).
In the _next method, you would just call the GtSeqIterator to retrieve
the next Fasta sequence, create a GtSequenceNode and return it.

I just documented the GtSeqIterator class, so make sure you have the
latest head.

Please let us know, if you encounter any problems. It would be nice to
add this to GenomeTools!

Gordon


More information about the gt-users mailing list