[gt-users] Tallymer to get k-mers in a specific range?
Stefan Kurtz
stefan.kurtz at me.com
Fri Jan 15 22:38:44 CET 2010
I fogot two things:
You have to call gt tallymer mkindex.
And please add the option -scan to read the suffix array sequentially.
To construct the suffix array with suffixerator you probably have to use the
option -parts <numberofparts> to reduce the memory peak.
Bye,
Stefan
Stefan Kurtz wrote:
> Hi Dan,
>
> Dan Bolser wrote:
>> Hi All, Stefan,
>>
>> I've just read through the Tallymer manual, so apologies if this
>> question is obvious...
>>
>> I have a large set of reads, and I'd like to extract all k-mers (the
>> sequences) with occurrence between 200 and 250 (mitochondrially
>> derived).
>>
>> I can't seem to see a way to do that with the Tallymer tools...
>>
>
> With the program gt tallymer mkindex you actually do not have to
> create an index,
> but you can directly pull out the k-mers satisfying the length
> constrains. Once you have
> the enhanced suffix array of your read set created you can call
>
> gt mkindex -esa myesaindex -mersize 51 -minocc 200 -maxocc 250 > out.tmp
>
> In out.tmp you find lines like
>
> 200 34234
> aatctataatggaatattaaaaaaaaaaaaaattata....agagatatata
> ...
> <34233 more lines with 51-mers>
> 201 663
> atatcaggatatgagc......aactatcgactatacgacgacgacggac
> <662 more lines with 51-mers>
> ...
>
> A simple postprocessing script should make it easy to extract the
> 51-mers with the
> appropriate occurrence counts. Note that the output generated can be huge
>
>>
>> Also, I'd like to plot the k-mer frequency distribution for a given
>> k-mer length, i.e. given a k-mer length of 51, how many k-mers occur 1
>> time, 2 times, 3 times, ... x times. etc. I can't see how to do this
>> directly.
>>
> The lines containing integer pairs in the output of the call above
> should give you exactly this
> information.
>
> Hope this helps. If not, contact me again.
>>
>> The best I can come up with is creating a fake sequence file with all
>> possible k-mers in it and passing that to search, but I'm sure this
>> information should be available directly from one of the index files.
>>
>>
>> Thanks for any help with these issues.
>>
>> All the best,
>> Dan.
>> _______________________________________________
>> gt-users mailing list
>> gt-users at genometools.org
>> http://genometools.org/mailman/listinfo/gt-users
>>
>>
>
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> gt-users mailing list
> gt-users at genometools.org
> http://genometools.org/mailman/listinfo/gt-users
>
--
Prof. Dr. Stefan Kurtz
Zentrum fuer Bioinformatik
Universitaet Hamburg
Bundesstrasse 43
20146 Hamburg
Germany
Email: kurtz at zbh.uni-hamburg.de
URL: http://www.zbh.uni-hamburg.de/kurtz
Phone: +49 (40) 42838 7311
FAX: +49 (40) 42838 7312
-------------- next part --------------
A non-text attachment was scrubbed...
Name: stefan_kurtz.vcf
Type: text/x-vcard
Size: 185 bytes
Desc: not available
Url : http://genometools.org/pipermail/gt-users/attachments/20100115/e99ce193/attachment.vcf
More information about the gt-users
mailing list