[gt-users] Tallymer to get k-mers in a specific range?

Stefan Kurtz stefan.kurtz at me.com
Fri Jan 15 22:23:42 CET 2010


Hi Dan,

Dan Bolser wrote:
> Hi All, Stefan,
>
> I've just read through the Tallymer manual, so apologies if this
> question is obvious...
>
> I have a large set of reads, and I'd like to extract all k-mers (the
> sequences) with occurrence between 200 and 250 (mitochondrially
> derived).
>
> I can't seem to see a way to do that with the Tallymer tools...
>   

With the program gt tallymer mkindex you actually do not have to create 
an index,
but you can directly pull out the k-mers satisfying the length 
constrains. Once you have
the enhanced suffix array of your read set created you can call

gt mkindex -esa myesaindex -mersize 51 -minocc 200 -maxocc 250 > out.tmp

In out.tmp you find lines like

200 34234
aatctataatggaatattaaaaaaaaaaaaaattata....agagatatata
...
<34233 more lines with 51-mers>
201 663
atatcaggatatgagc......aactatcgactatacgacgacgacggac
<662 more lines with 51-mers>
...

A simple postprocessing script should make it easy to extract the 
51-mers with the
appropriate occurrence counts. Note that the output generated can be huge

>
> Also, I'd like to plot the k-mer frequency distribution for a given
> k-mer length, i.e. given a k-mer length of 51, how many k-mers occur 1
> time, 2 times, 3 times, ... x times. etc. I can't see how to do this
> directly.
>   
The lines containing integer pairs in the output of the call above 
should give you exactly this
information.

Hope this helps. If not, contact me again.
>
> The best I can come up with is creating a fake sequence file with all
> possible k-mers in it and passing that to search, but I'm sure this
> information should be available directly from one of the index files.
>
>
> Thanks for any help with these issues.
>
> All the best,
> Dan.
> _______________________________________________
> gt-users mailing list
> gt-users at genometools.org
> http://genometools.org/mailman/listinfo/gt-users
>
>   


-- 
Prof. Dr. Stefan Kurtz
Zentrum fuer Bioinformatik
Universitaet Hamburg
Bundesstrasse 43
20146 Hamburg
Germany

Email:  kurtz at zbh.uni-hamburg.de
URL:    http://www.zbh.uni-hamburg.de/kurtz
Phone:  +49 (40) 42838 7311
FAX:    +49 (40) 42838 7312

-------------- next part --------------
A non-text attachment was scrubbed...
Name: stefan_kurtz.vcf
Type: text/x-vcard
Size: 185 bytes
Desc: not available
Url : http://genometools.org/pipermail/gt-users/attachments/20100115/181c20fc/attachment.vcf 


More information about the gt-users mailing list