[gt-users] Tallymer to get k-mers in a specific range?
Dan Bolser
dan.bolser at gmail.com
Sat Jan 16 07:52:44 CET 2010
Cheers Stefan, that looks great.
I just have a question about the format of the results (inline below).
2010/1/15 Stefan Kurtz <stefan.kurtz at me.com>:
> Hi Dan,
>
> Dan Bolser wrote:
>>
>> Hi All, Stefan,
>>
>> I've just read through the Tallymer manual, so apologies if this
>> question is obvious...
>>
>> I have a large set of reads, and I'd like to extract all k-mers (the
>> sequences) with occurrence between 200 and 250 (mitochondrially
>> derived).
>>
>> I can't seem to see a way to do that with the Tallymer tools...
>>
>
> With the program gt tallymer mkindex you actually do not have to create an
> index,
> but you can directly pull out the k-mers satisfying the length constrains.
> Once you have
> the enhanced suffix array of your read set created you can call
>
> gt mkindex -esa myesaindex -mersize 51 -minocc 200 -maxocc 250 > out.tmp
If I understand correctly this will return all 51-mers that occur
between 200 and 250 times in my read set - perfect!
> In out.tmp you find lines like
>
> 200 34234
> aatctataatggaatattaaaaaaaaaaaaaattata....agagatatata
> ...
> <34233 more lines with 51-mers>
Ahh... ok, I think I just got it. The 34233 lines are all the 51-mers
that occur exactly 200 times in my read set.
Thanks Stefan.
If I understand correctly, the -scan and -parts options in your
subsequent email are not mandatory, but are required if the memory of
the box is limited?
Looks like I may finally be able to assemble this Mt. sequence!
All the best,
Dan.
> 201 663
> atatcaggatatgagc......aactatcgactatacgacgacgacggac
> <662 more lines with 51-mers>
> ...
>
> A simple postprocessing script should make it easy to extract the 51-mers
> with the
> appropriate occurrence counts. Note that the output generated can be huge
>
>>
>> Also, I'd like to plot the k-mer frequency distribution for a given
>> k-mer length, i.e. given a k-mer length of 51, how many k-mers occur 1
>> time, 2 times, 3 times, ... x times. etc. I can't see how to do this
>> directly.
>>
>
> The lines containing integer pairs in the output of the call above should
> give you exactly this
> information.
>
> Hope this helps. If not, contact me again.
>>
>> The best I can come up with is creating a fake sequence file with all
>> possible k-mers in it and passing that to search, but I'm sure this
>> information should be available directly from one of the index files.
>>
>>
>> Thanks for any help with these issues.
>>
>> All the best,
>> Dan.
>> _______________________________________________
>> gt-users mailing list
>> gt-users at genometools.org
>> http://genometools.org/mailman/listinfo/gt-users
>>
>>
>
>
> --
> Prof. Dr. Stefan Kurtz
> Zentrum fuer Bioinformatik
> Universitaet Hamburg
> Bundesstrasse 43
> 20146 Hamburg
> Germany
>
> Email: kurtz at zbh.uni-hamburg.de
> URL: http://www.zbh.uni-hamburg.de/kurtz
> Phone: +49 (40) 42838 7311
> FAX: +49 (40) 42838 7312
>
>
> _______________________________________________
> gt-users mailing list
> gt-users at genometools.org
> http://genometools.org/mailman/listinfo/gt-users
>
>
More information about the gt-users
mailing list