[gt-users] gt python with (multi)processing

Sascha Steinbiss steinbiss at zbh.uni-hamburg.de
Fri Jan 22 18:59:51 CET 2010


On 01/20/2010 06:12 PM, Brent Pedersen wrote:
>> Sorry to come back to this, but I cannot see why this script uses
>> multithreading. It looks very sequential as there are no tasks
>> distributed to the workers in the pool.
> hi, yes it is just a dummy script to demonstrate the problem. that's the
> minimum require to cause problems on my machine.

Hmm. I can't reproduce that. This is what I get with a version before
the threadsafety patches (commit 02345ac73f9b...):

$ cat seq_fi_test.py
#!/usr/bin/env python
import processing
import gt
p = processing.Pool(4)
f = gt.FeatureIndexMemory()
f.add_gff3file('./testdata/encode_known_genes_Mar07.gff3')
f.add_gff3file('./testdata/encode_known_genes_Mar07.gff3')
print f.get_seqids()

$ ./seq_fi_test.py
['chr1', 'chr10', 'chr11', 'chr12', 'chr13', 'chr14', 'chr15', 'chr16',
'chr18', 'chr19', 'chr2', 'chr20', 'chr21', 'chr22', 'chr5', 'chr6',
'chr7', 'chr8', 'chr9', 'chrX']

This works reliably and repeatedly with several input files.

> regardless, i'll wait as the new threadsafe stuff
> progresses.

There are some news at this front. The next examples are all using my
thread-safe FeatureIndex class.

- Firstly, I am afraid that the Python processing package is not the
right tool to test the thread-safety of GenomeTools, since it does not
wrap threads, but rather full-fledged processes
(http://pypi.python.org/pypi/processing).
Which means that using the same FeatureIndex object transparently across
multiple pool workers will not work as intended, as I had to find out:

$ cat mp_fi_test.py
#!/usr/bin/env python
import processing
import gt
import sys
numthreads = 4
f = gt.FeatureIndexMemory()
p = processing.Pool(numthreads)
print p.map(f.add_gff3file, [sys.argv[1] for i in range(numthreads)])
print f.get_seqids()

$ ./mp_fi_test.py testdata/encode_known_genes_Mar07.gff3
[None, None, None, None]
[]


- Secondly, I used the following script to test whether threading works
from the Python bindings:

$ cat mp_fi_test2.py
#!/usr/bin/env python
import threading
import gt
import sys
f = gt.FeatureIndexMemory()
def get_nof_features_in_index(index):
    return \
    reduce(lambda acc, id: acc + len(index.get_features_for_seqid(id)),
                  index.get_seqids(), 0)
class TestThread(threading.Thread):
    def __init__(self, index, file, number):
        threading.Thread.__init__(self)
        self.fi = index
        self.file = file
        self.number = number
    def run(self):
        self.fi.add_gff3file(self.file)
        print ("%d finished, index now has %d features " + \
              "in %d sequences") % \
                         (self.number,
                          get_nof_features_in_index(self.fi),
                          len(self.fi.get_seqids()))
threads = []
for i in range(4):
    t = TestThread(f, sys.argv[1], i)
    threads.append(t)
    t.start()
for thread in threads:
    thread.join()
print f.get_seqids()
print "%d features in index altogether" % get_nof_features_in_index(f)

$ ./mp_fi_test2.py testdata/encode_known_genes_Mar07.gff3
2 finished, index now has 2991 features in 20 sequences
1 finished, index now has 7790 features in 20 sequences
3 finished, index now has 10624 features in 20 sequences
0 finished, index now has 11964 features in 20 sequences
['chr1', 'chr10', 'chr11', 'chr12', 'chr13', 'chr14', 'chr15', 'chr16',
'chr18', 'chr19', 'chr2', 'chr20', 'chr21', 'chr22', 'chr5', 'chr6',
'chr7', 'chr8', 'chr9', 'chrX']
11964 features in index altogether

which id correct and now also works reliably and well with the new C
patches (in my 'mt-featureindex' branch on github).
I will complete the unit tests (which are quite tedious to write in
order not to miss anything) and then push a new version into this branch
(and announce it here too).

> also, i notice in recent commits, you use the @function.setter
> decorator for the
> range module. that's cool syntax i hadn't used, but it is only
> available in >=python 2.6
> i'm attaching a patch that works with (at least) 2.4 and 2.5
> as well.

Thank you very much. It's now in the master.

> -brent

Sascha

-- 
Sascha Steinbiss
Center for Bioinformatics
University of Hamburg
Bundesstr. 43
20146 Hamburg
Germany

Email:  steinbiss at zbh.uni-hamburg.de
URL:    http://www.zbh.uni-hamburg.de/steinbiss
Phone:  +49 (40) 42838 7322
FAX:    +49 (40) 42838 7312



More information about the gt-users mailing list