[gt-users] one more question (sorry)
Sascha Steinbiss
steinbiss at zbh.uni-hamburg.de
Mon Apr 27 16:24:50 CEST 2009
La Rota, Mauricio wrote:
> I have the following curiosity,
>
> Do you know why the following python line (identified by *** in the
> piece of code) returns the features for a given sequence ID out of
> order (even though they were already sorted within the source GFF3
> file?)
>
> =====
> ...
>
> in_stream = GFF3InStream(sys.argv[1])
> add_introns_stream = AddIntronsStream(in_stream)
> feature_index = FeatureIndexMemory()
> feature_stream = FeatureStream(add_introns_stream, feature_index)
> gn = feature_stream.next_tree()
> # fill feature index adding the introns
> while gn:
> gn = feature_stream.next_tree()
>
> seqids = feature_index.get_seqids()
>
> for seqid in seqids:
> featuresThisSeq = feature_index.get_features_for_seqid(seqid)
> for feature in featuresThisSeq: # <--- ***
> # do something with the feature.
>
> ...
> =====
This is because the FeatureIndexMemory stores the nodes in an index
structure called an interval tree for faster retrieval. During a search
in the interval tree for all nodes overlapping a region (like your whole
sequence region here), the nodes are not guaranteed to come out in a
sorted fashion. However, accessing the index using the
FeatureIndex.get_features_for_range() method will sort the nodes on the
C side (by start position), so you can use this method if you need them
sorted. You can get the range covered by a sequence region by using the
FeatureIndex.get_range_for_seqid() method, by the way.
> Also is there a way to sequentially access the "GenomeNodes" and
> directly convert them into FeatureNodes while building the index in:
>
> "gn = feature_stream.next_tree()"
>
> For this particular case I don't need random access to the features
> in the FeatureStream, so I don't even need to build an index, just
> get to each feature node (and their children) sequentially. They are
> already sorted in GFF3 file by seqid and parent feature coordinates.
Up to now, directly converting them into feature nodes is not easy
because the next_tree() method can in theory also return GenomeNodes
which are not feature nodes. I guess I will have to think about how to
detect dynamically which GenomeNode subclass comes out of the stream and
create the appropriate Python subclass object.
If you are sure that only FeatureNodes are parsed from your input, then
you can access the pointer to the C object inside the GenomeNode object
and deliberately create a new FeatureNode wrapper object around it:
in_stream = GFF3InStream(sys.argv[1])
add_introns_stream = AddIntronsStream(in_stream)
feature_index = FeatureIndexMemory()
feature_stream = FeatureStream(add_introns_stream, feature_index)
gn = feature_stream.next_tree()
while gn:
fn = FeatureNode.create_from_ptr(gn.gn)
print fn
gn = feature_stream.next_tree()
> Thanks again,
> Mauricio La Rota
Sascha
--
Sascha Steinbiss
Center for Bioinformatics
University of Hamburg
Bundesstr. 43
20146 Hamburg
Germany
Email: steinbiss at zbh.uni-hamburg.de
URL: http://www.zbh.uni-hamburg.de/steinbiss
Phone: +49 (40) 42838 7322
FAX: +49 (40) 42838 7312
More information about the gt-users
mailing list