Changing `Sequence.__contains_

Discussion:

Changing `Sequence.__contains__`

Ram Rachum

2014-07-20 22:06:33 UTC

Why does the default `Sequence.__contains__` iterate through the items
rather than use `.index`, which may sometimes be more efficient?

I suggest an implementation like this:

def __contains__(self, i):
try: self.index(i)
except ValueError: return False
else: return True

What do you think?

Steven D'Aprano

2014-07-21 01:41:43 UTC

Permalink

Post by Ram Rachum
Why does the default `Sequence.__contains__` iterate through the items
rather than use `.index`, which may sometimes be more efficient?

Because having an index() method is not a requirement to be a sequence.
It is optional. The implementation for Sequence.__contains__ which makes
the least assumptions about the class is to iterate over the items.

Post by Ram Rachum
try: self.index(i)
except ValueError: return False
else: return True
What do you think?

That now means that sequence types will have to define an index method
in order to be a sequence. Not only that, but the index method has to
follow a standard API, which not all sequence types may do.

This would be marginally better:

def __contains__(self, obj):
try:
index = type(self).index
except AttributeError:
for o in self:
if o is obj or o == obj:
return True
return False
else:
try:
index(obj)
except ValueError:
return False
else:
return True

but it has at two problems I can see:

- it's not backwards compatible with sequence types which may already
define an index attribute which does something different, e.g.:

class Book:
def index(self):
# return the index of the book
def __getitem__(self, n):
# return page n

- for a default implementation, it's too complicated.

If your sequence class has an efficient index method (or an efficient
find method, or __getitem__ method, or any other efficient way of
testing whether something exists in the sequence quickly) it's not much
more work to define a custom __contains__ to take advantage of that.
There's no need for the default Sequence fallback to try to guess what
time-saving methods you might provide.

For a historical view, you should be aware that until recently, tuples
had no index method:

[***@ando ~]$ python2.5
Python 2.5.4 (r254:67916, Nov 25 2009, 18:45:43)
[GCC 4.1.2 20070626 (Red Hat 4.1.2-14)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
py> (1, 2).index
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'tuple' object has no attribute 'index'

There's no reason to expect that all sequences will have an index
method, and certainly no reason to demand that they do.

--
Steven

Andrew Barnert

2014-07-21 01:52:58 UTC

Permalink

Post by Steven D'Aprano

Post by Ram Rachum
Why does the default `Sequence.__contains__` iterate through the items
rather than use `.index`, which may sometimes be more efficient?

Because having an index() method is not a requirement to be a sequence.
It is optional.

Sequence.__contains__ certainly can assume that your class will have an index method, because it provides one if you don't. See https://docs.python.org/3/library/collections.abc.html#collections-abstract-base-classes (and you can look all the way back to 2.6 and 3.0 to verify that it's always been there).

The default implementation looks like this:

for i, v in enumerate(self):
if v == value:
return i
raise ValueError

Post by Steven D'Aprano
- it's not backwards compatible with sequence types which may already
# return the index of the book

This isn't a Sequence. You didn't inherit from collections.abc.Sequence, or even register with it. So, Sequence.__contains__ can't get called on your class in the first place.

If you _do_ make it a Sequence, then you're violating the protocol you're claiming to support, and it's your own fault if that doesn't work. You can also write a __getitem__ that requires four arguments and call yourself a Sequence, but you're going to get exceptions all over the place trying to use it.

Post by Steven D'Aprano
For a historical view, you should be aware that until recently, tuples

That was true up until the collections ABCs were aded in 2.6 and 3.0. Prior to that, yes, the "sequence protocol" was a vague thing, and you couldn't be sure that something had an index method just because it looked like a sequence—but, by the same token, prior to that, there was no Sequence ABC mixin, so the problem wasn't relevant in the first place.

Mark Lawrence

2014-07-21 02:06:59 UTC

Permalink

Post by Ram Rachum
Why does the default `Sequence.__contains__` iterate through the items
rather than use `.index`, which may sometimes be more efficient?
try: self.index(i)
except ValueError: return False
else: return True
What do you think?

I don't see how that can be more efficient than the naive

def __contains__(self, i):
for elem in self:
if elem == i:
return True
return False

What am I missing?

--
My fellow Pythonistas, ask not what our language can do for you, ask
what you can do for our language.

Mark Lawrence

---
This email is free from viruses and malware because avast! Antivirus protection is active.
http://www.avast.com

Chris Angelico

2014-07-21 02:09:27 UTC

Permalink

Post by Mark Lawrence

I don't see how that can be more efficient than the naive
return True
return False
What am I missing?

If your sequence provides a more efficient index(), then __contains__
can take advantage of it. If not, it's a bit more indirection and the
same result.

ChrisA

Mark Lawrence

2014-07-21 02:36:52 UTC

Permalink

Post by Chris Angelico

Post by Mark Lawrence

I don't see how that can be more efficient than the naive
return True
return False
What am I missing?

If your sequence provides a more efficient index(), then __contains__
can take advantage of it. If not, it's a bit more indirection and the
same result.
ChrisA

The question was about the default sequence.__contains__, not mine or
any other sequence which may or may not provide a more efficient index().

Andrew Barnert

2014-07-21 02:18:44 UTC

Permalink

Post by Mark Lawrence

I don't see how that can be more efficient than the naive
return True
return False
What am I missing?

Consider a blist.sortedlist (http://stutzbachenterprises.com/blist/sortedlist.html), or any other such data structure built on a tree, skip list, etc.

The index method is O(log N), so Ram's __contains__ is also O(log N). But naively iterating is obviously O(N). (In fact, it could be worse—if you don't implement a custom __iter__, and your indexing is O(log N), then the naive __contains__ will be O(N log N)…)

Needless to say, blist.sortedlist implements a custom O(log N) __contains__, and so does (hopefully) every other such library on PyPI. But Ram's proposal would mean they no longer have to do so; they'll get O(log N) __contains__ for free just by implementing index.

Of course that only removes one method. For example, they still have to implement a custom count method or they'll get O(N) performance from the default version. If you look at the code for any of these types, __contains__ is a tiny percentage of the implementation. So, it's not a huge win. But it's a small one.

Mark Lawrence

2014-07-21 02:40:49 UTC

Permalink

Post by Andrew Barnert

Post by Mark Lawrence

I don't see how that can be more efficient than the naive
return True
return False
What am I missing?

Consider a blist.sortedlist (http://stutzbachenterprises.com/blist/sortedlist.html), or any other such data structure built on a tree, skip list, etc.
The index method is O(log N), so Ram's __contains__ is also O(log N). But naively iterating is obviously O(N). (In fact, it could be worse—if you don't implement a custom __iter__, and your indexing is O(log N), then the naive __contains__ will be O(N log N)…)
Needless to say, blist.sortedlist implements a custom O(log N) __contains__, and so does (hopefully) every other such library on PyPI. But Ram's proposal would mean they no longer have to do so; they'll get O(log N) __contains__ for free just by implementing index.
Of course that only removes one method. For example, they still have to implement a custom count method or they'll get O(N) performance from the default version. If you look at the code for any of these types, __contains__ is a tiny percentage of the implementation. So, it's not a huge win. But it's a small one.

What has blist.sortedlist, which IIRC is one of the data structures that
has been rejected as forming part of the standard library, got to do
with the default sequence.__contains__ ?

Andrew Barnert

2014-07-21 02:59:26 UTC

Permalink

Post by Andrew Barnert

On Sunday, July 20, 2014 7:06 PM, Mark Lawrence

Post by Mark Lawrence

Why does the default `Sequence.__contains__` iterate through the

items

Post by Mark Lawrence

rather than use `.index`, which may sometimes be more efficient?
try: self.index(i)
except ValueError: return False
else: return True
What do you think?

I don't see how that can be more efficient than the naive
return True
return False
What am I missing?

Consider a blist.sortedlist

(http://stutzbachenterprises.com/blist/sortedlist.html), or any other such data
structure built on a tree, skip list, etc.

The index method is O(log N), so Ram's __contains__ is also O(log N).

But naively iterating is obviously O(N). (In fact, it could be worse—if you
don't implement a custom __iter__, and your indexing is O(log N), then the
naive __contains__ will be O(N log N)…)

Needless to say, blist.sortedlist implements a custom O(log N)

__contains__, and so does (hopefully) every other such library on PyPI. But
Ram's proposal would mean they no longer have to do so; they'll get
O(log N) __contains__ for free just by implementing index.

Of course that only removes one method. For example, they still have to

implement a custom count method or they'll get O(N) performance from the
default version. If you look at the code for any of these types, __contains__ is
a tiny percentage of the implementation. So, it's not a huge win. But
it's a small one.
What has blist.sortedlist, which IIRC is one of the data structures that
has been rejected as forming part of the standard library, got to do
with the default sequence.__contains__ ?

I think you're missing the whole point here.

Sequence is an ABC—an Abstract Base Class—that's used (either by inheritance, or registration) by a wide variety of sequence classes—built-in, stdlib, or third-party.

Like most of the other ABCs in the Python stdlib, it's also usable as a mixin, providing default implementations for methods that you don't want to provide in terms of those that you do. Among the mixin methods it provides is __contains__, as documented at https://docs.python.org/dev/library/collections.abc.html#collections-abstract-base-classes and implemented at http://hg.python.org/cpython/file/default/Lib/_collections_abc.py#l629

I suspect the problem is that you parsed "the default Sequence.__contains__" wrong; Ram was referring to "the default implementation of __contains__ provided as Sequence.__contains__", but you thought he was referring to "the implementation of __contains__ in the default sequence", and whatever "the default sequence means" it obviously can't be a class from a third-party module, right?

Mark Lawrence

2014-07-21 17:26:47 UTC

Permalink

Post by Andrew Barnert

On Sunday, July 20, 2014 7:06 PM, Mark Lawrence

Post by Mark Lawrence

Post by Ram Rachum
Why does the default `Sequence.__contains__` iterate through the

items

Post by Mark Lawrence

Post by Ram Rachum
rather than use `.index`, which may sometimes be more efficient?
try: self.index(i)
except ValueError: return False
else: return True
What do you think?

I don't see how that can be more efficient than the naive
return True
return False
What am I missing?

Consider a blist.sortedlist

(http://stutzbachenterprises.com/blist/sortedlist.html), or any other such data
structure built on a tree, skip list, etc.

The index method is O(log N), so Ram's __contains__ is also O(log N).

But naively iterating is obviously O(N). (In fact, it could be worse—if you
don't implement a custom __iter__, and your indexing is O(log N), then the
naive __contains__ will be O(N log N)…)

Needless to say, blist.sortedlist implements a custom O(log N)

Of course that only removes one method. For example, they still have to

I think you're missing the whole point here.
Sequence is an ABC—an Abstract Base Class—that's used (either by inheritance, or registration) by a wide variety of sequence classes—built-in, stdlib, or third-party.
Like most of the other ABCs in the Python stdlib, it's also usable as a mixin, providing default implementations for methods that you don't want to provide in terms of those that you do. Among the mixin methods it provides is __contains__, as documented at https://docs.python.org/dev/library/collections.abc.html#collections-abstract-base-classes and implemented at http://hg.python.org/cpython/file/default/Lib/_collections_abc.py#l629
I suspect the problem is that you parsed "the default Sequence.__contains__" wrong; Ram was referring to "the default implementation of __contains__ provided as Sequence.__contains__", but you thought he was referring to "the implementation of __contains__ in the default sequence", and whatever "the default sequence means" it obviously can't be a class from a third-party module, right?

Thanks for the explanation and yes I did parse it incorrectly.
Strangely everything seems much clearer at 6PM rather than 3AM :)