Discussion:
str.split with empty separator
Alexandre Conrad
2010-07-29 10:41:35 UTC
Permalink
Hello all,

What if str.split could take an empty separator?
'banana'.split('')
['b', 'a', 'n', 'a', 'n', 'a']
list('banana')
['b', 'a', 'n', 'a', 'n', 'a']

I think that, semantically speaking, it would make sens to split where
there are no characters (in between them). Right now you can join from
an empty string:

''.join(['b', 'a', 'n', 'a', 'n', 'a'])

So why can't we split from an empty string?

This wouldn't introduce any backwards incompatible changes as
'banana'.split('')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: empty separator

I would love to see my banana actually split. :)

Regards,
--
Alex
twitter.com/alexconrad
MRAB
2010-07-29 16:28:46 UTC
Permalink
Post by Alexandre Conrad
Hello all,
What if str.split could take an empty separator?
'banana'.split('')
['b', 'a', 'n', 'a', 'n', 'a']
list('banana')
['b', 'a', 'n', 'a', 'n', 'a']
I think that, semantically speaking, it would make sens to split where
there are no characters (in between them). Right now you can join from
''.join(['b', 'a', 'n', 'a', 'n', 'a'])
So why can't we split from an empty string?
This wouldn't introduce any backwards incompatible changes as
'banana'.split('')
File "<stdin>", line 1, in <module>
ValueError: empty separator
I would love to see my banana actually split. :)
'banana'.split('')
['', 'b', 'a', 'n', 'a', 'n', 'a', '']
Post by Alexandre Conrad
'banana'.startswith('')
True
Post by Alexandre Conrad
'banana'.endswith('')
True
Alexandre Conrad
2010-07-30 08:28:10 UTC
Permalink
Post by MRAB
'banana'.split('')
['', 'b', 'a', 'n', 'a', 'n', 'a', '']
Humm... I believe that it may be correct. It's not what I was
expecting, but it does look accurate.
--
Alex
twitter.com/alexconrad
Greg Ewing
2010-07-30 00:33:40 UTC
Permalink
Post by Alexandre Conrad
What if str.split could take an empty separator?
Do you have a use case for this?
Post by Alexandre Conrad
Right now you can join from an empty string...
So why can't we split from an empty string?
Because splitting on an empty string is ambiguous,
and nobody has so far put forward a compelling use
case that would show how the ambiguity should best
be resolved.
--
Greg
Raymond Hettinger
2010-07-30 02:16:50 UTC
Permalink
Post by Alexandre Conrad
What if str.split could take an empty separator?
I propose that the semantics of str.split() never be changed.

It has been around for a long time and has a complex set of behaviors
that people have come to rely on. For years, we've answered arcane
questions about it and have made multiple revisions to the docs in a
never ending quest to precisely describe exactly what it does without
just showing the C underlying code. Accordingly, existing uses depend
mainly on what-it-does-as-implemented and less on the various ways
it has been documented over the years.

Almost any change to str.split() would either complexify the explanation
of what it does or would change the behavior in a way the would
break somebody's code (perhaps in a subtle ways that are hard to detect).

In my opinion, str.split() should never be touched again.
Instead, it may be worthwhile to develop new splitters
with precise semantics aimed at specific use cases.


Raymond
MRAB
2010-07-30 02:41:35 UTC
Permalink
Post by Raymond Hettinger
Post by Alexandre Conrad
What if str.split could take an empty separator?
I propose that the semantics of str.split() never be changed.
It has been around for a long time and has a complex set of behaviors
that people have come to rely on. For years, we've answered arcane
questions about it and have made multiple revisions to the docs in a
never ending quest to precisely describe exactly what it does without
just showing the C underlying code. Accordingly, existing uses depend
mainly on what-it-does-as-implemented and less on the various ways
it has been documented over the years.
Almost any change to str.split() would either complexify the explanation
of what it does or would change the behavior in a way the would
break somebody's code (perhaps in a subtle ways that are hard to detect).
In my opinion, str.split() should never be touched again.
Instead, it may be worthwhile to develop new splitters
with precise semantics aimed at specific use cases.
Does it really have a complex set of behaviours? The only (possibly)
surprising behaviour for me is when it splits on whitespace (ie, passing
it None as the separator). I find it very easy to understand. Or perhaps
I'm just smarter than I thought! :-)
Greg Ewing
2010-07-30 04:27:44 UTC
Permalink
Post by MRAB
Does it really have a complex set of behaviours?
I think Raymond may be referring to the fact that the
behaviour of split() with and without a splitting string
differs in subtle ways with certain edge cases. It's
almost better thought of as two different functions
that happen to share a name.
--
Greg
Raymond Hettinger
2010-07-30 04:51:24 UTC
Permalink
Post by MRAB
Post by Raymond Hettinger
Post by Alexandre Conrad
What if str.split could take an empty separator?
I propose that the semantics of str.split() never be changed.
It has been around for a long time and has a complex set of behaviors that people have come to rely on. For years, we've answered arcane questions about it and have made multiple revisions to the docs in a
never ending quest to precisely describe exactly what it does without just showing the C underlying code. Accordingly, existing uses depend
mainly on what-it-does-as-implemented and less on the various ways
it has been documented over the years. Almost any change to str.split() would either complexify the explanation
of what it does or would change the behavior in a way the would
break somebody's code (perhaps in a subtle ways that are hard to detect).
In my opinion, str.split() should never be touched again. Instead, it may be worthwhile to develop new splitters with precise semantics aimed at specific use cases.
Does it really have a complex set of behaviours? The only (possibly)
surprising behaviour for me is when it splits on whitespace (ie, passing
it None as the separator). I find it very easy to understand. Or perhaps
I'm just smarter than I thought! :-)
Past bug reports and newsgroup discussions covered
a variety of misunderstandings:

* completely different algorithm when separator is None
* behavior when separator is multiple characters
(i.e. set of possible splitters vs an aggregate splitter
either with or without overlaps).
* behavior when maxsplit is zero
* behavior when string begins or ends with whitespace
* which characters count as whitespace
* behavior when a string begins or ends with a split character
* when runs of splitters are treated as a single splitter
* behavior of a zero-length splitter
* conditions under which x.join(s.split(x)) roundtrips
* algorithmic difference from re.split()
* are there invariants between s.count(x) and len(s.split(x))
so that you can correctly predict the number of fields returned

It was common that people thought str.split() was easy to understand
until a corner case arose that defied their expectations. When
the experts chimed-in, it became clear that almost no one in
those discussions had a clear understanding of exactly what
the implemented behaviors were and it was common to resort
to experiment to disprove various incorrect hypotheses.
We revised the docs several times and added a number of
examples and now have a pretty good description that took
years to get right.

Even now, it might be a good idea to validate the docs by
seeing if someone can use the documentation text to write
a pure python version of str.split() that behaves exactly like the
real thing (including all corner cases).

Even if you find all of the above to be easy and intuitive,
I still think it wise that we not add to complexity of str.split()
with new or altered behaviors.


Raymond
Loading...