Post by MRABPost by Raymond HettingerPost by Alexandre ConradWhat if str.split could take an empty separator?
I propose that the semantics of str.split() never be changed.
It has been around for a long time and has a complex set of behaviors that people have come to rely on. For years, we've answered arcane questions about it and have made multiple revisions to the docs in a
never ending quest to precisely describe exactly what it does without just showing the C underlying code. Accordingly, existing uses depend
mainly on what-it-does-as-implemented and less on the various ways
it has been documented over the years. Almost any change to str.split() would either complexify the explanation
of what it does or would change the behavior in a way the would
break somebody's code (perhaps in a subtle ways that are hard to detect).
In my opinion, str.split() should never be touched again. Instead, it may be worthwhile to develop new splitters with precise semantics aimed at specific use cases.
Does it really have a complex set of behaviours? The only (possibly)
surprising behaviour for me is when it splits on whitespace (ie, passing
it None as the separator). I find it very easy to understand. Or perhaps
I'm just smarter than I thought! :-)
Past bug reports and newsgroup discussions covered
a variety of misunderstandings:
* completely different algorithm when separator is None
* behavior when separator is multiple characters
(i.e. set of possible splitters vs an aggregate splitter
either with or without overlaps).
* behavior when maxsplit is zero
* behavior when string begins or ends with whitespace
* which characters count as whitespace
* behavior when a string begins or ends with a split character
* when runs of splitters are treated as a single splitter
* behavior of a zero-length splitter
* conditions under which x.join(s.split(x)) roundtrips
* algorithmic difference from re.split()
* are there invariants between s.count(x) and len(s.split(x))
so that you can correctly predict the number of fields returned
It was common that people thought str.split() was easy to understand
until a corner case arose that defied their expectations. When
the experts chimed-in, it became clear that almost no one in
those discussions had a clear understanding of exactly what
the implemented behaviors were and it was common to resort
to experiment to disprove various incorrect hypotheses.
We revised the docs several times and added a number of
examples and now have a pretty good description that took
years to get right.
Even now, it might be a good idea to validate the docs by
seeing if someone can use the documentation text to write
a pure python version of str.split() that behaves exactly like the
real thing (including all corner cases).
Even if you find all of the above to be easy and intuitive,
I still think it wise that we not add to complexity of str.split()
with new or altered behaviors.
Raymond