Iterating non-newline-separated files should be easier

Discussion:

Andrew Barnert

2014-07-17 19:53:05 UTC

tl;dr: readline and friends should take an optional sep parameter (which also means adding an iterlines method).

Recently, I was trying to add -0 support to a command-line tool, which means that it reads filenames out of stdin and/or a text file with \0 separators instead of \n.

This means that my code that looked like this:

with open(path, encoding=sys.getfilesystemencoding()) as f:
for filename in f:
do_stuff(filename)

… turned into this (from memory, not the exact code):

def resplit(chunks, sep):
buf = b''
for chunk in chunks:
parts = (buf+chunk).split(sep)

yield from parts[:-1]
buf = parts[-1]
if buf:
yield buf

with open(path, 'rb') as f:
chunks = iter(lambda: f.read(4096), b'')
for line in resplit(chunks, b'\0'):
filename = line.decode(sys.getfilesystemencoding())
do_stuff(filename)

Besides being a lot more code (and involving things that a novice might have problems reading like that two-argument iter), this also means that the file pointer is way ahead of the line that's just been iterated, I'm inefficiently buffering everything twice, etc.

The problem is that readline is hardcoded to look for b'\n' for binary files, smart-universal-newline-thingy for text files, there's no way to reuse its machinery if you want to look for something different, and there's no way to access the internals that it uses if you want to reimplement it.

While it might be possible to fix the latter problems in some generic and flexible way, that doesn't seem all that useful; really, other than changing the way readline splits, I don't think anyone wants to hook anything else about file objects. (On the other hand, people might want to hook it in more complex ways—e.g., pass a separator function instead of a separator string? I'm probably reaching there…)

If I'm right, all that's needed is an extra sep=None keyword-only parameter to readline and friends (where None means the existing newline behavior), along with an iterlines method that's identical to __iter__ except that it has room for that new parameter.

One minor side problem: Sometimes you don't actually have a file, but some kind of file-like object. I realize that as 3.1 or so, this is supposed to mean it actually is an io.BufferedIOBase or etc., but there are still plenty of third-party modules that just demand and/or provide "something with read(size)" or the like. In fact, that's the case with the problem I ran into above; another feature uses a third-party module to provide file-like objects for members of all kinds of uncommon archive types, and unlike zipfile, that module wasn't changed to provide io subclasses when it was ported to 3.x. So, it might be worth having adapters that make it easier (or just possible…) to wrap such a thing in the actual io interfaces. (The existing wrappers aren't adapters—BufferedReader demands readinto(buf), not read(size); TextIOWrapper can only wrap a BufferedIOBase.) But that's really a separate issue (and the answer to that one may just be to hold firm
with the "file-like object means IOBase" and eventually every library you care about will work that way, even if you occasionally have to fix it yourself).

Guido van Rossum

2014-07-17 20:48:28 UTC