Discussion:
Iterating non-newline-separated files should be easier
Andrew Barnert
2014-07-17 19:53:05 UTC
Permalink
tl;dr: readline and friends should take an optional sep parameter (which also means adding an iterlines method).

Recently, I was trying to add -0 support to a command-line tool, which means that it reads filenames out of stdin and/or a text file with \0 separators instead of \n.

This means that my code that looked like this:

    with open(path, encoding=sys.getfilesystemencoding()) as f:
        for filename in f:
            do_stuff(filename)

… turned into this (from memory, not the exact code):

    def resplit(chunks, sep):
        buf = b''
        for chunk in chunks:
            parts = (buf+chunk).split(sep)

            yield from parts[:-1]
            buf = parts[-1]
        if buf:
            yield buf

    with open(path, 'rb') as f:
        chunks = iter(lambda: f.read(4096), b'')
        for line in resplit(chunks, b'\0'):
            filename = line.decode(sys.getfilesystemencoding())
            do_stuff(filename)

Besides being a lot more code (and involving things that a novice might have problems reading like that two-argument iter), this also means that the file pointer is way ahead of the line that's just been iterated, I'm inefficiently buffering everything twice, etc.

The problem is that readline is hardcoded to look for b'\n' for binary files, smart-universal-newline-thingy for text files, there's no way to reuse its machinery if you want to look for something different, and there's no way to access the internals that it uses if you want to reimplement it.

While it might be possible to fix the latter problems in some generic and flexible way, that doesn't seem all that useful; really, other than changing the way readline splits, I don't think anyone wants to hook anything else about file objects. (On the other hand, people might want to hook it in more complex ways—e.g., pass a separator function instead of a separator string? I'm probably reaching there…)

If I'm right, all that's needed is an extra sep=None keyword-only parameter to readline and friends (where None means the existing newline behavior), along with an iterlines method that's identical to __iter__ except that it has room for that new parameter.

One minor side problem: Sometimes you don't actually have a file, but some kind of file-like object. I realize that as 3.1 or so, this is supposed to mean it actually is an io.BufferedIOBase or etc., but there are still plenty of third-party modules that just demand and/or provide "something with read(size)" or the like. In fact, that's the case with the problem I ran into above; another feature uses a third-party module to provide file-like objects for members of all kinds of uncommon archive types, and unlike zipfile, that module wasn't changed to provide io subclasses when it was ported to 3.x. So, it might be worth having adapters that make it easier (or just possible…) to wrap such a thing in the actual io interfaces. (The existing wrappers aren't adapters—BufferedReader demands readinto(buf), not read(size); TextIOWrapper can only wrap a BufferedIOBase.) But that's really a separate issue (and the answer to that one may just be to hold firm
with the "file-like object means IOBase" and eventually every library you care about will work that way, even if you occasionally have to fix it yourself).
Guido van Rossum
2014-07-17 20:48:28 UTC
Permalink
I think it's fine to add something to stdlib that encapsulates your
example. (TBD: where?)

I don't think it is reasonable to add a new parameter to readline(),
because streams are widely implemented using duck typing -- every
implementation would have to be updated to support this.


On Thu, Jul 17, 2014 at 12:53 PM, Andrew Barnert <
abarnert-/***@public.gmane.org> wrote:

> tl;dr: readline and friends should take an optional sep parameter (which
> also means adding an iterlines method).
>
> Recently, I was trying to add -0 support to a command-line tool, which
> means that it reads filenames out of stdin and/or a text file with \0
> separators instead of \n.
>
> This means that my code that looked like this:
>
> with open(path, encoding=sys.getfilesystemencoding()) as f:
> for filename in f:
> do_stuff(filename)
>
> 
 turned into this (from memory, not the exact code):
>
> def resplit(chunks, sep):
> buf = b''
> for chunk in chunks:
> parts = (buf+chunk).split(sep)
>
> yield from parts[:-1]
> buf = parts[-1]
> if buf:
> yield buf
>
> with open(path, 'rb') as f:
> chunks = iter(lambda: f.read(4096), b'')
> for line in resplit(chunks, b'\0'):
> filename = line.decode(sys.getfilesystemencoding())
> do_stuff(filename)
>
> Besides being a lot more code (and involving things that a novice might
> have problems reading like that two-argument iter), this also means that
> the file pointer is way ahead of the line that's just been iterated, I'm
> inefficiently buffering everything twice, etc.
>
> The problem is that readline is hardcoded to look for b'\n' for binary
> files, smart-universal-newline-thingy for text files, there's no way to
> reuse its machinery if you want to look for something different, and
> there's no way to access the internals that it uses if you want to
> reimplement it.
>
> While it might be possible to fix the latter problems in some generic and
> flexible way, that doesn't seem all that useful; really, other than
> changing the way readline splits, I don't think anyone wants to hook
> anything else about file objects. (On the other hand, people might want to
> hook it in more complex ways—e.g., pass a separator function instead of a
> separator string? I'm probably reaching there
)
>
> If I'm right, all that's needed is an extra sep=None keyword-only
> parameter to readline and friends (where None means the existing newline
> behavior), along with an iterlines method that's identical to __iter__
> except that it has room for that new parameter.
>
> One minor side problem: Sometimes you don't actually have a file, but some
> kind of file-like object. I realize that as 3.1 or so, this is supposed to
> mean it actually is an io.BufferedIOBase or etc., but there are still
> plenty of third-party modules that just demand and/or provide "something
> with read(size)" or the like. In fact, that's the case with the problem I
> ran into above; another feature uses a third-party module to provide
> file-like objects for members of all kinds of uncommon archive types, and
> unlike zipfile, that module wasn't changed to provide io subclasses when it
> was ported to 3.x. So, it might be worth having adapters that make it
> easier (or just possible
) to wrap such a thing in the actual io
> interfaces. (The existing wrappers aren't adapters—BufferedReader demands
> readinto(buf), not read(size); TextIOWrapper can only wrap a
> BufferedIOBase.) But that's really a separate issue (and the answer to that
> one may just be to hold firm
> with the "file-like object means IOBase" and eventually every library you
> care about will work that way, even if you occasionally have to fix it
> yourself).
> _______________________________________________
> Python-ideas mailing list
> Python-ideas-+ZN9ApsXKcEdnm+***@public.gmane.org
> https://mail.python.org/mailman/listinfo/python-ideas
> Code of Conduct: http://python.org/psf/codeofconduct/




--
--Guido van Rossum (python.org/~guido)
Alexander Heger
2014-07-17 21:39:42 UTC
Permalink
> I don't think it is reasonable to add a new parameter to readline(), because
> streams are widely implemented using duck typing -- every implementation
> would have to be updated to support this.

Could the "split" (or splitline) keyword-only parameter instead be
passed to the open function (and the __init__ of IOBase and be stored
there)?
Andrew Barnert
2014-07-17 22:21:25 UTC
Permalink
> On Thursday, July 17, 2014 2:40 PM, Alexander Heger <***@2sn.net> wrote:

> >> I don't think it is reasonable to add a new parameter to readline(),
> because
>> streams are widely implemented using duck typing -- every implementation
>> would have to be updated to support this.
>
> Could the "split" (or splitline) keyword-only parameter instead be
> passed to the open function (and the __init__ of IOBase and be stored
> there)?


Good idea. It's less powerful/flexible, but probably good enough for almost all use cases. (I can't think of any file where I'd need to split part of it on \0 and the rest on \n…) Also, it means you can stick with the normal __iter__ instead of needing a separate iterlines method.

And, since open/__init__/etc. isn't part of the protocol, it's perfectly fine for the builtin open, etc., to be an example or template that's generally worth following if there's no good reason not to do so, rather than a requirement that must be followed. So, if I'm getting file-like objects handed to me by some third-party library or plugin API or whatever, and I need them to be \0-separated, in many cases the problems with resplit won't be an issue so I can just use it as a workaround, and in the remaining cases, I can request that the library/app/whatever add the sep parameter to the next iteration of the API.

So, I retract my original suggestion in favor of this one. And, separately, Guido's idea of adding the helpers (or at least resplit, plus documentation on how to write the other stuff) to the stdlib somewhere.

Thanks.
Andrew Barnert
2014-07-18 00:04:00 UTC
Permalink
On Thursday, July 17, 2014 3:21 PM, Andrew Barnert <***@yahoo.com> wrote:



>  On Thursday, July 17, 2014 2:40 PM, Alexander Heger <***@2sn.net> wrote:

>>  Could the "split" (or splitline) keyword-only
>> parameter instead be passed to the open function 
>> (and the __init__ of IOBase and be stored there)?
>
> Good idea. It's less powerful/flexible, but probably
> good enough for almost all use cases. (I can't think
> of any file where I'd need to split part of it on \0
> and the rest on \n…) Also, it means you can stick with
> the normal __iter__ instead of needing a separate
> iterlines method.

It turns out to be even simpler than I expected.

I reused the "newline" parameter of open and TextIOWrapper.__init__, adding a param of the same name to the constructors for BufferedReader, BufferedWriter, BufferedRWPair, BufferedRandom, and FileIO.

For text files, just remove the check for newline being one of the standard values and it all works. For binary files, remove the check for truthy, make open pass each Buffered* constructor newline=(newline if binary else None), make each Buffered* class store it, and change two lines in RawIOBase.readline to use it. And that's it.

(Of course you'd also want to add it to all of the stdlib cases like zipfile.ZipFile.open/zipfile.ExtZipFile.__init__, but there aren't too many of those.)

This means that the buffer underlying a text file with a non-standard newline doesn't automatically have a matching newline. I think that's a good thing ('\r\n' and '\r' would need exceptions for backward compatibility; '\0'.encode('utf-16-le') isn't a very useful thing to split on; etc.), but doing it the other way is almost as easy, and very little code will never care.
Steven D'Aprano
2014-07-18 03:21:00 UTC
Permalink
On Thu, Jul 17, 2014 at 05:04:00PM -0700, Andrew Barnert wrote:

> It turns out to be even simpler than I expected.
>
> I reused the "newline" parameter of open and TextIOWrapper.__init__,
> adding a param of the same name to the constructors for
> BufferedReader, BufferedWriter, BufferedRWPair, BufferedRandom, and
> FileIO.
>
> For text files, just remove the check for newline being one of the
> standard values and it all works. For binary files, remove the check
> for truthy, make open pass each Buffered* constructor newline=(newline
> if binary else None), make each Buffered* class store it, and change
> two lines in RawIOBase.readline to use it. And that's it.

All the words are in English, but I have no idea what you're actually
saying... :-)

You seem to be talking about the implementation of the change, but what
is the interface? Having made all these changes, how does it effect
Python code? You have a use-case of splitting on something other than
the standard newlines, so how does one do that? E.g. suppose I have a
file "spam.txt" which uses NEL (Next Line, U+0085) as the end of line
character. How would I iterate over lines in this file?


> This means that the buffer underlying a text file with a non-standard
> newline doesn't automatically have a matching newline.

I don't understand what you mean by this.



--
Steven
Chris Angelico
2014-07-18 03:36:17 UTC
Permalink
On Fri, Jul 18, 2014 at 1:21 PM, Steven D'Aprano <steve-iDnA/YwAAsAk+I/***@public.gmane.org> wrote:
> On Thu, Jul 17, 2014 at 05:04:00PM -0700, Andrew Barnert wrote:
>
>> It turns out to be even simpler than I expected.
>>
>> I reused the "newline" parameter of open and TextIOWrapper.__init__,
>> adding a param of the same name to the constructors for
>> BufferedReader, BufferedWriter, BufferedRWPair, BufferedRandom, and
>> FileIO.
>>
>> For text files, just remove the check for newline being one of the
>> standard values and it all works. For binary files, remove the check
>> for truthy, make open pass each Buffered* constructor newline=(newline
>> if binary else None), make each Buffered* class store it, and change
>> two lines in RawIOBase.readline to use it. And that's it.
>
> All the words are in English, but I have no idea what you're actually
> saying... :-)
>
> You seem to be talking about the implementation of the change, but what
> is the interface? Having made all these changes, how does it effect
> Python code? You have a use-case of splitting on something other than
> the standard newlines, so how does one do that? E.g. suppose I have a
> file "spam.txt" which uses NEL (Next Line, U+0085) as the end of line
> character. How would I iterate over lines in this file?

The way I understand it is this:

for line in open("spam.txt", newline="\u0085"):
process(line)

If that's the case, I would be strongly in favour of this. Nice and
clean, and should break nothing; there'll be special cases for
newline=None and newline='', and the only change is that, instead of a
small number of permitted values ('\n', '\r', '\r\n'), any string (or
maybe any one-character string plus '\r\n'?) would be permitted.

Effectively, it's not "iterate over this file, divided by \0 instead
of newlines", but it's "this file uses the unusual encoding of
newline=\0, now iterate over lines in the file". Seems a smart way to
do it IMO.

ChrisA
Andrew Barnert
2014-07-18 04:23:05 UTC
Permalink
On Jul 17, 2014, at 20:36, Chris Angelico <rosuav-***@public.gmane.org> wrote:

> On Fri, Jul 18, 2014 at 1:21 PM, Steven D'Aprano <steve-iDnA/YwAAsAk+I/***@public.gmane.org> wrote:
>> You seem to be talking about the implementation of the change, but what
>> is the interface? Having made all these changes, how does it effect
>> Python code? You have a use-case of splitting on something other than
>> the standard newlines, so how does one do that? E.g. suppose I have a
>> file "spam.txt" which uses NEL (Next Line, U+0085) as the end of line
>> character. How would I iterate over lines in this file?
>
> The way I understand it is this:
>
> for line in open("spam.txt", newline="\u0085"):
> process(line)
>
> If that's the case, I would be strongly in favour of this. Nice and
> clean, and should break nothing; there'll be special cases for
> newline=None and newline='', and the only change is that, instead of a
> small number of permitted values ('\n', '\r', '\r\n'), any string (or
> maybe any one-character string plus '\r\n'?) would be permitted.
>
> Effectively, it's not "iterate over this file, divided by \0 instead
> of newlines", but it's "this file uses the unusual encoding of
> newline=\0, now iterate over lines in the file". Seems a smart way to
> do it IMO.

Exactly. As soon as Alexander suggested it, I immediately knew it was much better than my original idea.

(Apologies for overestimating the obviousness of that.)
Guido van Rossum
2014-07-18 04:47:06 UTC
Permalink
Well, I had to look up the newline option for open(), even though I
probably invented it. :-)

Would it still apply only to text files?

On Thursday, July 17, 2014, Andrew Barnert <abarnert-/***@public.gmane.org>
wrote:

> On Jul 17, 2014, at 20:36, Chris Angelico <rosuav-***@public.gmane.org <javascript:;>>
> wrote:
>
> > On Fri, Jul 18, 2014 at 1:21 PM, Steven D'Aprano <steve-iDnA/YwAAsAk+I/***@public.gmane.org
> <javascript:;>> wrote:
> >> You seem to be talking about the implementation of the change, but what
> >> is the interface? Having made all these changes, how does it effect
> >> Python code? You have a use-case of splitting on something other than
> >> the standard newlines, so how does one do that? E.g. suppose I have a
> >> file "spam.txt" which uses NEL (Next Line, U+0085) as the end of line
> >> character. How would I iterate over lines in this file?
> >
> > The way I understand it is this:
> >
> > for line in open("spam.txt", newline="\u0085"):
> > process(line)
> >
> > If that's the case, I would be strongly in favour of this. Nice and
> > clean, and should break nothing; there'll be special cases for
> > newline=None and newline='', and the only change is that, instead of a
> > small number of permitted values ('\n', '\r', '\r\n'), any string (or
> > maybe any one-character string plus '\r\n'?) would be permitted.
> >
> > Effectively, it's not "iterate over this file, divided by \0 instead
> > of newlines", but it's "this file uses the unusual encoding of
> > newline=\0, now iterate over lines in the file". Seems a smart way to
> > do it IMO.
>
> Exactly. As soon as Alexander suggested it, I immediately knew it was much
> better than my original idea.
>
> (Apologies for overestimating the obviousness of that.)
>
>
> _______________________________________________
> Python-ideas mailing list
> Python-ideas-+ZN9ApsXKcEdnm+***@public.gmane.org <javascript:;>
> https://mail.python.org/mailman/listinfo/python-ideas
> Code of Conduct: http://python.org/psf/codeofconduct/
>


--
--Guido van Rossum (on iPad)
Andrew Barnert
2014-07-18 06:26:28 UTC
Permalink
On Jul 17, 2014, at 21:47, Guido van Rossum <guido-+ZN9ApsXKcEdnm+***@public.gmane.org> wrote:

> Well, I had to look up the newline option for open(), even though I probably invented it. :-)

While we're at it, I think most places in the documentation and docstrings that refer to the parameter, except open itself, call it newlines (e.g., io.IOBase.readline), and as far as I can tell it's been like that from day one, which shows just how much people pay attention to the current feature. :)

> Would it still apply only to text files?

I think it makes sense to apply to binary files as well. Splitting binary files on \0 (or, for that matter, \r\n...) is probably at least as common a use case as text files.

Obviously the special treatment for "" (as a universal-newline-behavior flag) wouldn't carry over to b"" (which might as well just be an error, although I suppose it could also mean to split on every byte, as with bytes.split?). Also, I'm not sure if the write behavior (replace terminal "\n" with newline) should carry over from text to binary, or just ignore newline on write.

Binary files don't need the special-casing for b"" (with text files, that's more a universal-newlines flag than a newline value), and I'm not sure if they need the write behavior or only the read behavior.

> On Thursday, July 17, 2014, Andrew Barnert <abarnert-/***@public.gmane.orgid> wrote:
>> On Jul 17, 2014, at 20:36, Chris Angelico <rosuav-***@public.gmane.org> wrote:
>>
>> > On Fri, Jul 18, 2014 at 1:21 PM, Steven D'Aprano <steve-iDnA/YwAAsAk+I/***@public.gmane.org> wrote:
>> >> You seem to be talking about the implementation of the change, but what
>> >> is the interface? Having made all these changes, how does it effect
>> >> Python code? You have a use-case of splitting on something other than
>> >> the standard newlines, so how does one do that? E.g. suppose I have a
>> >> file "spam.txt" which uses NEL (Next Line, U+0085) as the end of line
>> >> character. How would I iterate over lines in this file?
>> >
>> > The way I understand it is this:
>> >
>> > for line in open("spam.txt", newline="\u0085"):
>> > process(line)
>> >
>> > If that's the case, I would be strongly in favour of this. Nice and
>> > clean, and should break nothing; there'll be special cases for
>> > newline=None and newline='', and the only change is that, instead of a
>> > small number of permitted values ('\n', '\r', '\r\n'), any string (or
>> > maybe any one-character string plus '\r\n'?) would be permitted.
>> >
>> > Effectively, it's not "iterate over this file, divided by \0 instead
>> > of newlines", but it's "this file uses the unusual encoding of
>> > newline=\0, now iterate over lines in the file". Seems a smart way to
>> > do it IMO.
>>
>> Exactly. As soon as Alexander suggested it, I immediately knew it was much better than my original idea.
>>
>> (Apologies for overestimating the obviousness of that.)
>>
>>
>> _______________________________________________
>> Python-ideas mailing list
>> Python-ideas-+ZN9ApsXKcEdnm+***@public.gmane.org
>> https://mail.python.org/mailman/listinfo/python-ideas
>> Code of Conduct: http://python.org/psf/codeofconduct/
>
>
> --
> --Guido van Rossum (on iPad)
> _______________________________________________
> Python-ideas mailing list
> Python-ideas-+ZN9ApsXKcEdnm+***@public.gmane.org
> https://mail.python.org/mailman/listinfo/python-ideas
> Code of Conduct: http://python.org/psf/codeofconduct/
Andrew Barnert
2014-07-18 04:18:08 UTC
Permalink
On Jul 17, 2014, at 20:21, Steven D'Aprano <steve-iDnA/YwAAsAk+I/***@public.gmane.org> wrote:

> On Thu, Jul 17, 2014 at 05:04:00PM -0700, Andrew Barnert wrote:
>
>> It turns out to be even simpler than I expected.
>>
>> I reused the "newline" parameter of open and TextIOWrapper.__init__,
>> adding a param of the same name to the constructors for
>> BufferedReader, BufferedWriter, BufferedRWPair, BufferedRandom, and
>> FileIO.
>>
>> For text files, just remove the check for newline being one of the
>> standard values and it all works. For binary files, remove the check
>> for truthy, make open pass each Buffered* constructor newline=(newline
>> if binary else None), make each Buffered* class store it, and change
>> two lines in RawIOBase.readline to use it. And that's it.
>
> All the words are in English, but I have no idea what you're actually
> saying... :-)
>
> You seem to be talking about the implementation of the change, but what
> is the interface?

"I reused the newline parameter."

My mistake was assuming that was so simple, nothing else needed to be said. But that only works if everyone went back and completely read the previous suggestions, which I realize nobody had any good reason to do.

Basically, the only change to the API is that it's no longer an error to pass arbitrary strings (or bytes, for binary mode) for newlines. The rules for how "\0" are handled are identical to the rules for "\r". There's almost nothing else to explain, but not quite--so, like an idiot, I dove into the minor nits in detail, skipping over the main point.

> Having made all these changes, how does it effect
> Python code?

Existing legal code does not change at all. Some code that used to be an error now does something useful (see below).

> You have a use-case of splitting on something other than
> the standard newlines, so how does one do that? E.g. suppose I have a
> file "spam.txt" which uses NEL (Next Line, U+0085) as the end of line
> character. How would I iterate over lines in this file?

with open("spam.txt", newline="\u0085") as f:
for line in f:
process(line)

>> This means that the buffer underlying a text file with a non-standard
>> newline doesn't automatically have a matching newline.
>
> I don't understand what you mean by this.

If you write this:

with open("spam.txt", newline="\u0085") as f:
for line in f.buffer:

The bytes you get back will be split on b"\n", not on "\u0085".encode(locale.getdefaultencoding()). The newlines applies only to the text file, not its underlying binary buffer. (This is exactly the same as the current behavior--if you open a file with newline='\r' in 3.4 then iterate f.buffer, it's still going to split on b'\n', not b'\r'.)
Wolfgang Maier
2014-07-18 11:53:48 UTC
Permalink
On 07/18/2014 02:04 AM, Andrew Barnert wrote:
> On Thursday, July 17, 2014 3:21 PM, Andrew Barnert <***@yahoo.com> wrote:
>
>
>
>> On Thursday, July 17, 2014 2:40 PM, Alexander Heger <***@2sn.net> wrote:
>
>>> Could the "split" (or splitline) keyword-only
>>> parameter instead be passed to the open function
>>> (and the __init__ of IOBase and be stored there)?
>>
>> Good idea. It's less powerful/flexible, but probably
>> good enough for almost all use cases. (I can't think
>> of any file where I'd need to split part of it on \0
>> and the rest on \n…) Also, it means you can stick with
>> the normal __iter__ instead of needing a separate
>> iterlines method.
>
> It turns out to be even simpler than I expected.
>
> I reused the "newline" parameter of open and TextIOWrapper.__init__, adding a param of the same name to the constructors for BufferedReader, BufferedWriter, BufferedRWPair, BufferedRandom, and FileIO.
>
> For text files, just remove the check for newline being one of the standard values and it all works. For binary files, remove the check for truthy, make open pass each Buffered* constructor newline=(newline if binary else None), make each Buffered* class store it, and change two lines in RawIOBase.readline to use it. And that's it.
>

You are not the first one to come up with this idea and suggesting
solutions. This whole thing has been hanging around on the bug tracker
as an unresolved issue (started by Nick Coghlan) since almost a decade:

http://bugs.python.org/issue1152248

Ever since discovering it, I've been sticking to the recipe provided by
Douglas Alan:

http://bugs.python.org/issue1152248#msg109117

Not that I wouldn't like to see this feature to be shipping with Python,
but it may help to read through all aspects of the problem that have
been discussed before.

Best,
Wolfgang
Andrew Barnert
2014-07-18 16:43:26 UTC
Permalink
Before responding to Wolfgang, something that occurred to me overnight: The only insurmountable problem with Guido's suggestion of "just unwrap and rewrap the raw or buffer in a subclass that adds this behavior" is that you can't write such a subclass of TextIOWrapper, because it has no way to either peek at or push back onto the buffer. So... Why not add one of those?

Pushing back is easier to implement (since it's already there as a private method), but a bit funky, and peeking would mean it works the same way as with buffered binary files. But I'll take a look at the idiomatic way to do similar things in other languages (C stdio, C++ iostreams, etc.), and make sure that peek is actually sensible for TextIOWrapper, before arguing for it.

While we're at it, it might be nice for the peek method to be documented as an (optional, like raw, etc.?) member of the two ABCs instead of just something that one implementation happens to have, and that the mixin code will use if it happens to be present. (Binary readline uses peek if it exists, falls back to byte by byte if not.)

On Jul 18, 2014, at 4:53, Wolfgang Maier <***@biologie.uni-freiburg.de> wrote:

> On 07/18/2014 02:04 AM, Andrew Barnert wrote:
>> On Thursday, July 17, 2014 3:21 PM, Andrew Barnert <***@yahoo.com> wrote:
>>
>>
>>
>>> On Thursday, July 17, 2014 2:40 PM, Alexander Heger <***@2sn.net> wrote:
>>
>>>> Could the "split" (or splitline) keyword-only
>>>> parameter instead be passed to the open function
>>>> (and the __init__ of IOBase and be stored there)?
>>>
>>> Good idea. It's less powerful/flexible, but probably
>>> good enough for almost all use cases. (I can't think
>>> of any file where I'd need to split part of it on \0
>>> and the rest on \n…) Also, it means you can stick with
>>> the normal __iter__ instead of needing a separate
>>> iterlines method.
>>
>> It turns out to be even simpler than I expected.
>>
>> I reused the "newline" parameter of open and TextIOWrapper.__init__, adding a param of the same name to the constructors for BufferedReader, BufferedWriter, BufferedRWPair, BufferedRandom, and FileIO.
>>
>> For text files, just remove the check for newline being one of the standard values and it all works. For binary files, remove the check for truthy, make open pass each Buffered* constructor newline=(newline if binary else None), make each Buffered* class store it, and change two lines in RawIOBase.readline to use it. And that's it.
>
> You are not the first one to come up with this idea and suggesting solutions. This whole thing has been hanging around on the bug tracker as an unresolved issue (started by Nick Coghlan) since almost a decade:
>
> http://bugs.python.org/issue1152248
>
> Ever since discovering it, I've been sticking to the recipe provided by Douglas Alan:
>
> http://bugs.python.org/issue1152248#msg109117

Thanks.

Douglas's recipe is effectively the same as my resplit, except less general (since it consumes a file rather than any iterable), and some, but not all, of the limitations of that approach were mentioned. And R. David Murray's hack patch is the basically the same as the text half of my patch.

The discussion there is also useful, as it raises the similar features in perl, awk, bash, etc.--all of which work by having the user change either a global or something on the file object, rather than putting it in the line-reading code, which reinforces my belief that Alexander's idea of putting the separator value it in the file constructors was right, and my initially putting it in readline or a new readuntil method was wrong.

> Not that I wouldn't like to see this feature to be shipping with Python, but it may help to read through all aspects of the problem that have been discussed before.
>
> Best,
> Wolfgang
>
>
> _______________________________________________
> Python-ideas mailing list
> Python-***@python.org
> https://mail.python.org/mailman/listinfo/python-ideas
> Code of Conduct: http://python.org/psf/codeofconduct/
Nick Coghlan
2014-07-19 07:10:58 UTC
Permalink
On 18 July 2014 12:43, Andrew Barnert <abarnert-/***@public.gmane.org> wrote:
> Before responding to Wolfgang, something that occurred to me overnight: The only insurmountable problem with Guido's suggestion of "just unwrap and rewrap the raw or buffer in a subclass that adds this behavior" is that you can't write such a subclass of TextIOWrapper, because it has no way to either peek at or push back onto the buffer. So... Why not add one of those?
>
> Pushing back is easier to implement (since it's already there as a private method), but a bit funky, and peeking would mean it works the same way as with buffered binary files. But I'll take a look at the idiomatic way to do similar things in other languages (C stdio, C++ iostreams, etc.), and make sure that peek is actually sensible for TextIOWrapper, before arguing for it.
>
> While we're at it, it might be nice for the peek method to be documented as an (optional, like raw, etc.?) member of the two ABCs instead of just something that one implementation happens to have, and that the mixin code will use if it happens to be present. (Binary readline uses peek if it exists, falls back to byte by byte if not.)

Slight tangent, but this rewrapping question also arises in the
context of changing encodings on an already open stream. See
http://bugs.python.org/issue15216 for (the gory) details.

> On Jul 18, 2014, at 4:53, Wolfgang Maier <wolfgang.maier-***@public.gmane.org> wrote:
>> You are not the first one to come up with this idea and suggesting solutions. This whole thing has been hanging around on the bug tracker as an unresolved issue (started by Nick Coghlan) since almost a decade:
>>
>> http://bugs.python.org/issue1152248
>>
>> Ever since discovering it, I've been sticking to the recipe provided by Douglas Alan:
>>
>> http://bugs.python.org/issue1152248#msg109117
>
> Thanks.
>
> Douglas's recipe is effectively the same as my resplit, except less general (since it consumes a file rather than any iterable), and some, but not all, of the limitations of that approach were mentioned. And R. David Murray's hack patch is the basically the same as the text half of my patch.
>
> The discussion there is also useful, as it raises the similar features in perl, awk, bash, etc.--all of which work by having the user change either a global or something on the file object, rather than putting it in the line-reading code, which reinforces my belief that Alexander's idea of putting the separator value it in the file constructors was right, and my initially putting it in readline or a new readuntil method was wrong.

I still favour my proposal there to add a separate "readrecords()"
method, rather than reusing the line based iteration methods - lines
and arbitrary records *aren't* the same thing, and I don't think we'd
be doing anybody any favours by conflating them (whether we're
confusing them at the method level or at the constructor argument
level).

While, as an implementation artifact, it may be possible to get this
"easily" by abusing the existing newline parameter, that's likely to
break a lot of assumptions in *other* code, that specifically expects
newlines to refer to actual line endings. A new separate method
cleanly isolates the feature to code that wants to use it, preventing
potentially adverse and hard to debug impacts on unrelated code that
happens to receive a file object with a custom record separator
configured.

With this kind of proposal, it isn't the "what happens when it works?"
cases that worry me - it's the cases where it *fails* and someone is
stuck with figuring out what has gone wrong. A new method fails
cleanly, but changing the semantics of *existing* arguments,
attributes and methods? That doesn't fail cleanly at all, and can also
have far reaching impacts on the correctness of all sorts of
documentation.

Attempting to wedge this functionality into *existing* constructs
means *changing* a lot of expectations that are now well established
in a Python context. By contrast, adding a *new* construct,
specifically for this purpose, means nothing needs to change with
existing constructs, we don't inadvertently introduce even more
obscure corner cases in newline handling, and there's a solid
terminology hook to hang the documentation one (iteration by line vs
iteration by record - and we can also be clear that "line buffered"
really does correspond to iteration by line, and may not be available
for arbitrary record separators).

Providing this feature as a separate method also makes it possible for
the IO ABC's to provide a default implementation (along the lines of
your resplit function), that concrete implementations can optionally
override with something more optimised. Pure ducktyped cases (not
inheriting from the ABCs) will fail with a fairly obvious error
("AttributeError: 'MyCustomFileType' object has no attribute
'readrecords'" rather than something related to unknown parameter
names or illegal argument values), while those that do inherit from
the ABCs will "just work".

Regards,
Nick.

--
Nick Coghlan | ncoghlan-***@public.gmane.org | Brisbane, Australia
Chris Angelico
2014-07-19 07:32:53 UTC
Permalink
On Sat, Jul 19, 2014 at 5:10 PM, Nick Coghlan <ncoghlan-***@public.gmane.org> wrote:
> I still favour my proposal there to add a separate "readrecords()"
> method, rather than reusing the line based iteration methods - lines
> and arbitrary records *aren't* the same thing

But they might well be the same thing. Look at all the Unix commands
that usually separate output with \n, but can be told to separate with
\0 instead. If you're reading from something like that, it should be
just as easy to split on \n as on \0.

ChrisA
Nick Coghlan
2014-07-19 08:18:35 UTC
Permalink
On 19 July 2014 03:32, Chris Angelico <rosuav-***@public.gmane.org> wrote:
> On Sat, Jul 19, 2014 at 5:10 PM, Nick Coghlan <ncoghlan-***@public.gmane.org> wrote:
>> I still favour my proposal there to add a separate "readrecords()"
>> method, rather than reusing the line based iteration methods - lines
>> and arbitrary records *aren't* the same thing
>
> But they might well be the same thing. Look at all the Unix commands
> that usually separate output with \n, but can be told to separate with
> \0 instead. If you're reading from something like that, it should be
> just as easy to split on \n as on \0.

Python isn't Unix, and Python has never supported \0 as a "line
ending". Changing the meaning of existing constructs is fraught with
complexity, and should only be done when there is absolutely no
alternative. In this case, there's an alternative: a new method,
specifically for reading arbitrary records.

Cheers,
Nick.

--
Nick Coghlan | ncoghlan-***@public.gmane.org | Brisbane, Australia
Steven D'Aprano
2014-07-19 09:01:59 UTC
Permalink
On Sat, Jul 19, 2014 at 04:18:35AM -0400, Nick Coghlan wrote:
> On 19 July 2014 03:32, Chris Angelico <rosuav-***@public.gmane.org> wrote:
> > On Sat, Jul 19, 2014 at 5:10 PM, Nick Coghlan <ncoghlan-***@public.gmane.org> wrote:
> >> I still favour my proposal there to add a separate "readrecords()"
> >> method, rather than reusing the line based iteration methods - lines
> >> and arbitrary records *aren't* the same thing
> >
> > But they might well be the same thing. Look at all the Unix commands
> > that usually separate output with \n, but can be told to separate with
> > \0 instead. If you're reading from something like that, it should be
> > just as easy to split on \n as on \0.
>
> Python isn't Unix, and Python has never supported \0 as a "line
> ending". Changing the meaning of existing constructs is fraught with
> complexity, and should only be done when there is absolutely no
> alternative. In this case, there's an alternative: a new method,
> specifically for reading arbitrary records.

I don't have an opinion one way or the other, but I don't quite see why
you're worried about allowing the newline parameter to be set to some
arbitrary separator. The best I can come up with is a scenario something
like this:

I open a file with some record-separator

fp = open(filename, newline="\0")

then pass it to a function:

spam(fp)

which assumes that each chunk ends with a linefeed:

assert next(fp).endswith('\n')


But in a case like that, the function is already buggy. I can see at
least two problems with such an assumption:

- what if universal newlines has been turned off and you're reading
a file created under (e.g.) classic Mac OS or RISC OS?

- what if the file contains a single line which does not end with an
end of line character at all?

open('/tmp/junk', 'wb').write("hello world!")
next(open('/tmp/junk', 'r'))

Have I missed something?


Although I'm don't mind whether files grow a readrecords() method, or
re-use the readlines() method, I'm not convinced that API decisions
should be driven solely by the needs of programs which are already
buggy.



--
Steven
Nick Coghlan
2014-07-19 09:27:49 UTC
Permalink
On 19 July 2014 05:01, Steven D'Aprano <steve-iDnA/YwAAsAk+I/***@public.gmane.org> wrote:
> On Sat, Jul 19, 2014 at 04:18:35AM -0400, Nick Coghlan wrote:
> But in a case like that, the function is already buggy. I can see at
> least two problems with such an assumption:
>
> - what if universal newlines has been turned off and you're reading
> a file created under (e.g.) classic Mac OS or RISC OS?

That's exactly the point though - people *do* assume "\n", and we've
gone to great lengths to make that assumption *more correct* (even
though it's still wrong sometimes).

We can't reverse course on that, and expect the outcome to make sense
to *people*. When making use of a configurable line endings feature
breaks (and it will), they're going to be confused, and the docs
likely aren't going to help much.

> - what if the file contains a single line which does not end with an
> end of line character at all?
>
> open('/tmp/junk', 'wb').write("hello world!")
> next(open('/tmp/junk', 'r'))
>
> Have I missed something?
>
>
> Although I'm don't mind whether files grow a readrecords() method, or
> re-use the readlines() method, I'm not convinced that API decisions
> should be driven solely by the needs of programs which are already
> buggy.

It's not being driven by the needs of programs that are already buggy
- my preferences are driven by the fact that line endings and record
separators are *not the same thing*. Thinking that they are is a
matter of confusing the conceptual data model with the implementation
of the framing at the serialisation layer. If we *do* try to treat
them as the same thing, then we have to go find *every single
reference* to line endings in the documentation and add a caveat about
it being configurable at file object creation time, so it might
actually be based on something completely arbitrary.

Line endings are *already* confusing enough that the "universal
newlines" mechanism was added to make it so that Python level code
could mostly ignore the whole "\n" vs "\r" vs "\r\n" distinction, and
just assume "\n" everywhere.

This is why I'm a fan of keeping things comparatively simple, and just
adding a new method (if we only add an iterator version) or two (if we
add a list version as well) specifically for this use case.

Cheers,
Nick.

--
Nick Coghlan | ncoghlan-***@public.gmane.org | Brisbane, Australia
Paul Moore
2014-07-19 09:30:38 UTC
Permalink
On 19 July 2014 10:01, Steven D'Aprano <steve-iDnA/YwAAsAk+I/***@public.gmane.org> wrote:
> I open a file with some record-separator
>
> fp = open(filename, newline="\0")
>
> then pass it to a function:
>
> spam(fp)
>
> which assumes that each chunk ends with a linefeed:
>
> assert next(fp).endswith('\n')

I will often do

for line in fp:
line = line.strip()

to remove the line ending ("record separator"). This fails if you have
an arbitrary separator. And for that matter, how would you remove an
arbitrary separator? Maybe line = line[:-1] works, but what if at some
point people ask for multi-character separators ("\n\n" for "paragraph
separated", for example - ignoring the universal newline complexities
in that).

A splitrecord method still needs a means for code to to remove the
record separator, of course, but the above demonstrates how reusing
line separation could break the assumptions of *current* code.

Paul
Greg Ewing
2014-07-20 04:03:20 UTC
Permalink
Paul Moore wrote:
> And for that matter, how would you remove an
> arbitrary separator? Maybe line = line[:-1] works, but what if at some
> point people ask for multi-character separators

If the newline mechanism is re-used, it would
convert whatever separator is used into '\n'.

--
Greg
Andrew Barnert
2014-07-20 05:02:03 UTC
Permalink
On Saturday, July 19, 2014 9:42 PM, Greg Ewing <***@canterbury.ac.nz> wrote:

> > Paul Moore wrote:
>> And for that matter, how would you remove an
>> arbitrary separator? Maybe line = line[:-1] works, but what if at some
>> point people ask for multi-character separators

You already can't use line[:-1] today, because '\r\n' is already a valid value, and always has been.

And however people deal with newline='\r\n' will work for any crazy separator you can think of. Maybe line[:-len(nl)]. Maybe line.rstrip(nl) if it's appropriate (it isn't always, either for \r\n or for some arbitrary separator).

> If the newline mechanism is re-used, it would

> convert whatever separator is used into '\n'.


No it wouldn't.

https://docs.python.org/3/library/io.html#io.TextIOWrapper

> When reading input from the stream, if newline is None, universal newlines mode is enabled… If it is '', universal newlines mode is enabled, but line endings are returned to the caller untranslated. If it has any of the other legal values, input lines are only terminated by the given string, and the line ending is returned to the caller untranslated.

So, making '\0' a legal value just means the '\0' line endings will be returned to the caller untranslated.

Also, remember that binary files don't do universal newline translation ever, so just letting you change the separator there wouldn't add translation.


Of course both of those could be changed as well (although with what interface, I'm not sure…), but I don't think they should be.
Guido van Rossum
2014-07-20 05:45:04 UTC
Permalink
If and when something is decided in this thread, can someone summarize it
to me? I don't have time to read all the lengthy arguments but I do care
about the outcome.

--
--Guido van Rossum (python.org/~guido)
Andrew Barnert
2014-07-20 11:56:28 UTC
Permalink
Per Nick's suggestion, I will write up a draft PEP, and link it to issue #1152248, which should be a lot easier to follow. If you want to wait until the first round of discussion and the corresponding update to the PEP before checking in, I'll make sure it's obvious when that's happened.

Sent from a random iPhone

On Jul 19, 2014, at 22:45, Guido van Rossum <guido-+ZN9ApsXKcEdnm+***@public.gmane.org> wrote:

> If and when something is decided in this thread, can someone summarize it to me? I don't have time to read all the lengthy arguments but I do care about the outcome.
>
> --
> --Guido van Rossum (python.org/~guido)
> _______________________________________________
> Python-ideas mailing list
> Python-ideas-+ZN9ApsXKcEdnm+***@public.gmane.org
> https://mail.python.org/mailman/listinfo/python-ideas
> Code of Conduct: http://python.org/psf/codeofconduct/
Antoine Pitrou
2014-07-19 14:55:43 UTC
Permalink
Le 19/07/2014 05:01, Steven D'Aprano a écrit :
>
> I open a file with some record-separator
>
> fp = open(filename, newline="\0")

Hmm... newline="\0" already *looks* wrong. To me, it's a hint that
you're abusing the API.

The main advantage of it, though, is that you can use iteration in
addition to the regular readline() (or readrecord()) method.

Regards

Antoine.
MRAB
2014-07-19 16:21:33 UTC
Permalink
On 2014-07-19 10:01, Steven D'Aprano wrote:
[snip]

> - what if universal newlines has been turned off and you're reading
> a file created under (e.g.) classic Mac OS or RISC OS?
>
[snip]
FTR, the line ending in RISC OS is '\n'.
Guido van Rossum
2014-07-19 20:05:32 UTC
Permalink
I don't have time for this thread.

I never meant to suggest anything that would require pushing back data into
the buffer (you must have misread me).

I don't like changing the meaning of the newline argument to open (and it
doesn't solve enough use cases any way).

I personally think it's preposterous to use \0 as a separator for text
files (nothing screams binary data like a null byte :-).

I don't think it's a big deal if a method named readline() returns a record
that doesn't end in a \n character.

I value the equivalence of __next__() and readline().

I still think you should solve this using a wrapper class (that does its
own buffering if necessary, and implements the rest of the stream protocol
for the benefit of other consumers of some of the data).

Once a suitable wrapper class has been implemented as a 3rd party module
and is in common use you may petition to have it added to the standard
library, as a separate module/class/function.

--
--Guido van Rossum (python.org/~guido)
Wichert Akkerman
2014-07-20 07:50:10 UTC
Permalink
> On 19 Jul 2014, at 22:05, Guido van Rossum <guido-+ZN9ApsXKcEdnm+***@public.gmane.org> wrote:
>
> I don't have time for this thread.
>
> I never meant to suggest anything that would require pushing back data into the buffer (you must have misread me).
>
> I don't like changing the meaning of the newline argument to open (and it doesn't solve enough use cases any way).

I see another problem with doing this by modifying the open() call: it does not work for filehandles creates using other methods such as pipe() or socket(), either used directly or via subprocess. There are have real-world examples of situations where that is very useful. One of them was even mentioned in this discussion: processing the output of find -0.

Wichert.
Andrew Barnert
2014-07-20 11:53:01 UTC
Permalink
On Jul 20, 2014, at 0:50, Wichert Akkerman <wichert-***@public.gmane.org> wrote:

>
>> On 19 Jul 2014, at 22:05, Guido van Rossum <guido-+ZN9ApsXKcEdnm+***@public.gmane.org> wrote:
>>
>> I don't have time for this thread.
>>
>> I never meant to suggest anything that would require pushing back data into the buffer (you must have misread me).
>>
>> I don't like changing the meaning of the newline argument to open (and it doesn't solve enough use cases any way).
>
> I see another problem with doing this by modifying the open() call: it does not work for filehandles creates using other methods such as pipe() or socket(), either used directly or via subprocess. There are have real-world examples of situations where that is very useful.

A socket() is not a python file object, doesn't have a similar API, and doesn't have a readline method.

The result of calling socket.makefile, on the other hand, is a file object--and it's created by calling open.* And I'm pretty sure socket.makefile already takes a newline argument and just passes it along, in which case it will magically work with no changes at all.**

IIRC, os.pipe() just returns a pair of fds (integers), not a file object at all. It's up to you to wrap that in a file object if you want to--which you do by passing it to the open function.

So, neither of your objections works.

There are some better examples you could have raised, however. For example, a bz2.BzipFile is created with bz2.open. And, while the file delegates to a BufferedReader or TextIOWrapper, bz2.open almost certainly validates its inputs and won't pass newline on to the BufferedReader in binary mode. So, it would have to be changed to get the benefit.

However, given that there's no way to magically make every file-like object anyone has ever written automatically grow this new functionality, having the API change on the constructors, which are not part of any API and not consistent, is better than having it on the readline method. Think about where you'd get the error in each case: before even writing your code, when you look up how BzipFile instances are created and see there's no way to pass a newline argument, or deep in your code when you're using a file object that came from who knows where and it's readline method doesn't like the standard, documented newline argument?


* Or maybe it's created by constructing a BufferedReader, BufferedWriter, BufferedRandom, or TextIOWrapper directly. I don't remember off hand. But it doesn't matter, because the suggestion is to put the new parameter in those constructors, and make open forward to them, so whether makefile calls them directly or via open, it gets the same effect.

** Unless it validates the arguments before passing them along. I looked over a few stdlib classes, and there was at least one that unnecessarily does the same validation open is going to do anyway, so obviously that needs to be removed before the class magically benefits.


In some cases (like tempfile.NamedTemporaryFile), even that isn't necessary, because the implementation just passes through all **kwargs that it doesn't want to handle to the open or constructor call.
Paul Moore
2014-07-20 13:42:20 UTC
Permalink
On 20 July 2014 12:53, Andrew Barnert <abarnert-/***@public.gmane.org> wrote:
> There are some better examples you could have raised, however. For example, a bz2.BzipFile is created
> with bz2.open. And, while the file delegates to a BufferedReader or TextIOWrapper, bz2.open almost
> certainly validates its inputs and won't pass newline on to the BufferedReader in binary mode.
> So, it would have to be changed to get the benefit.

The most significant example is one which has been mentioned, but you
may have missed. The motivation for this proposal is to interoperate
with the -0 flag on things like the unix find command. But that is
typically used in a pipe, which means your Python program will likely
receive \0 terminated records via sys.stdin. And sys.stdin is already
opened for you - you do not have the option to specify a newline
argument.

In actual fact, I can't think of a good example (either from my own
experience, or mentioned in this thread) where I'd expect to be
reading \0-terminated records from anything *except* sys.stdin.

Paul
Clint Hepner
2014-07-20 15:11:25 UTC
Permalink
--
Clint

> On Jul 20, 2014, at 9:42 AM, Paul Moore <p.f.moore-***@public.gmane.org> wrote:
>
> In actual fact, I can't think of a good example (either from my own
> experience, or mentioned in this thread) where I'd expect to be
> reading \0-terminated records from anything *except* sys.stdin.

Named pipes and whatever is used to implement process substitution ( < <(find ... -0) ) come to mind.
Juancarlo Añez
2014-07-19 11:49:58 UTC
Permalink
On Sat, Jul 19, 2014 at 3:48 AM, Nick Coghlan <ncoghlan-***@public.gmane.org> wrote:

> Python isn't Unix, and Python has never supported \0 as a "line
> ending". Changing the meaning of existing constructs is fraught with
> complexity, and should only be done when there is absolutely no
> alternative. In this case, there's an alternative: a new method,
> specifically for reading arbitrary records.
>

"practicality beats purity."

http://legacy.python.org/dev/peps/pep-0020/


--
Juancarlo *Añez*
Andrew Barnert
2014-07-19 23:28:55 UTC
Permalink
(replies to multiple messages here)

On Saturday, July 19, 2014 1:19 AM, Nick Coghlan <***@gmail.com> wrote:


>On 19 July 2014 03:32, Chris Angelico <***@gmail.com> wrote:
>> On Sat, Jul 19, 2014 at 5:10 PM, Nick Coghlan <***@gmail.com> wrote:
>>> I still favour my proposal there to add a separate "readrecords()"
>>> method, rather than reusing the line based iteration methods - lines
>>> and arbitrary records *aren't* the same thing
>>
>> But they might well be the same thing. Look at all the Unix commands
>> that usually separate output with \n, but can be told to separate with
>> \0 instead. If you're reading from something like that, it should be
>> just as easy to split on \n as on \0.
>
>Python isn't Unix, and Python has never supported \0 as a "line
>ending".

Well, yeah, but Python is used on Unix, and it's used to write scripts that interoperate with other Unix command-line tools.

For the record, the reason this came up is that someone was trying to use one of my scripts in a pipeline with find -0, and he had no problem adapting the Perl scripts he's using to handle -0 output, but no clue how to do the same with my Python script. 

In general, it's just as easy to write Unix command-line tools in Python as in Perl, and that's a good thing—it means I don't have to use Perl. But as soon as -0 comes into the mix, that's no longer true. And that's a problem.

> Changing the meaning of existing constructs is fraught with
>complexity, and should only be done when there is absolutely no
>alternative. In this case, there's an alternative: a new method,
>specifically for reading arbitrary records.

This was basically my original suggestion, so obviously I don't think it's a terrible idea. But I don't think it's as good.

First, which of these is more readable, easier for novices to figure out how to write, etc.:

    with open(path, newline='\0') as f:
        for line in f:
            handle(line.rstrip('\0'))

    with open(path) as f:
        for line in iter(lambda: f.readrecord('\0'), ''):
            handle(line.rstrip('\0'))

Second, as Guido mentioned at the start of this thread, existing file-like object types (whether they implement BufferedIOBase or TextIOBase, or just duck-type the interfaces) are not going to have the new functionality. Construction has never been part of the interface of the file-like object API; opening a real file has always looked different from opening a member file in a zip archive or making a file-like wrapper around a socket transport or whatever. But using the resulting object has always been the same. Adding a readrecord method or changing the interface readline means that's no longer true.

There might be a good argument for making the change more visible—that is, using a different parameter on the open call instead of reusing the existing newline. (And that's what Alexander originally suggested as an alternative to my readrecord idea.) That way, it's much more obvious that spam.open or eggs.makefile or whatever doesn't support alternate line endings, without having to read its documentation on what newline means. But either way, I think it should go in the open function, not the file-object API.


On Saturday, July 19, 2014 2:28 AM, Nick Coghlan <***@gmail.com> wrote:

> - my preferences are driven by the fact that line endings and record
> separators are *not the same thing*.  Thinking that they are is a
> matter of confusing the conceptual data model with the implementation
> of the framing at the serialisation layer. 

Yes, using lines implicitly as records can lead to confusion—but people actually do that all the time; this isn't a new problem, and it's exactly the same problem with \r\n, or even \n, as with \0. When you open up TextEdit and write a grocery list with one item on each line, those newlines are not part of the items. When you pipe the output of find to a script, the newlines are not part of the filenames. When you pipe the output of find -0 to a script, the \0 terminators are not part of the filenames.

> Line endings are *already* confusing enough that the "universal
> newlines" mechanism was added to make it so that Python level code
> could mostly ignore the whole "\n" vs "\r" vs 
> "\r\n" distinction, and
> just assume "\n" everywhere.

I understand the point here. There are cases where universal newlines let you successfully ignore the confusion rather than dealing with it, and newline='\0' will not be useful in those cases.

But then newline='\r' is also never useful in those cases. The new behavior will be useful in exactly the cases where '\r' already is—no more, but no less.

> This is why I'm a fan of keeping things comparatively simple, and just
> adding a new method (if we only add an iterator version) or two (if we
> add a list version as well) specifically for this use case.

Actually, the obvious new method is neither the iterator version nor the list version, but a single-record version, readrecord. Sometimes you need readline/readrecord, and it's conceptually simpler for the user. And of course the implementation is a lot simpler; you don't need to build a new iterator object that references the file for readrecord the way you do for iterrecords. And finally, if you only have one of the two, as bad as iter(lambda: f.readrecord('\0'), '') may look to novices, next(f.iterrecords('\0')) would probably be even more confusing.

But we could also add an iterrecords, for two methods.

And as for the list-based version… well, I don't even understand why readlines still exists in 3.x (much less why the tutorial suggests it), so I'd be fine not having a readrecords, but I don't have any real objection.

On Saturday, July 19, 2014 1:06 PM, Guido van Rossum <***@python.org> wrote:

>I never meant to suggest anything that would require pushing back data into the buffer (you must have misread me).

I get the feeling either there's a much simpler way to wrap a file object that I'm missing, or that you think there is.

In order to do the equivalent of readrecord, you have to do one of three things:

1. Read character by character, which can be incredibly slow.

2. Peek or push back on the buffer, as the io classes' readline methods do.


3. Put another buffer in front of the file, which means you have two objects both sharing the same file but with effective file pointers out of sync. And you have to reproduce all of the file-like-object API methods for your new buffered object (a lot more work, and a lot more to get wrong—effectively, it means you have to write all of BufferedReader or TextIOWrapper, but modified to wrap another buffered file instead of wrapping the lower-level thing). And no matter how you do it, it's obviously going to be less efficient.

If there's a lighter version of #3 that makes sense, I'm not seeing it. Which is probably a problem with my lack of insight, but I'd appreciate a pointer in the right direction.

>I don't like changing the meaning of the newline argument to open (and it doesn't solve enough use cases any way).


Maybe using a different argument is a better answer. (That's what Alexander suggested originally.)

The reason both I and people on the bug thread suggested using newline instead is because the behavior you want from sep='\0' happens to be identical to the behavior you get from newline='\r', except with '\0' instead of '\r'.

And that's the best argument I have for reusing newline: someone has already worked out and documented all the implications of newline, and people have already learned them, so if we really want the same functionality, it makes sense to reuse it. 

But I realize that argument only goes so far. It wasn't obvious, until I looked into it, that I wanted the exact same functionality.

>I personally think it's preposterous to use \0 as a separator for text files (nothing screams binary data like a null byte :-).

Sure, it would have been a lot better for find and friends to grow a --escape parameter instead of -0, but I think that ship has sailed.

>I don't think it's a big deal if a method named readline() returns a record that doesn't end in a \n character.
>
>I value the equivalence of __next__() and readline().
>
>I still think you should solve this using a wrapper class (that does its own buffering if necessary, and implements the rest of the stream protocol for the benefit of other consumers of some of the data).

Again, I don't see any way to do this sensibly that wouldn't be a whole lot more work than just forking the io package.

But maybe that's the answer: I can write _io2 as a fork of _io with my changes, the same for _pyio2 (for PyPy), and then the only thing left to write is a __main__ for the package that wraps up _io2/_pyio2 in the io ABCs (and re-exports those ABCs).
Nick Coghlan
2014-07-19 23:49:38 UTC
Permalink
On 20 Jul 2014 09:28, "Andrew Barnert" <abarnert-/***@public.gmane.org> wrote:
>
> (replies to multiple messages here)
>
> On Saturday, July 19, 2014 1:19 AM, Nick Coghlan <ncoghlan-***@public.gmane.org>
wrote:
>
>
> >On 19 July 2014 03:32, Chris Angelico <rosuav-***@public.gmane.org> wrote:
> >> On Sat, Jul 19, 2014 at 5:10 PM, Nick Coghlan <ncoghlan-***@public.gmane.org>
wrote:
> >>> I still favour my proposal there to add a separate "readrecords()"
> >>> method, rather than reusing the line based iteration methods - lines
> >>> and arbitrary records *aren't* the same thing
> >>
> >> But they might well be the same thing. Look at all the Unix commands
> >> that usually separate output with \n, but can be told to separate with
> >> \0 instead. If you're reading from something like that, it should be
> >> just as easy to split on \n as on \0.
> >
> >Python isn't Unix, and Python has never supported \0 as a "line
> >ending".
>
> Well, yeah, but Python is used on Unix, and it's used to write scripts
that interoperate with other Unix command-line tools.
>
> For the record, the reason this came up is that someone was trying to use
one of my scripts in a pipeline with find -0, and he had no problem
adapting the Perl scripts he's using to handle -0 output, but no clue how
to do the same with my Python script.
>
> In general, it's just as easy to write Unix command-line tools in Python
as in Perl, and that's a good thing—it means I don't have to use Perl. But
as soon as -0 comes into the mix, that's no longer true. And that's a
problem.

I would find adding NULL to the potential newline set significantly less
objectionable than opening it up to arbitrary character sequences.

Adding a single possible newline character is a much simpler change, and
one likely to have far fewer odd consequences. This is especially so if
specifying NULL as the line separator is only permitted for files opened in
binary mode.

Cheers,
Nick.
Chris Angelico
2014-07-19 23:51:26 UTC
Permalink
On Sun, Jul 20, 2014 at 9:49 AM, Nick Coghlan <ncoghlan-***@public.gmane.org> wrote:
> Adding a single possible newline character is a much simpler change, and one
> likely to have far fewer odd consequences. This is especially so if
> specifying NULL as the line separator is only permitted for files opened in
> binary mode.

U+0000 is a valid Unicode character, so I'd have no objection to, for
instance, splitting a UTF-8 encoded text file on \0.

ChrisA
Nick Coghlan
2014-07-19 23:56:18 UTC
Permalink
On 20 Jul 2014 09:49, "Nick Coghlan" <ncoghlan-***@public.gmane.org> wrote:
>
>
> On 20 Jul 2014 09:28, "Andrew Barnert" <abarnert-/***@public.gmane.org> wrote:
> >
> > (replies to multiple messages here)
> >
> > On Saturday, July 19, 2014 1:19 AM, Nick Coghlan <ncoghlan-***@public.gmane.org>
wrote:
> >
> >
> > >On 19 July 2014 03:32, Chris Angelico <rosuav-***@public.gmane.org> wrote:
> > >> On Sat, Jul 19, 2014 at 5:10 PM, Nick Coghlan <ncoghlan-***@public.gmane.org>
wrote:
> > >>> I still favour my proposal there to add a separate "readrecords()"
> > >>> method, rather than reusing the line based iteration methods - lines
> > >>> and arbitrary records *aren't* the same thing
> > >>
> > >> But they might well be the same thing. Look at all the Unix commands
> > >> that usually separate output with \n, but can be told to separate
with
> > >> \0 instead. If you're reading from something like that, it should be
> > >> just as easy to split on \n as on \0.
> > >
> > >Python isn't Unix, and Python has never supported \0 as a "line
> > >ending".
> >
> > Well, yeah, but Python is used on Unix, and it's used to write scripts
that interoperate with other Unix command-line tools.
> >
> > For the record, the reason this came up is that someone was trying to
use one of my scripts in a pipeline with find -0, and he had no problem
adapting the Perl scripts he's using to handle -0 output, but no clue how
to do the same with my Python script.
> >
> > In general, it's just as easy to write Unix command-line tools in
Python as in Perl, and that's a good thing—it means I don't have to use
Perl. But as soon as -0 comes into the mix, that's no longer true. And
that's a problem.
>
> I would find adding NULL to the potential newline set significantly less
objectionable than opening it up to arbitrary character sequences.
>
> Adding a single possible newline character is a much simpler change, and
one likely to have far fewer odd consequences. This is especially so if
specifying NULL as the line separator is only permitted for files opened in
binary mode.

Also, the interoperability argument is a good one, as is the analogy with
'\r'. Since this does end up touching the open() builtin and the core IO
abstractions, it will need a PEP.

As far as implementation goes, I suspect a RecordIOWrapper layered IO model
inspired by the approach used for TextIOWrapper may make sense.

Cheers,
Nick.

>
> Cheers,
> Nick.
Andrew Barnert
2014-07-20 00:57:14 UTC
Permalink
On Saturday, July 19, 2014 4:49 PM, Nick Coghlan <***@gmail.com> wrote:

>On 20 Jul 2014 09:28, "Andrew Barnert" <***@yahoo.com> wrote:


>> In general, it's just as easy to write Unix command-line tools in Python as in Perl, and that's a good thing—it means I don't have to use Perl. But as soon as -0 comes into the mix, that's no longer true. And that's a problem.

>I would find adding NULL to the potential newline set significantly less objectionable than opening it up to arbitrary character sequences.


>Adding a single possible newline character is a much simpler change, and one likely to have far fewer odd consequences. This is especially so if specifying NULL as the line separator is only permitted for files opened in binary mode.


But newline is only permitted for text mode. Are you suggesting that we add newline to binary mode, but the only allowed values are NULL (current behavior) and \0, while on text files the list of allowed values stays the same as today?

Also, would you want the same semantics for newline='\0' on binary files that newline='\r' has on text files (including newline remapping on write)?

And I'm still not sure why you think this shouldn't be allowed in text mode in the first place (especially given that you suggested the same thing for text files _only_ a few years ago).

The output of file is a list of newline-separated or \0-separated filenames, in the filesystem's encoding. Why should I be able to handle the first as a text file, but have to handle the second as a binary file and then manually decode each line?


You could argue that file -0 isn't really separating Unicode filenames with U+0000, but separating UTF-8 or Latin-1 or whatever filenames with \x00, and it's just a coincidence that they happen to match up. But it really isn't just a coincidence; it was an intentional design decision for Unicode (and UTF-8, and Latin-1) that the ASCII control characters map in the obvious way, and one that many tools and scripts take advantage of, so why shouldn't tools and scripts written in Python be able to take advantage of it?
Nick Coghlan
2014-07-20 01:23:56 UTC
Permalink
On 20 July 2014 10:57, Andrew Barnert <***@yahoo.com> wrote:
> On Saturday, July 19, 2014 4:49 PM, Nick Coghlan <***@gmail.com> wrote:
>
>>On 20 Jul 2014 09:28, "Andrew Barnert" <***@yahoo.com> wrote:
>
>
>>> In general, it's just as easy to write Unix command-line tools in Python as in Perl, and that's a good thing—it means I don't have to use Perl. But as soon as -0 comes into the mix, that's no longer true. And that's a problem.
>
>>I would find adding NULL to the potential newline set significantly less objectionable than opening it up to arbitrary character sequences.
>
>
>>Adding a single possible newline character is a much simpler change, and one likely to have far fewer odd consequences. This is especially so if specifying NULL as the line separator is only permitted for files opened in binary mode.
>
>
> But newline is only permitted for text mode. Are you suggesting that we add newline to binary mode, but the only allowed values are NULL (current behavior) and \0, while on text files the list of allowed values stays the same as today?

Actually, I temporarily forgot that newline was only handled at the
TextIOWrapper layer. All the more reason for a PEP that clearly lays
out the status quo (both Python's own newline handling and the "-0"
option for various UNIX utilities, and the way that is handled in
other scripting langauges), and discusses the various options for
dealing with it (new RecordIOWrapper class with a new "open"
parameter, new methods on IO clases, new semantics on the existing
TextIOWrapper class).

If the description of the use cases is clear enough, then the "right
answer" amongst the presented alternatives (which includes "don't
change anything") may be obvious. At present, I'm genuinely unclear on
why someone would ever want to pass the "-0" option to the other UNIX
utilities, which then makes it very difficult to have a sensible
discussion on how we should address that use case in Python.

Cheers,
Nick.

--
Nick Coghlan | ***@gmail.com | Brisbane, Australia
Chris Angelico
2014-07-20 01:31:10 UTC
Permalink
On Sun, Jul 20, 2014 at 11:23 AM, Nick Coghlan <ncoghlan-***@public.gmane.org> wrote:
> At present, I'm genuinely unclear on
> why someone would ever want to pass the "-0" option to the other UNIX
> utilities, which then makes it very difficult to have a sensible
> discussion on how we should address that use case in Python.

That one's easy. What happens if you use 'find' to list files, and
those files might have \n in their names? You need another sep.

ChrisA
Nick Coghlan
2014-07-20 01:40:25 UTC
Permalink
On 20 July 2014 11:31, Chris Angelico <rosuav-***@public.gmane.org> wrote:
> On Sun, Jul 20, 2014 at 11:23 AM, Nick Coghlan <ncoghlan-***@public.gmane.org> wrote:
>> At present, I'm genuinely unclear on
>> why someone would ever want to pass the "-0" option to the other UNIX
>> utilities, which then makes it very difficult to have a sensible
>> discussion on how we should address that use case in Python.
>
> That one's easy. What happens if you use 'find' to list files, and
> those files might have \n in their names? You need another sep.

Yes, but having a newline in a filename is sufficiently weird that I
find it hard to imagine a scenario where "fix the filenames" isn't a
better answer. Hence why I think the PEP needs to explain why the UNIX
utilities considered this use case sufficiently non-obscure to add
explicit support for it, rather than just assuming that the
obviousness of the use case can be taken for granted.

Cheers,
Nick.

--
Nick Coghlan | ncoghlan-***@public.gmane.org | Brisbane, Australia
Andrew Barnert
2014-07-20 03:58:58 UTC
Permalink
On Saturday, July 19, 2014 6:42 PM, Nick Coghlan <ncoghlan-***@public.gmane.org> wrote:

> On 20 July 2014 11:31, Chris Angelico <rosuav-***@public.gmane.org> wrote:
>>  On Sun, Jul 20, 2014 at 11:23 AM, Nick Coghlan <ncoghlan-***@public.gmane.org>
> wrote:
>>>  At present, I'm genuinely unclear on
>>>  why someone would ever want to pass the "-0" option to the
>>> other UNIX
>>>  utilities, which then makes it very difficult to have a sensible
>>>  discussion on how we should address that use case in Python.
>>
>>  That one's easy. What happens if you use 'find' to list files,
>> and
>>  those files might have \n in their names? You need another sep.
>
> Yes, but having a newline in a filename is sufficiently weird that I
> find it hard to imagine a scenario where "fix the filenames" isn't
> a
> better answer. Hence why I think the PEP needs to explain why the UNIX
> utilities considered this use case sufficiently non-obscure to add
> explicit support for it, rather than just assuming that the
> obviousness of the use case can be taken for granted.


First, why is it so odd to have newlines in filenames? It used to be pretty common on Classic Mac. Sure, they're not too common nowadays, but that's because they're illegal on DOS/Windows, and because the shell on Unix systems makes them a pain to deal with, not because there's something inherently nonsensical about the idea, any more than filenames with spaces or non-ASCII characters or >255 length.


Second, "fix the filenames" is almost _never_ a better answer. If you're publishing a program for other people to use, you want to document that it won't work on some perfectly good files, and close their bugs as "Not a bug, rename your files if you want to use my software"? If the files are on a read-only filesystem or a slow tape backup, you really want to copy the entire filesystem over just so you can run a script on it?

Also, even if "fix the filenames" were the right answer, you need to write a tool to do that, and why shouldn't it be possible to use Python for that tool? (In fact, one of the scripts I wanted this feature for is a replacement for the traditional rename tool (http://plasmasturm.org/code/rename/). I mainly wanted to let people use regular expressions without letting them run arbitrary Perl code, as rename -e does, but also, I couldn't figure out how to rename "foo" to "Foo" on a case-preserving-but-insensitive filesystem in Perl, and I know how to do it in Python.)

At any rate, there are decades of tradition behind using -print0, and that's not going to change just because Python isn't as good as other languages at dealing with it. The GNU find documentation (http://linux.die.net/man/1/find) explicitly recommends, in multiple places, using -print0 instead of -print whenever possible. (For example, in the summary near the top, "If no expression is given, the expression -print is used (but you should probably consider using -print0 instead, anyway).")


And part of the reason for that is that many other tools, like xargs, split on any whitespace, not on newlines, if not given the -0 argument. Fortunately, all of those tools know how to handle backslash escapes, but unfortunately, find doesn't know how to emit them. (Actually, frustratingly, both BSD and SysV find have the code to do it, but not in a way you can use here.) So, if you're writing a script that uses find and might get piped to anything that handles input like xargs, you have to use -print0.

And that means, if you're writing a tool that might get find piped to it, you have to handle -print0, even if you're pretty sure nobody will ever have newlines for you to deal with, because they're probably going to want to use -print0 anyway, rather than figure out how your tool deals with other whitespace.
Nick Coghlan
2014-07-20 05:00:15 UTC
Permalink
On 20 July 2014 13:58, Andrew Barnert <abarnert-/***@public.gmane.org> wrote:
> First, why is it so odd to have newlines in filenames? It used to be pretty common on Classic Mac. Sure, they're not too common nowadays, but that's because they're illegal on DOS/Windows, and because the shell on Unix systems makes them a pain to deal with, not because there's something inherently nonsensical about the idea, any more than filenames with spaces or non-ASCII characters or >255 length.

You answered your own question: because DOS/Windows make them illegal,
and the Unix shell isn't fond of them either. I was a DOS/Windows user
for more than a decade before switching to Linux for personal use, and
in a decade of using Linux (and even going to work for a Linux
vendor), I've never encountered a filename with a newline in it. Thus
the idea that anyone *would* do such a thing, and that it would be
prevalent enough for UNIX tools to include a workaround in programs
that normally produce newline separated output is an entirely novel
concept for me. Any such file I encountered *would* be an outlier, and
I'd likely be in a position to get the offending filename fixed rather
than changing any data processing pipelines (whether written in Python
or not) to tolerate newlines in filenames (since the cost differential
between fixing one filename vs updating the data processing pipelines
would be enormous).

However, note that my attitude changed significantly once you
clarified the use case - it's clear that there *is* a use case, it's
just one that's outside my own personal experience. That's one of the
things the PEP process is for - to explain such use cases to folks
that haven't personally encountered them, and then explain why the
proposed solution addresses the use case in a way that makes sense for
the domains where the use case arises. The recent matrix
multiplication PEP was an exemplary example of the breed.

That's what I'm asking for here: a PEP that makes sense to someone
like me for whom the idea of putting a newline in a filename is
completely alien. Yes, it's technically permitted by the underlying
operating system APIs on POSIX systems, but all the affordances at
both the console and GUI level suggest "no newlines allowed". If
you're coming from a DOS/Windows background (as I did), then the idea
that a newline is technically a permitted filename character may never
even occur to you (it certainly hadn't to me, and I'd never previously
come across anything to challenge that assumption).

Regards,
Nick.

--
Nick Coghlan | ncoghlan-***@public.gmane.org | Brisbane, Australia
Andrew Barnert
2014-07-21 00:41:32 UTC
Permalink
On Saturday, July 19, 2014 10:00 PM, Nick Coghlan <ncoghlan-***@public.gmane.org> wrote:

> That's one of the
> things the PEP process is for - to explain such use cases to folks
> that haven't personally encountered them, and then explain why the
> proposed solution addresses the use case in a way that makes sense for
> the domains where the use case arises.

OK, I wrote up a draft PEP, and attached it to the bug (if that's not a good thing to do, apologies); you can find it at http://bugs.python.org/file36008/pep-newline.txt

It's probably a lot more detailed than necessary in many areas, but I figured it was better to include too much than to leave things ambiguous; after I know which parts are not contentious, I can strip it down in the next revision.

Meanwhile, while writing it, and re-reading Guido's replies in this thread, I decided to come back to the alternative idea of exposing text files' buffers just like binary files' buffers. If done properly, that would make it much easier (still not trivial, but much easier) for users to just implement the readrecord functionality on their own, or for someone to package it up on PyPI. And I don't think the idea is as radical as it sounded at first, so I don't want it to be dismissed out of hand. So, also see http://bugs.python.org/file36009/pep-peek.txt

Finally, writing this up made me recognize a couple of minor problems with the patch I'd been writing, and I don't think I have time to clean it up and write relevant tests now, so I might not be able to upload a useful patch until next weekend. Hopefully people can still discuss the PEP without a patch to play with.
Paul Moore
2014-07-21 07:04:32 UTC
Permalink
On 21 July 2014 01:41, Andrew Barnert <abarnert-/***@public.gmane.org> wrote:
> OK, I wrote up a draft PEP, and attached it to the bug (if that's not a good thing to do, apologies); you can find it at http://bugs.python.org/file36008/pep-newline.txt

As a suggestion, how about adding an example of a simple nul-separated
filename filter - the sort of thing that could go in a find -print0 |
xxx | xargs -0 pipeline? If I understand it, that's one of the key
motivating examples for this change, so seeing how it's done would be
a great help.

Here's the sort of thing I mean, written for newline-separated files:

import sys

def process(filename):
"""Trivial example"""
return filename.lower()

if __name__ == '__main__':

for filename in sys.stdin:
filename = process(filename)
print(filename)

This is also an example of why I'm struggling to understand how an
open() parameter "solves all the cases". There's no explicit open()
call here, so how do you specify the record separator? Seeing how you
propose this would work would be really helpful to me.

Paul
Akira Li
2014-07-22 16:05:42 UTC
Permalink
Paul Moore <p.f.moore-***@public.gmane.org> writes:

> On 21 July 2014 01:41, Andrew Barnert
> <abarnert-/***@public.gmane.org> wrote:
>> OK, I wrote up a draft PEP, and attached it to the bug (if that's
>> not a good thing to do, apologies); you can find it at
>> http://bugs.python.org/file36008/pep-newline.txt
>
> As a suggestion, how about adding an example of a simple nul-separated
> filename filter - the sort of thing that could go in a find -print0 |
> xxx | xargs -0 pipeline? If I understand it, that's one of the key
> motivating examples for this change, so seeing how it's done would be
> a great help.
>
> Here's the sort of thing I mean, written for newline-separated files:
>
> import sys
>
> def process(filename):
> """Trivial example"""
> return filename.lower()
>
> if __name__ == '__main__':
>
> for filename in sys.stdin:
> filename = process(filename)
> print(filename)
>
> This is also an example of why I'm struggling to understand how an
> open() parameter "solves all the cases". There's no explicit open()
> call here, so how do you specify the record separator? Seeing how you
> propose this would work would be really helpful to me.
>

`find -print0 | ./tr-filename -0 | xargs -0` example implies that you
can replace `sys.std*` streams without worrying about preserving
`sys.__std*__` streams:

#!/usr/bin/env python
import io
import re
import sys
from pathlib import Path

def transform_filename(filename: str) -> str: # example
"""Normalize whitespace in basename."""
path = Path(filename)
new_path = path.with_name(re.sub(r'\s+', ' ', path.name))
path.replace(new_path) # rename on disk if necessary
return str(new_path)

def SystemTextStream(bytes_stream, **kwargs):
encoding = sys.getfilesystemencoding()
return io.TextIOWrapper(bytes_stream,
encoding=encoding,
errors='surrogateescape' if encoding != 'mbcs' else 'strict',
**kwargs)

nl = '\0' if '-0' in sys.argv else None
sys.stdout = SystemTextStream(sys.stdout.detach(), newline=nl)
for line in SystemTextStream(sys.stdin.detach(), newline=nl):
print(transform_filename(line.rstrip(nl)), end=nl)

io.TextIOWrapper() plays the role of open() in this case. The code
assumes that `newline` parameter accepts '\0'.

The example function handles Unicode whitespace to demonstrate why
opaque bytes-based cookies can't be used to represent filenames in this
case even on POSIX, though which characters are recognized depends on
sys.getfilesystemencoding().

Note:

- `end=nl` is necessary because `print()` prints '\n' by default -- it
does not use `file.newline`
- `-0` option is required in the current implementation if filenames may
have a trailing whitespace. It can be improved
- SystemTextStream() handles undecodable in the current locale filenames
i.e., non-ascii names are allowed even in C locale (LC_CTYPE=C)
- undecodable filenames are not supported on Windows. It is not clear
how to pass an undecodable filename via a pipe on Windows -- perhaps
`GetShortPathNameW -> fsencode -> pipe` might work in some cases. It
assumes that the short path exists and it is always encodable using
mbcs. If we can control all parts of the pipeline *and* Windows API
uses proper utf-16 (not ucs-2) then utf-8 can be used to pass
filenames via a pipe otherwise ReadConsoleW/WriteConsoleW could be
tried e.g., https://github.com/Drekin/win-unicode-console


--
Akira
Paul Moore
2014-07-22 17:35:58 UTC
Permalink
On 22 July 2014 17:05, Akira Li <4kir4.1i-***@public.gmane.org> wrote:
> The example function handles Unicode whitespace to demonstrate why
> opaque bytes-based cookies can't be used to represent filenames in this
> case even on POSIX, though which characters are recognized depends on
> sys.getfilesystemencoding().

Thanks. That's how you'd do it now.

A question for the OP: how would the proposed change improve this code?
Paul
Akira Li
2014-07-22 23:48:06 UTC
Permalink
Paul Moore <p.f.moore-***@public.gmane.org> writes:

> On 22 July 2014 17:05, Akira Li
> <4kir4.1i-***@public.gmane.org> wrote:
>> The example function handles Unicode whitespace to demonstrate why
>> opaque bytes-based cookies can't be used to represent filenames in this
>> case even on POSIX, though which characters are recognized depends on
>> sys.getfilesystemencoding().
>
> Thanks. That's how you'd do it now.

You've cut too much e.g. I wrote in [1]:

>> io.TextIOWrapper() plays the role of open() in this case. The code
>> assumes that `newline` parameter accepts '\0'.

[1] https://mail.python.org/pipermail/python-ideas/2014-July/028372.html

> A question for the OP: how would the proposed change improve this code?
> Paul

I'm not sure who is OP in this context but I can answer: the proposed
change might allow TextIOWrapper(.., newline='\0') and the code in [1]
doesn't support `-0` command-line parameter without it.


--
Akira
Paul Moore
2014-07-23 08:11:23 UTC
Permalink
On 23 July 2014 00:48, Akira Li <4kir4.1i-***@public.gmane.org> wrote:
> I'm not sure who is OP in this context but I can answer: the proposed
> change might allow TextIOWrapper(.., newline='\0') and the code in [1]
> doesn't support `-0` command-line parameter without it.

I see. My apologies, I read that part but didn't spot what you meant.
Thanks for clarifying.
Andrew Barnert
2014-07-23 04:40:54 UTC
Permalink
On Jul 22, 2014, at 9:05, Akira Li <4kir4.1i-***@public.gmane.org> wrote:

> Paul Moore <p.f.moore-***@public.gmane.org> writes:
>
>> On 21 July 2014 01:41, Andrew Barnert
>> <abarnert-/***@public.gmane.org> wrote:
>>> OK, I wrote up a draft PEP, and attached it to the bug (if that's
>>> not a good thing to do, apologies); you can find it at
>>> http://bugs.python.org/file36008/pep-newline.txt
>>
>> As a suggestion, how about adding an example of a simple nul-separated
>> filename filter - the sort of thing that could go in a find -print0 |
>> xxx | xargs -0 pipeline? If I understand it, that's one of the key
>> motivating examples for this change, so seeing how it's done would be
>> a great help.
>>
>> Here's the sort of thing I mean, written for newline-separated files:
>>
>> import sys
>>
>> def process(filename):
>> """Trivial example"""
>> return filename.lower()
>>
>> if __name__ == '__main__':
>>
>> for filename in sys.stdin:
>> filename = process(filename)
>> print(filename)
>>
>> This is also an example of why I'm struggling to understand how an
>> open() parameter "solves all the cases". There's no explicit open()
>> call here, so how do you specify the record separator? Seeing how you
>> propose this would work would be really helpful to me.
>
> `find -print0 | ./tr-filename -0 | xargs -0` example implies that you
> can replace `sys.std*` streams without worrying about preserving
> `sys.__std*__` streams:
>
> #!/usr/bin/env python
> import io
> import re
> import sys
> from pathlib import Path
>
> def transform_filename(filename: str) -> str: # example
> """Normalize whitespace in basename."""
> path = Path(filename)
> new_path = path.with_name(re.sub(r'\s+', ' ', path.name))
> path.replace(new_path) # rename on disk if necessary
> return str(new_path)
>
> def SystemTextStream(bytes_stream, **kwargs):
> encoding = sys.getfilesystemencoding()
> return io.TextIOWrapper(bytes_stream,
> encoding=encoding,
> errors='surrogateescape' if encoding != 'mbcs' else 'strict',
> **kwargs)
>
> nl = '\0' if '-0' in sys.argv else None
> sys.stdout = SystemTextStream(sys.stdout.detach(), newline=nl)
> for line in SystemTextStream(sys.stdin.detach(), newline=nl):
> print(transform_filename(line.rstrip(nl)), end=nl)

Nice, much more complete example than mine. I just tried to handle as many edge cases as the original he asked about, but you handle everything.

> io.TextIOWrapper() plays the role of open() in this case. The code
> assumes that `newline` parameter accepts '\0'.
>
> The example function handles Unicode whitespace to demonstrate why
> opaque bytes-based cookies can't be used to represent filenames in this
> case even on POSIX, though which characters are recognized depends on
> sys.getfilesystemencoding().
>
> Note:
>
> - `end=nl` is necessary because `print()` prints '\n' by default -- it
> does not use `file.newline`

Actually, yes it does. Or, rather, print pastes on a '\n', but sys.stdout.write translates any '\n' characters to sys.stdout.writenl (a private variable that's initialized from the newline argument at construction time if it's anything other than None or '').

But of course that's the newline argument to sys.stdout, and you only changed sys.stdin, so you do need end=nl anyway. (And you wouldn't want output translation here anyway, because that could also translate '\n' characters in the middle of a line, re-creating the same problem we're trying to avoid...)

But it uses sys.stdout.newline, not sys.stdin.newline.

> - `-0` option is required in the current implementation if filenames may
> have a trailing whitespace. It can be improved
> - SystemTextStream() handles undecodable in the current locale filenames
> i.e., non-ascii names are allowed even in C locale (LC_CTYPE=C)
> - undecodable filenames are not supported on Windows. It is not clear
> how to pass an undecodable filename via a pipe on Windows -- perhaps
> `GetShortPathNameW -> fsencode -> pipe` might work in some cases. It
> assumes that the short path exists and it is always encodable using
> mbcs. If we can control all parts of the pipeline *and* Windows API
> uses proper utf-16 (not ucs-2) then utf-8 can be used to pass
> filenames via a pipe otherwise ReadConsoleW/WriteConsoleW could be
> tried e.g., https://github.com/Drekin/win-unicode-console

First, don't both the Win32 APIs and the POSIX-ish layer in msvcrt on top of it guarantee that you can never get such unencodable filenames (sometimes by just pretending the file doesn't exist, but if possible by having the filesystem map it to something valid, unique, and persistent for this session, usually the short name)?

Second, trying to solve this implies that you have some other native (as opposed to Cygwin) tool that passes or accepts such filenames over simple pipes (as opposed to PowerShell typed ones). Are there any? What does, say, mingw's find do with invalid filenames if it finds them?

On Unix, of course, it's a real problem.
Akira Li
2014-07-23 12:13:06 UTC
Permalink
Andrew Barnert <abarnert-/***@public.gmane.org> writes:

> On Jul 22, 2014, at 9:05, Akira Li <4kir4.1i-***@public.gmane.org> wrote:
>
>> Paul Moore <p.f.moore-***@public.gmane.org> writes:
>>
>>> On 21 July 2014 01:41, Andrew Barnert
>>> <abarnert-/***@public.gmane.org> wrote:
>>>> OK, I wrote up a draft PEP, and attached it to the bug (if that's
>>>> not a good thing to do, apologies); you can find it at
>>>> http://bugs.python.org/file36008/pep-newline.txt
>>>
>>> As a suggestion, how about adding an example of a simple nul-separated
>>> filename filter - the sort of thing that could go in a find -print0 |
>>> xxx | xargs -0 pipeline? If I understand it, that's one of the key
>>> motivating examples for this change, so seeing how it's done would be
>>> a great help.
>>>
>>> Here's the sort of thing I mean, written for newline-separated files:
>>>
>>> import sys
>>>
>>> def process(filename):
>>> """Trivial example"""
>>> return filename.lower()
>>>
>>> if __name__ == '__main__':
>>>
>>> for filename in sys.stdin:
>>> filename = process(filename)
>>> print(filename)
>>>
>>> This is also an example of why I'm struggling to understand how an
>>> open() parameter "solves all the cases". There's no explicit open()
>>> call here, so how do you specify the record separator? Seeing how you
>>> propose this would work would be really helpful to me.
>>
>> `find -print0 | ./tr-filename -0 | xargs -0` example implies that you
>> can replace `sys.std*` streams without worrying about preserving
>> `sys.__std*__` streams:
>>
>> #!/usr/bin/env python
>> import io
>> import re
>> import sys
>> from pathlib import Path
>>
>> def transform_filename(filename: str) -> str: # example
>> """Normalize whitespace in basename."""
>> path = Path(filename)
>> new_path = path.with_name(re.sub(r'\s+', ' ', path.name))
>> path.replace(new_path) # rename on disk if necessary
>> return str(new_path)
>>
>> def SystemTextStream(bytes_stream, **kwargs):
>> encoding = sys.getfilesystemencoding()
>> return io.TextIOWrapper(bytes_stream,
>> encoding=encoding,
>> errors='surrogateescape' if encoding != 'mbcs' else 'strict',
>> **kwargs)
>>
>> nl = '\0' if '-0' in sys.argv else None
>> sys.stdout = SystemTextStream(sys.stdout.detach(), newline=nl)
>> for line in SystemTextStream(sys.stdin.detach(), newline=nl):
>> print(transform_filename(line.rstrip(nl)), end=nl)
>
> Nice, much more complete example than mine. I just tried to handle as
> many edge cases as the original he asked about, but you handle
> everything.
>
>> io.TextIOWrapper() plays the role of open() in this case. The code
>> assumes that `newline` parameter accepts '\0'.
>>
>> The example function handles Unicode whitespace to demonstrate why
>> opaque bytes-based cookies can't be used to represent filenames in this
>> case even on POSIX, though which characters are recognized depends on
>> sys.getfilesystemencoding().
>>
>> Note:
>>
>> - `end=nl` is necessary because `print()` prints '\n' by default -- it
>> does not use `file.newline`
>
> Actually, yes it does. Or, rather, print pastes on a '\n', but
> sys.stdout.write translates any '\n' characters to sys.stdout.writenl
> (a private variable that's initialized from the newline argument at
> construction time if it's anything other than None or '').

You are right. I've stopped reading the source for print() function at
`PyFile_WriteString("\n", file);` line assuming that "\n" is not
translated if newline="\0". But the current behaviour if "\0" were in
"the other legal values" category (like "\r") would be to translate "\n"
[1]:

When writing output to the stream, if newline is None, any '\n'
characters written are translated to the system default line
separator, os.linesep. If newline is '' or '\n', no translation takes
place. If newline is any of the other legal values, any '\n'
characters written are translated to the given string.

[1] https://docs.python.org/3/library/io.html#io.TextIOWrapper

Example:

$ ./python -c 'import sys, io;
sys.stdout=io.TextIOWrapper(sys.stdout.detach(), newline="\r\n");
sys.stdout.write("\n\r\r\n")'| xxd
0000000: 0d0a 0d0d 0d0a ......

"\n" is translated to b"\r\n" here and "\r" is left untouched (b"\r").

In order to newline="\0" case to work, it should behave similar to
newline='' or newline='\n' case instead i.e., no translation should take
place, to avoid corrupting embed "\n\r" characters. My original code
works as is in this case i.e., *end=nl is still necessary*.

> But of course that's the newline argument to sys.stdout, and you only
> changed sys.stdin, so you do need end=nl anyway. (And you wouldn't
> want output translation here anyway, because that could also translate
> \n' characters in the middle of a line, re-creating the same problem
> we're trying to avoid...)
>
> But it uses sys.stdout.newline, not sys.stdin.newline.

The code affects *both* sys.stdout/sys.stdin. Look [2]:

>> sys.stdout = SystemTextStream(sys.stdout.detach(), newline=nl)
>> for line in SystemTextStream(sys.stdin.detach(), newline=nl):
>> print(transform_filename(line.rstrip(nl)), end=nl)

[2] https://mail.python.org/pipermail/python-ideas/2014-July/028372.html

>> - SystemTextStream() handles undecodable in the current locale filenames
>> i.e., non-ascii names are allowed even in C locale (LC_CTYPE=C)
>> - undecodable filenames are not supported on Windows. It is not clear
>> how to pass an undecodable filename via a pipe on Windows -- perhaps
>> `GetShortPathNameW -> fsencode -> pipe` might work in some cases. It
>> assumes that the short path exists and it is always encodable using
>> mbcs. If we can control all parts of the pipeline *and* Windows API
>> uses proper utf-16 (not ucs-2) then utf-8 can be used to pass
>> filenames via a pipe otherwise ReadConsoleW/WriteConsoleW could be
>> tried e.g., https://github.com/Drekin/win-unicode-console
>
> First, don't both the Win32 APIs and the POSIX-ish layer in msvcrt on
> top of it guarantee that you can never get such unencodable filenames
> (sometimes by just pretending the file doesn't exist, but if possible
> by having the filesystem map it to something valid, unique, and
> persistent for this session, usually the short name)?
> Second, trying to solve this implies that you have some other native
> (as opposed to Cygwin) tool that passes or accepts such filenames over
> simple pipes (as opposed to PowerShell typed ones). Are there any?
> What does, say, mingw's find do with invalid filenames if it finds
> them?

In short: I don't know :)

To be clear, I'm talking about native Windows applications (not
find/xargs on Cygwin). The goal is to process robustly *arbitrary*
filenames on Windows via a pipe (SystemTextStream()) or network (bytes
interface).

I know that (A)nsi API (and therefore "POSIX-ish layer" that uses narrow
strings such main(), fopen(), fstream is broken e.g., Thai filenames on
Greek computer [3]. Unicode (W) API should enforce utf-16 in principle
since Windows 2000 [4]. But I expect ucs-2 shows its ugly head in many
places due to bad programming practices (based on the common wrong
assumption that Unicode == UTF-16 == UCS-2) and/or bugs that are not
fixed due to MS' backwards compatibility policies in the past [5].

[3]
http://blog.gatunka.com/2014/04/25/character-encodings-for-modern-programmers/
[4] http://en.wikipedia.org/wiki/UTF-16#Use_in_major_operating_systems_and_environments
[5] http://blogs.msdn.com/b/oldnewthing/archive/2003/10/15/55296.aspx


--
Akira
Andrew Barnert
2014-07-23 15:49:19 UTC
Permalink
On Jul 23, 2014, at 5:13, Akira Li <4kir4.1i-***@public.gmane.org> wrote:

> Andrew Barnert <abarnert-/***@public.gmane.org> writes:
>
>> On Jul 22, 2014, at 9:05, Akira Li <4kir4.1i-***@public.gmane.org> wrote:
>>
>>> Paul Moore <p.f.moore-***@public.gmane.org> writes:
>>>
>>>> On 21 July 2014 01:41, Andrew Barnert
>>>> <abarnert-/***@public.gmane.org> wrote:
>>>>> OK, I wrote up a draft PEP, and attached it to the bug (if that's
>>>>> not a good thing to do, apologies); you can find it at
>>>>> http://bugs.python.org/file36008/pep-newline.txt
>>>>
>>>> As a suggestion, how about adding an example of a simple nul-separated
>>>> filename filter - the sort of thing that could go in a find -print0 |
>>>> xxx | xargs -0 pipeline? If I understand it, that's one of the key
>>>> motivating examples for this change, so seeing how it's done would be
>>>> a great help.
>>>>
>>>> Here's the sort of thing I mean, written for newline-separated files:
>>>>
>>>> import sys
>>>>
>>>> def process(filename):
>>>> """Trivial example"""
>>>> return filename.lower()
>>>>
>>>> if __name__ == '__main__':
>>>>
>>>> for filename in sys.stdin:
>>>> filename = process(filename)
>>>> print(filename)
>>>>
>>>> This is also an example of why I'm struggling to understand how an
>>>> open() parameter "solves all the cases". There's no explicit open()
>>>> call here, so how do you specify the record separator? Seeing how you
>>>> propose this would work would be really helpful to me.
>>>
>>> `find -print0 | ./tr-filename -0 | xargs -0` example implies that you
>>> can replace `sys.std*` streams without worrying about preserving
>>> `sys.__std*__` streams:
>>>
>>> #!/usr/bin/env python
>>> import io
>>> import re
>>> import sys
>>> from pathlib import Path
>>>
>>> def transform_filename(filename: str) -> str: # example
>>> """Normalize whitespace in basename."""
>>> path = Path(filename)
>>> new_path = path.with_name(re.sub(r'\s+', ' ', path.name))
>>> path.replace(new_path) # rename on disk if necessary
>>> return str(new_path)
>>>
>>> def SystemTextStream(bytes_stream, **kwargs):
>>> encoding = sys.getfilesystemencoding()
>>> return io.TextIOWrapper(bytes_stream,
>>> encoding=encoding,
>>> errors='surrogateescape' if encoding != 'mbcs' else 'strict',
>>> **kwargs)
>>>
>>> nl = '\0' if '-0' in sys.argv else None
>>> sys.stdout = SystemTextStream(sys.stdout.detach(), newline=nl)
>>> for line in SystemTextStream(sys.stdin.detach(), newline=nl):
>>> print(transform_filename(line.rstrip(nl)), end=nl)
>>
>> Nice, much more complete example than mine. I just tried to handle as
>> many edge cases as the original he asked about, but you handle
>> everything.
>>
>>> io.TextIOWrapper() plays the role of open() in this case. The code
>>> assumes that `newline` parameter accepts '\0'.
>>>
>>> The example function handles Unicode whitespace to demonstrate why
>>> opaque bytes-based cookies can't be used to represent filenames in this
>>> case even on POSIX, though which characters are recognized depends on
>>> sys.getfilesystemencoding().
>>>
>>> Note:
>>>
>>> - `end=nl` is necessary because `print()` prints '\n' by default -- it
>>> does not use `file.newline`
>>
>> Actually, yes it does. Or, rather, print pastes on a '\n', but
>> sys.stdout.write translates any '\n' characters to sys.stdout.writenl
>> (a private variable that's initialized from the newline argument at
>> construction time if it's anything other than None or '').
>
> You are right. I've stopped reading the source for print() function at
> `PyFile_WriteString("\n", file);` line assuming that "\n" is not
> translated if newline="\0". But the current behaviour if "\0" were in
> "the other legal values" category (like "\r") would be to translate "\n"
> [1]:
>
> When writing output to the stream, if newline is None, any '\n'
> characters written are translated to the system default line
> separator, os.linesep. If newline is '' or '\n', no translation takes
> place. If newline is any of the other legal values, any '\n'
> characters written are translated to the given string.
>
> [1] https://docs.python.org/3/library/io.html#io.TextIOWrapper
>
> Example:
>
> $ ./python -c 'import sys, io;
> sys.stdout=io.TextIOWrapper(sys.stdout.detach(), newline="\r\n");
> sys.stdout.write("\n\r\r\n")'| xxd
> 0000000: 0d0a 0d0d 0d0a ......
>
> "\n" is translated to b"\r\n" here and "\r" is left untouched (b"\r").
>
> In order to newline="\0" case to work, it should behave similar to
> newline='' or newline='\n' case instead i.e., no translation should take
> place, to avoid corrupting embed "\n\r" characters.

The draft PEP discusses this. I think it would be more consistent to translate for \0, just like \r and \r\n.

For the your script, there is no reason to pass newline=nl to the stdout replacement. The only effect that has on output is \n replacement, which you don't want. And if we removed that effect from the proposal, it would have no effect at all on output, so why pass it?

Do you have a use case where you need to pass a non-standard newline to a text file/stream, but don't want newline replacement? Or is it just a matter of avoiding confusion if people accidentally pass it for stdout when they didn't want it?

> My original code
> works as is in this case i.e., *end=nl is still necessary*.

>> But of course that's the newline argument to sys.stdout, and you only
>> changed sys.stdin, so you do need end=nl anyway. (And you wouldn't
>> want output translation here anyway, because that could also translate
>> \n' characters in the middle of a line, re-creating the same problem
>> we're trying to avoid...)
>>
>> But it uses sys.stdout.newline, not sys.stdin.newline.
>
> The code affects *both* sys.stdout/sys.stdin. Look [2]:

I didn't notice that you passed it for stdout as well--as I explained above, you don't need it, and shouldn't do it.

As a side note, I think it might have been a better design to have separate arguments for input newline, output newline, and universal newlines mode, instead of cramming them all into one argument; for some simple cases the current design makes things a little less verbose, but it gets in the way for more complex cases, even today with \r or \r\n. However, I don't think that needs to be changed as part of this proposal.

It also might be nice to have a full set of PYTHONIOFOO env variables rather than just PYTHONIOENCODING, but again, I don't think that needs to be part of this proposal. And likewise for Nick Coghlan's rewrap method proposal on TextIOWrapper and maybe BufferedFoo.

>>> sys.stdout = SystemTextStream(sys.stdout.detach(), newline=nl)
>>> for line in SystemTextStream(sys.stdin.detach(), newline=nl):
>>> print(transform_filename(line.rstrip(nl)), end=nl)
>
> [2] https://mail.python.org/pipermail/python-ideas/2014-July/028372.html
>
>>> - SystemTextStream() handles undecodable in the current locale filenames
>>> i.e., non-ascii names are allowed even in C locale (LC_CTYPE=C)
>>> - undecodable filenames are not supported on Windows. It is not clear
>>> how to pass an undecodable filename via a pipe on Windows -- perhaps
>>> `GetShortPathNameW -> fsencode -> pipe` might work in some cases. It
>>> assumes that the short path exists and it is always encodable using
>>> mbcs. If we can control all parts of the pipeline *and* Windows API
>>> uses proper utf-16 (not ucs-2) then utf-8 can be used to pass
>>> filenames via a pipe otherwise ReadConsoleW/WriteConsoleW could be
>>> tried e.g., https://github.com/Drekin/win-unicode-console
>>
>> First, don't both the Win32 APIs and the POSIX-ish layer in msvcrt on
>> top of it guarantee that you can never get such unencodable filenames
>> (sometimes by just pretending the file doesn't exist, but if possible
>> by having the filesystem map it to something valid, unique, and
>> persistent for this session, usually the short name)?
>> Second, trying to solve this implies that you have some other native
>> (as opposed to Cygwin) tool that passes or accepts such filenames over
>> simple pipes (as opposed to PowerShell typed ones). Are there any?
>> What does, say, mingw's find do with invalid filenames if it finds
>> them?
>
> In short: I don't know :)
>
> To be clear, I'm talking about native Windows applications (not
> find/xargs on Cygwin). The goal is to process robustly *arbitrary*
> filenames on Windows via a pipe (SystemTextStream()) or network (bytes
> interface).

Yes, I assumed that, I just wanted to make that clear.

My point is that if there isn't already an ecosystem of tools that do so on Windows, or a recommended answer from Microsoft, we don't need to fit into existing practices here. (Actually, there _is_ a recommended answer from Microsoft, but it's "don't send encoded filenames over a binary stream, send them as an array of UTF-16 strings over PowerShell cmdlet typed pipes"--and, more generally, "don't use any ANSI interfaces except for backward compatibility reasons".)

At any rate, if the filenames-over-pipes encoding problem exists on Windows, and if it's solvable, it's still outside the scope of this proposal, unless you think the documentation needs a completely worked example that shows how to interact with some Windows tool, alongside one for interacting with find -print0 on Unix. (And I don't think it does. If we want a Windows example, resource compiler string input files, which are \0-terminated UTF-16, probably serve better.)

> I know that (A)nsi API (and therefore "POSIX-ish layer" that uses narrow
> strings such main(), fopen(), fstream is broken e.g., Thai filenames on
> Greek computer [3].

Yes, and broken in a way that people cannot easily work around except by using the UTF-16 interfaces. That's been Microsoft's recommended answer to the problem since NT 3.5, Win 95, and MSVCRT 3: if you want to handle all filenames, use _wmain, _wfopen, etc.--or, better, use CreateFileW instead of fopen. They never really addressed the issue of passing filenames between command-line tools at all, until PowerShell, where you pass them as a list of UTF-16 strings rather than a stream of newline-separated encoded bytes. (As a side note, I have no idea how well Python works for writing PowerShell cmdlets, but I don't think that's relevant to the current proposal.)

> Unicode (W) API should enforce utf-16 in principle
> since Windows 2000 [4]. But I expect ucs-2 shows its ugly head in many
> places due to bad programming practices (based on the common wrong
> assumption that Unicode == UTF-16 == UCS-2) and/or bugs that are not
> fixed due to MS' backwards compatibility policies in the past [5].

Yes, I've run into such bugs in the past. It's even more fun when you're dealing with unterminated string with separate length interfaces. Fortunately, as far as I know, no such bugs affect reading and writing binary files, pipes, and sockets, so they don't affect us here.
Akira Li
2014-07-24 09:07:59 UTC
Permalink
Andrew Barnert <abarnert-/***@public.gmane.org> writes:

> On Jul 23, 2014, at 5:13, Akira Li <4kir4.1i-***@public.gmane.org> wrote:
>> Andrew Barnert <abarnert-/***@public.gmane.org> writes:
>>> On Jul 22, 2014, at 9:05, Akira Li <4kir4.1i-***@public.gmane.org> wrote:
>>>> Paul Moore <p.f.moore-***@public.gmane.org> writes:
>>>>> On 21 July 2014 01:41, Andrew Barnert
>>>>> <abarnert-/***@public.gmane.org> wrote:
>>>>>> OK, I wrote up a draft PEP, and attached it to the bug (if that's
>>>>>> not a good thing to do, apologies); you can find it at
>>>>>> http://bugs.python.org/file36008/pep-newline.txt
>>>>>
>>>>> As a suggestion, how about adding an example of a simple nul-separated
>>>>> filename filter - the sort of thing that could go in a find -print0 |
>>>>> xxx | xargs -0 pipeline? If I understand it, that's one of the key
>>>>> motivating examples for this change, so seeing how it's done would be
>>>>> a great help.
>>>>
>>>> `find -print0 | ./tr-filename -0 | xargs -0` example implies that you
>>>> can replace `sys.std*` streams without worrying about preserving
>>>> `sys.__std*__` streams:
>>>>
>>>> #!/usr/bin/env python
>>>> import io
>>>> import re
>>>> import sys
>>>> from pathlib import Path
>>>>
>>>> def transform_filename(filename: str) -> str: # example
>>>> """Normalize whitespace in basename."""
>>>> path = Path(filename)
>>>> new_path = path.with_name(re.sub(r'\s+', ' ', path.name))
>>>> path.replace(new_path) # rename on disk if necessary
>>>> return str(new_path)
>>>>
>>>> def SystemTextStream(bytes_stream, **kwargs):
>>>> encoding = sys.getfilesystemencoding()
>>>> return io.TextIOWrapper(bytes_stream,
>>>> encoding=encoding,
>>>> errors='surrogateescape' if encoding != 'mbcs' else 'strict',
>>>> **kwargs)
>>>>
>>>> nl = '\0' if '-0' in sys.argv else None
>>>> sys.stdout = SystemTextStream(sys.stdout.detach(), newline=nl)
>>>> for line in SystemTextStream(sys.stdin.detach(), newline=nl):
>>>> print(transform_filename(line.rstrip(nl)), end=nl)
>>>
>>> Nice, much more complete example than mine. I just tried to handle as
>>> many edge cases as the original he asked about, but you handle
>>> everything.
>>>>
>>>> io.TextIOWrapper() plays the role of open() in this case. The code
>>>> assumes that `newline` parameter accepts '\0'.
>>>>
>>>> The example function handles Unicode whitespace to demonstrate why
>>>> opaque bytes-based cookies can't be used to represent filenames in this
>>>> case even on POSIX, though which characters are recognized depends on
>>>> sys.getfilesystemencoding().
>>>>
>>>> Note:
>>>>
>>>> - `end=nl` is necessary because `print()` prints '\n' by default -- it
>>>> does not use `file.newline`
>>>
>>> Actually, yes it does. Or, rather, print pastes on a '\n', but
>>> sys.stdout.write translates any '\n' characters to sys.stdout.writenl
>>> (a private variable that's initialized from the newline argument at
>>> construction time if it's anything other than None or '').
>>
>> You are right. I've stopped reading the source for print() function at
>> `PyFile_WriteString("\n", file);` line assuming that "\n" is not
>> translated if newline="\0". But the current behaviour if "\0" were in
>> "the other legal values" category (like "\r") would be to translate "\n"
>> [1]:
>>
>> When writing output to the stream, if newline is None, any '\n'
>> characters written are translated to the system default line
>> separator, os.linesep. If newline is '' or '\n', no translation takes
>> place. If newline is any of the other legal values, any '\n'
>> characters written are translated to the given string.
>>
>> [1] https://docs.python.org/3/library/io.html#io.TextIOWrapper
>>
>> Example:
>>
>> $ ./python -c 'import sys, io;
>> sys.stdout=io.TextIOWrapper(sys.stdout.detach(), newline="\r\n");
>> sys.stdout.write("\n\r\r\n")'| xxd
>> 0000000: 0d0a 0d0d 0d0a ......
>>
>> "\n" is translated to b"\r\n" here and "\r" is left untouched (b"\r").
>>
>> In order to newline="\0" case to work, it should behave similar to
>> newline='' or newline='\n' case instead i.e., no translation should take
>> place, to avoid corrupting embed "\n\r" characters.
>
> The draft PEP discusses this. I think it would be more consistent to
> translate for \0, just like \r and \r\n.

I read the [draft]. No translation is a better choice here. Otherwise
(at the very least) it breaks `find -print0` use case.

[draft] http://bugs.python.org/file36008/pep-newline.txt

Simple things should be simple (i.e., no translation unless special case):

- binary file -- a stream of bytes: no structure, no translation on
read/write
- text file -- a stream of Unicode codepoints
- file with fixed-length chunks:

for chunk in iter(partial(file.read, chunksize), EOF):
pass

- file with variable-length records (aka lines) which end with a
separator or EOF: no translation, no escaping (no embed separators):

for line in file:
pass

or

line = file.readline() # next(file)

newline in {None, '', '\r', '\r\n'} is a (very important) special case
that represents the complicated legacy behavior for text files.

newline='\0' (like '\n') should be a *much simpler* case: no
translation on read/write, no escaping (no embed '\0', each '\0' in the
stream is a separator).

newline='\0' is simple to explain: readline/next return everything until
the next '\0' (including it) or EOF. It is simple to implement - no
translation is required.

readline(keep_end=True) keyword-only parameter and/or chomp()-like
method could be added to simplify removing a trailing newline.

newline in {"\N{NEL}", "\n\n", "\r\r", "\n\r"} behave like newline="\n"
i.e., no translation. New *docs for writing text files*:

When writing output to the stream:

- if newline is None, any '\n' characters written are translated to
the system default line separator, os.linesep
- if newline is '\r' or '\r\n', any '\n' characters written are
translated to the given string
- no translation takes place for any other newline value.

The docs for binary files are simpler:

No translation takes place for any newline value. The line terminator
is newline parameter (default is b'\n').

The new *docs for reading text files*:

When reading input from the stream:

- if newline is None, universal newlines mode is enabled: lines in the
input can end in '\n', '\r', or '\r\n', and these are translated
into '\n' before being returned to the caller
- if newline is '', universal newlines mode is enabled, but line
endings are returned to the caller untranslated
- if newline is any other value, input lines are only terminated by
the given string, and the line ending is returned to the caller
untranslated.

The new behavior being more powerful is no more complex than the old one
https://docs.python.org/3.4/library/io.html#io.TextIOWrapper

Backwards compatibility is preserved except that newline parameter
accepts more values.

> For the your script, there is no reason to pass newline=nl to the
> stdout replacement. The only effect that has on output is \n
> replacement, which you don't want. And if we removed that effect from
> the proposal, it would have no effect at all on output, so why pass
> it?

Keep in mind, I expect that newline='\0' does *not* translate '\n' to
'\0'. If you remove newline=nl then embed \n might be corrupted i.e., it
breaks `find -print0` use-case. Both newline=nl for stdout and end=nl
are required here. Though (optionally) it would be nice to change
`print()` so that it would use `end=file.newline or '\n'` by default
instead.

There is also line_buffering parameter. From the docs:

If line_buffering is True, flush() is implied when a call to write
contains a newline character.

i.e., you might also need newline=nl to flush() the stream in time.

For example, the absense of the flush() call on newline may lead to a
deadlock if subprocess module is used to implement pexpect-like
behavior. There are corresponding Python issues:

- text mode http://bugs.python.org/issue21332 : add line_buffering=True
if bufsize=1, to avoid a deadlock (regression from Python 2 behavior)

- binary mode http://bugs.python.org/issue21471 : implement
line_buffering=True behavior for binary files when bufsize=1

> Do you have a use case where you need to pass a non-standard newline
> to a text file/stream, but don't want newline replacement?

`find -print0` use case that my code implements above.

> Or is it just a matter of avoiding confusion if people accidentally
> pass it for stdout when they didn't want it?

See the explanation above that starts with "Simple things should be simple."

>> My original code
>> works as is in this case i.e., *end=nl is still necessary*.
>
>>> But of course that's the newline argument to sys.stdout, and you only
>>> changed sys.stdin, so you do need end=nl anyway. (And you wouldn't
>>> want output translation here anyway, because that could also translate
>>> \n' characters in the middle of a line, re-creating the same problem
>>> we're trying to avoid...)
>>>
>>> But it uses sys.stdout.newline, not sys.stdin.newline.
>>
>> The code affects *both* sys.stdout/sys.stdin. Look [2]:
>
> I didn't notice that you passed it for stdout as well--as I explained
> above, you don't need it, and shouldn't do it.

Both newline=nl and end=nl are needed because I assume that there is no
newline translation in newline='\0' case. See the explanation
above. Here's the same code for context:

sys.stdout = SystemTextStream(sys.stdout.detach(), newline=nl)
for line in SystemTextStream(sys.stdin.detach(), newline=nl):
print(transform_filename(line.rstrip(nl)), end=nl)

[2] https://mail.python.org/pipermail/python-ideas/2014-July/028372.html

> As a side note, I think it might have been a better design to have
> separate arguments for input newline, output newline, and universal
> newlines mode, instead of cramming them all into one argument; for
> some simple cases the current design makes things a little less
> verbose, but it gets in the way for more complex cases, even today
> with \r or \r\n. However, I don't think that needs to be changed as
> part of this proposal.

Usually different objects are used for input and output i.e., a single
newline parameter allows input newlines to be different from output
newlines.

The newline behavior for reading and writing is different but it is
closely related. Having two parameters wouldn't make the documentation
simpler.

Separate parameters might be useful if the same file object is used for
reading and writing *and* input/output newlines are different from each
other. But I don't think it is worth it to complicate the common case
(separate objects).


--
Akira
Andrew Barnert
2014-07-25 18:29:11 UTC
Permalink
On Thursday, July 24, 2014 2:08 AM, Akira Li <4kir4.1i-***@public.gmane.org> wrote:

> > Andrew Barnert <abarnert-/***@public.gmane.org> writes:
>
>> On Jul 23, 2014, at 5:13, Akira Li <4kir4.1i-***@public.gmane.org> wrote:
>>> In order to newline="\0" case to work, it should behave 

>>> similar to
>>> newline='' or newline='\n' case instead i.e., no
>>> translation should take
>>> place, to avoid corrupting embed "\n\r" characters.
>>
>> The draft PEP discusses this. I think it would be more consistent to
>> translate for \0, just like \r and \r\n.
>
> I read the [draft]. No translation is a better choice here. Otherwise
>> (at the very least) it breaks `find -print0` use case.

No it doesn't. The only reason it breaks your code is that you add newline='\0' to your stdout wrapper as well as your stdin wrapper. If you just passed '', it would not do anything. And this is exactly parallel with the existing case with, e.g., trying to pass through a classic-Mac file full of '\r'-delimited strings that might contain embedded '\n' characters that you don't want to translate.

As I've said before, I don't really like the design for '\r' and '\r\n', or the fact that three separate notions (universal-newlines flag, line ending for readline, and output translation for write) are all conflated into one idea and crammed into one parameter, but I think it's probably too late and too radical to change that.

(It's less of an issue for binary files, because binary files can't take a newline parameter at all today, and because "no output translation" has been part of the definition of what "binary file" means all the way back to Python 1.x.)


> Backwards compatibility is preserved except that newline parameter
> accepts more values.

The same is true with the draft proposal. You've basically copied the exact same thing, except for what happens on output for newlines other than None, '', '\n', '\r', and '\r\n' in text files. Since that case cannot arise today, there are no backward compatibility issues. Your version is only a small change to the documentation and a small change to the code, but my version is an even smaller change to the documentation and no change to the code, so you can't argue this from a conservative point of view.

>
>> For the your script, there is no reason to pass newline=nl to the
>> stdout replacement. The only effect that has on output is \n
>> replacement, which you don't want. And if we removed that effect from
>> the proposal, it would have no effect at all on output, so why pass
>> it?
>
> Keep in mind, I expect that newline='\0' does *not* translate
> '\n' to
> '\0'. If you remove newline=nl then embed \n might be corrupted 

No, it's only corrupted if you _pass_ newline=nl. If you instead passed, e.g., newline='', nothing could possibly corrupted.

> i.e., it

> breaks `find -print0` use-case. Both newline=nl for stdout and end=nl
> are required here. Though (optionally) it would be nice to change
> `print()` so that it would use `end=file.newline or '\n'` by default
> instead.

That might be a nice change; I'll mention it in the next draft. But I think it's better to keep the changes as small and conservative as possible, so unless there's an upswell of support for it, I think anything that isn't actually necessary to solving the problem should be left out.

> There is also line_buffering parameter. From the docs:
>
>   If line_buffering is True, flush() is implied when a call to write
>   contains a newline character.

The way this is actually defined seems broken to me; IIRC (I'll check the code later) it flushes on any '\r', and on any translated '\n'. So, it's doing the wrong thing with '\r' in most modes, and with '\n' in '' mode on non-Unix systems. So my thought was, just leave it broken.

But now that I think about it, the existing code can only flush excessively, never insufficiently, and that's probably a property worth preserving. So maybe there _is_ a reason to pass newline for output without translation after all. In other words, the parameter may actually conflate _four_ things, not just three...

I'll need to think this through (and reread the code) this weekend; thanks for bringing it up.

>> Do you have a use case where you need to pass a non-standard newline
>> to a text file/stream, but don't want newline replacement?
>
> `find -print0` use case that my code implements above.
>
>> Or is it just a matter of avoiding confusion if people accidentally
>> pass it for stdout when they didn't want it?
>
> See the explanation above that starts with "Simple things should be
> simple."

I still don't understand your point here, and just repeating it isn't helping. You're making simple things _less_ simple than they are in the draft, requiring slightly more change to the documentation and to the code and slightly more for people to understand just to allow them to pass an unnecessary parameter. That doesn't sound like an argument from simplicity to me.

But line_buffering definitely might be a good argument, in which case it doesn't matter how good this one is.
Nick Coghlan
2014-07-25 23:28:23 UTC
Permalink
On 26 Jul 2014 04:33, "Andrew Barnert" <abarnert-/***@public.gmane.org>
wrote:
> As I've said before, I don't really like the design for '\r' and '\r\n',
or the fact that three separate notions (universal-newlines flag, line
ending for readline, and output translation for write) are all conflated
into one idea and crammed into one parameter, but I think it's probably too
late and too radical to change that.

It's potentially still worth spelling out that idea as a Rejected
Alternative in the PEP. A draft design that separates them may help clarify
the concepts being conflated more effectively than simply describing them,
even if your own pragmatic assessment is "too much pain for not enough
gain".

Cheers,
Nick.
Akira Li
2014-07-26 02:24:16 UTC
Permalink
Nick Coghlan <ncoghlan-***@public.gmane.org> writes:

> On 26 Jul 2014 04:33, "Andrew Barnert"
> <abarnert-/***@public.gmane.org>
> wrote:
>> As I've said before, I don't really like the design for '\r' and '\r\n',
> or the fact that three separate notions (universal-newlines flag, line
> ending for readline, and output translation for write) are all conflated
> into one idea and crammed into one parameter, but I think it's probably too
> late and too radical to change that.
>
> It's potentially still worth spelling out that idea as a Rejected
> Alternative in the PEP. A draft design that separates them may help clarify
> the concepts being conflated more effectively than simply describing them,
> even if your own pragmatic assessment is "too much pain for not enough
> gain".
>

It can't be in the rejected ideas because it is the current behavior for
io.TextIOWrapper(newline=..) and it will never change (in Python 3) due
to backward compatibility.

As I understand Andrew doesn't like that *newline* parameter does too
much:

- *newline* parameter turns on/off universal newline mode
- it may specify the line separator e.g., newline='\r'
- it specifies whether newline translation happens e.g., newline=''
turns it off
- together with *line_buffering*, it may enable flush() if newline is
written


It is unrelated to my proposal [1] that shouldn't change the old
behavior if newline in {None, '', '\n', '\r', '\r\n'}.

[1] http://bugs.python.org/issue1152248#msg224016


--
Akira
Andrew Barnert
2014-07-26 04:03:30 UTC
Permalink
On Jul 25, 2014, at 19:24, Akira Li <4kir4.1i-***@public.gmane.org> wrote:

> Nick Coghlan <ncoghlan-***@public.gmane.org> writes:
>
>> On 26 Jul 2014 04:33, "Andrew Barnert"
>> <abarnert-/***@public.gmane.org>
>> wrote:
>>> As I've said before, I don't really like the design for '\r' and '\r\n',
>> or the fact that three separate notions (universal-newlines flag, line
>> ending for readline, and output translation for write) are all conflated
>> into one idea and crammed into one parameter, but I think it's probably too
>> late and too radical to change that.
>>
>> It's potentially still worth spelling out that idea as a Rejected
>> Alternative in the PEP. A draft design that separates them may help clarify
>> the concepts being conflated more effectively than simply describing them,
>> even if your own pragmatic assessment is "too much pain for not enough
>> gain".
>
> It can't be in the rejected ideas because it is the current behavior for
> io.TextIOWrapper(newline=..) and it will never change (in Python 3) due
> to backward compatibility.

That's exactly why changing it would be a "rejected idea". It certainly doesn't hurt to document the fact that we thought about it and decided not to change it for backward compatibility reasons.

> As I understand Andrew doesn't like that *newline* parameter does too
> much:
>
> - *newline* parameter turns on/off universal newline mode
> - it may specify the line separator e.g., newline='\r'
> - it specifies whether newline translation happens e.g., newline=''
> turns it off
> - together with *line_buffering*, it may enable flush() if newline is
> written

Exactly. And the fourth one only indirectly; "newline" flushing doesn't exactly mean _either_ of "\n" or the newline argument. And the related-but-definitely-not-the-same newlines attribute makes it even more confusing. (I've found bug reports with both Guido and Nick confused into thinking that newline was available as an attribute after construction; what hope do the rest of us have?)

But the reality is, it rarely affects real-life programs, so it's definitely not worth breaking compatibility over. And it's still a whole lot cleaner than the 2.x design despite having a lot more details to deal with.
Akira Li
2014-07-26 02:13:24 UTC
Permalink
I've added a patch that demonstrates "no translation" for alternative
newlines behavior http://bugs.python.org/issue1152248#msg224016

Andrew Barnert
<***@yahoo.com.dmarc.invalid> writes:

> On Thursday, July 24, 2014 2:08 AM, Akira Li
> <***@gmail.com> wrote:
>
>> > Andrew Barnert <***@yahoo.com> writes:
>>
>>> On Jul 23, 2014, at 5:13, Akira Li
>>> <***@gmail.com> wrote:
>>>> In order to newline="\0" case to work, it should behave 
>
>>>> similar to
>>>> newline='' or newline='\n' case instead i.e., no
>>>> translation should take
>>>> place, to avoid corrupting embed "\n\r" characters.
>>>
>>> The draft PEP discusses this. I think it would be more consistent to
>>> translate for \0, just like \r and \r\n.
>>
>> I read the [draft]. No translation is a better choice here. Otherwise
>>> (at the very least) it breaks `find -print0` use case.
>
> No it doesn't. The only reason it breaks your code is that you add
> newline='\0' to your stdout wrapper as well as your stdin wrapper. If
> you just passed '', it would not do anything. And this is exactly
> parallel with the existing case with, e.g., trying to pass through a
> classic-Mac file full of '\r'-delimited strings that might contain
> embedded '\n' characters that you don't want to translate.

I won't repeat it several times but as you've already found out newline='\0'
for stdout (at the very least) can be useful for line_buffering=True
behavior.

...
>> There is also line_buffering parameter. From the docs:
>>
>>   If line_buffering is True, flush() is implied when a call to write
>>   contains a newline character.
>
> The way this is actually defined seems broken to me; IIRC (I'll check
> the code later) it flushes on any '\r', and on any translated
> \n'. So, it's doing the wrong thing with '\r' in most modes, and with
> \n' in '' mode on non-Unix systems. So my thought was, just leave it
> broken.

Yes. I've found at least one issue http://bugs.python.org/issue22069

> But now that I think about it, the existing code can only flush
> excessively, never insufficiently, and that's probably a property
> worth preserving. So maybe there _is_ a reason to pass newline for
> output without translation after all. In other words, the parameter
> may actually conflate _four_ things, not just three...
>
> I'll need to think this through (and reread the code) this weekend;
> thanks for bringing it up.


--
Akira
Andrew Barnert
2014-07-26 04:09:41 UTC
Permalink
On Jul 25, 2014, at 19:13, Akira Li <4kir4.1i-***@public.gmane.org> wrote:

> I've added a patch that demonstrates "no translation" for alternative
> newlines behavior http://bugs.python.org/issue1152248#msg224016

Having taken a better look at the line buffering code, I now agree with you that this is necessary; otherwise we'd have to make a much bigger change to the implementation (which I don't think we want).

When I update the draft PEP I'll change that and add a rationale (this also makes the rationale for "no translation for binary files" and for "only readnl is exposed, not writenl" a lot simpler).

I'll also change it in my C patch (which I hope to be able to clean up and upload this weekend).

> Andrew Barnert
> <abarnert-/***@public.gmane.org> writes:
>
>> On Thursday, July 24, 2014 2:08 AM, Akira Li
>> <4kir4.1i-***@public.gmane.org> wrote:
>>
>>>> Andrew Barnert <abarnert-/***@public.gmane.org> writes:
>>>
>>>> On Jul 23, 2014, at 5:13, Akira Li
>>>> <4kir4.1i-***@public.gmane.org> wrote:
>>>>> In order to newline="\0" case to work, it should behave
>>
>>>>> similar to
>>>>> newline='' or newline='\n' case instead i.e., no
>>>>> translation should take
>>>>> place, to avoid corrupting embed "\n\r" characters.
>>>>
>>>> The draft PEP discusses this. I think it would be more consistent to
>>>> translate for \0, just like \r and \r\n.
>>>
>>> I read the [draft]. No translation is a better choice here. Otherwise
>>>> (at the very least) it breaks `find -print0` use case.
>>
>> No it doesn't. The only reason it breaks your code is that you add
>> newline='\0' to your stdout wrapper as well as your stdin wrapper. If
>> you just passed '', it would not do anything. And this is exactly
>> parallel with the existing case with, e.g., trying to pass through a
>> classic-Mac file full of '\r'-delimited strings that might contain
>> embedded '\n' characters that you don't want to translate.
>
> I won't repeat it several times but as you've already found out newline='\0'
> for stdout (at the very least) can be useful for line_buffering=True
> behavior.
>
> ...
>>> There is also line_buffering parameter. From the docs:
>>>
>>> If line_buffering is True, flush() is implied when a call to write
>>> contains a newline character.
>>
>> The way this is actually defined seems broken to me; IIRC (I'll check
>> the code later) it flushes on any '\r', and on any translated
>> \n'. So, it's doing the wrong thing with '\r' in most modes, and with
>> \n' in '' mode on non-Unix systems. So my thought was, just leave it
>> broken.
>
> Yes. I've found at least one issue http://bugs.python.org/issue22069
>
>> But now that I think about it, the existing code can only flush
>> excessively, never insufficiently, and that's probably a property
>> worth preserving. So maybe there _is_ a reason to pass newline for
>> output without translation after all. In other words, the parameter
>> may actually conflate _four_ things, not just three...
>>
>> I'll need to think this through (and reread the code) this weekend;
>> thanks for bringing it up.
>
>
> --
> Akira
>
> _______________________________________________
> Python-ideas mailing list
> Python-ideas-+ZN9ApsXKcEdnm+***@public.gmane.org
> https://mail.python.org/mailman/listinfo/python-ideas
> Code of Conduct: http://python.org/psf/codeofconduct/
Andrew Barnert
2014-07-23 04:24:12 UTC
Permalink
Received: from localhost (HELO mail.python.org) (127.0.0.1)
by albatross.python.org with SMTP; 23 Jul 2014 06:27:27 +0200
Received: from nm7-vm3.bullet.mail.ne1.yahoo.com (unknown [98.138.91.137])
(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
(No client certificate requested)
by mail.python.org (Postfix) with ESMTPS
for <python-ideas-+ZN9ApsXKcEdnm+***@public.gmane.org>; Wed, 23 Jul 2014 06:27:27 +0200 (CEST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=yahoo.com; s=s2048;
t=1406089455; bh=CSQ/37zWihDxjlPMZ7YHKeQxlg1Xr2V4yaEPJ1GmunM=;
h=Received:Received:Received:DKIM-Signature:X-Yahoo-Newman-Id:X-Yahoo-Newman-Property:X-YMail-OSG:X-Yahoo-SMTP:References:Mime-Version:In-Reply-To:Content-Type:Content-Transfer-Encoding:Message-Id:Cc:X-Mailer:From:Subject:Date:To;
b=IICYCtRnC3Y0YtjAH6iW4tRWAkDZM5B0vMOKbWaHY50NgpOVy6ci8J+P6H1l0SE3/7uU17qQkYHMBMJbMyfjtu0FpL5EbycskVg+cFVe1mJlUC78BH5bnPr9KugwrQk/H7AEIdhHPPEpHrIagNMgYLRA78Vqwznt4d6keH2y9ZuuU7S069d62ZBG+uOemmRAqvPZmkx+oKdavGkY+NLqAU9lABIvbROlh0K8DFVN4vx+WC+mJrXDh5t5ybnKEsPYt0tD8nqmZUEV4i9tffdDei5meVDfyxuiaJ2LsVvf3l+4fVlmqHEBQky1hCY4WrjjopT8gkoOuUma7qkMOMfPsw==
DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=s2048; d=yahoo.com;
b=ggNnQOF5VA7qEE/XOOn2cc1AVhzy+b0SZLWFShgSt7WvbZFGGydOzvufg8DxvdeVJ2GGgjEwwbRFEWv6Z88pu0xAB+19qX3w0udaXpl40fWwpzcokx2cHtyVXp9vVfrV/RpS0y064DNtF3OMJElsUFGqihZq9i43SFo3uwKUJ6XG38PObxKDC4cuIbI8617mEk8/j6Tt8g8EMxkjk6JRi+RsH3ncaL8HijF67aEWiMmeuWtUjB9uk1uLrJ3gRUpMicTIDexmPBadkUu6FSxUaFZ7sPWS8R79KE+X50RHIVwYw/2KNg1X2hzyyvwN/Mxy6yu0ciExhx3bzbDdkqtKTQ==;
Received: from [98.138.101.130] by nm7.bullet.mail.ne1.yahoo.com with NNFMP;
23 Jul 2014 04:24:15 -0000
Received: from [98.138.226.61] by tm18.bullet.mail.ne1.yahoo.com with NNFMP;
23 Jul 2014 04:24:15 -0000
Received: from [127.0.0.1] by smtp212.mail.ne1.yahoo.com with NNFMP;
23 Jul 2014 04:24:15 -0000
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=yahoo.com; s=s1024;
t=1406089455; bh=CSQ/37zWihDxjlPMZ7YHKeQxlg1Xr2V4yaEPJ1GmunM=;
h=X-Yahoo-Newman-Id:X-Yahoo-Newman-Property:X-YMail-OSG:X-Yahoo-SMTP:References:Mime-Version:In-Reply-To:Content-Type:Content-Transfer-Encoding:Message-Id:Cc:X-Mailer:From:Subject:Date:To;
b=26hnbxPBGBE7n+HBP691V0WHT7Z7/0R+lD/ICGGvvwhFxx32d2P4r95T3ejgULzb+h9mIsN5C0jBdIeqVCiCxESHiE2d96zxKeaVu8uow/jAxPzjHryMRNsfl4WhYW8pJ724fTWEZ3XF5/DzMG6guxAhfGKu1E1FOrXcQPPUiTw=
X-Yahoo-Newman-Id: 626492.49645.bm-PymZ2SeGKaMNRsd/***@public.gmane.org
X-Yahoo-Newman-Property: ymail-3
X-YMail-OSG: dfEgu7gVM1mCC6dQblM0osBpmANUPn.y62v1S.F0Eurhu9X
_DjjfcY.XJDJpd9OKFZxnbNyE1XnGWqdPLBgg30F9X1Uxdcf3lqi1zsMs3mt
Tdq_ZSH2wcphyUqoX1tdscvcrzIxqQN6W4Fi8YP7jRhE8sBdptgnbo6LGpvQ
kBMF64qQDxqJMfDTVD045hf6ss.FX4OyUTf3SaX.TxlOODW5bmiwhk_Pf7sK
FkGZyOVE6ZvWgAA30j_Ge4qvz4koOvE_U2KciziSP1XkTR5NuQFxmgjP4eMO
bLAjW2AyiHB3whH1Gzlqhzhm4Ff24uxsq.RvvBAJJ1exUsrPSkyOnWklOg72
jVc5Fu6VT1xhh2G64v.HfunyT3aMDyCTtJ7h37KSV9UNMiGQQf2x3C5vu2Sa
hEIGqVaXPBd8oI5zmW_ACBqFjkMYX7x34uJFI5ICfWQklzaWXuXPo_MksoY8
xl7Q3uC8h.U0ARcJu1uNNhjjv1DQ5dkcdq2NIUv3GvQ28bCAROfCUf4mfdw0
EYDs.VoZi.f9X0II0CtsbBS7xCeFZ3l97pivyk1fbrBoyspEGglgs6J2NmV2
q8WN13KpyECXKzlbfRJw9au8csVzkeTRVJIClnFf4FOqTAt.DseGP6njyfRh PABdy
X-Yahoo-SMTP: ck9HaqiswBCVwqs4NvJmVAXTgz2k
In-Reply-To: <CACac1F9ROd6wz_D197UP_Pz-n8zkurjhCsGLCRTOpb0jpL=1oA-JsoAwUIsXosN+***@public.gmane.org>
X-Mailer: iPhone Mail (10B143)
X-BeenThere: python-ideas-+ZN9ApsXKcEdnm+***@public.gmane.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: Discussions of speculative Python language ideas
<python-ideas.python.org>
List-Unsubscribe: <https://mail.python.org/mailman/options/python-ideas>,
<mailto:python-ideas-request-+ZN9ApsXKcEdnm+***@public.gmane.org?subject=unsubscribe>
List-Archive: <http://mail.python.org/pipermail/python-ideas/>
List-Post: <mailto:python-ideas-+ZN9ApsXKcEdnm+***@public.gmane.org>
List-Help: <mailto:python-ideas-request-+ZN9ApsXKcEdnm+***@public.gmane.org?subject=help>
List-Subscribe: <https://mail.python.org/mailman/listinfo/python-ideas>,
<mailto:python-ideas-request-+ZN9ApsXKcEdnm+***@public.gmane.org?subject=subscribe>
Errors-To: python-ideas-bounces+gcpi-python-ideas=m.gmane.org-+ZN9ApsXKcEdnm+***@public.gmane.org
Sender: "Python-ideas"
<python-ideas-bounces+gcpi-python-ideas=m.gmane.org-+ZN9ApsXKcEdnm+***@public.gmane.org>
Archived-At: <http://permalink.gmane.org/gmane.comp.python.ideas/28376>

On Jul 21, 2014, at 0:04, Paul Moore <p.f.moore-***@public.gmane.org> wrote:

> On 21 July 2014 01:41, Andrew Barnert <abarnert-/***@public.gmane.org> wrote:
>> OK, I wrote up a draft PEP, and attached it to the bug (if that's not a good thing to do, apologies); you can find it at http://bugs.python.org/file36008/pep-newline.txt
>
> As a suggestion, how about adding an example of a simple nul-separated
> filename filter - the sort of thing that could go in a find -print0 |
> xxx | xargs -0 pipeline? If I understand it, that's one of the key
> motivating examples for this change, so seeing how it's done would be
> a great help.
>
> Here's the sort of thing I mean, written for newline-separated files:
>
> import sys
>
> def process(filename):
> """Trivial example"""
> return filename.lower()
>
> if __name__ == '__main__':
>
> for filename in sys.stdin:
> filename = process(filename)
> print(filename)

for file in io.TextIOWrapper(sys.stdin.buffer, encoding=sys.stdin.encoding, errors=sys.stdin.errors, newline='\0'):
filename = process(filename.rstrip('\0'))
print(filename)

I assume you wanted an rstrip('\n') in the original, so I did the equivalent here.

If you want to pipe the result to another -0 tool, you also need to add end='\0' to the print, of course.

If we had Nick Coghlan's separate idea of adding rewrap methods to the stream classes (not part of this proposal, but I would be happy to have it), it would be even simpler:

for file in sys.stdin.rewrap(newline='\0'):
filename = process(filename.rstrip('\0'))
print(filename)

Anyway, this isn't perfect if, e.g., you might have illegal-as-UTF8 Latin-1 filenames hiding in your UTF8 filesystem, but neither is your code; in fact, this does exactly the same thing, except that it takes \0 terminators (so it can handle filenames with embedded newlines, or pipelines that use -print0 just because they can't be sure which tools in the chain can handle spaces).

It's obviously a little more complicated than your code, but that's to be expected; it's a lot simpler than anything we can write today. (And it runs at the same speed of your code instead of 2x slower or worse.)

> This is also an example of why I'm struggling to understand how an
> open() parameter "solves all the cases". There's no explicit open()
> call here, so how do you specify the record separator? Seeing how you
> propose this would work would be really helpful to me.

The open function is just a shortcut to constructing a stack of io classes; you can always construct them manually. It would be nice if some cases of that were made a little easier (again, see Nick's proposal above), but it's easy enough to live with.
Paul Moore
2014-07-23 08:14:31 UTC
Permalink
On 23 July 2014 05:24, Andrew Barnert <abarnert-/***@public.gmane.org> wrote:
>> This is also an example of why I'm struggling to understand how an
>> open() parameter "solves all the cases". There's no explicit open()
>> call here, so how do you specify the record separator? Seeing how you
>> propose this would work would be really helpful to me.
>
> The open function is just a shortcut to constructing a stack of io classes;

Ah, yes, I get what you're saying now. I was reading your proposal too
literally as being about "open", and forgetting you can use the
underlying classes to rewrap existing streams.

Thanks for your patience.
Paul
Greg Ewing
2014-07-20 04:16:54 UTC
Permalink
Nick Coghlan wrote:
> having a newline in a filename is sufficiently weird that I
> find it hard to imagine a scenario where "fix the filenames" isn't a
> better answer.

In Classic MacOS, the way you gave a folder an icon
was to put it in a hidden file called "Icon\r".

--
Greg
David Mertz
2014-07-20 05:58:53 UTC
Permalink
The pattern I use, by far, most often with the -0 option is:

find $path -print0 | xargs -0 some_command

Embedding a '\n' in a filename might be weird, but having whitespace in
general (i.e. spaces) really isn't uncommon. However, in this case it
doesn't really seem to matter if some_command is some_command.py. But I
still think the null byte special delimiter is plausible for similar
pipelines.


On Sat, Jul 19, 2014 at 6:40 PM, Nick Coghlan <ncoghlan-***@public.gmane.org> wrote:

> On 20 July 2014 11:31, Chris Angelico <rosuav-***@public.gmane.org> wrote:
> > On Sun, Jul 20, 2014 at 11:23 AM, Nick Coghlan <ncoghlan-***@public.gmane.org>
> wrote:
> >> At present, I'm genuinely unclear on
> >> why someone would ever want to pass the "-0" option to the other UNIX
> >> utilities, which then makes it very difficult to have a sensible
> >> discussion on how we should address that use case in Python.
> >
> > That one's easy. What happens if you use 'find' to list files, and
> > those files might have \n in their names? You need another sep.
>
> Yes, but having a newline in a filename is sufficiently weird that I
> find it hard to imagine a scenario where "fix the filenames" isn't a
> better answer. Hence why I think the PEP needs to explain why the UNIX
> utilities considered this use case sufficiently non-obscure to add
> explicit support for it, rather than just assuming that the
> obviousness of the use case can be taken for granted.
>
> Cheers,
> Nick.
>
> --
> Nick Coghlan | ncoghlan-***@public.gmane.org | Brisbane, Australia
> _______________________________________________
> Python-ideas mailing list
> Python-ideas-+ZN9ApsXKcEdnm+***@public.gmane.org
> https://mail.python.org/mailman/listinfo/python-ideas
> Code of Conduct: http://python.org/psf/codeofconduct/
>



--
Keeping medicines from the bloodstreams of the sick; food
from the bellies of the hungry; books from the hands of the
uneducated; technology from the underdeveloped; and putting
advocates of freedom in prisons. Intellectual property is
to the 21st century what the slave trade was to the 16th.
Wichert Akkerman
2014-07-20 07:58:44 UTC
Permalink
> On 20 Jul 2014, at 03:40, Nick Coghlan <ncoghlan-***@public.gmane.org> wrote:
>
> On 20 July 2014 11:31, Chris Angelico <rosuav-***@public.gmane.org> wrote:
>> On Sun, Jul 20, 2014 at 11:23 AM, Nick Coghlan <ncoghlan-***@public.gmane.org> wrote:
>>> At present, I'm genuinely unclear on
>>> why someone would ever want to pass the "-0" option to the other UNIX
>>> utilities, which then makes it very difficult to have a sensible
>>> discussion on how we should address that use case in Python.
>>
>> That one's easy. What happens if you use 'find' to list files, and
>> those files might have \n in their names? You need another sep.
>
> Yes, but having a newline in a filename is sufficiently weird that I
> find it hard to imagine a scenario where "fix the filenames" isn't a
> better answer.

Because you are likely to have no control af all over what people do with filenames. Since, on POSIX at least, filenames are allowed to contain all characters other than NUL and / you must be able to deal with that. Similar to how you must also be able to deal with a mixture of filenames using different encodings or even pure binary names.

Wichert.
Stephen J. Turnbull
2014-07-19 09:06:59 UTC
Permalink
Chris Angelico writes:

> But they might well be the same thing. Look at all the Unix commands
> that usually separate output with \n, but can be told to separate with
> \0 instead. If you're reading from something like that, it should be
> just as easy to split on \n as on \0.

Nick's point is more general, I think, but as a special case consider
a "multiline" record. What's the right behavior on output from the
application if the newline convention of this particular multiline
differs from that of the rest of the output stream? IMO this goes
beyond "consenting adults" (YMMV, of course).

Steve
Wolfgang Maier
2014-07-20 10:41:29 UTC
Permalink
On 19.07.2014 09:10, Nick Coghlan wrote:
>
> I still favour my proposal there to add a separate "readrecords()"
> method, rather than reusing the line based iteration methods - lines
> and arbitrary records *aren't* the same thing, and I don't think we'd
> be doing anybody any favours by conflating them (whether we're
> confusing them at the method level or at the constructor argument
> level).
>

Thinking about possible use-cases for my own work, made me realize one
thing:
At least for text files, the distinction between records and lines, in
practical terms, is that records may have *internal structure based on
newline characters*, while lines are just lines.

If a future readrecords() method would return the record as a StringIO
or BytesIO object, this would allow nested reading of files as lines
(with full newline processing) within records:

for record in infile.readrecords():
for line in record:
do_something()

For me, that sort of feature is a more common requirement than being
able to retrieve single lines terminated by something else than newline
characters.
Maybe though, it's possible to have both, a readrecords method like the
one above and an extended set of "newline" tokens that can be passed to
open (at least allowing "\0" seems to make sense).

Best,
Wolfgang
Andrew Barnert
2014-07-17 21:59:29 UTC
Permalink
On Thursday, July 17, 2014 1:48 PM, Guido van Rossum <***@python.org> wrote:


>I think it's fine to add something to stdlib that encapsulates your example. (TBD: where?)

Good question about the where.

The resplit function seems like it could be of more general use than just this case, but I'm not sure where it belongs. Maybe itertools?

The iter(lambda: f.read(bufsize), b'') part seems too trivial to put anywhere, even just as an example in the docs—but given that it probably looks like a magic incantation to anyone who's a Python novice (even if they're a C or JS or whatever expert), maybe it is worth putting somewhere. Maybe io.iterchunks(f, 4096)?

If so, the combination of the two into something like iterlines(f, b'\0') seems like it should go right alongside iterchunks.


However…


>I don't think it is reasonable to add a new parameter to readline()

The problem is that my code has significant problems for many use cases, and I don't think they can be solved.

Calling readline (or iterating the file) uses the underlying buffer (and stream decoder, for text files), keeps the file pointer in the same place, etc. My code doesn't, and no external code can. So, besides being less efficient, it leaves the file pointer in the wrong place (imagine using it to parse an RFC822 header then read() the body), doesn't properly decode files where the separator can be ambiguous with other bytes (try separating on '\0' in a UTF-16 file), etc.

Maybe if we had more powerful adapters or wrappers so I could just say "here's a pre-existing buffer plus a text-file-like object, now wrap that up as a real TextIOBase for me" it would be possible to write something that worked from outside without these problems, but as things stand, I don't see an answer.

Maybe put resplit in the stdlib, then just give iterlines as a 2-liner example (in the itertools recipes, or the file-I/O section of the tutorial?) where all these problems can be raised and not answered?
Guido van Rossum
2014-07-17 22:37:58 UTC
Permalink
On Thu, Jul 17, 2014 at 2:59 PM, Andrew Barnert <
abarnert-/***@public.gmane.org> wrote:

> On Thursday, July 17, 2014 1:48 PM, Guido van Rossum <guido-+ZN9ApsXKcEdnm+***@public.gmane.org>
> wrote:
>
>
> >I think it's fine to add something to stdlib that encapsulates your
> example. (TBD: where?)
>
> Good question about the where.
>
> The resplit function seems like it could be of more general use than just
> this case, but I'm not sure where it belongs. Maybe itertools?
>
> The iter(lambda: f.read(bufsize), b'') part seems too trivial to put
> anywhere, even just as an example in the docs—but given that it probably
> looks like a magic incantation to anyone who's a Python novice (even if
> they're a C or JS or whatever expert), maybe it is worth putting somewhere.
> Maybe io.iterchunks(f, 4096)?
>
> If so, the combination of the two into something like iterlines(f, b'\0')
> seems like it should go right alongside iterchunks.
>
>
> However

>
>
> >I don't think it is reasonable to add a new parameter to readline()
>
> The problem is that my code has significant problems for many use cases,
> and I don't think they can be solved.
>
> Calling readline (or iterating the file) uses the underlying buffer (and
> stream decoder, for text files), keeps the file pointer in the same place,
> etc. My code doesn't, and no external code can. So, besides being less
> efficient, it leaves the file pointer in the wrong place (imagine using it
> to parse an RFC822 header then read() the body), doesn't properly decode
> files where the separator can be ambiguous with other bytes (try separating
> on '\0' in a UTF-16 file), etc.
>

You can implement a subclass of io.BufferedIOBase that wraps an instance of
io.RawIOBase (I think those are the right classes) where the wrapper adds a
readuntil(separator) method. Whichever thing then wants to read the rest of
the data should call read() on the wrapper object.

This still sounds a lot better to me than asking everyone to add a new
parameter to their readline() (and the implementation).

Maybe if we had more powerful adapters or wrappers so I could just say
> "here's a pre-existing buffer plus a text-file-like object, now wrap that
> up as a real TextIOBase for me" it would be possible to write something
> that worked from outside without these problems, but as things stand, I
> don't see an answer.
>

You probably have to do a separate wrapper for text streams, the types and
buffering implementation are just too different.


> Maybe put resplit in the stdlib, then just give iterlines as a 2-liner
> example (in the itertools recipes, or the file-I/O section of the
> tutorial?) where all these problems can be raised and not answered?
>

(Sorry, in a hurry / terribly distracted.)

--
--Guido van Rossum (python.org/~guido)
Andrew Barnert
2014-07-18 04:40:11 UTC
Permalink
On Jul 17, 2014, at 15:37, Guido van Rossum <guido-+ZN9ApsXKcEdnm+***@public.gmane.org> wrote:

> On Thu, Jul 17, 2014 at 2:59 PM, Andrew Barnert <abarnert-/***@public.gmane.orginvalid> wrote:
>> >I don't think it is reasonable to add a new parameter to readline()
>>
>> The problem is that my code has significant problems for many use cases, and I don't think they can be solved.
>>
>> Calling readline (or iterating the file) uses the underlying buffer (and stream decoder, for text files), keeps the file pointer in the same place, etc. My code doesn't, and no external code can. So, besides being less efficient, it leaves the file pointer in the wrong place (imagine using it to parse an RFC822 header then read() the body), doesn't properly decode files where the separator can be ambiguous with other bytes (try separating on '\0' in a UTF-16 file), etc.
>
> You can implement a subclass of io.BufferedIOBase that wraps an instance of io.RawIOBase (I think those are the right classes) where the wrapper adds a readuntil(separator) method. Whichever thing then wants to read the rest of the data should call read() on the wrapper object.
>
> This still sounds a lot better to me than asking everyone to add a new parameter to their readline() (and the implementation).

[snip]

> You probably have to do a separate wrapper for text streams, the types and buffering implementation are just too different.

The problem isn't needing two separate wrappers, it's that the text wrapper if effectively impossible.

For binary files, MyBufferedReader.readuntil is a slightly modified version of _pyio.RawIOBase.readline, which only needs to access the public interface of io.BufferedReader (peek and read).

For text files, however, it needs to access private information from TextIOWrapper that isn't exposed from C to Python. And, unlike BufferedReader, TextIOWrapper has no way to peek ahead, or push data back onto the buffer, or anything else usable as a workaround, so even if you wanted to try to take care of the decoding state problems manually, you can't, except by reading one character at a time.

There are also some minor problems even for binary files (e.g., MyBufferedReader(f.raw) has a different file position from f, so if you switch between them you'll end up skipping part of the file), but these won't affect most use cases; the text file problem is the big one.
Loading...