Faster PyArg_ParseTupleAndKeywords kwargs

Discussion:

dw+

2014-05-23 19:22:20 UTC

Early while working on py-lmdb I noticed that a huge proportion of
runtime was being lost to PyArg_ParseTupleAndKeywords, and so I
subsequently wrote a specialization for this extension module.

In the current code[0], parse_args() is much faster than
ParseTupleAndKeywords, responsible for a doubling of performance in
several of the library's faster code paths (e.g.
Cursor.put(append=True)). Ever since adding the rewrite I've wanted to
go back and either remove it or at least reduce the amount of custom
code, but it seems there really isn't a better approach to fast argument
parsing using the bare Python C API at the moment.

[0] https://github.com/dw/py-lmdb/blob/master/lmdb/cpython.c#L833

In the append=True path, parse_args() yields a method that can complete
1.1m insertions/sec on my crappy Core 2 laptop, compared to 592k/sec
using the same method rewritten with PyArg_ParseTupleAndKeywords.

Looking to other 'fast' projects for precedent, and studying Cython's
output in particular, it seems that Cython completely ignores the
standard APIs and expends a huge amount of .text on using almost every
imagineable C performance trick to speed up parsing (actually Cython's
output is a sheer marvel of trickery, it's worth study). So it's clear
the standard APIs are somewhat non-ideal, and those concerned with
performance are taking other approaches.

ParseTupleAndKeywords is competitive for positional arguments (1.2m/sec
vs 1.5m/sec for "Cursor.put(k, v)"), but things go south when a kwarg
dict is provided.

The primary goal of parse_args() was to avoid the continous temporary
allocations and hashing done by PyArg_ParseTupleAndKeywords, by way of
PyDict_GetItemString(), which invokes PyString_FromString() internally,
which subsequently causes alloc / strlen() and memcpy(), one for each
possible kwarg, on every function call.

The rewrite has been hacked over time, and honestly I'm not sure which
bits are responsible for the speed improvement, and which are totally
redundant. The tricks are:

* Intern keyword arg strings once at startup, avoiding the temporary
PyString creation and also causing their hash() to be cached across
calls. This uses an incredibly ugly pair of enum/const char *[]
static globals.[3]

[3] https://github.com/dw/py-lmdb/blob/master/lmdb/cpython.c#L79

* Use a per-function 'static const' array of structures to describe the
expected set of arguments. Since these arrays are built at compile
time, they cannot directly reference the runtime-generated interned
PyStrings, thus the use of an enum.

A nice side effect of the array's contents being purely small integer
is that each array element is small and thus quite cache-efficient.
In the current code array elements are 4 bytes each.

* Avoid use of variable-length argument lists. I'm not sure if this
helps at all, but certainly it simplifies the parsing code and makes
the call sites much more compact.

Instead of a va_arg list of destination pointers, parsed output is
represented as a per-function structure[1][2] definition, whose
offsets are encoded into the above argspec array, and at build time.

[1] https://github.com/dw/py-lmdb/blob/master/lmdb/cpython.c#L1265
[2] https://github.com/dw/py-lmdb/blob/master/lmdb/cpython.c#L704

This might hurt the compiler's ability to optimize the placement of
what were previouly small stack variables (e.g. I'm not sure if it
prevents the compiler making more use of registers). In any case the
overall result is much faster than before.

And most recently, giving a further 20% boost to append=True:

* Cache a dict that maps interned kwarg -> argspec array offset,
allowing the per-call kwarg dict to be iterated, and causing only one
hash lookup per supplied kwarg. Prior to the cache, presence of
kwargs would cause one hash lookup per argspec entry (e.g.
potentially 15 lookups instead of 1 or 2).

It's obvious this approach isn't generally useful, and looking at the
CPython source we can see the interning trick is already known, and
presumably not exposed in the CPython API because the method is quite
ugly. Still it seems there is room to improve the public API to include
something like this interning trick, and that's what this mail is about.

My initial thought is for a horribly macro-heavy API like:

PyObject *my_func(PyObject *self, PyObject *args, PyObject *kwargs)
{
Py_ssize_t foo;
const char *some_buf;
PyObject *list;

Py_BEGIN_ARGS
PY_ARG("foo", PY_ARG_SSIZE_T, NULL, PY_ARG_REQUIRED),
PY_ARG("some_buf", PY_ARG_BUFFER, NULL, PY_ARG_REQUIRED),
PY_ARG("list", PY_ARG_OBJECT, &PyList_Type, NULL, 0)
Py_END_ARGS

if(Py_PARSE_ARGS(args, kwds, &foo, &some_buf, &list)) {
return NULL;
}

/* do stuff */
}

Where:

struct py_arg_info; /* Opaque */

struct py_arg_spec {
const char *name;
enum { ... } type;
PyTypeObject *type;
int options;
};

#define PY_BEGIN_ARGS \
static struct py_arg_info *_py_arg_info; \
if(! _py_arg_info) { \
static const struct py_arg_spec _py_args[] = {

#define PY_END_ARGS \
}; \
_Py_InitArgInfo(&_py_arg_info, _py_args, \
sizeof _py_args / sizeof _py_args[0]); \
}

#define PY_ARG(name, type, type2, opts) {name, type, type2, opts}

#define Py_PARSE_ARGS(a, k, ...) \
_Py_ParseArgsFromInfo(&_py_arg_info, a, k, _VA_ARG_);

Here some implementation-internal py_arg_info structure is built up on
first function invocation, producing the cached mapping of argument
keywords to array index, and storing a reference to the py_arg_spec
array, or some version of it that has been internally transformed to a
more useful format.

You may notice this depends on va_arg macros, which breaks at least
Visual Studio, so at the very least that part is broken.

The above also doesn't deal with all the cases supported by the existing
PyArg_ routines, such as setting the function name and custom error
message, or unpacking tuples (is this still even supported in Python 3?)

Another approach might be to use a PyArg_ParseTupleAndKeywords-alike
API, so that something like this was possible:

static PyObject *
my_method(PyObject *self, PyObject *args, *PyObject *kwds)
{
Py_ssize_t foo;
const char *some_buf;
Py_ssize_t some_buf_size;
PyObject *list;

static PyArgInfo arg_info;
static char *keywords[] = {
"foo", "some_buf", "list", NULL
};

if(! PyArg_FastParse(&arg_info, args, kwds, "ns#|O!", keywords,
&foo, &some_buf, &some_buf_size,
&PyList_Type, &list)) {
return NULL;
}

/* do stuff */
}

In that case that API is very familiar, and PyArg_FastParse() builds the
cache on first invocation itself, but the supplied va_list is full of
noise that needs to be carefully skipped somehow. The work involved in
doing the skipping might introduce complexity that slows things down all
over again.

Any thoughts on a better API? Is there a need here? I'm obviously not
the first to notice PyArg_ParseTupleAndKeywords is slow, and so I wonder
how many people have sighed and brushed off the fact their module is
slower than it could be.

David

dw+

2014-05-23 20:07:17 UTC

Permalink

Post by dw+
if(! PyArg_FastParse(&arg_info, args, kwds, "ns#|O!", keywords,
&foo, &some_buf, &some_buf_size,
&PyList_Type, &list)) {
return NULL;
}

Perhaps the most off-the-wall approach would be to completely preserve
the existing interface, by using a dollop of assembly to fetch the
return address, and use that to maintain some internal hash table.

That's incredibly nasty, but as a systemic speedup it might be worth it?

David

dw+

2014-05-23 21:41:35 UTC

Permalink

Post by dw+
Perhaps the most off-the-wall approach would be to completely preserve
the existing interface, by using a dollop of assembly to fetch the
return address, and use that to maintain some internal hash table.
That's incredibly nasty, but as a systemic speedup it might be worth it?

Final (obvious in hindsight) suggestion: mix 'fmt' and 'keywords'
argument pointers together for use as a hash key into a table within
getargs.c, no nasty asm or interface changes necessary.

David

Nathaniel Smith

2014-05-23 20:08:28 UTC

Permalink

Post by dw+
Early while working on py-lmdb I noticed that a huge proportion of
runtime was being lost to PyArg_ParseTupleAndKeywords, and so I
subsequently wrote a specialization for this extension module.
In the current code[0], parse_args() is much faster than
ParseTupleAndKeywords, responsible for a doubling of performance in
several of the library's faster code paths (e.g.
Cursor.put(append=True)). Ever since adding the rewrite I've wanted to
go back and either remove it or at least reduce the amount of custom
code, but it seems there really isn't a better approach to fast argument
parsing using the bare Python C API at the moment.
[0] https://github.com/dw/py-lmdb/blob/master/lmdb/cpython.c#L833
In the append=True path, parse_args() yields a method that can complete
1.1m insertions/sec on my crappy Core 2 laptop, compared to 592k/sec
using the same method rewritten with PyArg_ParseTupleAndKeywords.
Looking to other 'fast' projects for precedent, and studying Cython's
output in particular, it seems that Cython completely ignores the
standard APIs and expends a huge amount of .text on using almost every
imagineable C performance trick to speed up parsing (actually Cython's
output is a sheer marvel of trickery, it's worth study). So it's clear
the standard APIs are somewhat non-ideal, and those concerned with
performance are taking other approaches.

As another data point about PyArg_ParseTupleAndKeywords slowness,
Numpy has tons of barely-maintainable hand-written argument parsing
code. I haven't read the proposal below in detail, but anything that
helps us clean that up is ok with me...

You should check out Argument Clinic (PEP 436) if you haven't seen it.

-n

--
Nathaniel J. Smith
Postdoctoral researcher - Informatics - University of Edinburgh
http://vorpus.org

dw+

2014-05-23 20:22:48 UTC

Permalink

Post by Nathaniel Smith
You should check out Argument Clinic (PEP 436) if you haven't seen it.

Thanks! I'd seen this but forgotten about it. The use of a preprocessor
seems excessive, and a potential PITA when combined with other
preprocessors - e.g. Qt's moc, but the language is a very cool idea.

If the DSL definition was expressed as a string constant, that pointer
could key an internal hash table. Still not as fast as specialized code,
but perhaps an interesting middleground.

David

Nathaniel Smith

2014-05-23 20:38:40 UTC

Permalink

Post by dw+

Post by Nathaniel Smith
You should check out Argument Clinic (PEP 436) if you haven't seen it.

Yes, but OTOH it's working and shipping code with a substantial user
base (lots of the CPython implementation), so making it fast and
usable in third-party libraries might still be the most efficient
approach. And IIRC it's not (necessarily) a build-time thing, the
usual mode is for it to update your checked-in source directly, so
integration with other preprocessors might be a non-issue.

A preprocessor approach might make it easier to support older Python's
in the generated code, compared to a library approach. (It's easier to
say "developers/the person making the source release must have Python
3 installed, but the generated code works everywhere" than to say
"this library only works on Python 3.5+ because that's the first
version that ships the new argument parsing API".)

-n

--
Nathaniel J. Smith
Postdoctoral researcher - Informatics - University of Edinburgh
http://vorpus.org

Nick Coghlan

2014-05-24 01:36:29 UTC

Permalink

Post by Nathaniel Smith

Post by dw+

Post by Nathaniel Smith
You should check out Argument Clinic (PEP 436) if you haven't seen it.

Note there are two key goals behind Argument Clinic:

1. Add introspection metadata to functions implemented in C without further
reducing maintainability (adding an arg to a C function already touched 6
places, signature metadata would have been a 7th)

2. Eventually switch the generated code to something faster than
PyArg_ParseTupleAndKeywords.

What phase 2 actually looks like hasn't been defined yet (enabling phase 1
ended up being a big enough challenge for 3.4), but the ideas in this
thread would definitely be worth exploring further in that context.

As Nathaniel noted, once checked in, Argument Clinic code is just ordinary
C code with some funny comments, so it introduces no additional build time
dependencies.

Cheers,
Nick.

dw+

2014-05-24 01:52:11 UTC

Permalink

Post by Nick Coghlan
1. Add introspection metadata to functions implemented in C without further
reducing maintainability (adding an arg to a C function already touched 6
places, signature metadata would have been a 7th)
As Nathaniel noted, once checked in, Argument Clinic code is just ordinary C
code with some funny comments, so it introduces no additional build time
dependencies.

Hadn't realized it was already in use! It isn't nearly as intrusive as I
might have expected, it seems 'preprocessor' is just a scary word. :)

Post by Nick Coghlan
2. Eventually switch the generated code to something faster than
PyArg_ParseTupleAndKeywords.
What phase 2 actually looks like hasn't been defined yet (enabling phase 1
ended up being a big enough challenge for 3.4), but the ideas in this thread
would definitely be worth exploring further in that context.

The previous mail's hint led to thinking about how to actually implement
a no-API-changes internal cache for PyArg_ParseTupleAndKeywords. While
not a perfect solution, that approach has the tremendous benefit of
backwards compatibility with every existing extension.

It seems after 20 years' evolution, getargs.c is quite resistant to
change (read: it afflicts headaches and angst on the unweary), so
instead I've spent my Friday evening exploring a rewrite.

David