web.archive.org

Python Warts

Introduction

The Jargon File's definition of the term "wart" is:

A small, crocky feature that sticks out of an otherwise clean design. Something conspicuous for localized ugliness, especially a special-case exception to a general rule. ...

While I think Python has a very elegant design that successfully straddles the fine line between the minimalism of Lisp and the rococo complexities of Perl, it's certainly not perfect. There are various design features that I consider ugly, or at least suboptimal in some way. This essay will examine the most significant problems in Python as I perceive them, assisted by suggestions from the comp.lang.python crowd.

The purpose of this discussion isn't to bash Python, or to second-guess GvR; most of these problems are rather difficult to solve and don't have any obviously correct solution, even if one disregarded backward compatibility. Instead, the goal is simply to demonstrate awareness of Python's flaws, and to ask if they're fixable. One test of whether someone is a good programmer is to ask them to assess the tools they use -- languages, libraries, and operating systems. Someone who cannot perceive flaws or express an opinion about the low and high points of a design is either not accustomed to thinking analytically about the systems they encounter or is blindly partisan in the service of their chosen favorites. Computing, at least in the exploratory fields where I hang out, is more of an art than a science, and inability to critique a design is a serious liability.

So, let's proceed to examine some of Python's design flaws in no particular order. Many of these flaws are now only of historical interest, having been fixed in current versions of Python.

Scoping and Functional Programming

(Fixed in Python 2.1/2.2.)

Python has 3 built-in functions, map(), filter(), and reduce(), that provide some support for the functional style of programming. In versions of Python before 2.2, using these functions sometimes required a little hack with default arguments.

Let's choose one particular function for our example: filter(func, L) loops over the list L, returning a new list containing all the elements of L for which func(element) returned true. Take the situation where you have a list of numbers L, and want to find all the objects that are even numbers (multiples of 2). You could write this as:

L2 = filter(lambda N: (N % 2) == 0, L)

In case you're not familiar with the lambda statement: it's defining a little function without a name that takes a single parameter N and returns the value of the comparison expression.

L2 is then set to a list containing all of the even numbers in L:

>>> L = [0, 5, 4, 32.5, 17, -6]
>>> L2 = filter(lambda N: (N % 2) == 0, L)
>>> print L2
[0, 4, -6]
>>>

Now, what if we want to extend this so you can find all the multiples of a given number m? In versions of Python up to 2.1, the rules for variable scoping would prevent the obvious code from working inside a function:

def f(L):
    m = 3
    L2 = filter(lambda N: (N % m) == 0, L)
    return L2
f([0, 2, 3, 17, -15])

On running this in Python 2.1, you would get a NameError for m in the second line of f(). Inside the anonymous function defined by the lambda statement, Python's scoping rules dictated that first the function's local variables and then the module-level variables were searched; f()'s local variables were never searched at all. The solution was to specify V as a default argument:

def f(L):
    m = 3
    L2 = filter(lambda N, m=m: (N % m) == 0, L)
    return L2
f([0, 2, 3, 17, -15])

If the filtering function required several of f()'s local variables, they'd all have to be passed in as default arguments. This results in the self=self, v=v that can be seen cluttering many uses of map() and filter(). Fixing this required modifying Python's scoping rules.

Python 2.1 added static scoping to the language to fix this problem. In 2.1, it had to be deliberately enabled in a module by including the directive from __future__ import nested_scopes at the top of a module, and the new scoping behaviour became the default only in Python 2.2. This was done to avoid breaking existing code whose behaviour would have changed as a result of the new scoping rules; Python 2.1 printed warnings about programs that were going to produce different results in Python 2.2 so that everyone had lots of time to fix their code. (See PEP 236 for a description of how incompatible changes are gradually introduced into Python.)

The first effect of static scoping was that the m=m default argument became unnecessary in the above example; the first version of the function now works as written. Under the new rules, when a given variable name is not assigned a value within a function (by an assignment, or the def, class, or import statements), references to the variable will be looked up in the local namespace of the enclosing scope. A more detailed explanation of the rules, and a dissection of the implementation, can be found in PEP 227: "Nested Scopes".

Scoping Rules

(Fixed in Python 2.1/2.2.)

This is another demonstration of the same problem. Recursive functions, as in following example, work fine when they're placed at the top-level of a module.:

def fact(n):
    if n == 1: return 1
    else: return n*fact(n-1)

However, if this definition is nested inside another function, it breaks as in the following example:

def g():
    def fact(n):
        if n == 1: return 1
        else: return n*fact(n-1)
    print fact(5)

When the nested version of fact() was called, it can't access its own name because the name fact was neither in the locals or in the module-level globals.:

[amk@mira amk]$ python2.1 /tmp/t.py
Traceback (innermost last):
  File "/tmp/t.py", line 8, in ?
    g()
  File "/tmp/t.py", line 6, in g
    print fact(5)
  File "/tmp/t.py", line 5, in fact
    else: return n*fact(n-1)
NameError: fact

This problem went away because of the static scoping introduced in Python 2.1 and described in the previous section.

The Type/Class Dichotomy

(Fixed in Python 2.2.)

Down in the bowels of the C implementation of Python, types and classes are subtly different. A type is a C structure that contains several tables of pointers to C functions. For example, one table points to the functions that implement numeric operators such as + and *. Types which don't implement a function simply set the corresponding pointer to NULL; otherwise, there's a pointer to a C function that implements the corresponding operation. A class, on the other hand, is a Python object; methods with special names, such as __getitem__ and __add__ are used to add operator semantics, such as dictionary access or numeric behaviour.

The problem is that types and classes are similar in one respect, but different in others. Both types and classes provide attributes and methods, and operators can be overridden for them, but you couldn't subclass a type because types and classes are implemented differently. This meant that you couldn't subclass Python's built-in lists, dictionaries, file objects to add a new method or different behaviour. It was possible to simulate a built-in type by providing all the required special methods -- for example, the UserList class in the standard library simulates lists by implementing __len__, __getitem__, __setitem__, __setslice__, and so forth -- but this is tedious, and the wrapper class can become out of date if future Python versions add new methods or attributes. If code did an explicit type check, as in:

def f(list_arg):
    if not isinstance(list_arg, list):
        raise TypeError, "list_arg should be a list"

Instances of your list lookalike subclass would never be acceptable as arguments to f(). (There are very good arguments that explicit isinstance() checks are un-Pythonic and should be avoided for just this reason; explicit checks prevent you from passing in arguments that will behave just like the desired type.)

Jython doesn't have this type/class dichotomy because it can subclass an arbitrary Java class. Jim Fulton's ExtensionClass could alleviate the problem for CPython, but it's not a standard component of Python, requires compiling a C extension, and can't simulate a class completely so it ends up introducing various new problems that occur less often but much more difficult to debug or work around when you encounter them. For example, with ExtensionClass you can't define a list-like class and override methods to define comparison between instances of the class and regular Python lists.

This problem can be fixed, but the flip side is that such generality will probably demand a speed penalty. The C code that makes up Python can take shortcuts if it's known that a Python object will belong to a given C type. For example, classes, instances, and modules are all implemented as namespaces using the dictionary type. If it's possible to subclass dictionaries, then it should also be possible to use a dictionary subclass to implement a class, instance, or module. (This could be used for all sorts of clever tricks by changing the semantics of the dictionary class used; for example, a read-only dictionary could prevent you from modifying instances or modules.) But this means that retrieving a dictionary key can't use special-case code that assumes the default C implementation, but instead has to call more general code. I'd like to see this problem fixed, because it would make some tasks easier and cleaner, but how much loss of speed am I willing to pay for it? I don't really know...

Python 2.2 made it possible to subclass built-in types such as lists and dictionaries, and the distinction between classes and types is greatly reduced. The changes are complex to describe, requiring the introduction of a new set of inheritance semantics and relationships, called "new-style classes" as a shorthand.

Backward compatibility was of great concern, however, so it wasn't possible to jettison the old rules (or "classic classes", as they're called). Instead the two sets of rules will coexist. If you subclass a built-in type or the new-in-2.2 object type, they'll obey the rules for new-style classes; otherwise they'll follow the rules for classic classes. In some distant future version -- perhaps Python 3.0 -- the old-style rules may be discarded; how this change will be gradually introduced isn't clear.

The best way to learn about new-style classes is to read Guido van Rossum's essay. For a highly detailed dissection of the new rules, you can read the three PEPS describing them: PEP 252: "Making Types Look More Like Classes", PEP 253: "Subtyping Built-in Types", and PEP 254: "Making Classes Look More Like Types". (You'll need to be a serious Python wizard to find the PEPs very helpful; they are deep magic, indeed.)

No do statement

Python has a while statement which has the loop test at the beginning of the loop, but there's no variant which has the test at the end. As a result, you commonly see code like this:

# Read lines until a blank line is found
while 1:
    line = sys.stdin.readline()
    if line == "\n": 
        break

The while 1: ... if (condition): break idiom is often seen in Python code. The code might be clearer if you could write:

# Read lines until a blank line is found
do:
    line = sys.stdin.readline()
while line != "\n"

Adding this new control structure to Python would be pretty straightforward, though any existing code that used "do" as a variable name would be broken by the introduction of a new do keyword. PEP 315: "Enhanced While Loop" is a detailed proposal for adding a do construct, but at this point no ruling on it has been made.

The addition of iterators in Python 2.2 has made it possible to write many such loops by using the for statement instead of while, an idiom that you might find preferable.:

for line in sys.stdin:
    if line == '\n':
        break

Local Variable Optimization

This flaw bites people fairly often. Consider the following function; what do you think happens when you run it?:

i=1
def f():
    print "i=",i
    i = i + 1 
f()

You might expect it to print "i=1" and increase the value of i to 2. In fact, you get this:

i=
Traceback (innermost last):
  File "<stdin>", line 1, in ?
  File "<stdin>", line 2, in f
NameError: i

What's going on here?

Python's source-to-bytecode compiler tries to optimize accesses to local variables. It decides that a variable is local if it's ever assigned a value in a function. Without the assignment i = i + 1, Python would assume that i is a global and generate code that accessed the variable as a global at the module level. When the assignment is present, the bytecode compiler generates different code that assumes i is a local variable; local variables are assigned consecutive numbers in an array so they can be retrieved more quickly. The print statement, therefore, gets compiled to look for the local i, which doesn't exist yet, and dies with the NameError exception. (In Python 1.6 and later, this raises a different exception, UnboundLocalError; it's hoped that this makes the problem a bit clearer.)

The fix for this problem is to declare i as a global in your function, like this:

i=1
def f():
    global i
    print "i=",i
    i = i + 1

I view this as an optimization becoming user-visible by breaking code; the right solution would be to do a basic-block analysis and, when this situation occurs, either report an error or automatically add a global declaration for the variable in question.

This wart usually doesn't affect people because module-level variables are commonly used for constants and therefore code never assigns a new value to them. Occasionally someone would write code that triggered this and assume it was some sort of interpreter bug, but the change to raising UnboundLocalError seems to have alleviated the problem, and puzzled questions about this are no longer a common sight in comp.lang.python.

r-strings

r-strings is short for raw-strings. They're string literals in which no processing of \-escapes is performed, and were added primarily to make regular expression patterns more readable. Python's string literals use \ for special sequences, as practically all C-derived languages do, but this collides with the frequent use of in regular expressions. The two layers of quoting were often confusing; to write a pattern that matches the TeX command construct '\break', you needed the regular expression \\break, and therefore the Python string literal "\\\\break". r-strings remove one layer of quoting, so you can write r"\\break", making the regex pattern clearer. This works nicely, but is perhaps a bit of a hack.

One fix would be to add regular expressions to the language core and have special syntax for them, as Perl and Ruby have done. However, I don't like this solution because Python is a general-purpose language, and regular expressions are used for one application domain, that of text processing. For other application domains, regular expressions may be of no interest and you might want to remove them from your interpreter to save code size. It's useful to have regular expressions in a module of their own where they can easily be removed from the interpreter binary if desired.

Another solution would be to introduce special forms of function arguments in which some argument would be left intact by the parser. This would be similar to special forms in Lisp, where some arguments may be evaluated and others aren't. For example, the call might be something like:

pattern = re.compile((break|par), re.M)

Making this work, however, would be difficult; how would the parser figure out where the pattern ended? r-strings at least have the saving grace of being simple and easy to understand. In short, while I don't find the idea of r-strings very clean, they certainly do solve the problem that inspired them, and I can't think of a better solution.

Calling Base Class Methods

The syntax for calling a base class method is icky. Consider a class C with a method f(). If you subclass C, override f(), and need to invoke the original method, the invocation syntax is C.f(self,x). If f() takes keyword arguments that need to be passed to the original version, things get worse in Python versions before 2.0 because you have to use the apply() built-in function: apply(C.f, (self,), kwdict). (Python 2.0 added f(arg1, arg2, **kwdict) to the language syntax.)

Python doesn't automatically call base class constructors, so you have to call them explicitly from subclasses.:

class Base:
    def __init__(self,x,y,**kw):
         ...
class Derived(base.Base):
    def __init__(self,x,y,z,**kw):
        base.Base.__init__(self,x,y,**kw)
        ...

Hudson called this "one of the ugliest things I've ever seen in Python that I couldn't think of a prettier way round".

Python 2.2's new-style classes add a built-in super() function that provides a tidier way to call superclass methods. In 2.2 you can write this:

class Derived(base.Base):
    def __init__(self,x,y,z,**kw):
        super(Derived, self).__init__(self,x,y,**kw)
        ...

This usage of super() will also be correct when the Derived class inherits from multiple base classes and some or all of them have __init__ methods.

Integers and Floats

The results of integer operations can sometimes be surprising; the most notable example is that 7 / 2 returns 3 instead of 3.5. This comes from Python's C ancestry; integer operations return integers as their result, not floating point values. Randy Pausch, who used a modified version of Python in the Alice system for teaching programming to non-programmers, found this to be a constant source of confusion. Programmers used to C semantics, though, may be used to this truncating behaviour.

This is in the process of being fixed, but a change to such a fundamental operator needs to be phased in very carefully. The process began in Python 2.2 with the introduction of a new operator, represented by // and called the floor-division operator, that always truncates. Existing programs that rely on truncating can simply use // instead of /.

In 2.2, / has the same semantics as before. Adding a from __future__ import division directive will cause / to perform true division; 1/2 is then 0.5, not 0. The Python interpreter supports command-line switches which will warn at run-time about division applied to two integers; these switches can be used to find affected code and fix it. The meaning of / will not change until Python 3.0.

Catching Multiple Exceptions

You can catch multiple exceptions in a single try...except statement, but the syntax makes it easy to slip up. The first argument to the except clause can be a tuple of exceptions, but it's easy to write code like this:

try:
    ... whatever ...
except NameError, OverflowError:  # this is the line
    ... something else ...

If a NameError is raised, this binds the exception object to the name 'OverflowError'. Later on, if you try to catch OverflowError in the same namespace, it won't work because the name 'OverflowError' will no longer be bound to the correct exception object. The right syntax is as follows:

except (NameError, OverflowError):

Tim Peters points out "Nobody defended that, but it's also not a common mistake (trying to catch N exceptions is unusual for 1 < N < infinity). So while it's a sin, it's barely worth mentioning."

On a related note, try can only be followed by except or finally, not both, as in Java. This is because getting the code generation right for all these cases would be really difficult, so Guido punted and imposed this constraint.

Explicit self in Methods

It's been suggested that the requirement to use self. to access attributes within class methods is tolerable but awkward, and the implied this from C++ and Java would be better. Perhaps this is a matter of preference; my Java code is instantly recognizable by the constant explicit use of this.attribute all over the place. Many Java or C++ coding standards dictate that object attributes should have a special prefix (e.g. m_) to distinguish them from locals; perhaps those who forget self are doomed to reinvent it.

If self. is too much typing for you, you can use a shorter variable name instead of self:

def method (s, arg1, arg2=None):
    s.attr1 = arg1
    if arg2 is not None:
         s.other_meth(arg2)

Using s instead of self doesn't follow the normal Python coding conventions, but few people will have difficulty adapting to the change.

Doubled Underscores for Private Variables

As a simple way to approximate private variables, beginning an attribute name with double underscores ('__') causes it to be mangled by Python's bytecode compiler. For example, an attribute named __value of a class C has its name mangled to _C__value. This certainly prevents name collisions between private variables used by a class and its subclass. But it's a hack and a kludge; making privacy depend on an unrelated property such as the attribute's name is clumsy. At least this ugliness is limited to one specific and little-used case; few Python programmers ever bother to use this private variable feature.

The .join() String Method

Python 2.0 introduced methods for string and Unicode objects. For example, 'abcdef'.replace('bcd', 'ZZZ') returns the string 'aZZZef'. Strings are still immutable, so the methods always return a new string instead of magically changing the contents of the existing string object. Most of the functions in the string module are now also available as string methods, and the intention is to encourage people to use string methods instead of the string module.

For many methods, there's no great argument about the appropriateness of the string method: the fact that s.upper() returns an uppercase version of the string is fairly clear and uncontroversial. The great exception is string.join(seq, sep), which takes a sequence seq containing strings and concatenates them, inserting the string sep between each element. The string method version of this is sep.join(seq), which seems backwards to many people. You can argue that it's strange to think of the separator as the actor in this situation; instead people think of the sequence as the primary actor and expect seq.join(sep), where join() is a method of the sequence. It's been pointed out that the string method is a bit clearer if you use it like this:

space = ' '
newstring = space.join(sequence)

A fair number of people find the above idiom unconvincing, calling it no more natural just because the string object is accessed as a variable value instead of a string literal. GvR argues against adding join() to sequences because then every sequence type would have to grow this new method; Python contains three sequence types (strings, tuples, lists) and many user-defined classes also behave like sequences.

This would be a minor point -- if you find the join() method on strings confusing, you could simply not use it -- if it weren't for the fact that the string module will be removed completely in Python 3.0, and this means string.join() would go away completely. This has been the point of much contention on comp.lang.python and on the python-dev list. An alternative resolution might be to add a join() built-in that will join any sequence of strings, but a counterargument is that joining sequences isn't so common a task that it deserves a built-in function.

print >>

Another feature added in Python 2.0 was the ability to redirect the output of a print statement to a given file object, using a syntax inspired by the Unix shell redirection syntax. The Python usage looks like this:

import sys
print >>sys.stderr, "warning: ..."

Many people didn't like the idea of adding more punctuation to Python for this one particular case; it's not so difficult to produce a string and use the file object's write() method. On the other hand, the functionality is awfully handy for sending a warning to a CGI script's error log or whatever, and no one came up with a clearer syntax. Python tries to innovate as little as possible and many people are familiar with the >> syntax through the shell and shell-derived languages such as Perl, so re-using this syntax seems really the best choice. It's still possible to maintain that the functionality should simply have been left out, though.

Strangest of all is the case of print >>None. This was debated on python-dev for a while; the reasonable options were to report an error, simply discard the output (so print >>None would be like writing to /dev/null), or choosing some default file object to use in this case. GvR selected the strangest and most magical behaviour; print >>None sends its output to standard output, sys.stdout.

Conclusion

These are the features of Python which have annoyed some people. Some of these features have been fixed in current versions, but others have not and personally I agree that some of them are irritating. So why do I still use Python?

When viewed next to the large number of things which Python gets right -- a small language core, strong but dynamic typing, reuse of the idea of namespaces, always using call by reference, indentation instead of delimiters -- these flaws are small ones, and I can easily live with them. Perhaps some of them can be fixed in future point releases of Python, or in the backward-incompatible Python 3.0, but even if they're not, they're relatively minor blemishes on an otherwise elegant design.

Acknowledgments

My thanks to the following people for comments and suggestions for this article: Scott Daniels, Bruce Eckel, John Farrell, Michael Hudson, Tim Peters, Reuben Sumner, Jarno Virtanen.