Solving Problems Badly

Sat 03 October 2015 by Peter Ward

Perfect is the enemy of good.

I have an interesting story to tell today, and I’m sure the title of this post probably surprised you. Well, it’s not satire or schadenfreude, but actually serious advice. This is the story of some code I wrote (back in 2013) to solve a problem, where I deliberately chose a “bad” solution.

One of the things I’ve done for many years now is tutor high school students to learn programming as part of the NCSS Challenge. Each week, for five weeks, students are given a set of problems to solve, accompanied by notes explaining all the Python features and concepts they need to solve them. Many high school computing teachers sign up their class (sometimes using it as an assessment task), but many students also sign up themselves. In addition to the notes, students can also chat and discuss the problems (but not share code) on forums, or seek help privately from our volunteer tutors.

The challenge uses Discourse for the forums, which works well, but also for the private messaging (with some custom modifications), which works less well. There are two main problems, which the following sequence illustrates:

  1. Student 1 sends a message (sending a notification to all tutors).
  2. Tutor A starts writing a response.
  3. Tutor B starts writing a response
  4. Tutor A posts their response.
  5. Tutor B either doesn’t notice (meaning the student gets two responses, which might attempt to solve the problem in different ways), or does notice and is annoyed that their time has been wasted.
  6. Tutor A goes back to the notifications, and starts clicking through the unread posts, but they’ve all been answered by other tutors.

To solve the problem of multiple tutors answering the same post, we set up an IRC channel that tutors joined, so they could announce which threads they were going to work on. Various tutors wrote IRC bots to automate bits of this, and then I combined all the functionality into a single bot for keeping track of what state (unreplied/replied) threads were in, and which tutor was currently assigned to answer it.

And now you have enough background to get onto what my bad solution was. One of the parts of this IRC bot was the “threads backend”, which stores thread metadata: author, title, unreplied/replied state, and the tutor assigned to it. Now the obvious answer to this is to use a database: SQLite if it’s a toy problem, and PostgreSQL if it’s big enough.

Instead of doing that, however, I went with something far more crude:

import os

class Thread(object):
    def __init__(self, title=None, ...):
        self.title = title
        ...

    def to_json(self):
        return {
            'title': self.title,
            ...
        }

def load_threads():
    with open('threads.json', 'r') as f:
        threads = json.load(f)
    return {
        key: Thread(**kwargs)
        for key, kwargs in threads.iteritems()
    }

def save_threads(threads):
    with open('threads.json.new', 'w') as f:
        # If this crashes, we end up with an empty threads.json.new file,
        # but threads.json is still intact.
        json.dump({
            key: thread.to_json()
            for key, thread in threads.iteritems()
        }, f)
    os.rename('threads.json.new', 'threads.json')

...

def request_handler(request):
    threads = load_threads()

    if request.action == 'get_post':
        return get_post(threads, request)

    elif request.action == 'new_post':
        response = new_post(threads, request)
        save_threads(threads)
        return response
    ...

It’s terrible! On every request, the entire database is read from disk, and whenever a thread gets modified, the entire database has to be written back to disk (not just the parts which changed). There’s some other problems too: it doesn’t let you handle requests concurrently, and it doesn’t let you create secondary indices to speed up queries. It’s also missing an os.fsync() call to ensure the data is actually written to disk, but let’s not talk about that one. 😉

The secondary indices isn’t just a throwaway comment either: one of the operations this backend supports is filtering threads by various attributes. Obviously, the only way this backend can support that is by doing a linear scan over all threads, and returning the ones which match.

But actually, despite the number of raised eyebrows, it’s not actually quite as terrible as it first seems. Firstly, it’s a backend to an IRC bot, which means that any response it gives is going to be seen on an IRC channel, which is (sort of) a serialised output: so not handling concurrent requests isn’t a big deal. It also doesn’t have to handle a lot of traffic (much less than one request per second), and even a 1s latency wouldn’t be obviously bad on IRC.

The other thing to consider is that the database isn’t actually that big! In the first year, there were ~500 threads: even with a generous 1KiB of metadata per thread, that’s still less than a mebibyte, which is easily written to disk in less than a second. It’s also small enough to fit in the OS buffer cache, so reading the database from “disk” isn’t expensive either. And you can also see that those linear scans aren’t that expensive either.

Futhermore, there’s some advantages to the code. It’s simple: there’s full visibility into the internal state, serialisation and deserialisation processes (unlike many ORMs, e.g. SQLAlchemy). It’s debuggable: it’s just JSON, so it’s incredibly easy to manually inspect (after pretty-printing with python -m json.tool) or write custom tools to analyse it. Schema migrations are easy: tell Thread.__init__ about the new key (providing a default value), and have Thread.to_json emit whatever schema you want stored: on the next database write, all record will be rewritten to the new schema.

This system performed adequately for the first year it was used. There were many schema changes, but it had no data corruption or was the cause of any outages. It’s performance was good enough: by the end, some filtering requests took just over a second to respond to, but it was never bad enough for users to notice.

The next year, I rewrote it to use SQLAlchemy backed by SQLite and added indices to optimise the most frequent filter requests. But I didn’t see that as admitting defeat: on the contrary: by starting with a terrible implementation, I learnt exactly what the bottlenecks were, and could then make informed decisions when designing the next version.


Perhaps you weren’t surprised I went with a “bad” solution first: the idea is certainly not new. It’s an interesting strategy, but you need to be careful with it: if you don’t make it bad enough, you won’t be motivated enough to rewrite it in the future, and if you make it too bad, then it’ll crash and burn before you’re ready to rewrite.

Even if this story hasn’t fully convinced you that deliberately choosing imperfect solutions is a good idea, I hope you can see it’s a viable strategy in some cases.


Learning Go with AppEngine

Sat 06 April 2013 by Peter Ward

I’ve been meaning to learn Go for a while, and I had a nice opportunity today to do that. But before I talk about Go, let me explain why I was writing a Go program in the first place.

Last night, I saw that Alex (a friend of mine …

read more

PSA: Hotmail is Borken

Sat 09 February 2013 by Peter Ward

I’ve just debugged an issue where someone was trying to send an email to my domain, and got back an error message, which they took to mean that the email address no longer existed.

Of the very little debugging information I managed to get out, was this snippet (some …

read more

Restarting this blog

Wed 07 November 2012 by Peter Ward

Welcome to my blog! Apparently I think these things are still useful for something (I may be slightly delusional on this point).

I’m using Pelican, and have imported all the content from my old blog(s), but there are still some formatting issues with the import which I need …

read more

Keys

Sun 07 August 2011 by Peter Ward

Just a quick note for the people who care about trivial things like privacy. ;)

I’m decommissioning my four year old 1024-bit DSA key (1024D/D1ED1C9F), and replacing it with a new 4096-bit RSA key (4096R/3906C4AF). The fingerprint of the new key is 410D 28F0 1DFB 0DC4 0E0C  D663 …

read more

Mouse activity

Wed 12 May 2010 by Peter Ward

As seen on Planet Debian.

IOGraph” is a small Java program that tracks and draws the status of your mouse over a period of time. Lines are movement, circles are places where the mouse has stopped… The bigger the circle, the longer it was there.

I suspect the large density …

read more

Hello Planet NCSS!

Thu 01 April 2010 by Peter Ward
print "Hello Planet NCSS!"

Thanks to Tim for making this happen so quickly!

read more

Test post

Wed 11 November 2009 by Peter Ward

Congratulations, you've won a prize! Nope...nothing to see here, move along.

read more

Links to stuff

Sun 04 October 2009 by Peter Ward

Erm, whoops!

Sun 12 April 2009 by Peter Ward
flowblok@aurora:~git/cgit-0.8.2.1 Z make
zsh: command not found: make

Clearly, I should have installed build-essentials a long time ago! ;-) More about what this "aurora" machine is later.

read more