Solving Problems Badly — flowblok’s blog

Published on October 3 2015

Perfect is the enemy of good.

I have an interesting story to tell today, and I’m sure the title of this post probably surprised you. Well, it’s not satire or schadenfreude, but actually serious advice. This is the story of some code I wrote (back in 2013) to solve a problem, where I deliberately chose a “bad” solution.

One of the things I’ve done for many years now is tutor high school students to learn programming as part of the NCSS Challenge. Each week, for five weeks, students are given a set of problems to solve, accompanied by notes explaining all the Python features and concepts they need to solve them. Many high school computing teachers sign up their class (sometimes using it as an assessment task), but many students also sign up themselves. In addition to the notes, students can also chat and discuss the problems (but not share code) on forums, or seek help privately from our volunteer tutors.

The challenge uses Discourse for the forums, which works well, but also for the private messaging (with some custom modifications), which works less well. There are two main problems, which the following sequence illustrates:

Student 1 sends a message (sending a notification to all tutors).
Tutor A starts writing a response.
Tutor B starts writing a response
Tutor A posts their response.
Tutor B either doesn’t notice (meaning the student gets two responses, which might attempt to solve the problem in different ways), or does notice and is annoyed that their time has been wasted.
Tutor A goes back to the notifications, and starts clicking through the unread posts, but they’ve all been answered by other tutors.

To solve the problem of multiple tutors answering the same post, we set up an IRC channel that tutors joined, so they could announce which threads they were going to work on. Various tutors wrote IRC bots to automate bits of this, and then I combined all the functionality into a single bot for keeping track of what state (unreplied/replied) threads were in, and which tutor was currently assigned to answer it.

And now you have enough background to get onto what my bad solution was. One of the parts of this IRC bot was the “threads backend”, which stores thread metadata: author, title, unreplied/replied state, and the tutor assigned to it. Now the obvious answer to this is to use a database: SQLite if it’s a toy problem, and PostgreSQL if it’s big enough.

Instead of doing that, however, I went with something far more crude:

import os

class Thread(object):
    def __init__(self, title=None, ...):
        self.title = title
        ...

    def to_json(self):
        return {
            'title': self.title,
            ...
        }

def load_threads():
    with open('threads.json', 'r') as f:
        threads = json.load(f)
    return {
        key: Thread(**kwargs)
        for key, kwargs in threads.iteritems()
    }

def save_threads(threads):
    with open('threads.json.new', 'w') as f:
        # If this crashes, we end up with an empty threads.json.new file,
        # but threads.json is still intact.
        json.dump({
            key: thread.to_json()
            for key, thread in threads.iteritems()
        }, f)
    os.rename('threads.json.new', 'threads.json')

...

def request_handler(request):
    threads = load_threads()

    if request.action == 'get_post':
        return get_post(threads, request)

    elif request.action == 'new_post':
        response = new_post(threads, request)
        save_threads(threads)
        return response
    ...

It’s terrible! On every request, the entire database is read from disk, and whenever a thread gets modified, the entire database has to be written back to disk (not just the parts which changed). There’s some other problems too: it doesn’t let you handle requests concurrently, and it doesn’t let you create secondary indices to speed up queries. It’s also missing an os.fsync() call to ensure the data is actually written to disk, but let’s not talk about that one. 😉

The secondary indices isn’t just a throwaway comment either: one of the operations this backend supports is filtering threads by various attributes. Obviously, the only way this backend can support that is by doing a linear scan over all threads, and returning the ones which match.

But actually, despite the number of raised eyebrows, it’s not actually quite as terrible as it first seems. Firstly, it’s a backend to an IRC bot, which means that any response it gives is going to be seen on an IRC channel, which is (sort of) a serialised output: so not handling concurrent requests isn’t a big deal. It also doesn’t have to handle a lot of traffic (much less than one request per second), and even a 1s latency wouldn’t be obviously bad on IRC.

The other thing to consider is that the database isn’t actually that big! In the first year, there were ~500 threads: even with a generous 1KiB of metadata per thread, that’s still less than a mebibyte, which is easily written to disk in less than a second. It’s also small enough to fit in the OS buffer cache, so reading the database from “disk” isn’t expensive either. And you can also see that those linear scans aren’t that expensive either.

Futhermore, there’s some advantages to the code. It’s simple: there’s full visibility into the internal state, serialisation and deserialisation processes (unlike many ORMs, e.g. SQLAlchemy). It’s debuggable: it’s just JSON, so it’s incredibly easy to manually inspect (after pretty-printing with python -m json.tool) or write custom tools to analyse it. Schema migrations are easy: tell Thread.__init__ about the new key (providing a default value), and have Thread.to_json emit whatever schema you want stored: on the next database write, all record will be rewritten to the new schema.

This system performed adequately for the first year it was used. There were many schema changes, but it had no data corruption or was the cause of any outages. It’s performance was good enough: by the end, some filtering requests took just over a second to respond to, but it was never bad enough for users to notice.

The next year, I rewrote it to use SQLAlchemy backed by SQLite and added indices to optimise the most frequent filter requests. But I didn’t see that as admitting defeat: on the contrary: by starting with a terrible implementation, I learnt exactly what the bottlenecks were, and could then make informed decisions when designing the next version.

Perhaps you weren’t surprised I went with a “bad” solution first: the idea is certainly not new. It’s an interesting strategy, but you need to be careful with it: if you don’t make it bad enough, you won’t be motivated enough to rewrite it in the future, and if you make it too bad, then it’ll crash and burn before you’re ready to rewrite.

Even if this story hasn’t fully convinced you that deliberately choosing imperfect solutions is a good idea, I hope you can see it’s a viable strategy in some cases.

Comments