Chipping Away at Monoliths

My favorite way to design a program is to distill its requirements into the functionality that makes it unique, and then only build that. The other requirements can usually be satisfied by other programs that have satisfied them before. I have a lot more fun writing programs that work with other programs because I get to focus on the actual problem. That focus on the problem pays dividends too when going back to read it later.

Programs working together is a great example of software composability, and in the context of a UNIX system it’s an example of the software tool philosophy, which promotes composability of programs by assuming that the output of one program will be used as input to another. The benefits of composability aren’t monopolized by the FP crowd and shell adepts—other cultures are preferring composition too.

Better Ingredients, Better Programs

I mostly really like Taco Bell Programming. It’s an article about a number of things: Getting to the point. Solving the problem. The state of system administration. Franchise logistics. What I picked up on a recent re-read though is that it’s also implicitly about how to write a good program, specifically by following the tool philosophy. The example command from the article processes files downloaded from a web crawler. It uses standard programs to list the files and control concurrency, and uses a purpose-built program to implement the processing logic:

find crawl_dir/ -type f -print0 | xargs -n1 -0 -P32 ./process

What this command does is list the contents of the crawl_dir directory, and then distribute them across 32 parallel instances of the ./process program. Despite its simplicity—and partly as a consequence of it—the command can inspire strong negative reactions from some programmers, warning against delivering something like it, citing concerns about maintainability. Others are simply skeptical, wondering how robust of a solution it could be.

Well, we can’t know how maintainable or robust the command is without looking into the implementation of the process program, and I suspect some of the concerns come from seeing how it gets used without seeing how it works. It might be a Go binary or a node.js script. Maybe it gracefully handles failure, can be retried, and has meticulous logging. Or maybe it does none of those things and fails spectacularly when it is run by anyone except the original author in their particular environment. Since we can’t open up this Schrödinger’s program, I’ll assume it’s as robust as it needs to be to satisfy everyone on the team. However, we don’t need to open it to observe how it excels as a tool:

It can be run on one file at a time: ./process filename. This is a huge boon to iterative development and testing.
It can be run on multiple files at once: ./process crawl_dir/*. Again great for testing, also good for analyzing subsets of data and for checking resource usage (CPU, RAM, disk space, etc.) as inputs grow.
It can be parallelized across cores on a single instance with xargs, like in the example pipeline. If the problem is CPU-bound, and we have enough resources on the instance, we can tune the parallelization to speed it up.
It can be parallelized across nodes with parallel. If we don’t have enough resources on one instance, we can split up the input files and utilize multiple instances without changing the program.

In the Shadow of the Monolith

Just for fun, let’s look at a monolithic implementation of the example command from Taco Bell Programming. It could be written in any language, as long as it’s an exact recreation of the original pipeline. I picked Python because it was the next easiest way for me to implement it after the pipeline version.

It is sculpture in code, marmoreal and beautiful. But it is inflexible, and it is not a tool. It is a monolith.

from multiprocessing import Process, Queue
from os import path, walk


def process(filename):
    "This is where the magic happens."
    pass


def worker(workqueue):
    for filename in iter(workqueue.get, None):
        process(filename)


def main(worker_count, directory):
    workqueue = Queue()
    for root, _, filenames in walk(directory):
        for filename in filenames:
            workqueue.put(path.join(root, filename))

    processes = []
    for _ in range(worker_count):
        p = Process(target=worker, args=(workqueue,))
        p.start()
        processes.append(p)
        workqueue.put(None)

    for p in processes:
        p.join()

if __name__ == '__main__':
    main(32, 'crawl_dir')

It exhibits great Python style. Its intent and operation is clear. It also has some weaknesses in its design:

It can only process files found in a directory called crawl_dir in the current working directory.
It always spins up 32 processes no matter how many files need to be processed.

To address these weaknesses we need to add even more code. If we want to be able to change the input directory, we need to add some code to handle that as an option. Even then, the program is still inflexible because it can only take a directory to walk as input. That makes it hard to test or re-run on individual files because the user has to create a directory, then populate it, and then finally run the program. If we wanted to be able to control the parallelism, we’d have to add an option for that too.

We can simplify. Use our tools to chip away at the monolith. Replace our parallelization code with xargs or parallel when necessary. Replace our filesystem walking code with taking filenames as arguments so that we can use shell globbing or find. With the knowledge of what’s already in the toolbox, we can whittle our implementation down into a workable tool:

import sys


def process(filename):
    "This is where the magic happens."
    pass


def main(filenames):
    for filename in filenames:
        process(filename)

if __name__ == '__main__':
    main(sys.argv[1:])

That is what I imagine the process program might look like. It shows that designing programs as tools can do things like remove boilerplate outer loops, obviate inflexible input methods and complicated option handling, and bring focus to business logic.

Erosion and Accretion

The benefits of composable software are apparent when the systems that they are elements are of are changing. Command line tools help to perform frequent interactive and exploratory tasks in the shell by allowing smaller sub-tasks to be combined. Components in systems like React help manage the effects of rapid development and frequent product changes in apps in part by enabling simpler reorganization.

When a system stops changing, it’s not uncommon for the components to accrete back into a monolith. This tendency is explored in Microservices and the Migrating UNIX Philosophy. It argues that components fuse when composability isn’t needed anymore, usually when the system goes into stasis. That is true especially when the system actually is finished and the benefits of composability start diminishing in the face of potential optimizations. In some cases though the accretion is a matter of perspective and packaging. If I develop a set of tools to perform an analysis and then send the final packaged version to a colleague, that program might appear monolithic to the person I send it to depending on how they invoke it, so it’s a matter of perspective.