Finite automata 101

Almost every software starts very small. They gain complexity over time. A red flag of "technical debt" is the growing difficulty to control the program state.

Imagine a little program that runs in a self-service kiosk. Once upon a time it just printed a numbered token when a button was pressed, a sort of "electronic line manager" so people wait around until their number is called up. The only external control was a button, and the program used to have just two states.

        (button pressed)
Idle    ---------------> Printing
        <---------------
        (printer done)

Since it was so simple, the state control was just a boolean variable named "is_printing".

Then, the kiosk gets additional features: a numeric keyboard and a small screen. The user can now type the token number to query how many people are in the line, the estimated time to be called up, etc. To handle this feature, we add two more booleans: "is_typing", and "query_on_screen". The print button should be unresposive while typing or showing information.

The next day after the new kiosk is commissioned, someone types a number but leaves before pressing ENTER, and not pressing ESC. The following users kept pushing the original button to get tokens, to no avail. To fix this case, a software update: a timer that clears "is_typing" and "query_on_screen" after a short time of inactivity.

But the programmer naively employed a thread to implement this timer and forgot to handle the case when the user does press ESC, and another user asks for a token right after. The timer triggers in the middle of printing, leaving a half-printed token. Since threads cannot be cancelled, simply testing the "is_typing" variable is not enough, since a cancelled query may be followed by a new query, instead of a token request. Our rock star developer adds yet another boolean "is_timer_cancelled".

Yet another problem: someone pressed a numeric key while the token was printed. The keyboard event made "is_typing" true before checking first if "is_printing" was true. The answer: a bunch of if's here and there to avoid this pitfall.

At this point, we already have 4 boolean variables that together represent the app state. There are 16 possible states with 256 possible transitions from a given state to the next. But, out of these, only 4 states and 8 transitions are actually valid:

Event                      Transition
-----------------------    ------------------------
Token button               Idle -> Printing
Printing finished          Printing -> Idle
First digit typed          Idle -> Typing
Other digit typed          Typing -> Typing
ENTER                      Typing -> Query
ESC or timer               Typing -> Idle
ENTER or button            Query -> Idle

The gut reaction of every developer is to add lots of if's here and there to avoid the "forbidden" states e.g. "is_printing" and "is_typing" may never be true at the same time, and also to avoid forbidden transitions (e.g. not to interrupt a print when a key is pressed). The code becomes a spaghetti, and the implementation of every new feature is a nightmare because possible states and transitons grow exponentially.

A step to the right direction is the replacement of all booleans by a single state number. Some developers will be kind enough to use an enumeration. At least the forbidden states are no longer a menace, but forbidden transitions are still possible. Every event handler still needs to watch out for them. The spaghetti carbonara became spaghetti al sugo — leaner, but still a spaghetti.

The solution for this mess is one of the oldest design patterns around: the finite automaton state machine. Despite its age, it is shocking how much code does not leverage on it.

I guess state machines are taught in comp sci classes as purely theoertical gimmicks, together with partial functions, cardinality of sets, P vs. NP and other things that most programmers forget as soon as start writing real software. (Like that joke where a programmer must invert a binary tree on the job interview, and his first task on job is to vertically center a CSS div...)

For the kiosk example, the Right Thing™ would be to use a state machine from day one. But yes, I am a lazy programmer myself and I also start my projects the wrong way, and then I bring myself to refactor the code as a state machine when things get out of control, and testing becomes a bigger nuisance than paying the technical debt.

Thinking over the state machine

I like to use pen and paper to draft a finite automaton before I start the implementation. The first draft is generally awful, and next drafts get better and better:

Figure 1: First draft of a finite automaton

Figure 2: Third draft of a finite automaton

To define a finite automaton, we must define a) the states, b) the valid transitions between states, c) the events that trigger the transitions, and d) what task must be run when the transition happens. Everyone can draw these elements as they please. I like to scribble the triggering events near the arrows that represent transitions.

Here we have the first advantage of using a state machine: the very act of thinking over how the application should work. This exercise alone avoids many bug, or highlights many bugs when existing code is being refactored.

Implementing the finite automaton

First off, you should use a ready-made library. There are many of them for every programming language. But it is so easy to craft one, that I find myself implementing them from scratch every now and then.

I will show a sample implementation in Swift language. I start with an associative array and a state variable. A list with human-readable state names is also handy for debug messages.

typealias SMTrans = (AnyObject?) -> Bool

enum SM: Int {
    case start = 0
    case idle
    ...
}

var SMnames = ["START", ...]

var _state = SM.start
var sm: [SM:[SM:SMTrans]] = [:]

The trans function below is called when some event requests the machine to change its state:

func trans(_ newstate: SM, graceful: Bool, cargo: AnyObject?) -> Bool
{
    let oldstate = self._state
    if self.sm[oldstate] == nil || self.sm[oldstate]![newstate] == nil
        if !graceful {
            NSLog("########### invalid transition %@ -> %@",
                SMnames[oldstate.rawValue],
                SMnames[newstate.rawValue])
        }
        return false
    }
        
    NSLog("Trans %@ -> %@",
              SMnames[oldstate.rawValue],
              SMnames[newstate.rawValue])

    self._state = newstate
    self.sm[oldstate]![newstate]!(cargo)

    return true
}

In my implementation, if the transition is valid, the code associated with the transition is run. Otherwise, an error is logged. (I will explain graceful and cargo later.)

Using a finite automaton elides all those if's that used to be scattered throughout the code to avoid forbidden transitions. Any event handler can try to change the state, but the change is "accepted" only if the transition is indeed valid. Of course, tried invalid transitions are most probably bugs; that's why they are logged, to be spotted and fixed. But, even in production, a refused transition is way better than letting code run at the wrong moment.

The machine itself has been implemented. Now we add the valid state transitions, one per graph arrow, and we attach the code that should run when they happen.

sm[SM.start] = [:]
...

sm[SM.start]![SM.idle] = { _ in
    last_token = 1
    ... rest of the boot code ..
}

sm[SM.idle]![SM.printing] = { _ in
    printToken(last_token)
    last_token += 1
}

sm[SM.printing]![SM.idle] = { _ in
    // does nothing
}

Some finite automata implementations allow to attach "observers" to each transition, so more than one piece of code is run when the transition happens. I prefer that all affairs of a given transaction are laid on a single place.

This automaton is all we need to implement the original kiosk, that just printed tokens. In order to use it, we just need to trigger the transitions:

func program_boot()
{
	trans(SM.idle, graceful: false, cargo: nil)
}

func button_pressed() {
	trans(SM.printing, graceful: true, cargo: nil)
}

func finished_printing() {
	trans(SM.idle, graceful: false, cargo: nil)
}

Now it's time to explain graceful parameter, When the button is pressed, graceful is true because users can keep pressing the button while the printer is still working. The transition must not happen while the printer is active, but this refusal is not a bug, it is an expected situation. On the other hand, the transitions motivated by boot and by the printer should only happen when expected, otherwise they should be logged for further debugging.

Now, let's implement the additional state transitions for the query. In some transitions I employ the cargo parameter to transport keystokes:

sm[SM.idle]![SM.typing] = { cargo in
	let c = cargo as! String
	if is_digit(c) {
		restart_timer()
		typed_number = c 
	} else {
		// go back
		trans(SM.idle)
	}
}

sm[SM.typing]![SM.typing] = { cargo in
	let c = cargo as! String
	if is_digit(c) {
		restart_timer()
		typed_number += c
	} else if is_ENTER(c) {
		trans(SM.show_query)
	} else if is_ESC(caracter) {
		// quits
		trans(SM.idle)
	}
}

sm[SM.typing]![SM.show_query] = { _ in
	restart_timer()
	... show query on screen ...
}

sm[SM.typing]![SM.idle] = { _ in
	cancel_timer()
}

sm[SM.show_query]![SM.idle] = { _ in
	... clear screen ...
	cancel_timer()
}

At key press event handler, I pass the key as cargo through trans() call, sparing yet another global variable.

func keypress(key: String)
{
	if ! trans(SM.typing, graceful: true,
			cargo: key as AnyObject) {
		trans(SM.idle, graceful: true, cargo: nil)
	}
}

func timer_alarm()
{
	trans(SM.idle, graceful: true, cargo: nil)
}

(By the way, the cargo name pays homage to Clipper 5 which used this name for ancilliary data passed through callbacks. It can be also understood as "contraband" since it is a feature that theoretical, "pure" state machines don't have.)

In the function keypress I have used another dirty trick. Since trans() returns false when the transition is invalid, I employ this return as a way to choose between two possible transitions. If the transition to SM.typing fails, it must be because the kiosk is showing a query, and then I try transitioning to SM.idle.

But the implementation above has at least one bug and a risk.

The bug: if the machine is printing, the transition to idle is valid. But, as it is, pressing any key also transitions to idle and that would interrupt the printer. Finite automata do not have a way to check whether a particular event can happen at a particular state. They only check if the transition is valid.

The fix is to create an additional printed state. Only the printer can transition from printing to printing. The attached transition code just fast-forwards to idle state. A keypress is no longer able to interrupt the printer, because the transition printing-idle is now forbidden.

The risk: the timer handling is still dirty. If we forget to cancel the timer, it can bring problems. Again, an additional state timeout removes the risk so elegantly that we don't even need to cancel the timer anymore:

func timer_alarm()
{
	trans(SM.timeout, graceful: true, cargo: nil)
}

sm[SM.typing]![SM.idle] = { _ in
	// does nothing
}

sm[SM.typing]![SM.timeout] = { _ in
	trans(SM.idle, graceful: false, cargo: nil)
}

sm[SM.timeout][SM.idle] = { _ in
	// does nothing
}

sm[SM.show_query]![SM.idle] = { _ in
	... clear screen ...
}

sm[SM.show_query]![SM.timeout] = { _ in
	... clear screen ...
	trans(SM.idle, graceful: false, cargo: nil)
}

The extemporaneous timer is no longer a problem because only the typing-timeout and query-timeout transitions are valid. If the machine is at any other state, the timeout is ignored. If a query is cancelled and another one is started, the active timer is not a problem because the first thing the transition idle-typing does is to reset the timer.

In a sense, these new states timeout and printed are pseudo-states, or fake states, because they don't map directly to real-world events. They have been created just to make our code fit better within the constraints of a state machine.

So yeah, the code style does change when using finite automata. The code is no longer written top-down; each fragment is associated to a particular transition. Each piece reacts to some state change, that is triggered by some event. The final result looks a lot like asynchronous programming, a style that many developers dislike.

Finite automata are pretty limited, and in the example we had to create intermediate states in order to safely handle timeout and printing.

When the automaton gets hairy

A state machine can grow too complex, of course. One remedy is to break it down in multiple smaller automata, with or without hierarchical relationship between them. Our kiosk could have three state machines, one main machine and two ancilliary machines:

Figure 3: Kiosk modeled as three finite automata

In this proposal, the two ancilliary automata possess a terminal state, in which they "die". The death triggers tha main machine back to idle state. This approach also resolved that problems with extemporaneous keypresses and timeouts, without the need of pseudo-states.

For a finite automaton, the ideal world is 1:1 relationship between events and transitions. It is difficult to model transitions that depend on multiple, independent external events. For example, suppose a bomb that sets off when three buttons (A, B, C) are pressed in any order, but only once each. Modeling this sequence as a state machine is very confusing:

Figure 4: ABC bomb as a finite automaton

Perhaps we could use the cargo value in a real implementation to memorize the keys already presses, but let's face the truth, it is not the right tool for the job. A Petri net is the right abstract machine for this case.

Reactive programming

Part of the allure of the reactive programming, found in newer frameworks like React and Vue.js, is derived from the finite automata.

Theory

Among the abstract state machines, the finite automaton is the most "stupid". Its only "memory" is the state itself. Most systems are too complex for it to grasp. The Petri net is more powerful and quite popular among network protocol scientists. The most powerful machine is, of course, the Turing Machine, whom a computer is equivalent to.

A regular expression can be converted to a finite automaton and vice-versa, and the share the same limitations. For example, it is next to impossible to test whether an e-mail address is valid using only a regular expression, and it is equally silly to do this using finite automata.

A side benefit of writing code following a theoretical model as finite automata or Petri nets, is the possibility of automatic/mechanical consistency checking. There are ready-made tools for the task, and it is not that difficult to write tests that e.g. find unused states, unused transitions, etc. that are likely bugs.