2023-10-27

Edge Cases

Sometimes, software developers seem tempted to treat the part they like about a problem as the important part, and everything else they must get right as "edge cases".

Edge cases are related to corner cases and special cases, and these are all names for various kinds of pathological cases. Except, sometimes, the pathological cases really are part of the core problem, not some exotic side quest.

Pretend for a second that you're designing a very basic cache. A cache will have an immutable lookup key and an eviction policy. Entries might be evicted when they get too old, or when something invalidates them. A cache lookup can result in a miss. In this case, the system performs the expensive operation and inserts the fresh result into the cache. A cache lookup can also locate an existing entry, but still needs to validate the entry before handing it off to the user (for instance, to ensure that the entry didn't expire).

Now, pretend that multiple processes can use this cache, and that the expensive operation whose result must be cached takes minutes or hours, and considerable machine resources.

What might be special cases under one set of requirements can be a part of the core problem under more demanding requirements. For instance, if the cache serves multiple client processes at the same time, it now needs to deal with a set of transactional requirements: What atomicity, consistency and isolation guarantees does it provide? (It's a cache, so we can relax the durability requirements. As far as correctness goes, evicting random entries at random times is a perfectly acceptable behavior for a cache!)

It's not an edge case that two clients can look up the same key at the same time, and miss. In that case, for efficiency purpose, you might define a requirement that the two operations don't both execute the expensive operation at the same time. Instead, you might want to handle more complex states for each cache entry, such as a "populating" state. Not so basic anymore! The cache now provides mutual exclusion around the fill operation.

If the clients and the cache are part of a distributed system, various services can fail. The fill operation might become a lot more complex, and so would the mutual exclusion that the cache now promises. Between the state explosion introduced by concurrency, and the additional states introduced by faults, if you try to reason about everything that can happen in your system as "edge cases", you'll definitely miss a few. This approach is fundamentally flawed, and never works out.

Practitioners choose the requirements they want their systems to meet, and define safety properties for them. Design a system with a manageable number of states and transitions, and prove that all the safety invariants hold before and after every transition. Basic software engineering practices such as encapsulation (information hiding), building complex components out of simpler ones, and writing self-checking code help enormously with managing state.

Advanced practitioners might use a model checker such as Spin, TLA+ or other alternatives. You can get very, very far with very basic engineering practices before you need to reach out to model checkers (if ever), but if you can master their learning curve, model checkers can simplify your world view a lot, and check it for you, too.

But, whatever you do, please don't try to sweep under the rug parts of the core problem as "edge cases". When dealing with concurrency and distributed systems, that never goes well.

No comments:

Post a Comment

Edge Cases

Sometimes, software developers seem tempted to treat the part they like about a problem as the important part, and everything else they must...