
Duplication and other Heuristics
Pictured: two mugs that I made for a friend who really loves pigeons
Throwing pots
About a year ago, I started throwing pots for the very first time. Throwing pots is a complex skill, and I had to struggle quite a bit to get started.
One dimension of what makes pottery complex is that you need to master a number of different sub-skills. To throw my first set of pots, I needed to execute all of the following steps:
- Wedging: Working a spiral core into raw clay and dividing it into pieces to be thrown
- Coning / Mastering: Raising the clay into a tower and flattening it back out again, so as to further align the core of the clay into a spiral
- Opening up the clay / pulling up walls: Turning the lump of clay from a ball into a vessel like a mug or bowl
- Glazing: Covering the clay with pigment so that it looks nice
Each of these steps had its own set of intricacies to come to terms with, which made my first few sessions overwhelming - it was a struggle just to remember all the steps, and mid-step, it was really hard for me to tell if I was doing things correctly 1.
heuristic Irregular formation from Ancient Greek εὑρίσκω (heurískō, “I find, discover”) (compare the proper Greek term εὑρετικός (heuretikós)). https://en.wiktionary.org/wiki/heuristic
Heuristics
Enter Heuristics. From the Green heurískō, meaning “I find” [^irregular_formation], a Heuristic is a signal we use to measure something. Heuristics are practical yet imperfect, and so offer an approximate measurement.
Pottery studios catering to students often display a set of cutout forms that show what a piece of pottery on the wheel should look and feel like at each stage. This offers a heuristic, which allows the potter to answer the question “am I executing this stage correctly?”
Duplication
And this brings us to duplication. Duplication is the first heuristic I learned as a builder of software. As a heuristic, it’s quite seductive because of how easy it is to make sense of, and how easy it is to apply. Plus, deduplication had a pithy motto; one could declare “this code needs to be more DRY” and immediately leave the conversation, feeling very smug.
When I was a teenager studying for A’ levels, I thought of studying as a chore. Reading my books for an hour didn’t help me very much, but it left me with a strong feeling of accomplishment2. Deduplicating my code left me with the same feeling of accomplishment. The software I built wasn’t terribly improved by the deduplication, though. If anything, every round of DRYing out my code made the logic harder to follow and less clear.
Here are two examples of how attempts to reduce duplication can make your codebase worse (Thanks, Claude!):
- The Premature Shared Component - A developer notices two UI elements that look similar and creates a shared component to reduce duplication. However, over time, the requirements for these elements diverge, requiring extensive configuration options and conditional logic. The shared component becomes bloated with props and complexity, making it harder to understand and maintain than having two separate components would have been.
- The Over-generalized Utility Function - A developer spots similar data processing logic in multiple places and creates a generic utility function. However, each use case has subtle differences in requirements. The function grows to include numerous parameters and conditional branches to handle all cases, becoming more complex and brittle than the original duplication. When requirements change for one use case, developers must carefully modify the shared function without breaking other usages.
Moving beyond duplication
It’s difficult to point my finger exactly to when and what changed my mind about duplication, but two key resources come to mind:
- I read Clean Code by Uncle Bob
- I found This HN post about This post on Sandi Metz’ blog
I was surprised when I read Clean Code’s chapter on Functions. It described a bunch of reasons to create functions, but it didn’t talk about duplication at all. This blew my mind - it wasn’t just that duplication wasn’t the primary reason to break logic out into functions; it turned out that while there were a bunch of reasons to write functions, duplication was so far down the list it didn’t even make it into the chapter. Wow. 3
Sandi Metz’ post talked about an arcane idea called “The right abstraction”. I found this appealing but hard to grasp: how could you tell if your logic was using the right abstraction? How could you know what the right abstraction was?
The Hacker News post introduced two new heuristics:
- Coupling: Coupling is the extent to which your implementation forces you to create and use certain modules together even though those modules should be useable independently and testable in isolation. If changing an implementation detail inside module A requires you to change a corresponding implementation detail in module B, then that indicates module A and B are strongly coupled. 45
- Global State: When a system has a lot of state, and the functioning of the system depends on the interplay of lots of stateful components, that makes the system harder to reason about. Therefore, making the system as stateless as possible makes it simpler and easier to work on.
My current set
All this brings me to my current set of heuristics, in order of priority:
- Keep things smol 6
- Is this logic using the right abstraction?
- Is it coupled too tightly to another part of the implementation?
- Is there too much global state in my system?
- Is there too much duplication?
Keep things Smol
If this point seems like it’s been pasted into this post after the fact, that’s because it was. This one actually popped into my head the day after I published this post!
At RailsConf in 2014, Sandi Metz said
“when people ask me now how to write object-oriented code, I give them one small piece of advice. I say make smaller things, that’s all it is. Make smaller classes, make smaller methods, and let them know as little about each other as possible. ” 7
It’s a fantastic point, and Sandy makes it well. If you haven’t seen the video, you should! I included a link in the footnotes.
This brings me to arborists. An arborist cares for and studies individual trees 8. I’ve lived my whole life in cities, so I’ve never had the opportunity to be responsible for a tree, but I find that being an Arborist is a useful metaphor for building software.
An arborist looks at a young tree and imagines the different ways it can grow out into a mature tree. Using a combination of experience and imagination, they then decide which branches to prune so that the tree grows into the right shape over time.
Building software is building trees of abstractions. Just like trees in an orchard, source trees grow over time. Good codebases / modules / abstractions are like well-nurtured trees: they are neither too wide nor too deep - instead, their height and depth are balanced, making them easy to comprehend and manipulate. In addition to a balanced height and depth, well designed modules are organized in ways that are pleasant and predictable, just like well-cared for trees. A well-cared for tree (I imagine) will have a single main trunk. A wild tree might have several, twisting and snarling about each other in a way that’s difficult to follow and quite hard to pick from. Source code can grow in similar ways.
To me, this sense of nurturing abstractions so that they are balanced is a part of keeping things small. Software engineers exist to solve hard problems, and hard problems can rarely be solved by a single line of code. So we do need to manage complexity and keep things small - how do we solve hard problems and keep our solutions small? I think the answer is to think like an arborist: use your imagination and experience to envision the shape your source tree will grow into over time, and act in ways that keep its depth and breadth balanced and keep it looking pleasant and predictable.
The right abstraction
This is a difficult concept to describe. Like the advanced pottery techniques that rely on a keen eye for subtle movements in clay, my explanation must rest on the shoulders of your previous experience.
That said, here goes:
When we build software, we build mental models of entities that exist in the real-world, and then match them with abstractions in software. When our existing software systems don’t readily offer up abstractions that match the entities we want to model, we sometimes create new abstractions.
There is a tension between finding the best abstraction to model the real-world entity we seek to represent, and finding the most convenient abstraction to use given the layout of the codebase.
“Finding the right abstraction” is the act of navigating this tension. A codebase that cleaves too closely towards finding the best abstractions to represent the individual entities being modeled will be chock full of abstractions, and the different abstractions being used will not work with each other very well. A codebase that cleaves too closely towards finding the most convenient abstractions is also bad - in that scenario, the abstractions being used will be ineffective because they won’t represent the real world entities they should be represent closely enough to be useful.
There are certain abstractions that are quite good at solving specific problems. For example, SQL is an excellent abstraction for representing database queries, and HTML is an excellent abstraction for defining interfaces. While they have their warts, they are undoubtedly good at what they are designed to do. Therefore, whenever we introduce abstractions that distance our logic from these “golden” abstractions, we introduce room for issues.
Sometimes, in our enthusiasm to DRY our code base, we will introduce new abstractions purely to reduce duplication. In doing so we increase the distance between the things we are modeling and the golden abstractions, thus introducing room for error.
There are rare times when we can build on top of golden abstractions with equally golden, generalizable abstractions. React is an example of this - React does add distance between the logic we write and HTML / JS, but it is a better abstraction than raw HTML / JS for building interactive systems. Then again, React - quality abstractions are rare, and no one comes up with something that good every single week.
Going back to the The Premature Shared Component example above, prematurely introducing an abstraction purely to reduce duplication between multiple UI components is a poor decision precisely because it introduces the wrong abstraction - it increases the distance between the logic you write and the golden abstraction.
Tight coupling to another part of the system
I once worked on a ticketing system with a microservices architecture. Each major part of the system, like the order processing system, the integrations engine, and the part of the system that dispatched emails, had it’s own repository and was executed in it’s own runtime environment.
The tricky part about this system was that the data model used to define the APIs for all these microservices was represented in a single repository - the model repository. All the microservices depended on this repository, and a change to the API entity model often required touching every microservice.
I didn’t know it at the time, but this was a textbook example of friction caused by tight coupling - by defining all API entities in a single repository, we coupled all the services together and thus made it very difficult for us to make changes to the system.
I like this example because it’s a case where the right solution introduces duplication: because all the microservices operate in the same business domain, they naturally share the same domain model - in Domain Driven Design terms, these three services all share the same ubiquitous language9. Therefore, in the interest of minimizing duplication, it makes sense to represent the model in a single place, define it once and have all services refer to it.
In practice, however, this causes a great deal of friction by making it quite difficult to modify different parts of the system in isolation. In a microservice environment, it is useful to cascade (i.e: progressively propagate) large changes across the system by changing once service at a time. But because the model repository approach couples all the services together, it complicates the process of making changes to specific models and releasing them across the system. As a result, my solution in similar situations is to have each service define it’s own model. This approach introduces duplication, but in return it makes the system much more flexible and easy to work in.
Too much global state
A well architected system is easy to reason about because it constrains the flow of data - architectural patterns often dictate what components of a system can communicate with each other, and in which direction the communication happens. By introducing these constraints, the architectural pattern makes the system simpler and easier to reason about.
Global state undermines this - global state changes happen across the entire system at once, and parts of the system can communicate with each other using global state in a chaotic and ungovernable way. As a result, adding in global state makes the system more difficult to reason about.
This is why modern engineers share the heuristic that “global variables are bad.”
This heuristic seems quite independent of duplication - I can’t think of a time where I’ve tried to reduce duplication by introducing global state. However, it’s a worthy member of my basic heuristical toolkit.
Duplication
At last, we have duplication. Unchecked duplication is genuinely quite bad for a codebase; I have seen places where the same abstraction is implemented multiple times in a codebase - often this happens with utility methods: for example, we might find a getHostURL()
method duplicated in multiple places in the same codebase, where each instance is hidden in a different utils
folder in a different branch of the source tree.
An engineer finds one getHostURL()
definition, wraps up a big change that modifies this and moves on to the next task, all the while without noticing the second instance lurking elsewhere in the codebase.
Naturally, this leads to bugs.
Duplication is bad, and should be avoided. However, a certain amount of duplication should be tolerated in a codebase, and learning about the heuristics I prioritize over duplication helped me understand why.
In conclusion
I’ve been throwing pots for about a year now, and I’m getting better. I can throw consistently and my end products are things that I am happy with. I’ve grown much better fine motor control, and I’ve grown a deeper and more subtle theory of how and why it works.
I still reach for my basic set of heuristics, though - I still think though the stages as I mould clay. When a piece collapses, my reasoning is often constructed on the heuristics themselves: for instance, I say “ah, I didn’t center correctly, so there was a small wobble in the clay that got worse and it led to a wall that was overly thin, thus leading to collapse.”
I’ve been building software for nearly twenty years, and Ive gotten paid for it for about thirteen of those years. When things go wrong, or when I build a codebase that doesn’t smell right, my reasoning as to where things fell apart still relies on the basic heuristics I laid out above.
So let me leave you with this - heuristics are important, and you should pay attention to yours. Interrogate them every so often, and be wary of them. Uninterrogated heuristics can be deadly and make you write worse code and feel smug about it.
Footnotes
-
As a sidenote, over time I found that the techniques the instructors used to introduce me to the various stages of pottery were often not the best techniques that were known and practiced. Instead, the techniques I was taught were the easiest techniques to teach. More effective techniques were radically difficult to communicate with words, and often relied on the ability to see subtle nuances of movement in clay and replicate them with very fine movements. The techniques that I was initially taught were only moderately effective when it came to making a finished product, but they were easier to describe with words, and required less subtle observation to duplicate, and relied on less fine motor control. This feels very similar to the approach used in a coding bootcamp. ↩
-
My A’ level results were fine but not terribly impressive. This is probably because I wasn’t studying effectively at all. ↩
-
C., Martin Robert. Clean Code: A Handbook of Agile Software Craftsmanship (Robert C. Martin Series) (p. 102). Pearson Education. Kindle Edition. ↩
-
A formal definition of coupling (along with several approaches to mathematically modelling coupling) and an introduction to it’s more arcane siblings Cohesion and Connascence 10 can be found in Richards, Mark; Ford, Neal. Fundamentals of Software Architecture: An Engineering Approach (p. 46). O’Reilly Media. Kindle Edition. It has a fantastic technical overview of these ideas, and the striking illustration of a parrot on the cover is a nice bonus. ↩
-
As a sidenote, poor tests can add coupling to a system. In particular, testing a system that uses mocks to simulate implementation details can cause a huge amount of coupling. Ever felt reluctant to make changes to a system purely because you dread changing all the tests involved? That’s a symptom that you’ve fallen into this trap. Using mocks to simulate dependencies that are well-separated from your implementation (typically via dependency injection) can make life easier - this leaves you reasonably free to change the implementation of each module in your system without needing to make major changes to tests. ↩
-
i.e: small ↩
-
“RailsConf 2014 - All the Little Things by Sandi Metz” posted May 21, 2014, by Confreaks, YouTube, 0 min., 55 sec., https://youtu.be/8bZh5LMaSmE?si=58_6k26KfUUwfSdP ↩
-
Eric Evans. Domain-Driven Design: Tackling Complexity in the Heart of Software (p. 24). Addison-Wesley Professional. 1st Edition ↩
-
Yes - the definition of coupling I offered above is technically a definition of Connascence. Technically correct is the best kind of correct, and so, by making it to this footnote, you have earned a cookie. ↩