- A short story about chaos
- A few facts about people and the knowledge
- Crazy bus driver
- Increasing bus factor
A short story about chaos
Mark is a software engineer at a tech company that helps people to order solar panels. He’s been there for a year already. He developed an account management system for the users, as well as integrated payment gateways that allowed the users to pay for the orders. His previous job was a software engineer at a fin-tech startup, so he knows a bit about integrating payment gateways. The code he has developed here was satisfying for him.
Mark also has a team member, Tim. Tim is responsible for the core of the product. He spent hours talking to civil engineers trying to understand different rules of the photovoltaics and how you can place solar cells on top of your roof, what are the constraints, including wind zones and some technical standards like for example ASCE 7-16. 
It’s a beautiful morning, at the end of April. Tim tells Mark he’s got a new position as a tech lead in a pretty famous food delivery startup. Good salary and interesting technical challenges to solve. He starts the job at the beginning of June. And before that he’s going to visit Copenhagen and Stockholm. He still has unused 15 days of vacation.
Joshua, their product manager, a couple hours before that, showed the roadmap for the product. They have great ideas, how to dominate the market and overtake competition. One of them involves improving the solar cells editor. After he found out about Tim, he asked Mark if he could get some knowledge from Tim, because it may be useful when developing a new solution. They have new engineers in a recruitment pipeline, but it takes time to sign the contracts.
And here it starts. Mark spends 5 days with Tim trying to understand everything about the solution he has developed over a year. And then Tim is gone. Mark gets his first story to implement. And immediately he gets completely lost with all the constraints, validation rules, and gets confused by the terms used by civil engineers. One new feature equals five new bugs. He drowns in the quick fixes and fires to put out. He’s frustrated and tired. The code works, but he doesn’t really understand how. New developers come in, so they take over some of these responsibilities. They in turn start to introduce bug after bug. After a couple of months the app works, the demo goes well. Mark understands that the solution is fragile as hell. He tells Joshua that he needs to clean up the mess. Joshua understands this, but he asks him if they could first add one more thing because they have a great business opportunity, and it would be a shame to pass it up. “Sure, no problem, but we need to fix it eventually”. And it takes longer than initially expected. Mark is frustrated. In the meantime, he’s got a job offer. A good one. He quits. The code is left as is. In a state of utter mess.
The story shows a company that seems somewhat irresponsible. But the company is successful. They dominate the market and the charts go up year after year. And it is not a very uncommon scenario. Does it always have to be like that? Do we have to work in chaos?
Let’s try to notice some facts:
A few facts about people and the knowledge
People come and go
People change jobs for various reasons. Money, satisfaction, personal reasons, they want to move to a different country, they want to try out something new. An organization can’t (and shouldn’t) control that. They may try to keep people with them. Yes, that’s correct, but it doesn’t mean they will always succeed in this area.
Organization’s knowledge accumulates
The company achieves success because many people work together. Everyone brings some new knowledge. After they leave the company, the knowledge stays in the company and new people come in. Over time the knowledge accumulates, people make mistakes, fix them and learn out of them.
The domain is unclear at the beginning
It’s not uncommon to make wrong assumptions about topics we don’t know yet. As young people we don’t know what we want to study, where we want to work, because we simply know too little about different options. The same happens in business. You start a company because you make a bet that your idea is worth someone else’s money. But it doesn’t have to be true. You have to verify the hypothesis and you learn new things in the process.
Think about Netflix which was founded in 1997. Their current core business idea which is video streaming service wasn’t even a thing back then, so it’s impossible for them to know that this was exactly what they wanted to do since the beginning.
People have technical skills but they don’t have the company specific skills
Team Topologies (Skelton, Pais, 2019) talks about three kinds of cognitive load:
- Extraneous cognitive load — that relates to the environment in which the tasks are being done (e.g., “How to set up a testing environment?”, “how to deploy service in our system?”)
- Germane cognitive load — that relates to aspects of the task that need special attention for learning or high performance (e.g., “How to determine a wind zone in a solar panels system?”)
Considering these three, we can assure we hire people that would deal with the first of them – we can verify people know Java if we know the system is written in Java.
The other two could be only mitigated. E.g. the candidate may have proven experience in setting up test environments, so you, as a hiring manager can assume that the person would know how to do it in your company as well. Similarly with the Germane cognitive load. If you are a fin-tech company, you can hire people with prior experience in the finance industry. But there’s no way you would find a candidate that from day one knows how to deploy services in your system, and knows everything about the domain the company focuses on.
The faster team members understand what they are doing, the better for the organization
When you hire a new engineer, you cannot assume that they would be able to deliver results starting from day one. They have to understand the ecosystem they are in: the technology stack, the company values and mission, the team structure and dynamics, the domain they are going to work with, the workflow. How long it takes for the person depends on the size of the organization, the complexity of the topics to understand, or simply the person’s motivation and capability to digest new information.
And when a new joiner doesn’t know what to do, the company loses money. So it’s obvious from an economic point of view that the companies would like to shorten this period to minimum.
Crazy bus driver
There is a term in theory of risk management: Bus factor. It essentially boils down to a number of people that would have to be hit by a bus before the project stalls due to lack of competency. Increasing a bus factor could be a cure to problems described in the story of Mark and Tim from the beginning. I would like to propose a couple of ideas that I believe contribute to increasing a bus factor within an organization.
Increasing bus factor
Books are an amazing invention of humanity. They essentially let us accumulate knowledge through generations. They have two important properties:
- they are persistent – which means that facts written in the books can be read as long as people have access to the book. It’s like a time-traveling machine – you can get into Marcus Aurelius’s head even though he lived almost 2000 years ago.
- they are reproducible – which means that facts written in the books can be read by an infinite number of people as long as they have access to the book.
Documentation somehow learns from the success of books and allows the knowledge to spread. For example, one software engineer might have spent weeks trying to understand complex ideas of accounting to design a system that automates the process. After that he could write it all down, in a language that is easily understandable by other software engineers that don’t have prior experience in accounting. And we can take advantage of the aforementioned properties of the books here! You write it once, and any number of software engineers that join the project can read the document and quickly get up to speed with core ideas and terminology.
Tips for writing documentation
There are some pieces of advice that turned out to be useful for me when writing docs in my teams.
- Support text with images and diagrams to show the flow (be it the user flow, request flow and so on). There are open source tools like Mermaid that let you define sequence diagrams as code. This also means it’s good to keep the reference code together with the images as over time you may need to introduce small changes to the diagram itself.
- Split the high level docs from implementation details. Think about the pace of changes for each of them. Non-technical aspects and general concepts tend to change slower than the concrete technical solution or the screenshots of the app. Different document types have different goals. High level overview should give you an idea why the thing is needed in the first place, general glossary, and the business use cases it is supposed to cover. API reference or implementation details on the other hand should rather focus on details and edge cases, not the general approach.
- Write the docs right after you finish implementing a solution. Don’t postpone it, treat it as an acceptance criteria of the story. After you finish the solution – it’s the moment when you have the most fresh memory of how the solution works, and you are still in the context.
👉 There’s also a post Types of software documentation that can help you decide on the best document type to keep the documentation up-to-date.
Sometimes there are some specific topics that would be better off when shown in the form of a video. Think about setting up the work environment, custom IDE plugins, or End-to-end flow of the implemented solution. Recording a 5 minute video where you go step by step, click through UI and talk about what you’re doing is a much quicker option than spending 2 hours writing it all down. And there are two main benefits of videos. First one, similarly to the written down documentation – you record it once, watch it multiple times. Secondly, it catches implicit details. If you were to describe the IDE setup, you would go step by step, window by window describing what you click, what you select etc. But watching the video you can see the full context, that the writer could simply forget to mention e.g. “oh, you have also env variables set up in the IDE, ok maybe that’s needed too“, or “ok, now I see, you’re showing the user flow in the staging environment, that’s why I cannot see this and that option.“
I could not count how many times I heard from software engineers sentences like “why do you like this?” or “why this decision was made?”. A solution for that could be an ADR – Architecture Decision Record. It’s essentially a structured document template that lets you describe the rationale for the solution, the thought process, and the trade-offs to consider. The important thing is that they should be anchored to a specific moment in time, so it’s easier to understand which decision follows which one.
When you have a growing team you may take a decision of splitting into multiple teams. Sometimes the boundaries of the team don’t necessarily follow the boundaries of the system. (According to Conway’s law they eventually will, but it happens over time and may cause a pushback at the beginning – Team topologies, 2019). So you may end up in an interim phase where people from team A are specialists in team B’s domain. The knowledge matrix can help you identify who knows what best and potentially identify the gaps.
👉 In case you would like to know more about Knowledge Matrix or see examples you can check a post about this topic:
Team Knowledge Matrix
After you have identified the gaps, you may try to schedule some time to reduce them. Organize workshops for a given topic, letting everyone know so people interested in it could join. Schedule in advance so people can prepare questions.
Two people working at the same thing may seem counterintuitive. Why should I pay two people to do the job of one person? However, the results are not immediate. During pair programming, ideas are immediately challenged. An engineer doesn’t have to wait for the code review once the pull request is opened. This is the idea of “shifting left”. You detect problems as early as possible (Software Engineering at Google, 2020). Initially this may slow down the progress, but on the other hand, contrived and overcomplicated code may slow down people even further in the future.
Another benefit is that two people immediately get the knowledge of the solution. Otherwise you would spend time developing a solution and then spend more time explaining it to other developers that would have to work with it later on. Here it happens somehow automatically.
An organization may be successful, and it may seem that everything is ok until a first person with all the project knowledge in the world decides to leave (think Brent if you’ve read The Phoenix Project, 2013). After that the chaos starts to creep in faster and faster. But if you’re smart enough you act proactively and mitigate that. You can try to increase the Bus Factor.
You can achieve it by writing good documentation focusing on the right scope and abstraction level. You can also record videos that catch many tiny little details without you explicitly talking about them. Writing down decision records will help you understand the thought process behind some bigger decisions. Knowledge matrix will help you identify the biggest gaps in knowledge sharing and dedicated workshops will help you reduce these gaps. And last but not least, pair programming will act as a tool for challenging ideas and spreading the knowledge between engineers working on the code together.
- Pais, M., Skelton, M., Team Topologies, (IT Revolution Press, 2019), chapter 3.
- Behr, K., Kim, G., Spafford, G., The Phoenix Project: A Novel About IT, DevOps, and Helping Your Business Win (IT Revolution Press, 2013)
- Manshreck, T., Winters, T., Wright, H., Software Engineering at Google, (O’Reilly Media, Inc, 2020), chapter 2
 – it doesn’t really matter what this is. The idea was to name something very specific to the industry.