Incident Report Writing: What to Do If Things Go Off the Rails in Production
- 50332 views
- 11 min
- Sep 12, 2018
Let’s face it: any application can go off the rails.
The world’s largest technology companies such as Google and Amazon pour billions of dollars into their hardware and software, and even they aren’t immune to incidents. Even if a software development company employs passionate and experienced developers and DevOps engineers, there’s still a chance that their software products will run into trouble.
Needless to say, a software development team should fix everything as soon as possible. But there’s one more important thing to do: write an incident report. This step might seem pointless, but it plays a key role in incident management.
In this article, we tell you why incident reports matter and how we write them at RubyGarage.
What is an incident report in software development?
A software development incident report provides documentation of an event that has fully or partially disrupted the normal operation of a live production server. In other words, this report describes a serious incident that has made a significant negative impact on an application. It doesn’t matter if the incident has hit hardware or software, as in both cases the application can’t function normally. Needless to say, such situations require urgent action to bring the application back into operation.
If this definition sounds vague, take a look at several situations in which writing an incident report is a must:
- Downtime − a production server (and an application) hasn’t been available for a certain period of time
- Loss of data − some information in an application’s database has been fully or partially lost or damaged
- Security breach − some confidential information (such as login credentials or credit card numbers) has been exposed
- Bulk mailing − an application has sent bulk email messages by mistake
- Interruption of normal operation − some critical functionality (such as log-in or sign-up) hasn’t been working
Why incident reports matter
Any software development team has probably faced serious incidents and successfully overcome them. This is completely normal, as nothing is perfect in this world. But team members might be wondering if there’s any point in writing lengthy documentation on an incident that they’ve already handled.
Yet incident report writing has nothing to do with producing documentation no one ever reads. When it comes to incidents in production, software development teams need to think big and consider the people impacted by an incident and the negative consequences it brings:
- End users are the first who suffer from any incident, as they can’t use an application and their personal information might be at risk.
- Business owners lose money and reputation if their application goes offline or malfunctions.
- The software development team bears responsibility for technical incidents, so team members need to properly handle them in the shortest time possible.
It goes without saying that incidents are unpleasant and upsetting for everyone: end users, businesses, and software development teams. To ensure customer satisfaction, minimize losses, and build trust, software development teams need to not only handle incidents quickly and professionally but also carry out a comprehensive analysis of incidents and establish effective communication with business owners. Incident report writing helps development teams achieve all of these goals.
Benefits of incident report writing for software development teams
Now that you understand why incident reporting matters, let’s go over the advantages it provides to software development teams. At RubyGarage, we see three main benefits of incident documentation:
Provides a comprehensive description of an incident to a business owner
First and foremost, a software development incident report is a comprehensive explanation of an incident to a business owner. It helps software development teams provide full details, specifying the timeline of an incident, its cause, and the measures the team has taken. Moreover, an incident report allows development teams to prove their technical expertise and trustworthiness.
Incident documentation isn’t a mere explanation. It’s also an important source of information for a business owner who may use it to instruct their customer support team or to prepare a statement for customers.
Helps a software development team analyze an incident
Having documented an incident, a software development team can thoroughly analyze it later in order to find out what caused the incident, which of the team’s measures were effective and which weren’t, and so on. This analysis helps reveal flaws in the software system as well as in the team’s actions or skills so that they can make improvements and prevent similar incidents in future. Though any incident is unpleasant, it’s best for teams to treat incidents not as failures but as opportunities for improvement.
For example, an analysis may show that the team found out about an incident too late, which indicates that an application requires an error tracking tool (such as Rollbar). Or if production servers experienced downtime without your team’s knowing, it’s a clear sign that an application needs a proper availability monitoring tool (such as Pingdom).
Allows team members to share experience
Often, not all members of a software development team are present or participate in fixing an incident. Some of them may have had a day off or been on sick leave, and incident documentation helps teams keep all teammates fully informed about every incident. At RubyGarage, we encourage sharing incident reports among all our developers, not only those directly involved in the project.
Fosters continuous improvement
For software development companies, incident documentation helps not only with analyzing each incident separately but also with accumulating experiences and making continuous improvements to systems, processes, and workflows.
For example, if several development teams within a company face similar incidents, this means something is wrong with the company’s development flow or processes.
The structure of an incident report
There are lots of different incident report templates; some of them are really brief, while others are detailed and take a lot of time to produce. At RubyGarage, we use a simple and intuitive structure for incident reports. It includes the following five sections:
This is a brief and to-the-point description of the incident that contains general information about it, including reasons for the incident, the time when it happened and the time it was resolved, and its consequences.
This section provides exact times of all events related to the incident, including the time the incident started, the times at which actions were taken to resolve it, and ultimately the time when it was resolved.
This is a detailed description of what caused the incident. It’s important to provide as many details as possible.
Resolution and recovery
This is a detailed description of all actions taken by the software development team to resolve the incident, along with times when each action was taken and its results. This section describes all actions taken by the development team, even those that turned out to be incorrect or ineffective.
Corrective and preventive measures
In this section, software development teams need to list all measures that should be taken to prevent similar incidents in future. These measures may include any necessary improvements to the project, development procedures, DevOps flow, etc.
RubyGarage incident report writing tips
Knowing the structure of a typical incident report is like reading a recipe from a cookbook − you may know what ingredients to add, but this doesn’t mean your dish will be delicious. There’s always something more than just the theory. Here are some helpful tips on incident documentation that we follow at RubyGarage:
#1 Write an incident report after handling an incident
It’s best to write a report only after resolving the incident. Writing it in the course of the incident makes no sense, as development teams need to act quickly and figure out how to resolve it. At this stage, teams can simply brief a business owner about the incident and keep them updated about the progress.
#2 No finger pointing
Write incident reports on behalf of the whole software development team. There’s no need to specify who exactly is to blame for an incident. Teams write incident reports not to point out and punish culprits but to carefully analyze the problem, fix it, and prevent it from happening again. Moreover, incidents often happen due to poor systems or processes, and development teams should focus on finding flaws in them rather than blaming team members.
#3 Don’t exaggerate or conceal anything
Development teams don’t need to exaggerate or conceal anything in their incident reports. All people make mistakes, and that’s why reports exist − they help others not to fall in the same hole. Moreover, not telling the truth in incident documentation undermines a business owner’s trust in a team.
RubyGarage incident report sample
We’ve examined incident reports from A to Z and there’s just one thing left: an example. To show you how we write incident reports at RubyGarage, we’ve composed a brief sample based on the structure we’ve described. This incident report sample was made up specifically for this guide and isn’t connected to any real project.
Database failure incident report
Saturday 16 June
At 9:24 PM Pacific Time (PT) the error tracking tool (Rollbar) that we use on the project reported an abrupt spike in errors on the live production server. The issue affected critical functionality of the website, so at 9:28 PM our team switched the website to maintenance mode and it wasn’t available to end users. The root cause of the incident turned out to be a database failure caused by an incorrectly written script. The issue was fully resolved by 10:23 PM and the website was fully functional again.
- 9:23 PM PT − The update of the production database was finished.
- 9:24 PM PT − Our monitoring tool detected an abrupt spike in errors.
- 9:26 PM PT − Our team started checking logs to find out what had caused the incident.
- 9:28 PM PT − We moved the website into maintenance.
- 9:37 PM PT − We initiated the task that would restore the production database.
- 9:43 PM PT − We optimized Sidekiq workers to speed up database restoration.
- 9:55 PM PT − We switched off the maintenance mode; at the time, 80% of all products were listed and only 20% were still missing.
- 10:23 PM PT − The production database was fully restored and the incident was resolved.
During a planned update to the production database, a newly added script that worked with its structure turned out to have been written incorrectly. As a result, the script removed a large number of products from the item_datafeed table that contains the URLs that lead to retailers’ online stores.
Resolution and recovery
At 9:24 PM PT our team received lots of alerts from our error tracking software, and we immediately started investigating the issue.
At 9:26 PM PT, we started to check logs and quickly found out that the issue was caused by a production database failure. To minimize customer dissatisfaction, at 9:28 PM PT we decided to move the website into maintenance mode. Afterward, our team initiated the task to restore the production database, and we also optimized Sidekiq workers to speed up database restoration.
When 80% of products in the database had already been restored (at 9:55 PM PT), our team decided to move the website back to production, and at 10:23 PM PT all products were listed on the website and available to customers.
Corrective and preventive measures
Having analyzed the incident in-depth, our team came up with the following measures for preventing similar incidents in future:
- Although we rely heavily on tests in our development workflow, we will test even changes related to database updates on the staging server prior to rolling them out on the production server.
- We will back up all databases used on the project on a regular basis so that we can roll them back easily in case of incidents. Apart from automated scheduled backups, our team will introduce database backups before all major changes.
- We will increase test coverage of the application to ensure that all critical parts of functionality are 100% covered with automated tests.
Incident reporting might seem unpleasant, but it’s an important part of the software development process. In this article, we’ve shown you how we handle incident documentation at RubyGarage. If you want to know more about the development practices we follow, subscribe to our blog.