Privacy preference center

Cookies are small files saved to a user’s computer/device hard drive that track, save, and store information about the user’s interactions and website use. They allow a website, through its server, to provide users with a tailored experience within the site. Users are advised to take necessary steps within their web browser security settings to block all cookies from this website and its external serving vendors if they wish to deny the use and saving of cookies from this website to their computer’s/device’s hard drive. To learn more click Cookie Policy.

Manage consent preferences

Necessary cookies

Always active

These cookies are necessary for the website to function and cannot be switched off in our systems. They are usually only set in response to actions made by you which amount to a request for services, such as setting your privacy preferences, logging in or filling in forms. You can set your browser to block or alert you about these cookies, but some parts of the site will not then work. These cookies do not store any personally identifiable information.

Analytics cookies

These cookies allow us to count visits and traffic sources so we can measure and improve the performance of our site. They help us to know which pages are the most and least popular and see how visitors move around the site. If you do not allow these cookies we will not know when you have visited our site, and will not be able to monitor its performance.

Incident Report Writing: What to Do If Things Go Off the Rails in Production

55218 views
11 min
Sep 12, 2018

Gleb B.

Copywriter

Vlad V.

Chief Executive Officer

Tags:

Trends

Let’s face it: any application can go off the rails.

The world’s largest technology companies such as Google and Amazon pour billions of dollars into their hardware and software, and even they aren’t immune to incidents. Even if a software development company employs passionate and experienced developers and DevOps engineers, there’s still a chance that their software products will run into trouble.

Needless to say, a software development team should fix everything as soon as possible. But there’s one more important thing to do: write an incident report. This step might seem pointless, but it plays a key role in incident management.

In this article, we tell you why incident reports matter and how we write them at RubyGarage.

What is an incident report in software development?

A software development incident report provides documentation of an event that has fully or partially disrupted the normal operation of a live production server. In other words, this report describes a serious incident that has made a significant negative impact on an application. It doesn’t matter if the incident has hit hardware or software, as in both cases the application can’t function normally. Needless to say, such situations require urgent action to bring the application back into operation.

If this definition sounds vague, take a look at several situations in which writing an incident report is a must:

Downtime − a production server (and an application) hasn’t been available for a certain period of time
Loss of data − some information in an application’s database has been fully or partially lost or damaged
Security breach − some confidential information (such as login credentials or credit card numbers) has been exposed
Bulk mailing − an application has sent bulk email messages by mistake
Interruption of normal operation − some critical functionality (such as log-in or sign-up) hasn’t been working

Why incident reports matter

Any software development team has probably faced serious incidents and successfully overcome them. This is completely normal, as nothing is perfect in this world. But team members might be wondering if there’s any point in writing lengthy documentation on an incident that they’ve already handled.

Yet incident report writing has nothing to do with producing documentation no one ever reads. When it comes to incidents in production, software development teams need to think big and consider the people impacted by an incident and the negative consequences it brings:

End users are the first who suffer from any incident, as they can’t use an application and their personal information might be at risk.
Business owners lose money and reputation if their application goes offline or malfunctions.
The software development team bears responsibility for technical incidents, so team members need to properly handle them in the shortest time possible.

It goes without saying that incidents are unpleasant and upsetting for everyone: end users, businesses, and software development teams. To ensure customer satisfaction, minimize losses, and build trust, software development teams need to not only handle incidents quickly and professionally but also carry out a comprehensive analysis of incidents and establish effective communication with business owners. Incident report writing helps development teams achieve all of these goals.

Benefits of incident report writing for software development teams

Now that you understand why incident reporting matters, let’s go over the advantages it provides to software development teams. At RubyGarage, we see three main benefits of incident documentation:

Provides a comprehensive description of an incident to a business owner
First and foremost, a software development incident report is a comprehensive explanation of an incident to a business owner. It helps software development teams provide full details, specifying the timeline of an incident, its cause, and the measures the team has taken. Moreover, an incident report allows development teams to prove their technical expertise and trustworthiness.

Incident documentation isn’t a mere explanation. It’s also an important source of information for a business owner who may use it to instruct their customer support team or to prepare a statement for customers.
Helps a software development team analyze an incident
Having documented an incident, a software development team can thoroughly analyze it later in order to find out what caused the incident, which of the team’s measures were effective and which weren’t, and so on. This analysis helps reveal flaws in the software system as well as in the team’s actions or skills so that they can make improvements and prevent similar incidents in future. Though any incident is unpleasant, it’s best for teams to treat incidents not as failures but as opportunities for improvement.

For example, an analysis may show that the team found out about an incident too late, which indicates that an application requires an error tracking tool (such as Rollbar). Or if production servers experienced downtime without your team’s knowing, it’s a clear sign that an application needs a proper availability monitoring tool (such as Pingdom).
Allows team members to share experience
Often, not all members of a software development team are present or participate in fixing an incident. Some of them may have had a day off or been on sick leave, and incident documentation helps teams keep all teammates fully informed about every incident. At RubyGarage, we encourage sharing incident reports among all our developers, not only those directly involved in the project.
Fosters continuous improvement
For software development companies, incident documentation helps not only with analyzing each incident separately but also with accumulating experiences and making continuous improvements to systems, processes, and workflows.

For example, if several development teams within a company face similar incidents, this means something is wrong with the company’s development flow or processes.

The structure of an incident report

There are lots of different incident report templates; some of them are really brief, while others are detailed and take a lot of time to produce. At RubyGarage, we use a simple and intuitive structure for incident reports. It includes the following five sections:

Summary

This is a brief and to-the-point description of the incident that contains general information about it, including reasons for the incident, the time when it happened and the time it was resolved, and its consequences.

Timeline

This section provides exact times of all events related to the incident, including the time the incident started, the times at which actions were taken to resolve it, and ultimately the time when it was resolved.

Root cause

This is a detailed description of what caused the incident. It’s important to provide as many details as possible.

Resolution and recovery

This is a detailed description of all actions taken by the software development team to resolve the incident, along with times when each action was taken and its results. This section describes all actions taken by the development team, even those that turned out to be incorrect or ineffective.

Corrective and preventive measures

In this section, software development teams need to list all measures that should be taken to prevent similar incidents in future. These measures may include any necessary improvements to the project, development procedures, DevOps flow, etc.

RubyGarage incident report writing tips

Knowing the structure of a typical incident report is like reading a recipe from a cookbook − you may know what ingredients to add, but this doesn’t mean your dish will be delicious. There’s always something more than just the theory. Here are some helpful tips on incident documentation that we follow at RubyGarage:

#1 Write an incident report after handling an incident

It’s best to write a report only after resolving the incident. Writing it in the course of the incident makes no sense, as development teams need to act quickly and figure out how to resolve it. At this stage, teams can simply brief a business owner about the incident and keep them updated about the progress.

#2 No finger pointing

Write incident reports on behalf of the whole software development team. There’s no need to specify who exactly is to blame for an incident. Teams write incident reports not to point out and punish culprits but to carefully analyze the problem, fix it, and prevent it from happening again. Moreover, incidents often happen due to poor systems or processes, and development teams should focus on finding flaws in them rather than blaming team members.

#3 Don’t exaggerate or conceal anything

Development teams don’t need to exaggerate or conceal anything in their incident reports. All people make mistakes, and that’s why reports exist − they help others not to fall in the same hole. Moreover, not telling the truth in incident documentation undermines a business owner’s trust in a team.

RubyGarage incident report sample

We’ve examined incident reports from A to Z and there’s just one thing left: an example. To show you how we write incident reports at RubyGarage, we’ve composed a brief sample based on the structure we’ve described. This incident report sample was made up specifically for this guide and isn’t connected to any real project.

Database failure incident report

Saturday 16 June

Summary

At 9:24 PM Pacific Time (PT) the error tracking tool (Rollbar) that we use on the project reported an abrupt spike in errors on the live production server. The issue affected critical functionality of the website, so at 9:28 PM our team switched the website to maintenance mode and it wasn’t available to end users. The root cause of the incident turned out to be a database failure caused by an incorrectly written script. The issue was fully resolved by 10:23 PM and the website was fully functional again.

Timeline

9:23 PM PT − The update of the production database was finished.
9:24 PM PT − Our monitoring tool detected an abrupt spike in errors.
9:26 PM PT − Our team started checking logs to find out what had caused the incident.
9:28 PM PT − We moved the website into maintenance.
9:37 PM PT − We initiated the task that would restore the production database.
9:43 PM PT − We optimized Sidekiq workers to speed up database restoration.
9:55 PM PT − We switched off the maintenance mode; at the time, 80% of all products were listed and only 20% were still missing.
10:23 PM PT − The production database was fully restored and the incident was resolved.

Root case

During a planned update to the production database, a newly added script that worked with its structure turned out to have been written incorrectly. As a result, the script removed a large number of products from the item_datafeed table that contains the URLs that lead to retailers’ online stores.

Resolution and recovery

At 9:24 PM PT our team received lots of alerts from our error tracking software, and we immediately started investigating the issue.

At 9:26 PM PT, we started to check logs and quickly found out that the issue was caused by a production database failure. To minimize customer dissatisfaction, at 9:28 PM PT we decided to move the website into maintenance mode. Afterward, our team initiated the task to restore the production database, and we also optimized Sidekiq workers to speed up database restoration.

When 80% of products in the database had already been restored (at 9:55 PM PT), our team decided to move the website back to production, and at 10:23 PM PT all products were listed on the website and available to customers.

Corrective and preventive measures

Having analyzed the incident in-depth, our team came up with the following measures for preventing similar incidents in future:

Although we rely heavily on tests in our development workflow, we will test even changes related to database updates on the staging server prior to rolling them out on the production server.
We will back up all databases used on the project on a regular basis so that we can roll them back easily in case of incidents. Apart from automated scheduled backups, our team will introduce database backups before all major changes.
We will increase test coverage of the application to ensure that all critical parts of functionality are 100% covered with automated tests.

Wrapping up

Incident reporting might seem unpleasant, but it’s an important part of the software development process. In this article, we’ve shown you how we handle incident documentation at RubyGarage. If you want to know more about the development practices we follow, subscribe to our blog.

CONTENTS

Tags:

Trends

Authors:

Gleb B.

Copywriter

Vlad V.

Chief Executive Officer

Rate this article!

Nay

So-so

Not bad

Good

Wow

13 rating, average 4.85 out of 5

Share article with

Comments (0)

There are no comments yet

Name	_rg_session
Provider	rubygarage.org
Retention period	2 days
Type	First party
Category	Necessary
Description	The website session cookie is set by the server to maintain the user's session state across different pages of the website. This cookie is essential for functionalities such as login persistence, ensuring a seamless and consistent user experience. The session cookie does not store personal data and is typically deleted when the browser is closed, enhancing privacy and security.

Name	m
Provider	m.stripe.com
Retention period	1 year 1 month
Type	Third party
Category	Necessary
Description	The m cookie is set by Stripe and is used to help assess the risk associated with attempted transactions on the website. This cookie plays a critical role in fraud detection by identifying and analyzing patterns of behavior to distinguish between legitimate users and potentially fraudulent activity. It enhances the security of online transactions, ensuring that only authorized payments are processed while minimizing the risk of fraud.

Name	__cf_bm
Provider	.pipedrive.com
Retention period	1 hour
Type	Third party
Category	Necessary
Description	The __cf_bm cookie is set by Cloudflare to support Cloudflare Bot Management. This cookie helps to identify and filter requests from bots, enhancing the security and performance of the website. By distinguishing between legitimate users and automated traffic, it ensures that the site remains protected from malicious bots and potential attacks. This functionality is crucial for maintaining the integrity and reliability of the site's operations.

Name	_GRECAPTCHA
Provider	.recaptcha.net
Retention period	6 months
Type	Third party
Category	Necessary
Description	The _GRECAPTCHA cookie is set by Google reCAPTCHA to ensure that interactions with the website are from legitimate human users and not automated bots. This cookie helps protect forms, login pages, and other interactive elements from spam and abuse by analyzing user behavior. It is essential for the proper functioning of reCAPTCHA, providing a critical layer of security to maintain the integrity and reliability of the site's interactive features.

Name	__cf_bm
Provider	.calendly.com
Retention period	30 minutes
Type	Third party
Category	Necessary
Description	The __cf_bm cookie is set by Cloudflare to distinguish between humans and bots. This cookie is beneficial for the website as it helps in making valid reports on the use of the website. By identifying and managing automated traffic, it ensures that analytics and performance metrics accurately reflect human user interactions, thereby enhancing site security and performance.

Privacy preference center

Manage consent preferences

Necessary cookies

Analytics cookies

First party (rubygarage.org)

m.stripe.com

pipedrive.com

recaptcha.net

calendly.com

First party (rubygarage.org)

Incident Report Writing: What to Do If Things Go Off the Rails in Production

What is an incident report in software development?

Why incident reports matter

Benefits of incident report writing for software development teams

The structure of an incident report

Summary

Timeline

Root cause

Resolution and recovery

Corrective and preventive measures

RubyGarage incident report writing tips

#1 Write an incident report after handling an incident

#2 No finger pointing

#3 Don’t exaggerate or conceal anything

RubyGarage incident report sample

Database failure incident report

Summary

Timeline

Root case

Resolution and recovery

Corrective and preventive measures

Wrapping up