This website uses cookies to better the user experience of its visitors. Where applicable, this website uses a cookie control system, allowing users to allow or disallow the use of cookies on their computer/device on their first visit to the website. This complies with recent legislative requirements for websites to obtain explicit consent from users before leaving behind or reading files such as cookies on a user’s computer/device. To learn more click Cookie Policy.

Privacy preference center

Cookies are small files saved to a user’s computer/device hard drive that track, save, and store information about the user’s interactions and website use. They allow a website, through its server, to provide users with a tailored experience within the site. Users are advised to take necessary steps within their web browser security settings to block all cookies from this website and its external serving vendors if they wish to deny the use and saving of cookies from this website to their computer’s/device’s hard drive. To learn more click Cookie Policy.

Manage consent preferences

These cookies are necessary for the website to function and cannot be switched off in our systems. They are usually only set in response to actions made by you which amount to a request for services, such as setting your privacy preferences, logging in or filling in forms. You can set your browser to block or alert you about these cookies, but some parts of the site will not then work. These cookies do not store any personally identifiable information.
These cookies allow us to count visits and traffic sources so we can measure and improve the performance of our site. They help us to know which pages are the most and least popular and see how visitors move around the site. If you do not allow these cookies we will not know when you have visited our site, and will not be able to monitor its performance.
Cookies list
Name _rg_session
Provider rubygarage.org
Retention period 2 days
Type First party
Category Necessary
Description The website session cookie is set by the server to maintain the user's session state across different pages of the website. This cookie is essential for functionalities such as login persistence, ensuring a seamless and consistent user experience. The session cookie does not store personal data and is typically deleted when the browser is closed, enhancing privacy and security.
Name m
Provider m.stripe.com
Retention period 1 year 1 month
Type Third party
Category Necessary
Description The m cookie is set by Stripe and is used to help assess the risk associated with attempted transactions on the website. This cookie plays a critical role in fraud detection by identifying and analyzing patterns of behavior to distinguish between legitimate users and potentially fraudulent activity. It enhances the security of online transactions, ensuring that only authorized payments are processed while minimizing the risk of fraud.
Name __cf_bm
Provider .pipedrive.com
Retention period 1 hour
Type Third party
Category Necessary
Description The __cf_bm cookie is set by Cloudflare to support Cloudflare Bot Management. This cookie helps to identify and filter requests from bots, enhancing the security and performance of the website. By distinguishing between legitimate users and automated traffic, it ensures that the site remains protected from malicious bots and potential attacks. This functionality is crucial for maintaining the integrity and reliability of the site's operations.
Name _GRECAPTCHA
Provider .recaptcha.net
Retention period 6 months
Type Third party
Category Necessary
Description The _GRECAPTCHA cookie is set by Google reCAPTCHA to ensure that interactions with the website are from legitimate human users and not automated bots. This cookie helps protect forms, login pages, and other interactive elements from spam and abuse by analyzing user behavior. It is essential for the proper functioning of reCAPTCHA, providing a critical layer of security to maintain the integrity and reliability of the site's interactive features.
Name __cf_bm
Provider .calendly.com
Retention period 30 minutes
Type Third party
Category Necessary
Description The __cf_bm cookie is set by Cloudflare to distinguish between humans and bots. This cookie is beneficial for the website as it helps in making valid reports on the use of the website. By identifying and managing automated traffic, it ensures that analytics and performance metrics accurately reflect human user interactions, thereby enhancing site security and performance.
Name __cfruid
Provider .calendly.com
Retention period During session
Type Third party
Category Necessary
Description The __cfruid cookie is associated with websites using Cloudflare services. This cookie is used to identify trusted web traffic and enhance security. It helps Cloudflare manage and filter legitimate traffic from potentially harmful requests, thereby protecting the website from malicious activities such as DDoS attacks and ensuring reliable performance for genuine users.
Name OptanonConsent
Provider .calendly.com
Retention period 1 year
Type Third party
Category Necessary
Description The OptanonConsent cookie determines whether the visitor has accepted the cookie consent box, ensuring that the consent box will not be presented again upon re-entry to the site. This cookie helps maintain the user's consent preferences and compliance with privacy regulations by storing information about the categories of cookies the user has consented to and preventing unnecessary repetition of consent requests.
Name OptanonAlertBoxClosed
Provider .calendly.com
Retention period 1 year
Type Third party
Category Necessary
Description The OptanonAlertBoxClosed cookie is set after visitors have seen a cookie information notice and, in some cases, only when they actively close the notice. It ensures that the cookie consent message is not shown again to the user, enhancing the user experience by preventing repetitive notifications. This cookie helps manage user preferences and ensures compliance with privacy regulations by recording when the notice has been acknowledged.
Name referrer_user_id
Provider .calendly.com
Retention period 14 days
Type Third party
Category Necessary
Description The referrer_user_id cookie is set by Calendly to support the booking functionality on the website. This cookie helps track the source of referrals to the booking page, enabling Calendly to attribute bookings accurately and enhance the user experience by streamlining the scheduling process. It assists in managing user sessions and preferences during the booking workflow, ensuring efficient and reliable operation.
Name _calendly_session
Provider .calendly.com
Retention period 21 days
Type Third party
Category Necessary
Description The _calendly_session cookie is set by Calendly, a meeting scheduling tool, to enable the meeting scheduler to function within the website. This cookie facilitates the scheduling process by maintaining session information, allowing visitors to book meetings and add events to their calendars seamlessly. It ensures that the scheduling workflow operates smoothly, providing a consistent and reliable user experience.
Name _gat_UA-*
Provider rubygarage.org
Retention period 1 minute
Type First party
Category Analytics
Description The _gat_UA-* cookie is a pattern type cookie set by Google Analytics, where the pattern element in the name contains the unique identity number of the Google Analytics account or website it relates to. This cookie is a variation of the _gat cookie and is used to throttle the request rate, limiting the amount of data collected by Google Analytics on high traffic websites. It helps manage the volume of data recorded, ensuring efficient performance and accurate analytics reporting.
Name _ga
Provider rubygarage.org
Retention period 1 year 1 month 4 days
Type First party
Category Analytics
Description The _ga cookie is set by Google Analytics to calculate visitor, session, and campaign data for the site's analytics reports. It helps track how users interact with the website, providing insights into site usage and performance.
Name _ga_*
Provider rubygarage.org
Retention period 1 year 1 month 4 days
Type First party
Category Analytics
Description The _ga_* cookie is set by Google Analytics to store and count page views on the website. This cookie helps track the number of visits and interactions with the website, providing valuable data for performance and user behavior analysis. It belongs to the analytics category and plays a crucial role in generating detailed usage reports for site optimization.
Name _gid
Provider rubygarage.org
Retention period 1 day
Type First party
Category Analytics
Description The _gid cookie is set by Google Analytics to store information about how visitors use a website and to create an analytics report on the website's performance. This cookie collects data on visitor behavior, including pages visited, duration of the visit, and interactions with the website, helping site owners understand and improve user experience. It is part of the analytics category and typically expires after 24 hours.
Name _dc_gtm_UA-*
Provider rubygarage.org
Retention period 1 minute
Type First party
Category Analytics
Description The _dc_gtm_UA-* cookie is set by Google Analytics to help load the Google Analytics script tag via Google Tag Manager. This cookie facilitates the efficient loading of analytics tools, ensuring that data on user behavior and website performance is accurately collected and reported. It is categorized under analytics and assists in the seamless integration and functioning of Google Analytics on the website.

Incident Report Writing: What to Do If Things Go Off the Rails in Production

  • 53719 views
  • 11 min
  • Sep 12, 2018
Gleb B.

Gleb B.

Copywriter

Vlad V.

Vlad V.

Chief Executive Officer

Tags:

Share

Let’s face it: any application can go off the rails.

The world’s largest technology companies such as Google and Amazon pour billions of dollars into their hardware and software, and even they aren’t immune to incidents. Even if a software development company employs passionate and experienced developers and DevOps engineers, there’s still a chance that their software products will run into trouble.

Needless to say, a software development team should fix everything as soon as possible. But there’s one more important thing to do: write an incident report. This step might seem pointless, but it plays a key role in incident management.

In this article, we tell you why incident reports matter and how we write them at RubyGarage.

What is an incident report in software development?

A software development incident report provides documentation of an event that has fully or partially disrupted the normal operation of a live production server. In other words, this report describes a serious incident that has made a significant negative impact on an application. It doesn’t matter if the incident has hit hardware or software, as in both cases the application can’t function normally. Needless to say, such situations require urgent action to bring the application back into operation.

If this definition sounds vague, take a look at several situations in which writing an incident report is a must:

  • Downtime − a production server (and an application) hasn’t been available for a certain period of time
  • Loss of data − some information in an application’s database has been fully or partially lost or damaged
  • Security breach − some confidential information (such as login credentials or credit card numbers) has been exposed
  • Bulk mailing − an application has sent bulk email messages by mistake
  • Interruption of normal operation − some critical functionality (such as log-in or sign-up) hasn’t been working

Why incident reports matter

Any software development team has probably faced serious incidents and successfully overcome them. This is completely normal, as nothing is perfect in this world. But team members might be wondering if there’s any point in writing lengthy documentation on an incident that they’ve already handled.

Yet incident report writing has nothing to do with producing documentation no one ever reads. When it comes to incidents in production, software development teams need to think big and consider the people impacted by an incident and the negative consequences it brings:

  • End users are the first who suffer from any incident, as they can’t use an application and their personal information might be at risk.
  • Business owners lose money and reputation if their application goes offline or malfunctions.
  • The software development team bears responsibility for technical incidents, so team members need to properly handle them in the shortest time possible.
Incident Report Writing Illustration

It goes without saying that incidents are unpleasant and upsetting for everyone: end users, businesses, and software development teams. To ensure customer satisfaction, minimize losses, and build trust, software development teams need to not only handle incidents quickly and professionally but also carry out a comprehensive analysis of incidents and establish effective communication with business owners. Incident report writing helps development teams achieve all of these goals.

Benefits of incident report writing for software development teams

Now that you understand why incident reporting matters, let’s go over the advantages it provides to software development teams. At RubyGarage, we see three main benefits of incident documentation:

  • Provides a comprehensive description of an incident to a business owner

    First and foremost, a software development incident report is a comprehensive explanation of an incident to a business owner. It helps software development teams provide full details, specifying the timeline of an incident, its cause, and the measures the team has taken. Moreover, an incident report allows development teams to prove their technical expertise and trustworthiness.

    Incident documentation isn’t a mere explanation. It’s also an important source of information for a business owner who may use it to instruct their customer support team or to prepare a statement for customers.

  • Helps a software development team analyze an incident

    Having documented an incident, a software development team can thoroughly analyze it later in order to find out what caused the incident, which of the team’s measures were effective and which weren’t, and so on. This analysis helps reveal flaws in the software system as well as in the team’s actions or skills so that they can make improvements and prevent similar incidents in future. Though any incident is unpleasant, it’s best for teams to treat incidents not as failures but as opportunities for improvement.

    For example, an analysis may show that the team found out about an incident too late, which indicates that an application requires an error tracking tool (such as Rollbar). Or if production servers experienced downtime without your team’s knowing, it’s a clear sign that an application needs a proper availability monitoring tool (such as Pingdom).

  • Allows team members to share experience

    Often, not all members of a software development team are present or participate in fixing an incident. Some of them may have had a day off or been on sick leave, and incident documentation helps teams keep all teammates fully informed about every incident. At RubyGarage, we encourage sharing incident reports among all our developers, not only those directly involved in the project.

  • Fosters continuous improvement

    For software development companies, incident documentation helps not only with analyzing each incident separately but also with accumulating experiences and making continuous improvements to systems, processes, and workflows.

    For example, if several development teams within a company face similar incidents, this means something is wrong with the company’s development flow or processes.

The structure of an incident report

There are lots of different incident report templates; some of them are really brief, while others are detailed and take a lot of time to produce. At RubyGarage, we use a simple and intuitive structure for incident reports. It includes the following five sections:

Summary

This is a brief and to-the-point description of the incident that contains general information about it, including reasons for the incident, the time when it happened and the time it was resolved, and its consequences.

Timeline

This section provides exact times of all events related to the incident, including the time the incident started, the times at which actions were taken to resolve it, and ultimately the time when it was resolved.

Root cause

This is a detailed description of what caused the incident. It’s important to provide as many details as possible.

Resolution and recovery

This is a detailed description of all actions taken by the software development team to resolve the incident, along with times when each action was taken and its results. This section describes all actions taken by the development team, even those that turned out to be incorrect or ineffective.

Corrective and preventive measures

In this section, software development teams need to list all measures that should be taken to prevent similar incidents in future. These measures may include any necessary improvements to the project, development procedures, DevOps flow, etc.

RubyGarage incident report writing tips

Knowing the structure of a typical incident report is like reading a recipe from a cookbook − you may know what ingredients to add, but this doesn’t mean your dish will be delicious. There’s always something more than just the theory. Here are some helpful tips on incident documentation that we follow at RubyGarage:

#1 Write an incident report after handling an incident

It’s best to write a report only after resolving the incident. Writing it in the course of the incident makes no sense, as development teams need to act quickly and figure out how to resolve it. At this stage, teams can simply brief a business owner about the incident and keep them updated about the progress.

#2 No finger pointing

Write incident reports on behalf of the whole software development team. There’s no need to specify who exactly is to blame for an incident. Teams write incident reports not to point out and punish culprits but to carefully analyze the problem, fix it, and prevent it from happening again. Moreover, incidents often happen due to poor systems or processes, and development teams should focus on finding flaws in them rather than blaming team members.

#3 Don’t exaggerate or conceal anything

Development teams don’t need to exaggerate or conceal anything in their incident reports. All people make mistakes, and that’s why reports exist − they help others not to fall in the same hole. Moreover, not telling the truth in incident documentation undermines a business owner’s trust in a team.

RubyGarage incident report sample

We’ve examined incident reports from A to Z and there’s just one thing left: an example. To show you how we write incident reports at RubyGarage, we’ve composed a brief sample based on the structure we’ve described. This incident report sample was made up specifically for this guide and isn’t connected to any real project.

Database failure incident report

Saturday 16 June

Summary

At 9:24 PM Pacific Time (PT) the error tracking tool (Rollbar) that we use on the project reported an abrupt spike in errors on the live production server. The issue affected critical functionality of the website, so at 9:28 PM our team switched the website to maintenance mode and it wasn’t available to end users. The root cause of the incident turned out to be a database failure caused by an incorrectly written script. The issue was fully resolved by 10:23 PM and the website was fully functional again.

Timeline

  • 9:23 PM PT − The update of the production database was finished.
  • 9:24 PM PT − Our monitoring tool detected an abrupt spike in errors.
  • 9:26 PM PT − Our team started checking logs to find out what had caused the incident.
  • 9:28 PM PT − We moved the website into maintenance.
  • 9:37 PM PT − We initiated the task that would restore the production database.
  • 9:43 PM PT − We optimized Sidekiq workers to speed up database restoration.
  • 9:55 PM PT − We switched off the maintenance mode; at the time, 80% of all products were listed and only 20% were still missing.
  • 10:23 PM PT − The production database was fully restored and the incident was resolved.

Root case

During a planned update to the production database, a newly added script that worked with its structure turned out to have been written incorrectly. As a result, the script removed a large number of products from the item_datafeed table that contains the URLs that lead to retailers’ online stores.

Resolution and recovery

At 9:24 PM PT our team received lots of alerts from our error tracking software, and we immediately started investigating the issue.

At 9:26 PM PT, we started to check logs and quickly found out that the issue was caused by a production database failure. To minimize customer dissatisfaction, at 9:28 PM PT we decided to move the website into maintenance mode. Afterward, our team initiated the task to restore the production database, and we also optimized Sidekiq workers to speed up database restoration.

When 80% of products in the database had already been restored (at 9:55 PM PT), our team decided to move the website back to production, and at 10:23 PM PT all products were listed on the website and available to customers.

Corrective and preventive measures

Having analyzed the incident in-depth, our team came up with the following measures for preventing similar incidents in future:

  • Although we rely heavily on tests in our development workflow, we will test even changes related to database updates on the staging server prior to rolling them out on the production server.
  • We will back up all databases used on the project on a regular basis so that we can roll them back easily in case of incidents. Apart from automated scheduled backups, our team will introduce database backups before all major changes.
  • We will increase test coverage of the application to ensure that all critical parts of functionality are 100% covered with automated tests.

Wrapping up

Incident reporting might seem unpleasant, but it’s an important part of the software development process. In this article, we’ve shown you how we handle incident documentation at RubyGarage. If you want to know more about the development practices we follow, subscribe to our blog.

CONTENTS

Tags:

Authors:

Gleb B.

Gleb B.

Copywriter

Vlad V.

Vlad V.

Chief Executive Officer

Rate this article!

Nay
So-so
Not bad
Good
Wow
13 rating, average 4.85 out of 5

Share article with

Comments (0)

There are no comments yet

Leave a comment

Subscribe via email and know it all first!