Last Updated on 13 September 2023
MTTR is the average time that a system is down before it can be brought back to usefulness. It is essentially the average length of downtime.
It has many names all starting the same way, such as Mean Time To:
- Repair; this is the most common and the one I’ll use in this article
- Resolve
- Recovery
- Resolution
- Respond
- Remediate
The different names usually relate to using MTTR in different industries, but the definition and calculation is always the same. It is a key metric in many different types of operation, such as manufacturing, IT support and cybersecurity. Due partly to it having lots of different names, it’s usually just known by its acronym.
You should be aware that some organizations will split the detection time ‘Mean Time To Detect’ (MTTD) out of this figure, but most have it included so I have here also. The difference mostly occurs in the cybersecurity industry, when the ‘detect’ period where there is an unknown attack can sometimes go on for a long time.
Including the detect time in the ‘respond time’ here would make the figure entirely dependent on the detection time. As the focus is normally on a fast recovery, splitting out the metrics into the two focuses in this case can help add value so that they can be tackled separately.
It is often used with MTBF (Mean Time Between Failures) to calculate the productivity lost to downtime.
What is the Mean Time To Repair?
MTTR is the average time from something breaking to it working again.
The time therefore includes diagnosing the error, fixing the error and getting the machinery back up and running.
When a operation goes down, product can be produced and there is a huge amount of waste from staff just standing around to performing value added activities. Measuring and reducing MTTR combats this by training the organization to get the process back quickly, minimizing this waste.
The metric tells you how efficient you are at responding to and fixing an issue.
Why is it important?
Downtime is incredibly expensive, in fact a typical value for e.g. network downtime is $5,600 per minute. Obviously this will vary based on your industry, organization size and the type of downtime. Either way it’s going to be expensive.
When you’re faced with these kinds of costs, you’re going to want to improve your turnaround time. The well quoted saying is:
If you can’t measure it, you can’t improve it
Peter Drucker
MTTR gives you a useful metric for measuring how quick you are getting your business back on its feet when there’s a problem that causes production to be interrupted. If you measure your current efficiency, you can then work on improving it. This can save your organization from huge amounts of waste.
It is a worthy goal to aim to get failure rate to zero, but issues will occur eventually, and improving (which means reducing) MTTR will reduce the amount of damage that happens when they do.
Often machinery will become less reliable as they get older. Monitoring their reliability through MTTR and MTBF can help let you know when a machine has passed it’s useful life and needs to be replaced. Keeping them past this point can cause you huge amounts of costs and lost revenues through downtime, so it is helpful to have this guide.
When is it used?
In a manufacturing environment, improving MTTR can be useful to maximize production, which can greatly improve profitability. It is a great way of increasing output, reduce waste and contribute to Total Productive Maintenance (see below).
You can use it effectively for a service desk environment, e.g. IT support. Staff will elsewhere in the organization will need their equipment up and running as quick as possible, so this will monitor those times. An IT outage can cost a huge amount of money very quickly, so especially in an IT related field this can be a board-level monitored metric.
It’s also useful for high profile companies where the downtime can cause reputational damage. If you are for example a tech company where your customers notice when you are working and when you aren’t e.g. web hosting, time is very important. They may not notice a 30 second outage, but 2 hours may be enough for them to consider changing supplier.
Service agreements
There may be part of your contract with your customers meaning you are subject to fines if repairs take too long. Long response times can in these situations get very expensive. Measuring and improving Mean Time To Repair can help you keep within these limits. You should note though you still need to ensure individual responses don’t get close to or even over the maximum allowed time even if the average is below.
Mean Time To Repair formula
So how do you calculate Mean Time To Repair? Firstly it’s important to note that the metric is calculated over time, so you need to choose a time period to calculate it over, e.g. for a day, month quarter or year. Your choice is often down to how often you get downtime, as you’ll want to be averaging over a large enough population of repairs; preferably at least 20. You won’t want to have a period that is too long, as you need to be able to measure MTTR repeatedly over time to check that your metric is improving.
You will also want to make sure you are covering the right period of time for each downtime. The downtime starts when the issue occurs, not when it is detected. It ends when the operator is able to use the equipment again, not just when the fix is in place and the engineer is testing it.
The calculation is then:
MTTR = Total downtime / Number of repairs
If you have a long list of downtime durations in Microsoft Excel, you can just take the average (=AVERAGE(highlight list)).
As you can decrease MTTR by increasing the number of issues, it is important that organizations both decrease MTTR and the number of issues.
What makes up the Mean Time To Repair?
The time taken in repairing the issue includes the time taken to:
- Detect the issue
- Diagnose the problem
- Source the materials needed for the repair
- Repair the machine / operation
- Get it back up and running
The clock starts when the operations go down (or in cybersecurity when the attack starts), and it stops when they go back up again.
How to improve your MTTR
Downtime can be incredibly expensive because often you’re not just running up extra costs, but losing revenue and potentially customers. An optimized system may have an MTTR of minutes or even seconds whereas a poor one may be hours or even days. MTTR reduction can therefore yield huge amounts of cost savings.
If you want to improve your performance rather than just gaming the metric, there are still many ways that you can improve your MTTR. It’s usually worth splitting them into the main stages of the repair:
Detection
- Improve your quality ‘ticket’ system, so that the correct people are chosen and notified immediately when there is a downtime issue. You want to make it as quick and easy as possible for your staff to request help or report an issue.
- Add a system monitoring and notification system such as Andon so that issues can be automatically detected by the system and reported without human intervention required.
Diagnose
- Clear escalation procedures can help, as time taken to find someone who can better diagnose or help implement the solution can greatly impede MTTR
- Document common errors and their ‘symptoms’ to enable staff to diagnose quicker, especially for problems they personally haven’t seen before. You may be able to create a diagnosis flowchart for them to quickly reach the solution.
- Make the documentation easy to find and search through, preferably on a well-organized central location on your network
Source equipment
- Add to your documentation what parts are needed for what issues, and where they are kept
- Organize your stock so that items are easy to find
- Put your important items such as the documentation, spare parts etc in the same area and easily portable, so that the engineer doesn’t have to move too much to get the equipment and information they need
- Keep a good stock of spare parts, and replenish them when required
Repair
- Train your operators to do minor repairs so that they can fix the issues themselves.
- Create a library of FAQs and how-to guides so that staff can fix their own issues if the engineers are busy
- Empower your staff to make decisions, rather than having to wait for authorization for the repair
- Where authority is needed, make the reporting lines and contact details clear so this can be done quickly
- Make standard repair documentation for issues that occur regularly, so that more staff around the organization have the skills to do the repair, they can do it quicker and they will get the procedure right
- Make the repair instructions easy to find; preferably have where to find them written on / linked from the diagnosis documentation
- Create a proactive scheduled maintenance that will make the major breakdowns (that take a long time to repair) a lot rarer.
Having your operators or staff do their own repairs can make these a lot more efficient, and free you up for more important tasks.
General improvements
- A lot of these can be remedied by a wholesale solution to reducing downtime, such as Total Productive Maintenance
- Time your responses going through the different stages to see which part(s) are causing the issues. A value stream map may help you here.
- Make sure your ‘repairers’ are well rested, well trained, motivated and not stressed, so they can give their all to the repair
- Make sure there is not a blame culture. If staff are afraid to alert management to issues, some issues will be ‘swept under the carpet’, altering the MTTR and preventing proactive action.
Stratification
If you have different categories of issue, it’s often worthwhile working out MTTR for each category. Categories can include different causes of issue and seriousness of impact on the organization. Splitting by issue type can help diagnose the root causes to improve your times.
- Splitting your figure by impact of the issue on the company can help make sure you’re focusing on the business-critical problems and avoid ‘noise’. If you work on a ‘ticket’ based system you may be able to let the user pick the priority level, helping prioritize both your work and the analysis.
- Calculating for different machines will show you which machine(s) you need to improve the reliability of
- A split by cause of the issue will help you focus on the issues having the greatest effect on your calculation
- Splitting by department or cost center can give you a view on the effect of management, and show you the department that first needs your focus
- By engineer will show you whether the time is affected by the experience, motivation, training etc of the person carrying out the repair so you can improve accordingly
- An analysis by age of machine may indicate whether the issue is being caused by installation errors or aging machinery, and help you decide when machines need replacing
You can often have serious issues hidden by the large amount of data from other categories, so this helps you tackle the most important issues first. You could use FMEA to help decide which categories are causing the most issue, and aim on improving MTTR for those first.
Criticism of MTTR
MTTR is a powerful tool to improve your response to downtime, but can cause issues if not used carefully. If it is blindly followed as your only quality measure it can cause issues such as:
- It favors repeatedly fixing an item over properly and permanently solving the issue
- If too much pressure is applied to improve MTTR, staff may cut corners on the repair to ‘stop the clock’
- Quality staff can be pulled off more important tasks to do a minor repair or to stand ready for issues
- It can distract from planned maintenance that would save money in the long run
- A quick repair rather than a thorough repair can lead to increase occurrence of downtime and even safety issues
- A good score can be a sign that the staff are well practiced at fixing the same issues repeatedly, which could mean there’s an issues with the fragility of your systems
- It treats high impact and low impact issues the same, which can make staff prioritize the wrong issues.
What to do about it
The key danger is only looking at one metric, which is always dangerous. This can be mitigated by combining with other metrics such as Mean Time Between Failures (MTBF). This will encourage both quick turnaround of issue (limiting downtime) whilst still encouraging permanent fixes and thorough root cause analysis investigations to prevent them reoccurring.
You should also make sure that you have enough staff time allocated to preventative measures, and they are not spending too much of their time re-actively ‘firefighting’. Implementing an all-round solution such as Total Productive Maintenance can help make sure that your efforts have the correct split between reactive and proactive.
Availability and Overall Equipment Effectiveness calculation
The key metric of Total Productive Maintenance is Overall Equipment Effectiveness (OEE). OEE is made up of three parts; availability, performance and quality.
Availability is the probability of a machine being available for production at any given time. You can calculated it using a formula that involves MTTR and Mean Time Between Failures (MTBF):
Availability = MTBF / (MTBF + MTTR)
This is uptime / total time, or the percentage of time that the equipment is ‘available’.
Decreasing MTTR can therefore contribute to improving OEE, often one of your organizations main metrics, especially for quality departments.
Leave a Reply