Clifton Risk Management

Process and Events

Each heading in this chapter relates to a box in the modern model shown previously. We explain how a business continuity programme can be implemented and maintained in the most cost-effective method to reduce an organisation’s exposure in the shortest possible time. All examples quoted are ones we have experienced in the last few years of working in all business sectors around the world. There is nothing wrong with the currently accepted model for business continuity but it does leave organisations exposed during implementation and leads to personnel looking at the operation as a project not as a component of business as usual.

In no area are we defining absolute rights and wrongs, just some new variants on a theme. In addition this is a short document and as such cannot detail all areas as far as is sometimes needed. Each individual paragraph here could be a document in itself.

Determine Capability and Understanding

All organisations have an understanding of what would happen if the fire alarm went off during the working day. This is the start of understanding your perceived attitude towards risk and to understanding what a plan should contain. Not so clear is the reaction to wide spread infection by a virus in the computer systems or a product recall of faulty goods. What can be established quickly and at the outset of implementing risk management is the Board, or senior management’s understanding and requirements in implementing solutions.

Traditionally an overview of what the business continuity project will contain has been presented and agreement to support the project has been sought. Months later the same senior management are asked to sign off business priorities they instinctively knew at the out set and authorise spending on solutions that they have no clear understanding of the need for.

We suggest that at the outset the Board member or senior managers are organised and presented with a short sharp scenario based walkthrough of a serious incident and they are asked to react to it and work through the problem. By completing this exercise senior management will themselves realise the level of exposure they have and will want to drive the project, they will identify high-level business requirements and organise an initial reaction plan to a serious interruption. No detail will come from the meeting but drive, support and requirements will be delivered.

Our experience has shown that most Board members or senior operational managers understand how their organisations works, where it makes and loses money, what is critical to the customer, what the future plans encompass and where an interruption would be intolerable. They may not know detailed amounts, timeframes or the technological reasons for a failure but they can in a very short period of time deliver a thumbnail sketch of the priorities and requirements for recovery or protection. To many Board’s members the difference between £15 million and £50 million becomes irrelevant if the cost of protection is £200,000.

If we use the rule that states you get 80% of the success in the first 20% of effort, and that you use the remaining 80% of effort to achieve the last 20% of success, then we can see that this quick hit achieves one of our key aims. Whilst this approach will not deliver detailed and 100% accurate definitions it will provide immediate protection to the organisation.

Agree High Level Approach

The initial meeting is facilitated with the key objective of delivering three drivers to ensure the organisation is managing risk appropriately, namely:

The facilitation by someone who understands risk requirements will ensure that the detailed business knowledge of the attendees is used to produce these factors.

Impacts

The impacts that a business can feel can be both internal and external and need not always be measured in financial terms. Many government offices suffer little direct financial impact if they cease to operate but there are operational impacts, which must be considered. In some organisations senior management may decide that a critical impact is not one of those identified through business impact measured against financial loss. For example many companies place customer service as their number one priority but do not know how to measure this in financial terms.

The attendees at the initial workshop, through years of experience and clear understanding of the business, know what the initial impacts to their company will be and can agree amongst them the priority of operations if there is a need to recover any. These impacts will not have clear, tangible numbers against them in either financial terms or operational risk measurement terms, but they will be agreed and accurate to the extent of the Boards knowledge. The presence of the Financial Director or CFO will ensure that the key profit and turnover aspects of the company are considered and any particular sensitive times of year are addressed.

Therefore you will find in one short meeting you have circumvented what for many companies can be months of work in producing a business impact analysis report. Many of the agreed priorities and timescales may have to be verified in detail for auditors, insurers and regulators, but this can be done at leisure once the company is protected. Our experience has shown that in almost all cases when detailed business impact analysis reports are presented to the Board very little information is shown that was not already known and often the Board, knowing the future strategy, have clearer vision of priorities than operational managers do.

Strategies

The meeting will also agree sweeping large-scale strategies which will initially define the way forward for planning and can later be confirmed through testing. For larger organisations this can include whether we believe we can afford to recover or should we be investing in our computer systems to ensure that there is no down time, looking at options including disk space redundancy, mirroring or disk farming. Organisations running call centres, emergency lines, Web sites and financial trading floors, should really take a high level review of whether any down time is now acceptable to the customer service side of their operation.

Strategies at this level may also involve estimates as to how many personnel would be required, and in what timescales. Risk reduction strategies can also be agreed where obvious areas of weakness are seen. Reducing risk should lead to the reduction of the need to utilise a recovery strategy and often this meeting will identify obvious areas of failure that need immediate redress and cannot wait for a formal project to address them.

Plans

The attendees will also form initial plans during the meeting. By using a realistic scenario they will take control of areas in which they obviously excel, or manage in normal daily operations. No detailed plans will be created but assumptions will be discussed and ownership of all key areas agreed. If facilitated correctly then on completion of the meeting each agreed owner will now understand the need for planning and drive for the completion of their plan as they have demonstrated to themselves how exposed they are. These plans will be pragmatic and realistic due to the way in which they were formed. Our experience has shown that many organisations produce detailed complex plans. The personnel expected to use these plans have little faith in them, do not believe in them and use their own reactions above the written documentation. By ensuring that plans are produced in conjunction with their users in a realistic scenario then the plans will be supported and practical.

The initial high-level approach will produce a number of streams which need to be continued throughout the business continuity operation. If this is a project to implement business continuity then initially these may be run as programmes, but eventually they should be incorporated into business as usual, using tools such as change management and project management.

Implement Immediate Protection

On completion of the initial meeting to establish high-level requirements it will be self-evident what the immediate actions required are. These may involve changing accepted working practices slightly or ensuring obvious weaknesses are reduced, a standard example from such an exercise is the back up regime for computer data. We find in almost every size of organisation that the timetabling and scope of the regular back ups are not what the Board members had expected it to be.

Other, simple, actions which can be implemented immediately can be the creation of call trees or the agreement of who would own an incident and manage it. Whilst these are simple examples they are products of the initial meeting that may prove to be “show stoppers” in the event of an incident today. We are confident that they would be addressed in the full-scale business continuity programme but our aim is to reduce exposure immediately and therefore take “off-risk” high-risk failure points.

Off Risk Programme

The Off Risk Programme relates to immediate protection but is also an ongoing programme that continues beyond the life of the set up project. Our experience has shown in many projects, risks are identified and then added to a risk matrix to be resolved or reduced at a later date. Some of these risks are so fundamental to the operation of the organisation that a project-managed approach to reviewing them in the future is unacceptable, something needs to be done immediately to take elements off risk.

If you badly cut your leg so that it needs to be stitched you would do something immediately and you would not leave the wound to bleed until you reach medical help where it could be addressed. You would improvise and reduce the problem until a perfect solution was found. This is not what most organisations do, they leave areas at risk whilst searching for a perfect answer. We would argue that a “quick and dirty” interim solution, which reduces exposure and impact, is the better way forward.

Once key risks have been identified, either through the initial meeting of the Board, the risk management process, testing, and maintenance or through business as usual, then these risks must be addressed. Most will be managed in the normal risk management process detailed later but some are too critical to wait. We have seen that this approach of taking issues off-risk by implementing a formal list of key risks, assigning them owners, reporting on them in public each month, week or day, and tracking all progress, ensures that the risks are resolved or reduced to an acceptable level quickly and efficiently. This programme becomes part of business as usual, either within normal change, or project planning, reporting or within normal management review meetings.

The formality and openness of the programme and the agreed high priority of the risks means that the company can quickly see real benefit and risk will be reduced at the earliest opportunity.

We do not recommend that the off-risk process is used for all risks as it can be seen as heavy handed and can lose support if minor or long term issues are included. This off-risk stream may be closed down once key risks are resolved and re-opened on discovery of new operational threatening risks.

Initial Response Plan

Recent research has shown us that the name Crisis Management Plan is emotive and inflammatory, and in almost all cases, incorrect for the type of plan in place. Putting the words crisis and plan in the same title is almost an oxymoron as by definition a crisis must be an unplanned or unforeseen event and as such how can you have a plan for it?

What most organisations have in place is an Initial Response Plan. They have a plan identifying the initial key functionaries that must be involved in any decision making or first actions in the event of a serious interruption to operations. This team in general incorporates representatives from areas such as:

The plans that are written generally cater from the first reaction to a serious incident through to the moment when initial business continuity can begin. The plans should address such issues as initial notification and escalation through to the definition of what is a disaster within each particular organisation. Key personnel with identified roles such as handling the emergency services, press and media, key customers and personnel issues should all be included. An agreed method to execute a damage assessment and key operational criterion should be included for decision making. Other specific business components need to be involved and especially organisations with external dependencies such as airlines, must have dedicated plans. Many other elements go into this plan and this booklet does not intend to be prescriptive as to how an initial response plan should look nor does it claim to be complete in all aspects of business continuity, but rather to challenge the norm.

This is the fundamental plan and has been discussed and written about many times, by many authors, but we would argue that this is not a Crisis Management Plan, what a Crisis Management Plan is will be discussed later. We would however go further in challenging the make up and usefulness of many Crisis Management/Initial Response plans currently in place.

In some companies what a disaster is defined at will be very important. As will the recognition that there is a time when an incident becomes a disaster.

Our experience has shown that in tests and real activations in almost all cases the plans are not used. In general someone who does not have to use them has written them, they are complex, wordy and difficult to navigate. Too much emphasis is placed on the detail of the content and it is usually presented in a formal report style. This is not what the document is for, its an aide memoire to be used in a moment of extreme stress, it must be clear, concise, simple to use and to understand. In fact the more people understand their roles through training, rehearsals and education the less the need for a document. As this understanding grows then the aide memoire/checklist approach becomes more and more the norm. We would stress the need for more clarity, understanding and education above the need for clever paper or computer based plans, which in a real event are unlikely to be used.

Information Technology

As mentioned earlier in this document most disaster recovery planning and subsequent business continuity planning evolved from the Information Technology areas within organisations. Traditional business continuity models and approaches have seen the IT area as being an individual component of planning and not intrinsically linked to other parts of the process. This may have been acceptable once but it does not work in the modern implementation of solutions.

If we accept that most companies have some form of resilient strategy in place to protect IT failures and interruptions then we must either work within this framework of solutions or review it at the outset. As such in the new model the IT component runs as a stream throughout the process. IT protection and removal of risk must in place as soon as possible. Existing IT plans and strategies must be reviewed and continually reviewed as business plans and risks are agreed. The testing of IT plans to meet business requirements and not simply to meet IT recovery goals will in many cases lead to plans being amended and reviewed.

In many organisations business continuity plans for the IT department do not exist. Whilst disaster plans for recovering systems may be in place, plans to restore IT as an operational function, and not a recovery service, have not been written, this needs to be reviewed. The IT department is integral in restoring services to users but services such as IT security, Help Desk and most commonly IT Development are often forgotten.

Development is very often considered to be non critical and this can be a mistake. Many mainframe systems operate on what is called object code and this is system code readable by computers but not in a state that can be amended by IT staff. The code that can be changed by programmers is commonly known as source code and this mostly resides on systems in the Development areas of IT departments. If this code is not available at the recovery site then should the “live” code fail it will be very difficult, if not impossible to fix the problem. In addition many man years of effort can be spent in developing computer systems which are to support a product launch or a change in systems due to regulatory instruction (ie the Euro.) If this is not completed, or goes “live” late then companies may be unable to operate, be in breach of regulators or fail to achieve a product launch. Just assuming IT Development is not critical is not acceptable.

Simplistically a large number of organisations have had IT solutions for production systems for sometime and rely upon these to be their business continuity plan, this should not be the case. Over time IT strategies have been mis-interpreted by management and the business, and assumptions are not correct. For example we have seen many cases where the following change of reality has occurred due to mis-interpretation:

It can be argued that the previous diagram is simply a failure in communication and it probably is. However for most business areas business continuity is an overhead and they are willing to believe, or accept what they think is the simplest and often cheapest solution. When the following diagram is explained to the business they are horrified at the reality and many managers refuse to accept it!

Day 0001 0400 0800 01200 1600 2000 2359
Monday End Online Day Start Batch
Tuesday Backup Monday Take Monday Off-Site End Online Day Start Batch
Wednesday Incident Invoke Contract
Thursday Equipment Arrives
Friday Complete Data Restore Begin Reduced Business Operation—As At Close of Business Monday
Normal Operations Time Without Computers Operation In Recovered State

The diagram shows that a serious incident, occurring overnight between Tuesday and Wednesday, which forces a recovery of computer systems means, that operations will not be functional again until Friday morning at the earliest. This new operation will be working on data almost 4 days out of date and will only be working in a reduced manner. When this sort of model is demonstrated to business managers we find that the majority of disaster recovery contracts are inadequate, inappropriate or misleading. There is more on the new approach to disaster recover later in this document.

As was mentioned earlier in this document most business continuity methods and models are based on the assumption of organisations using mainframe computers and in truth the above demonstration of recovery times originates from that era too. Most modern organisations now have a cross section of equipment and hardware which reduces the exposure of a single point of failure but introduces the complexities of various skill sets and different points of consistency.

Points of consistency are a fundamental problem for all disaster recovery plans and are generally overlooked. Most systems run at different times offering different services to various clients or parts of the business. Some of these run on-line systems where data is updated as it is entered and others collate the information to be updated during a “batch” run usually over night. The introduction of client servers, complex networks and PC systems now mean that there are more systems, many of which are not controlled by central IT functions. As such if we use the previous diagram and assume that the mainframe has been lost for a number of days what does this mean to business areas using local systems? If local servers are still operational then the business may continue to enter data. This data is now out of synchronisation with the lost mainframe data. If the data lost in the over night batch had not been downloaded back to local systems then, from the previous diagram, on Wednesday morning business areas may start operations with incorrect and out of date facts and figures. The impact of this can be horrendous, and sometimes illegal, as corrupt systems using the wrong data could lead to false actions or transactions.

Consider a financial organisation running a Trading floor. The loss of data may mean that opening positions show an incorrect level of holdings and assumptions of where positions are long or short may not be accurate. This could lead to losses and incorrect reporting to regulators. Alternatively a retail distribution centre that relies upon stock re-orders from automatic tills in stores could show data on local systems showing stock ordered but the systems recovered centrally do not have this information and fail to collate or re-order.

When the original business continuity models were constructed and central computer systems were the norm, problems with points of consistency were still the case but less so as most computing was conducted on, or from, single points. As such in a disaster all processing would stop and this would prevent systems becoming out of synchronisation. With distributed systems this in no longer the case. In addition the mystique of computing to business users has dulled in the last 10 years and now most users are also owners of computers at home and understand the equipment in front of them. The modern world needs different approaches to both planning and strategies, which consider the issues of distributed systems, points of consistency and the advanced knowledge and understanding of some users.

In most organisations IT should already have some disaster recovery solutions in place and as such are no longer a part of the planning but rather a stream to be improved, adjusted and updated. The diagram of the new model for BCP shows the IT stream recovering in parallel with testing as the assumptions should be proven at all stages, it also runs into testing again at the bottom to ensure that cross area testing is captured and business led IT tests are executed.

Maintenance

Traditionally maintenance has always been the last box on any model. Meaning that any approach to maintaining strategies, plans or procedures were not considered until the project was completed. As mentioned we do not see business continuity as a project and as such maintenance must begin as soon as something is in place.

In many organisations maintenance is not done particularly well and this leads to a secondary project based approach to repair the original strategies and plans and in many cases to start again. This approach does not lead to any buy in from owners of plans nor does it encourage an environment of risk aware managers.

Business Continuity and Risk Management are overheads and are often perceived as taking time and not giving value. The business continuity manager’s job is to ensure that the value is understood throughout the company and that the commitment is there to continuously improve solutions. In many organisations, as we will demonstrate, recovery is no longer an option and continuous processing must be in place, as such the need for maintenance and ownership is essential.

In the new model as soon as high level approaches to impacts, strategies and plans have been agreed then these need to be maintained. As these approaches are “firmed up” and fully implemented then the ongoing maintenance can pick up these amendments. As stated if the initial start up meeting is at a high enough level within the company and is facilitated well enough to ensure that commitment is there from day one, then this will ensure that owners for maintenance are found. The IT plans should be in place and ongoing maintenance will update these in line with findings from the risk reviews and Off-risk programme. By adopting this approach then maintenance is not something to think of at the end of a long project, where in fact in many organisations a number of plans and strategies are already out of date, but is part of the ongoing life cycle.

We are always asked about methods of maintenance. Is the best method to use central updating or is distributed ownership more effective? How should managers be motivated to encourage them to maintain plans? In most cases this comes down to a cultural issue within the organisation and if possible then rather than inventing new methods and introducing new software which needs to be distributed, installed and could involve training we recommend that organisations stay within what is already in place. Linking maintenance to appraisals, performance measurements or even bonuses can motivate management but this is all negative motivation. If the business continuity manager can clearly demonstrate to business managers the necessity and benefits of the plans in such a way that they can see that it is beneficial to have them in place and up to date then maintenance stops being an issue. Once again we have found this is best done through testing, during tests management will take ownership of their plans as they can see the potential risk of not recovering in a clear structured manner.

Testing, Training, Rehearsing

This is another section that has changed in the new model from being a block at the end of the project to a stream running through the life cycle of business continuity. Once again too often we have seen very lengthy business continuity projects lasting 18 months to 3 years, which culminate in planning for a test. At this stage it is found that fundamental assumptions and procedures will not work, do not intertwine with other areas requirements and are not acceptable to the business. Testing is the most effective way to discover and prove requirements and to ensure buy in from people present.

Semantics can be emotive at this time and calling the method of proving plans and strategies a test, training exercise or rehearsal can have various impacts on your companies’ personnel. Being cognisant of the culture within a company is essential.

For the rest of this document we will use the word test to represent all three options.

The very first box in the new model shows the senior managers determining their capability and that of the organisations. It was suggested that this was run as a scenario based walkthrough, in essence we are suggesting that the whole process begins with a test and that testing continues to be the driving force.

We find in almost all tabletop tests the attendees are unprepared, un-briefed and are not clear on their roles and responsibilities. The tests turn into briefing and education seminars during which many attendees, seemingly for the first time, grasp what they are there for and what being in a recovery team means. In many cases with this realisation comes the fact that they do not feel they are the right person to be there or that the plan in front of them is unrealistic, unworkable or not relevant. In many cases, after a test, the plans are re-written and it is at this stage that the people they were written for then adopt them. This clearly demonstrates that until testing the concept of business continuity remains a theoretical one. Unfortunately many times this theoretical approach to planning is clearly shown when plans have to be used in a real situation and they fail or are ignored.

Testing demonstrates to the people who must use plans that they are the ones who must shape them. Many tests simplify plans, or re-shape whole strategies, in fact we continually see that plans do not really exist until testing is executed.

As such we see testing as a continual event from the very early stages of a project to implement BCP and ongoing throughout the business continuity lifecycle. Testing in the early part of the lifecycle demonstrates how exposed companies are and allows focus to be placed on areas of greatest risk. It also visibly demonstrates weaknesses to managers and shows them why they need to plan and implement strategies. As the lifecycle matures tests develop or are even driven by business units. Leaving testing until the end of a project based approach means that weaknesses may be left un-addressed for an indefinite period of time and that assumed plans are unworkable.

In addition to this testing can be fun, this in itself is a break from the norm in the midst of peoples business life and is more likely to get support than a paper based theoretical exercise. Testing has thrown up many benefits to companies that are above the business continuity programmes as it allows people to challenge the normal methods of operation, some streamlining and better processes have been developed from it.

Testing, in itself can be a risk. Some large-scale tests we have facilitated have involved hundreds of people, many business areas and a whole range of computer systems. The level of project planning required to make this work can be huge and can take a great deal of time. This in itself is hardly a spontaneous and instant reaction to a Crisis but is important to ensure that the best value is achieved from the test. You must be clear why you are testing, what are you trying to achieve, how will you measure this level of achievement, what are your aims and objectives? Testing for testing sake will not give much benefit to the organisation and if you cant measure your levels of success from the test then lessons learnt may be missed.

So what should you test and how should you do it? Is the ultimate aim for all companies the “surprise pull the plug” type test where a senior manager arrives at work and declares a test there and then? In the late 1990s certain UK regulators were insisting that they wanted to see “full tests” of recovery capability and re-location. From our experience this is hardly, if ever, undertaken for many reasons. This type of test costs a great deal of money, not only in organisations but in the man hours lost by all business areas in being involved in the test. The test can also be difficult to organise as systems that are in place at the operational business site may need to be artificially altered for the test to a state that they would not need to be in if a real incident occurred. This can be due to networking, routing or links to other key live systems which can not handle duplicate links from the disaster recovery site and the live operational site at the same time.

As such agreeing the scope of a test before you start to plan for it is essential.

Testing has to go beyond IT systems and must involve as many personnel as possible. Simple aspects of where to evacuate to, where to meet after evacuation and how to get to the recovery site are often forgotten. We have found that those involved in business continuity planning on a regular basis can become almost too close to the subject and forget that for 90% of staff or more BCP is something which may be discussed once a year. We found recently that only 2 people out of 12 on a Crisis management team of a company which had moved into a new building new where the fire evacuation point was. Just because there is a poster on the wall informing people does not mean anyone has read it. Tests do not get much simpler than these, but they are often ignored in the rush for more “dramatic” tests.

If business areas tell you they can operate without computer systems and use manual work arounds ask them to do it for a day. Determine the true business impact of this through testing as opposed to analysis. There is no limit to what you can test and business continuity managers should strive to test all aspects of the plans, from communication to Public Relations and from the Initial Reaction to Information Technology. Tests can include:

Types of Tests Overview
Table Top Rehearsals These tests are the lynch pin for all tests across your company. They can be used to exercise the Crisis Management Team in its immediate reaction to an incident or to prove a specific Business unit understands how it will operate during relocation or without systems. They are usually scenario based and over time can develop in complexity as your company matures into the business continuity process.
Information Technology Before any large scale IT test is undertaken, and as new components for recovery are introduced, each element of technology should be tested. This can range from switching communications lines to restoring data. Whatever the test is it can improve procedures identify dependencies and help to agree recovery timeframes.
Disaster Recovery Once all IT components have been proven large scale tests can be executed. This can involve invoking external contracts and using recovery sites or having equipment delivered. The full IT test will show how systems will interlink, prove resource dependencies and help determine points of consistency for data.
Business Continuity When business areas understand their plans, and once IT have proven the technology can be restored, then full business continuity tests can occur. These are very high profile and involve personnel from across the company so must be tightly planned. Business users will relocate to their agreed recovery site and use the systems and data presented to them to operate. This will prove that the plans and strategies meet the minimum needs of the business to survive.
Call Out Cascade These tests allow you to prove your communication systems and demonstrate that by making one single telephone call every member of staff can be contacted in a short period of time. We usually select a mid week evening and allow only three hours, during which we track the progress of all calls. This test takes business continuity out of the office and moves it on from management to include every person who works for your company.
Public Relations Working with professional journalists you can test your reaction to the Media and learn how to give interviews under pressure. Using television and radio interview techniques you can test against a scenario, how you would react to your company undergoing a serious incident live on television.
Telecommunications Being able to communicate to staff, customers and each other is vitally important during any incident and the ability to do this is rarely tested. A test of your ability to transfer telephone calls to another site and then answer and handle these needs to be planned for and executed.

Define Scope

Traditionally, and somewhat obviously, the definition of the scope of the business continuity programme has always been the first step. However this ensures that business continuity is a project and not business as usual. Normal business is not limited by project scope and nor should business continuity. Scope at the beginning of a project limits the extent of the work before any assessment is made, it excludes business areas, sites and sometimes countries. Whilst this is sensible in reducing costs and producing a project plan it should be based upon Board level knowledge and sign off, not upon a consultants or IT Managers decision.

The scope in the new model is defined to ensure that, having already been identified at Board level, all key risks, owners and requirements are agreed and signed off. At this stage the scope can be agreed to focus effort where it required to reduce exposure in the quickest and most cost effective manner whilst matching the concurrent streams already actioned.

Risk Analysis

Risk Analysis has been and remains the backbone of business continuity and all operational risk management. Without identifying, quantifying and agreeing risks then the process has little value and less meaning.

The question is though at what areas should risk analysis be aimed?

Risk functions in many companies have been in place for years, monitoring, assessing and controlling the financial risks and exposures a company is exposed too. Most commonly and possibly incorrectly the Insurance function is called risk management.

In many cases when a business continuity risk assessment is carried out it is only the physical risks which are assessed, you will find that reports contain comments about the number of distinct electrical supplies to the building or the proximity of a flight path. However risk analysis needs to go further than this.

In the United Kingdom a report on Corporate Governance, known as the Turnbull report, has set the responsibility for the identification and monitoring of risk at the Board level for companies listed on the UK stock exchange. The initial reaction to this by traditional risk management auditors clearly demonstrated the gap between operational risk and financial risk management. Many large independent auditors found the assessment of operational risk was a new idea and a new concept to learn. To the business continuity professional this was the essence of their work for the last 20 years.

This gap in the agreement as to what is included in operational risk, how it is measured and how it should be reported on seems to be becoming more pronounced rather than less. Insurance Risk Managers routinely dominate a meeting of Operational Risk professionals. Their approach is pre-dominantly driven by the need to quantify the risk in pure financial loss and to agree mitigation measures to reduce the financial exposure. They tend to be led by models, theorems and complex equations to determine values. However there is often a gap here between the theoretical approach many of these risk managers adopt and the pragmatic approach many business continuity risk managers take to implementing physical solutions.

Risk analysis must cover all and every area of an organisation that could fail and have an impact upon the future operation or success of that organisation.

The analysis needs to cover at least:

The physical review should include the environment in which the business resides, including the local and national situations and infrastructure. The review should look at the nature and fabric of the office or workspace and include the normal methods of operation. It should look at the security systems in place and the history of events that have impacted the organisation and organisation in similar business sectors.

As with all aspects of analysis the logical review can be at many levels, from a review of the Information Technology procedures in place to a full penetration test of the network. Asking if passwords are used is one thing, seeing how many are shared and never replaced is another. The review should be agreed before hand and if need be expert help brought in from outside. In many cases a review has found hacking or illegal use of systems which the internal staff have been unaware of.

Personnel are rarely reviewed and yet in almost all cases of failure or impact it has been down to human beings being negligent, malicious or making a mistake. We always review where personnel work and what they work with but rarely review them. In this time of downsizing, mergers and uncertainty, staff may have many reasons to put a company at risk. Simple reviews of references, qualifications and credit checks are often not carried out by employers. More reliance on contractors brought in by third parties has increased the risks as company loyalty and length of service is no longer a motivator. Ongoing assessment is also required, in most companies reviews of staff are carried out when they are recruited but not again, look at your own life and see the changes in the last 15 years, do you not think its worth a review?

The nature of the operation is also rarely reviewed. What type of business does the company do, is it seasonal, controlled by weather or trends? Can an illogical emotional reaction by customers affect you; does an impact in a similar business have a knock on effect on you? For example the share value of Coach and Bus companies in the UK were affected by the Kings Cross Rail disaster even if not involved. Similarly soft drinks share value, across the sector, were affected by the recall made by Coke in the USA. Public image and key staff loss should be looked into, as should dependencies on sole suppliers. These risks are often ignored and yet in many cases have a bigger, more costly and visible impact than a hacker, virus or localised fire.

Reviews are executed by physically checking items however using very simple and common models can help to initially prioritise them. The following graph can be used with senior management around a table top to gain an understanding of where they feel effort should be focussed and to identify what attitude to risk the company has:

Incidents, or risks can be grouped by simply identifying what concerns people and then asking them if they consider this is likely to happen on a simple high likelihood or low likelihood scale. The same incident is then “guess-timated” for its impact, high and low being used once again. This simple but effective method quickly shows where effort should be placed and can be used as a review tool, as it demonstrates if the effort being made will result in an impact or likelihood, associated to a risk, being reduced.

If the above method is used before a physical review then the review can be tempered to ensure that focus is placed on risks agreed upon whilst not missing other risks which may need to be entered into the diagram at a later date.

But even after all the most detailed analysis, in depth impact reviews and structured reporting there is still the possibility that the remotest risk may happen—and it could be that one that stops your business. It is because of this that, even though we accept that risk reporting and Insurance are important—business continuity plans must be in place.

It must be remembered that people will have a different interpretation as to what is a “risk” dependent on their position and role within the company. All risks should be challenged by “so what” at all stages. The fact that a company has a virus on its computer systems is actually irrelevant, what is relevant is the loss or compromise of the information on those systems. A Company that has no computers but loses it paper work has potentially the same impact. The actual risk is the loss of information and the mechanisms or media on which this is held is a function to be addressed not a risk. However the people completing the review, if from the Information Technology departments, may have a very different view on this.

Sometimes a review can take time, not in its execution, but in gaining the trust of the people working with you to complete the review. Who likes to admit that things are wrong, not executed properly or could be improved? It’s not in many peoples nature to acknowledge this. In addition some Countries cultures do not find admitting failures or weaknesses acceptable and an alternative method of reviewing risk needs to be considered.

The output from the risk assessment will, in most cases, be in an audit style report with a mechanism to track and close down risks identified. The following sub-chapters consider at a high level how this can be achieved.

Implement Risk Reduction Strategies

Risks that have been identified can be accepted, removed, reduced, contained or contingencies can be planned for if they occur. In general business continuity management plans for the contingencies and accept the fact that no matter how far you go in protecting yourself something can still go wrong. If it can’t then you should also cancel all your insurance, as you will not need that either!

Putting contingencies in place, alternate working methods and disaster recovery solutions have long been the key part of the business continuity strategies. Many companies and advisors have forgotten that the reduction of risk and protection from risk is just as important.

In the next few sections we will argue that the traditional approach to disaster recovery may no longer be appropriate for some modern organisations. The argument is that for more and more organisations disaster recovery is dead and continual operations must be in place.

Immediate

The Off-risk programme, implemented as an immediate an ongoing stream to remove obvious and high impact risks, will reduce exposure and protect a company in the quickest possible way. However the solutions implemented during that phase may not be the best ones or permanent ones, you have to remember that taking things “Off-risk” is an immediate and are interim solution only. As such detailed risk analysis will identify what needs to be protected, changed or planned for.

The immediate risk reduction strategies will in general compliment the Off-risk programme and due to the level of analysis should go into new areas too. Immediate action is required for risks that are intolerable and for areas where the solution is so simple that it can be implemented without detailed project planning or application for funding.

Organisations should also have something in place that will immediately protect it from a serious traumatic incident. You should ask the question if our building burns down tonight what do we do tomorrow? It is not a question of, what do we do in 6 to 12 months after a long term project and detailed planning but what is our plan now? This short-term business continuity strategy will almost always be surpassed but something needs to be there immediately.

Short Term

Short-term risk reduction strategies will need some planning and application for funding and may involve bringing together people from various functions to work together to alter areas at risk. There may be an impact on normal working practices and accepted methods of operation by implementing these changes and as such formal change management and acceptance sign off is required. It should never be underestimated the impact of a slight change, and in the strangest areas. The emotional reaction of staff is often disregarded but in two simple examples this was one of the biggest problems.

To reduce risk of uncontrolled entry to a computer company we implemented a change that reduced the number of doors allowing entrance and egress on the ground floor from 4 to one. This then meant staff had to walk up to 100 metres further than they used to, to gain entry to the building. The outcry this caused almost led to senior management changing their minds!

In a second case we implemented a rule where personnel would wear security passes at all times so that they were visible and would show the holders photographs. Some senior managers refused to wear them and thus the culture of the company over rode the security needs.

The impact on operations, culture and the accepted norm should be considered before any risk is resolved.

Solutions should work in conjunction with culture and personnel needs as often as is possible, for example many banks share passwords on Trading desks and allows traders to never turn their computers off as it could impact their operation. Culturally this is accepted and worked around. Whether the risk is managed correctly here is not the issue, senior management sign off risk to a level they accept.

We are suggesting that risk solutions are not a single block as they have been in traditional models but rather come in waves, some immediate and obvious some in normal change control methods and project planning and others as long term plans to alter culture and standard operating procedures.

Long Term

Traditionally the long-term business continuity recovery strategies have been either to look internally for a solution or to turn to a disaster recovery provider. A company either accepted a contract with a disaster recovery provider and entered into an agreement where they held a partial timeshare on a piece of equipment, or they looked into becoming self sufficient and to building a strategy to use alternate sites within their organisation.

Once again most of this way of thinking came from the time when information was stored on, and distributed from mainframes and the time for systems recovery was measured in days not minutes. This approach also almost always assumed that the entire computer system would be lost and would need to be recovered. This way of thinking is still the basis for most strategies and in many large companies is unacceptable and unworkable.

It is time for a re-think.

Let us consider the problems with traditional disaster recovery suppliers and the challenges we hope they are now addressing to bring them into the 21st Century. Some key areas which have changed attitudes:

Internationalism

Many large-scale organisations have operations in Countries around the world. This means they have natural resilience to a problem but it also means that there is never a clean “downtime” in which recovery can be made.

For example if you consider a Bank which operates offices and trades equities in London, Hong Kong and New York, then this is in essence trading 24 hours a day. If a computer system fails, a system, which in effect might not even be in the country of operations, then the impact is immediate and highly visible.

A business continuity strategy for an office in New York to react to a fire at 0400hrs may be to call out the Facilities managers then have these form an Initial Response team to consider damage assessment and to invoke recovery procedures. This then could involve a detailed call out cascade being activated. Meanwhile in London the time the fire happened was 0900hrs and systems went down immediately, during the business day with real financial impacts to the Bank.

Whilst New York is invoking a controlled systematic approach to the incident chaos is reigning in London and the people affected are calling any telephone number they can find for their equivalents in the USA. This undermines the controls in New York and takes away the management process.

So who owns the incident? The physical incident has one set of needs whilst the business impact is actually many thousands of miles away. Companies need to plan for this and to have strategies in place, which consider the business as a global entity and are not restricted by physical buildings.

Internationalism also introduces risks brought on by varying cultures and regulatory bodies within Countries. Even within unions such as the EEC there are varying laws on what information can and can’t be taken into other Countries. All of these risks need to be considered in the long-term risk reduction strategy to ensure your company is protected.

24 Hour Operations

The continued growth of 24-hour operations, be they linked to service centres or to an international operation, has meant that the traditional split of computer work between on-line window and batch operations is disappearing. Controlling systems used to be clearly defined, the business had the online systems during the day and then this was handed over to Computer Operations for night time batch processing. During this batch processing all transactions were updated, tomorrows data prepared and the back-ups of systems taken. This gave recovery plans set points of consistency for data and thus made recovery much more straight forward. Points of consistency are explained earlier in this document within the Information Technology section.

Organisations are now faced with business operations that are available 24 hours a day. This means that data is continually being updated and that backups will rarely be in synchronisation with other systems. In the event of a business interruption it is likely that somewhere within the business there are live transactions being actioned. This brings two sets of problems. Firstly to what point and with what data do IT recover, and secondly what does the business do immediately when clients can see it is non-operable. An example of this is 24-hour banking where in many cases the telephone is the only point of contact that clients have with the bank. If the phone stops being answered confidence in the banks ability is immediately weakened. Once again traditional planning methods of contain and recover are not going to be acceptable to business managers that seriously want to operate in this market.

Internet Business

E-Commerce, E-Markets, E-trade, E-business, if traditionally it’s a form of business then put an E in front of it and you are now cutting edge. The growth in Internet based business facilities is unprecedented and if the media is to be believed will be the future of commerce. At this stage in the early 21st Century people are still finding their feet in this world and as many business fail spectacularly as succeed. One thing however is clear, if Internet based businesses are to succeed and convince customers, both business and private, to use this service as opposed to traditional proven services then it has to be functional.

Internet based services are available to almost everybody with access to a computer anywhere in the world. So immediately we are again into a market where at all times for someone it is the middle of the business day. Customers know that if they visit a shop, manufacturer or order over a telephone their goods will be delivered, so the Internet service has to be better than that. People who use the internet are less forgiving and if a site does not respond, takes to long or can not be found, then immediately a search engine will take you to another. As such E-BCM (like I said, put an E in front of it and invent a service) must ensure that potential downtime is minimised and that recovery is seamless and invisible.

To follow our theme, this cannot be done by restoration. Businesses who want to operate in this market must look at duplication of service, twin hosting and methods for immediate switching. Currently this is possible for major businesses but costly or confusing for the smaller ones. However this must change and a large part of the future of business continuity will be focussed on this type of business and the methods for keeping sites operational.

Call Centres

Within the UK, Call Centres are one of the biggest growth industries. Three interesting factors affect these businesses.

Many industries who have gone down this path very often have no way of clients contacting the business other than through the Call Centre, and as such these operations are essential to the core business.

Call Centres currently tend to be built up around automatic call distribution (ACD) systems and more of them are moving towards Interactive Voice Recognition (IVR) systems. ACD allows calls to be managed as they arrive at the centre and then distributes them in a controlled manner. They can be tailored to allow specific calls to go to special areas or they can look for the next available operator either locally or nationally. All of them allow for statistics to be generated and for management to monitor calls and operators. IVR passes the control back to the client. You are asked to select keys which navigate you to an operator who should be ready to answer your specific query, i.e. “press * for train times”. Both of these tend to be very bespoke to each business and attempts at replicating them are difficult and time consuming.

It is possible to replicate one manufactures ACD onto another, or at least attempt to deliver the level of service through another but this brings in new problems. Businesses should always consider this:

“Is the moment of disaster the best time to introduce new technology and systems to the business?”

Operators and management will be under pressure to resume operations and looking at even slightly different terminals, headsets and screens will only add to the level of confusion. Add to this IT staff having to restore, and then be responsible for, this new equipment and you have the recipe for a potential second disaster. Traditional planning usually assumes that only a percentage of the normal operation will be restored, however experience has shown that after each major incident the number of incoming telephone calls rises dramatically in the short term. As such consideration should be given to recovering in excess of 100% of Call Centre functionality for the short term following an incident. This goes against normal strategies but can be proven in testing.

If call Centres are the front face of your business and if you sign service level agreements with clients then how long can you really have your Call Centre non operational for? Once again traditional methods of stopping operations and recovering later are probably not acceptable.

Client Server Technology

Most business continuity management models are based upon mainframe technology. During the 1980s when most of these models were created computing was centralised, delivered from limited numbers of platforms and these were generally on the same types of technology. Data was stored centrally and in the event of a serious incident most companies assumed that all would be lost.

Client server technology does not meet any of this criteria. Walk into most major organisations computer rooms and you will find a multitude of technology, the Y2K experience showed us that most companies did not even know what they had! Platforms will be from many suppliers, configured in various ways and will supply numerous functions. There may be many computer rooms in a building and there may be many buildings with computer rooms. Despite serious security breaches companies still allow servers to be located with the business in non-secure environments.

People have forgotten that in the UK in 1991 most large financial institutions were still using “dumb” terminals and that the growth of client server technology has been rapid, and in some cases seemingly uncontrolled. Whilst this new method of working brings in horrendous management and security issues it has been both good and bad news for the business continuity manager.

Companies are now unlikely to lose all of their systems or all of their data at any one time. Unless the incident is catastrophic and destroys the whole facility then it is likely that some operations will be sustainable. But is this good? If businesses continue to enter information into these systems are they becoming out of synchronisation with the systems lost, are they missing key checks from dependent systems and who has this information? Does the business continuity plan tell business users to stop using systems even if they are functional?

Does IT have the numbers of staff needed to restore what in many cases can be over 100 servers for a company and at the same time rebuild other types of technology? In many cases these skills can be so specific that a technician may be able to rebuild only one type of machine. Traditionally on mainframes we all used to help move tapes from storage silos and load them onto the hardware, now in many cases recovery has to be in a serial fashion not a parallel one.

Servers have simplified the methods for disc mirroring and for allowing synchronisation of systems across large distances. Alternate methods to compliment standard recovery plans should be investigated and costs justified against loss; this can come from the business impact analysis review, which we will cover later.

Summary

Businesses must ensure that the data they will be given after an incident and the speed in which they will be given it are satisfactory and realistic to current business needs, once again testing is the simplest method to prove this.

In this new millennium how many businesses can now afford to have an invocation time, a delivery time, a set up time a rebuild time and then have data that is at least a day out of date? Even using fixed site recovery sites brings in many of these problems and for many businesses traditional disaster recovery services must become a thing of the past. The disaster recovery providers themselves need to drive the next generation of solutions and provide continual service to clients.

Planning

The business continuity plan is, after the implemented strategy, the fundamental output from the whole business continuity management process. This is the element that managers, staff and possibly, customers and suppliers, will use to work their way through an incident.

So what is the business continuity plan?

In most cases it is an A4 size folder of reams of paper hardly ever maintained, rarely looked at and very often not used in an incident. Most plans are there to satisfy auditors and through lack of testing have little buy in or support from the so-called “plan owners”.

External consultants, IT, facilities or a team that has little access to, or understanding of, the business often write plans. These plans are then in many cases given to the business areas and then forgotten about. This may all seem a little harsh but it is based upon experience.

Plans need to be functional, brief and useful. They do not need to be wordy, explanatory or patronising.

Some organisations have tried to move plans away from paper and onto systems. There is merit in this for maintenance and standardisation purposes but in most cases during a time of stress people resort to paper. They are comfortable with it, they can control it and they can scribble on it.

A plan is to be used in a time of turmoil and confusion and as such must be easy to use.

Consider larger fonts.

Think bigger spacing.

The plan does not need to have an introduction section, a how to use section, a history and all the other good documentation standards within it. These aspects of documentation need to be there but make them either an appendix or better still a separate folder. The plan needs page one—point one to be the first instruction. Try it in a test, people will be paging through their plans up to page 14 to find out what they should do and immediately they have lost faith in the plan.

The plan needs to be current, each page dated and version controlled with a page number on it and their needs to be sign off by the plan owner when updates are distributed. Will the plan owner print off a new copy if they receive an email saying the electronic copy has been updated, I doubt it.

We recently started to produce plans based on the old British Army orders book. These were A5 size, so they fit a in your hand or your pocket. Duplex, so twice as much information can be seen and waterproof, so in the event a crisis has to be managed outside and not in a controlled office environment the plans continue to be readable. This is a simple idea, costs little and works.

Hand held computers and laptops also work, as long as there is power, understanding of the technology and management believe in scrolling through files as opposed to having things to hand. WAP phones? How many sheets of A5 can you see at a time? Do you then need a second phone to make calls with as you read plans from another? I am sure all of these things will come in the future—someday, but I do not see business management buying in to them until at least 2005. It is ironic that this document is about moving on from traditional things to new concepts but on this one I am afraid paper is king!

Plans can cover many areas and this document highlights only some of the standard plans. The Information Technology team plan was covered earlier in this document, as was, the Immediate Response Plan, the following plans are usually activated once that team is in control of the situation.

Facilities & Premises

This is a fundamental team to all plans no matter how large or small the organisation. In most cases their representative will be the first on the scene for any major incident and will initially own the relationship with any emergency services. Most companies have a key holder or on duty maintenance representative who is called out if anything goes wrong. This person will make the first damage assessment and invoke any further call out. It is key that they understand this responsibility.

The following information shows that this team has a number of critical roles and in many cases may not have the resources to handle this. Plans for this team must bear this in mind and need to show a managed way forward. If a serious incident has occurred then this team will have a multitude of duties but will have four major roles:

Liaison with Emergency Services

In most cases, even when there are 24-hour operations on site, the facilities representatives will liaise with the emergency services. It is essential that this relationship is built up in the planning stages and that both parties are clear about who will liaise with who, what expectations can be set and where control points may be. This is a key role and one that will help to establish the recovery timescales and capabilities at an early stage. The first person on site, having reacted to a call out, will be asked questions about personnel on site and about where key services are. All this information should be to hand and the person handling the liaison should have the ability to explain this to the emergency services.

Salvage and Security of Site Affected

Once the site is returned to the company and the emergency services allow limited access there is an immediate need to secure it. Experience has shown that in most major incidents there are attempts at looting. The site must be made physically secure to prevent further damage and then the assets within must be protected.

If the incident does not totally destroy a site then there must be attempts made to salvage key information and assets. This salvage can range from computer data to the company seals or from key documents to critical machinery. A plan must be in place as to what is important and as to how this should be removed from the damaged site.

Management of Recovery Site

For many major organisations a site will have been allocated for recovery purposes. This may be in house or under contract. If the incident is significant then there will be a need to have this site operational and the facilities team will be key in making this happen. Demands on their time could range from infrastructure and power through to catering and re-establishing the post room.

Restoration of Affected Site

It is always hoped that the organisation will be able to return to the original site, although in some of the most significant disasters of recent times this has taken a number of years. However in most cases disasters are not of that scale and a return will be possible. The Facilities team will manage the restoration of the original site and the rebuild. They will plan for clean up operations and for contractors to re-construct where necessary.

Public Relations

Almost all serious incidents in modern time have had a common theme:

“It not how you recover from an incident that is critical, it is how you are perceived to recover from an incident which is critical.”

The power of the media to convince public opinion one way or another should never be under estimated (look at the USA presidential election result monitoring in 2000 for an example.) As such the importance of the public relations teams for passing information both internally and externally should never be under estimated. There are so many examples of varying ways of handling the media during an incident of any kind, including Exxon, British Midland, Swiss Air, Townsend Thorensen, Coca-Cola, Perrier that this lesson should be known by all. However in plan after plan, which we review, public relations is often given lip service only.

If public relations are not the first person to arrive at an incident then the media will see the person there as the company spokesman. This may be incorrect and against internal policy but it is a fact, so how is this managed? When do pubic relations arrive, what do they do, what is the company message, when do we use the Chief Executive and when don’t we? Do we need to say anything? All of this and more needs to be planned for and tested.

Internally how do we keep all staff aware of the situation and what message do we give to them? How do we control staff in a crisis situation and stop them or limit them from talking to the media?

The image of an organisation can be enhanced by a traumatic incident, look at Commercial Union in the UK following the bomb in London, or it can lead the company to closure.

Personnel

The personnel, or human resources, area within an organisation has a role in the planning and awareness of any business continuity programme in addition to roles during and after an incident.

Personnel can be used to help raise awareness for the need for risk management. They can do this by introducing key components of business continuity into the induction of new employees as well as putting an entry into the employee handbook. They are also responsible for maintaining key contact information on employees and can be used to help put together any call out cascades you wish to implement. If your company has Unions then the liaison with them can assist in raising the profile of the importance in protecting jobs and the work environment.

During an incident Personnel can help to co-ordinate any call outs and to manage staff’s expectations. They can be a central point for communication and can co-ordinate the relocation or reallocation of staff to new areas or tasks. They must have all records to hand and they must be as accurate as possible. Our experience has shown in almost all organisations that the information held on personnel in the business continuity managers plans is more up to date and complete than that held on personnel systems. This tends to happen as the business continuity manager will be updating plans monthly and will have a mandatory rule that managers must supply information. Personnel tend to maintain information annually with changes collected on a voluntary basis, as such their systems become out of date but for once a year. This is a generalisation and it is a pity that it is true but it is a fact and companies would benefit from linking both systems to allow the Personnel departments data to be as accurate as the business continuity planners.

Post incident personnel can assist in bringing staff and management back into the normal working routine. During an incident normal reporting lines and management criteria may get forgotten in a drive to get the job done. As such there is a need to re-establish normality as quickly as possible. Staff may have been working through the nights or staying away from home and this can all lead to new levels of expectations. Many companies plan to have counsellors brought in to offer counselling to staff but few of these have a plan as to why they are doing this or what they are aiming to achieve through it. Many companies believe counselling is important but having a plan stating you will bring counsellors into the company is not enough. A contrary recent survey has shown that in many cases counselling is counterproductive and serves to continually re-enforce an image to the person that they were successfully getting over. This paper does not say which is right or wrong merely that consideration for all action should be taken when plans are being assembled.

Each and every area of the business must have a plan in place. Even if the plan states that this operation will be suspended for 12 months there must be a plan. What you do not want during a recovery period is business areas challenging for limited resources and some business areas that have not been involved in the planning now making demands. If plans are in place for all areas, and signed of by the management of that area then expectation has been set and agreed and the recovery can focus on this.

Some business areas will involve the development of complex alternate working methods that will need to be tested. In many cases management will state that an operation will return to manual procedure is computers are not available. However in many companies the staff in place have never used manual procedures or the complexity of computer systems has meant that manual calculations cannot be done.

Plans may just be call out cascades and lists of key customers and suppliers to let them know what has happened, in most cases communication is the most important part of a plan. If complex manufacturing machinery has been lost then can the operation continue or does the plans have to acknowledge that there will be an extended period with no recovery?

Business plans need to be able to identify key personnel for each area, critical operational times of year, month, day and establish what really must be done in the immediate aftermath of a failure. The plan should include how the business is going to re-establish data it may have lost, how it will operate in a reduced capacity and how it will structure its recovery back to full productivity over time.

Business Unit Plans

Each and every area of the business must have a plan in place. Even if the plan states that this operation will be suspended for 12 months there must be a plan. What you do not want during a recovery period is business areas challenging for limited resources and some business areas that have not been involved in the planning now making demands. If plans are in place for all areas, and signed of by the management of that area then expectation has been set and agreed and the recovery can focus on this.

Some business areas will involve the development of complex alternate working methods that will need to be tested. In many cases management will state that an operation will return to manual procedure is computers are not available. However in many companies the staff in place have never used manual procedures or the complexity of computer systems has meant that manual calculations cannot be done.

Plans may just be call out cascades and lists of key customers and suppliers to let them know what has happened, in most cases communication is the most important part of a plan. If complex manufacturing machinery has been lost then can the operation continue or does the plans have to acknowledge that there will be an extended period with no recovery?

Business plans need to be able to identify key personnel for each area, critical operational times of year, month, day and establish what really must be done in the immediate aftermath of a failure. The plan should include how the business is going to re-establish data it may have lost, how it will operate in a reduced capacity and how it will structure its recovery back to full productivity over time.

Customers

Customers and suppliers failing or being impacted by failure are not areas that many companies have traditionally planned for. The millennium bug made companies acknowledge that failures in these areas could have significant impacts and put plans in place. The fuel protests in the UK in 2000 made companies develop plans for a single incident but it also made them look again at supplier failure.

Companies can work with both customers and suppliers to develop linked plans which identify and cover for key failures.

If a customer has a serious incident and no longer requires your product where does that leave you? Contractually they may have to pay for the goods but if production is then impacted due to lack of demand what is our plan?

Single dependency on a key customer has caused companies to fail all over the globe and yet due to pressures from the customer for exclusivity it is still common practice, do you have a plan for that customer terminating the contract?

Alternatively do you have a plan to inform your key customers that you have had a disaster and that you will be unable to honour contracts? This could be a Bank failing to complete transactions or a manufacturer failing to produce goods. In any case a plan needs to be in place which will manage the customers expectations and ensure the relationship is still in place once you have recovered operations.

Suppliers

Once again before the millennium bug planning for supplier failure had been nominal. However the fragility of the supply networks and the dependency on it was recognised during that period and plans were put in place. Whether these are being maintained is another question.

Some key suppliers such as electric, gas, water and transport are so fundamental that their failure will in most cases cause some impact. However this can be mitigated for and strategies can be developed. The drive to “just in time” operations and the reduction in stock piling means that the failure of a supplier to deliver raw materials will halt production in a very short period of time. Service level agreements are fine to use for compensation, in the same way that insurance is, but they do not keep you operating or show you as a well managed company.

The risk management here lies with the Board of a company as the costs of implementing solutions, which can change a companies culture, must be balanced against the risk of a supplier failing. Simplistically though if one failure would lose you clients then any risk is not acceptable.

Crisis Management Plan

We believe that crisis management planning has been incorrectly used as the title for incident response plans. This may sound like an argument over semantics but it is more than that. The word crisis has emotive undertones and implies that something has occurred that was beyond your management capability, thus a crisis. But this is not the case. The whole point of business continuity management is to plan and put in place strategies for such incidents. As such when an event occurs we have a planned, managed response to the event, we do not ever find ourselves in a crisis. The image we portray is one of control and professionalism, we have thought this through and we are prepared for it. It is not a crisis.

So what is a crisis management plan for then?

“The incident beyond the plan.”

The crisis management plan is to cater for the scenario or event that we never thought would happen.

As an example:

If you work in a large company and one night the main office including all of the IT is destroyed by fire then this is not a crisis. You have plans in place to respond to this, you have a strategy which allows a contract to be invoked to supply space and equipment. You have further plans in place to manage staff, media and customer expectations and strategies to allow the business to continue. If however you invoke your recovery suppliers contract and are told that everywhere is full and you cannot have any space or systems then this is a crisis. You no longer have a plan and now you are beyond areas you have considered. This is when you need a crisis management plan.

Very few companies have one.

Software

Business continuity management is about people and people working through a potentially traumatic event to ensure operations can continue. Large amounts of data will need to be gathered to allow functional plans and solutions to be put in place and this information will need to be managed and manipulated. Each organisation will have to determine how this is best done for them.

“Is there a place for business continuity software in the 21st Century?”

There are on the market well in excess of 30 business continuity planning packages and this excludes packages claiming to do business impact or risk analysis for you.

The time when disaster recovery planning was first starting to grow rapidly from a niche area to mainstream was during the 1980s. At this time almost all large companies were using mainframe computers to store their data on and very basic front-end processors to produce out put. PCs were not common place and word processing packages or database packages were still very new. As such accessing data which allowed you to build plans for personnel, software, hardware, suppliers and customers was not easy. Manipulating this data and producing usable documentation was even harder. This then created a need for business continuity planning tools.

The first major tools came from the USA although some did originate from Holland. These tools were primarily relational databases allowing masses of disparate data to be linked to produce documentation for plans. Some were word processor based and others contained templates for completion to allow fast track production of output. As most information was stored and managed centrally on mainframes the BCP products contained import and export functions to bring in ASCII based information to populate the tables within the databases. These tools were extremely useful as in most organisations there was no other way to bring all of this information together. However right from the outset it became obvious that the tools themselves were an overhead and brought in another discipline to the business continuity manager which had been unforeseen. The tools were envisaged as panacea’s to the problem of planning and it was imagined that by purchasing, (in most cases an expensive) software tool the solutions would appear. However the understanding of the information and the rationalising of the content of the plans still had to be executed and software could not bring intelligence to planning. Many software tools were not flexible and led to plans being written to match the software rather than to match the requirements of the users. In addition in the early days some tools were scenario driven and required users to complete plans for all eventualities rather than focusing on worst case scenarios.

The early 1990s brought a time where almost all users through out companies were given PCs and associated products. Suddenly data was being held on client servers, it was accessible, manageable and through simple databases could be linked to many other areas. Users became used to quality documentation using colours and font changes and pictures. They became used to producing their own output and many BCP software packages still now in the new millennium do not offer this flexibility to the users.

Many major packages still don’t allow spell checking of output, variety in colour or font or control of the output. This is at a time where packages such as Microsoft Word are known and used by most personnel in companies. Databases such as Oracle and Access mean that people are used to manipulating data and producing reports from the output. Users now find that in many cases the input they supply to the business continuity manager for entry to the specialist software is of a higher quality than the output they receive back.

Software now has become so large that National Newspapers now advertise jobs for business continuity software specialists, introducing a whole new overhead to a company to maintain a product using information widely available elsewhere.

In the 1980s one of the claims made to justify software was that it stopped the need to maintain information in many places and introduce a central repository. This was true. However now distributed client server systems mean that this information is already there and accessible, all that new specialist business continuity software is doing is introducing another icon to PCs for users to be trained in and for network engineers to implement and maintain.

Business continuity software has a place in today’s market but it is competing with products already known to users and costing a fraction of their price. Software must change to deliver to users what they already have or better and it must be simple and cost effective.

Business Continuity Management Model of Today