How To Power Down Your Data Center
If you are not in the cloud and remain in your legacy data center that was built when you moved into your building, you need to plan to shut down your data center every few years. This will allow for the maintenance required on all of the power and cooling systems that are now aging. An event of this magnitude may require 30 hours of work, therefore this should only be planned on a long weekend. Here are the activities you will need to execute.
The power-down date and time, (hereinafter the “event”) will be driven by your heating, ventilation, air-conditioning, or electrical team. Once the date is selected, consider booking hotel rooms close to the office for the team members. Having a local hotel room will help expedite pulling people into the office when the work is completed.
If this is your first time executing a shut down of your data center, plan on at least 5 months of planning and organization.
The event should utilize the standard project methodology for project planning. This includes establishing a project charter, steering committee and dedicated team members. Establish a weekly meeting with the team members and executives that supervise the data center environment. This meeting should increase to every other day, three weeks before the event for issue resolution. Data that needs to be exchanged during the program should be placed in a SharePoint or other collaborative tool. Please note, if you are using a tool that is hosted in the data center that will be impacted by the event, then that will not work. You need to store your files in a location that will not be impacted by the event. See the previous post that describes the project management activities. How To Move A Data Center.
This event will require a number of project managers for coordination. This event needs to be treated as a program that will use multiple project managers.
One of the first files to set up in your collaboration tool is a sign-up sheet. Each team member that will be needed during the event should fill in the sheet for the role they have and when they will be working during the event. Make this a self-service portal as opposed to you filling in the spreadsheet for every team member. You need to select a tool that will record the detailed steps for each team to execute during the change. Consider using ServiceNow and setting up a change request with tasks for each area.
For the coordination of the event, reserve a large conference room for the duration of the program. Make sure the conference room is large enough to support all of the team members for the tabletop exercises that will occur. You will need to book other conference rooms for the week prior to the event. These other conference rooms will be used for food distribution and stand up meetings. Reserving it a week in advance will allow you time to organize the room and test all of the equipment. The conference rooms should contain the following equipment: Quality conference room phone, projector for display from a computer, chairs, network connectivity through a broadband connection or cables for all parties that can be in the conference room, dual monitors for the people sitting in the room permanently.
During the event, one of the conference rooms will be the location where the leads for all of your teams should sit. In the room you will need the team leads representing compute, storage, network, firewall, security, operating system and middleware software to be present. This will be “Mission Control”. (Editors note; I love aviation, never got to be a part of it, but I often refer to coordinating events in the context of aviation)
Resources
The following resources will be required for this effort:
1. Project manager
2. Dedicated database person that can just enter the data and maintain
the data for the servers in the database that you create.
3. Middleware software tools point of contact
4. Project Manager for Application Testing
a) The coordination of the application testing is a large effort.
b) You should assign a project manager that just focuses on
coordinating all of the activities for the application test.
5. Database Resources to include SQL, Oracle, DB2
a) It is recommended that you have a database person assigned to
each environment for coordination of work in each of the environments.
b) This person should be seated in the operations center with the
middleware team
6. Presentation Creator
a) You need a presentation person that can just take the data and
make various executive presentations that are due on a bi-weekly basis.
b) During this exercise, you will have a lot of detailed data but it did not
present in a summary format at all.
c) We spent hours just creating presentations that could tell the
plan from the detail data. Milestone presentations need to be factored in
for the future.
7. Food Committee chair
a) The coordination of food is a large effort and should be completed
early.
8. Unix Team Members
a) In an environment that maintains Unix Global Zones, you need to
have a large amount of team members on hand that can help restart
these zones.
9. Desktop support should be present during the application testing
a) During this time they should test:
(1) Printers
(2) E-Copiers
Help Desk
A temporary help desk will need to be established to report application testing issues and success during the event. A dedicated number and automated call distributor queue needs to be established to report the issues. The helpdesk team should utilize ServiceNow or an equivalent help desk tool to track the issues. If you are unable to set up an automated call distributor queue, then you should build dedicated phone lines that can be used by each line of business or each component area to report issues. These phone lines should be available for the entire duration of the event. Typically the hours could be from 7A-11P.
During the event, you will need to relocate your help desk from the building that is having the event. This will be to another building that has power and network connectivity. This connectivity needs to connect to your secondary data center since the primary data center will be offline. The testing that should occur at the secondary location includes the ability to access email and the internet. The email test should include the fact that the users at the temporary building can access the email that is also not in the primary data center.
Once the temporary building is established, the telephone lines need to be tested. You will need to redirect your 800# from the primary building to the secondary building. Time will need to be built into the project for the telecom team to build the phones and deploy them at the new location. On the day of the event, the task will need to be executed to move the 800# from the primary location to the secondary location.
Prior to relocating your teams to the secondary building, the elevator and parking garage need to be tested during the hours when the teams will be working. Do not have the team test the elevator and parking garage during the weekday if they are working at night and weekends. They need to test during the time when they will use the building. This is very important for access to the building and areas in your office.
Food
For a weekend event consider the following schedule:
1. Friday, of the event: Nothing, but you may want to consider lunch and dinner
2. Saturday, Breakfast
3. Sunday, Breakfast, Lunch, Dinner
C. The following items should be executed when you need to provide
food:
1. Start a food committee early in the process.
2. You need to know the numbers of people and what day they will work
earlier in the process. (If your resources are in a Listview on SharePoint
documenting their work schedule, it allows the food committee to review the
list daily for changes.)
a) The numbers can change quickly. Consider using an online registration form for food. The change in numbers can drastically change the food quantity to deal with.
3. You need to add all people in the building during the event for food.
This should include people who have a normal shift even if they are not part of the project. For example, the regular Help Desk team that works on Saturday also needs to be included in the food count. Your cleaning team needs to be counted as well!
4. The food delivery needs to be scheduled 1 hour ahead of the time when
people are to eat.
a) For example, if you want to eat at 6PM, then tell the food people to
be here at 5 PM.
5. Plan for vegetarian and gluten-free options.
D. Notify Building maintenance early in the process for the cleaning
that will need to take place due to the food consumption over the
weekend. This notification should include the location of the food and
the Operation Center conference rooms.
E. When the food arrives, you need to check the order against the
delivery. Often your condiments are forgotten.
F. The project manager will not have time to deal with the food. You
need dedicated staff for the food committee.
G. Setup
1. Food should be positioned away from the Operations center on a
different floor.
2. On the floor where the food is located, signs should be created that
indicate how to get to the food.
3. You should consider presenting the food on the table in a “creative
format”.
a) Less than stellar quality is made up for when the presentation is
eye-pleasing.
b) For example, arrange the bottles of water in a pyramid on the table
Schedule
Building the schedule for the shutdown requires a meeting with all of your areas. This likely includes the following groups and a weekly meeting. Consider assigning a project manager to each group for coordination. The project schedule needs to specifically indicate who is doing what task. A user should be able to take their tasks and read it verbatim for what they should do. The data provided to the user for what they should do should be provided in a format other than Microsoft Project. Users do not have Microsoft Project and often do not look at the activities. You cannot post the data on SharePoint for them to look through to determine what they should do.
Three weeks prior to the event, large group meetings were held with the infrastructure teams in order to walk through the activities on the schedule that each area would do. These meetings took place every other day starting three weeks prior to the event. The meetings were conducted in large conference rooms. Please note, the conference rooms should need to be booked in the very first week of the project. This can be done because you know when the event will occur and therefore can back up three weeks from the implementation date.
Develop 3 schedules early on
You need three schedules for your event: (1) Overall project schedule (2) Shutdown schedule (3) Startup schedule. For the schedules you should immediately list the activities on paper, even if they are wrong this will start the discussion of what needs to go where in the sequence. You need to show the list of servers by their group and what time they get shut down. If you setup a database, you can use the access database to set up the schedule of servers when they get shut down.
When you are launching the project make a decision whether to focus on the technical items at the meetings, or just get a schedule down together and continue to refine the activities that are missing. If you focus on the technical items, then you take away time from the hour x hour schedule you need. You should be focused on the detailed hour x hour schedule early in the process. If you develop your hour x hour schedule two weeks prior to the event, it is almost too late. Your hour x hour schedule should be developed earlier in the schedule.
Groups that need coordination for the shutdown
- Storage
- Windows
- Linux
- Networking
- Firewall
- Non-production, ie. Lab Administrators
- Mainframe
- Security
- Team members that manage severity one, system down issues.
- Help Desk
- Business representatives for each application impacted.
- Database Administration Team
- Middle Ware Team that does business objects, infosphere, WebSphere, WebLogic
- The customers that use the applications
Four weeks prior to the event you will conduct a tabletop exercise for all team members to walk through the steps that will occur. This will fully utilize the conference room that you are using for this program.
Two weeks prior to the event, a meeting with the impacted stakeholders needs to be conducted in order to confirm that all parties are ready for the event.
This event must analyze the people that will be in the building when the event occurs. Often your data center location has a number of support team members that are performing work during the event. If they need to stay in the building, then they will need connectivity to another location, usually through a VPN connection. In order to support this, you need to order mobile broadband cards for your team members. These cards should be ordered months in advance and tested.
Six months, prior to the actual event, you need to identify what services will be available during the outage. For many organizations communication in the form of email is imperative. After email, organizations may want to prioritize connectivity to the Internet and then access to the corporate home page. The corporate home page is important so outside people can know about the status of the activity. Once you have identified the services you will need, a preliminary isolation event is required. This event will isolate the data center that will undergo the work and demonstrate that your identified services can work from another location. This isolation event can be performed by severing the connectivity between your locations and isolates the data center. During this isolation event, you should prove that all of the applications work from another location, especially communication resources. You may discover that services are required from the data center that is isolated. You will need to use the time to fix these issues. Once the test is successful, a freeze must be put in place for any application changes that would impact the event.
The schedule for the shutdown should include shutting down the non-production and labs as early as possible. If your event is on a weekend, consider shutting down the non-production equipment the week before the event. This may include furloughing the developers or lab personnel for the rest of the week as they will have no equipment to use.
For the production equipment shutdown, commence the shutdown on Saturday morning and complete by 12PM. Deliver the data center in the shutdown status to your HVAC/electrical team and give them 30 hours to perform their work. This will deliver the data center back to the technical team prior on Sunday at 6 PM. Commence the power-on of the production equipment on Sunday at 6 PM. Conduct the user testing starting on Monday, which will be a holiday. The problem with this schedule is the “All-night” work your teams will need to perform to bring the environment up starting at 6 PM. Turning on the equipment will require 12 hours of work. Therefore, when the users come in on Monday and report issues, you have nobody left to work on the issues. To mitigate this, split the team members between the overnight shift, and the day shift in order to have resources that are coherent. If this is not possible, then only give your HVAC/Electrical team 20 hours to perform the work. This would require that the data center is back in your possession on Sunday at 8 AM.
The testing to occur for the event needs to be decided in advance. Due to the fact that you are not changing anything in the application, you might consider a limited test of the applications once they are powered back on. This type of test needs to be decided in advance. The team that will do the testing needs to be put on notice that they will be working on a holiday weekend or other off-hours time to support the test. Key items to consider in establishing the test plan:
- Assign a project manager to oversee the testing component and ask the following questions:
- Will all applications be tested?
- Determine who are the testers. You should use the organizational chart to coordinate with the product manager for the application for testing. This may be referred to as the end-user.
- Build a list that everyone can see to document who will perform the testing. The list should use the email address of the person that will do the testing. Do not list a first name only on the chart.
- The testing needs to identify any dependent applications. For example, perhaps your payroll system depends on WorkDay or some other application. In that case you need to test WorkDay with the payroll system.
- Determine if you will test all of your environments. You may have a Development, Quality Assurance, SystemTest and Production environment. Each environment needs a plan for testing
- Determine the communication method to tell your testers when an environment becomes available for them to test
Inventory
Prior to the shutdown event, a complete inventory of the location needs to be executed. You need to set up an inventory database to record the following components about each system. Knowing what you have allows the team members to set up a plan to power off and power on every device.
- You need the physical information about every item in the data center at the physical and logical layer. You need to document if the device will be turned off during the event.
- Determine what “layer” is this device supporting. Is it part of the web, app, DB layer?
- What does the device look like when it is turned off? Is it showing a power light, but in a different color?
- Who is responsible to power off the device? Who hits the button?
- What are the applications that run on the device?
- Document if all devices and peripherals are working on the device prior to shutdown. For example, if you have multiple network cards in a device, you need to confirm if all cards are working prior to the shutdown. Often times you will have cards that were not working prior to the shutdown.
- Take all of the devices and assign a shutdown time to the devices
Once the inventory is complete this will build the power off and power on duration. This needs to be put in a schedule so you can see when each device will be shutdown. <JJD — INSERT DISCUSSION ABOUT SCHEMA)
Closest
During this activity, you need to have access to all closets that maintain impacted computer and network equipment. Conduct a survey of the closets and make sure you have keys to each closet and can open them.
Shutdown Order
The order of shutdown traditionally follows this model:
- Database
- Application
- Web
- Infrastructure, i.e. Active Directory Servers
- Storage
- Networking
Physical to Logical Mapping
Building the physical to logical mapping creates the dependency diagram that you will need for this event. This allows you to know what server supports what application. This mapping also builds the schedule for the power off and power on for the devices.
Power On/Power Off Activities
Utilizing the completed inventory, you will now build a schedule for the power on and power off of each device. You may want to group the devices in categories and assign a time to them for startup and shutdown.
Communication
During the event, you will have every member of your technical team interested in the status of the exercise. To minimize missing people from your email, build a distribution list that people can submit a request form and have themselves added to the distribution list. You want to automate this process as it can be time-consuming and you do not want your project managers consumed with this administrative activity. The closer the date to the event, you will find people come out of the woodwork and want to be added to the list. Let the people that are just “Waking up”, use a process to get on the list. The list should be built from the resource list.
Any list of names you collect and store on a list should be via the email address and their mobile phone. Do not collect first names in a list. It must be a fully distinguishable name. Providing the email name allows you to extract the names and quickly populate an email distribution list. The list should have consistency built into the fields to minimize data corrections. For example, if you just have an excel document with a column “Email Address”, format that column to specify that it must be something@something.something. Do not allow people to randomly enter their name, as you will get a variety of names that you will have to find in the Global Address list.
The external communication to your entire company needs to start early in the process. Once you know this event is happening, send a corporate-wide announcement with the dates and continue to market this activity in all of your meetings going forward.
The first announcement should include:
- Dates of the event
- The objectives of the event
The second announcement should include:
- The actual testing activities the user will need to do
- The impact that the regular users will need to prepare for
Writing the announcements may need the subject matter experts to assist in the process.
Email Communication During the Event
During the event, you will want to use email to record the status of the application from the testers. (Please note if you have another robust system then this section does not apply.)
For communication about the applications, try to group them and setup a mailbox for that group. For example, Email, mobile phones, Desktop applications could be grouped into one mailbox. When those testers have completed their testing then they could send an email to the mailbox. The dashboard operator could retrieve the data from this mailbox on the status of the email. The setup of the mailboxes should occur months in advance and the people who need to see messages in the mailbox should test their access.
Corporate Web Site
During your data center shutdown, you may lose applications that your external customers use. This will require using your external web site to alert them to the change. This is similar to the message you may often see at Salesforce.com or your bank. The creation of this message needs to involve the team responsible for posting messages on the external web site. This will involve hours to create the message and get it approved.
Part of the process for using the external message on the web site should include confirmation that the external web site will be available during the event. If your primary web site is in the data center that will be shut down, then you need to find a location where the web site will operate from. Testing the web site should be part of your initial isolation test. The test script should include how to send the data to update the web page and confirmation that the web page is visible for outside parties. See the section titled: “Groups that need coordination for the shutdown”
The distribution of an external message could require the execution of several standard operating procedures. During the initial meeting with the external facing team, identify how many standard operating procedures they will need to follow in order to create, approve and distribute the message. Furthermore, the devices that are used by the external web-facing team need to be identified for the shutdown event. If the external team has not been part of an event, then they may need to develop procedures to shut down their devices. The creation of these procedures could take months to create. These procedures should reference how to suspend the monitoring of the web site to reduce false-positive alerts.
The distribution and coordination of the web site updated should be assigned to your communications manager. This is a lengthy project and requires dedicated time.
Schedule Updates During the Event
Once the event starts, updates will be required on a regular basis. Typically 2 hours is a duration where enough work can happen that generates a meaningful discussion. The updates should be provided the moment the event starts until it ends. This will require a shift of 3 people to cover a 24 X 7 rotation. Consider hiring somebody in Hawaii to perform the overnight shift, or at the minimum use, a follow the sun model. If you do not have a team of people that work in different time zones, or work overnight, then don’t promise overnight status updates unless you have somebody to do them.
The update should include content about all of the activities that have created this event. For example, if you shut down your data center in order to do an upgrade to an air conditioner, then you update needs to include content about that activity. Once the event that created the shutdown is done, then the update will shift to your technical components and the milestones that will be reported on in your update.
The suggested process for providing the update includes the following components:
- Dedicate a team of three people, that can cover 24 X 7 shifts during the entire event. This will be your communications team. You may want to use people in different time zones. However, if you need the people doing the updates to learn first hand about the status, they may need to be on site.
- Establish a due date from the people managing their specific activities for them to provide their update. This should be at least 30 minutes prior to official communication. For example, if you have established the official communication schedule for a 2 PM distribution, then the people managing their specific activities will need to send their update at 1:30 PM. This leaves time for the communications team to format and clarify any data in the report
- Establish the communication method for the update. If you will use Email, confirm that Email will be available. If you will distribute via voice, confirm the bridge line that people will use to receive the update.
- Confirm the format for the note. Will it have particular sections? Who is on the distribution list?
- Determine the pertinent data required in the report. Do people want to know more about the development environment? Do they want to know more about the production environment? You need to determine what the majority is looking for in the report.
Dashboard status
If you have a team of programmers, consider building a real-time dashboard to show systems coming online. The characteristics of the Dashboard would include:
- The list of Applications that are impacted would show on the dashboard. The list on the dashboard should match the list that you have agreed to test. If you list all the applications that are impacted on your dashboard, but you do not test all of them, then people will ask the status of those applications which will confuse people.
- The team member managing the dashboard would need:(1) workstation (2) 27” Monitors. The workstation should be tested two weeks ahead of time to make sure it meets performance standards and that the login works correctly.
- The dashboard components would include the following listed below. Keep in mind, you need a dedicated person to chase all of the status updates. This is not a job for the project manager.
- A list of all the applications was put in Excel and a column was added to reflect the testing status, for a pass or fail or other status
- The application team would call into a bridge line, use the chat window or email a distribution list and report on the status to the person that is the dedicated dashboard operator When the application reported success, the excel spreadsheet was updated to show the status. When you set up your reporting structure for the application, you want to make sure your dashboard operator does not need to look in multiple mailboxes for the status.
- Once the spreadsheet was updated, this would update a PowerPoint chart that would show the stats of the applications. This PowerPoint could be made available on the Intranet or a file share for download. If you are going to allow for a download, consider putting it there as a PDF file format so nobody changes it.
- Please note: The dashboard should only show the applications that are going to be tested. Your data center shutdown involves a lot of applications you may not test. Don’t display things that will not be tested. If you do display all applications, this, that will skew your completeness percentages, where you might only show 33% of the total applications tested, but in reality, compared to the applications that are being tested, that could be 100%.
Hotel Rooms
The shutdown of a data center in order to make improvements needs to have schedule flexibility for bringing it back online. The construction team could finish early, therefore you will want to start right away. If you sequester your team at a local hotel room, you have the ability to bring them back to the office to start immediately, as opposed to calling everyone and driving to the office. The Hotel Rooms need to be reserved as one of the early items in the project. Set this task as one of the first items in the project. You will need to reserve a block of rooms as people will sign themselves up and take them off the list.
For the Hotel Room Signup, use the SharePoint site for team members to indicate if they need a hotel room and what nights they need to stay in the hotel. This is a self-service reservation system. You do not want to call every team member and ask them their hotel intentions.
You should plan that every team member that is involved in the shutdown or startup of the applications will need a hotel room.
End-User Testing
Due to the volume of applications that you are bringing back online, you need to have a system for the end-users to report issues and help fix the system with the issues.
Establish application groups and assign each group a bridge number. Distribute the bridge number to the application lead. The application lead will collect the issues and dial into the bridge to report the issues. The call operator will log the issue, (ServiceNow is your friend in this situation), and then bridge the call into another group that can assess the issue technically. The risk with this plan would be one queue having a long hold time and other queues that are empty. You can solve this problem by establishing an Automated Call Distributor queue. You can give out one number to all application groups, they call one number and then you can have a team of people that can answer the call and align the required resources.
Due to the length of the event, you need to have a team of call operators that can work multiple shifts. The shifts should be from 7A-11P. You will need to develop a turnover document from one shift to another. If you are using ServiceNow then you can have reports run on the queue. Using ServiceNow stops the use of dedicated mailboxes and assigning people to monitor the mailboxes.
Vendor support/Professional Services
Due to the amount of equipment that will be powered down and back on, you need to plan to have spare parts available for the equipment. The probability of a failed hard drive or power supply increases when you are shutting off all of your equipment. In order to mitigate this risk, hire professional services from your leading vendors. This should include your compute, storage, and network vendor. For many companies, this includes Hewlett Packard, DellEMC, Cisco and Juniper.
When an issue is discovered that is hardware-related, assign the task to the professional services representative. This representative will use their process for resolution. The manufacturer may open an internal service request with their team and they will track the issue.
The use of professional services will require a statement of work, (SOW). This SOW should include 24X 7 coverage. Developing the statement of work takes several revisions. Plan on a daily call with the vendors to confirm the content in the SOW. Find an online system where both parties can have a shared document. Perhaps Google Docs is a solution for your organization. You need to minimize version control insanity with the distribution of the document traveling back and forth between various parties.
Prior to setting up the professional services for your equipment, you need to provide a complete list of the equipment by Serial Number to the vendor. The vendor will take this data and confirm if the items are covered under an existing support contract. This will add duration to the creation of the professional services engagement and will add cost. Due to the length of the event, the SOW should specify that your vendors should plan to have three representatives on-site throughout the event to provide 24X7 coverage.
If you have devices that are not covered on a support contract you will need to add them to the contract. If you don’t want them covered, then you need to have all interested parties confirm that they will pay or time and materials for the devices that are not on a support contract.
The vendors providing support may require specified hours for their time on site. This can be challenging if the repairs end early and you will commence the activity to turn on all of your compute, storage, and network equipment. You don’t want to pay the vendors for sitting on-site, when devices are not coming online, however, you want them to be readily available when the event does happen. To mitigate this risk, plan on compensating them for the time that they are placed “on-call” as opposed to the time they will spend sitting on site.
If you do not hire professional services, then your team will be responsible for fixing the hardware issues. This could include putting in failed hard drives or power supplies. Professional Services are a key component in order to successfully power down your data center.
Testing Peripherals (Printers etc.)
Powering down your data center includes equipment within a confined space, but this equipment contributes to other devices in your environment, like printers. A listing of printers and devices in locked cabinets needs to be on the inventory list. Your test plan needs to include the functionality testing of all of the features on the printer. This could include scanning, email functions and standard printing.
Devices are often in conference rooms and should be included in the test plan.
Unix Global Zone Challenges
For those environments that maintain “Zones” in their Unix environment, you need to plan out the shutdown and startup of the “Zones”. Zones can take hours to shutdown and startup. Planning and understanding with the Unix team needs to take place
Backup / Replication
Prior to shutting down the data center, you need to confirm that all of your backup jobs and replication jobs have been completed. If the regular backup jobs will not finish, then your teams will need to create special backup jobs to accommodate the shutdown schedule. Creating new backup jobs will take months to create and test. Early in the project, obtain the list of backup and replication jobs and the finish time. The job listing should include the type of job and the duration. This would include the full, incremental and differential job. The list should note the resource group that is responsible for each backup job. You may have Mainframe, SQL, Oracle, Linux, Windows or Apple backup jobs that are controlled by different groups. The backups need to include the completion of the database backups.
Database backups are performed using the native backup tools. For SQL and Oracle this could include: (1) SQL Server is a dump file (2) Oracle are RMAN backups. Once the file is created, your backup software will pick up the file and send it to the backup media. If you are going to adjust the backup jobs, you will need to coordinate who creates the new database file and who will adjust the program that will write the file to the backup media.
In summary, your backup tasks should include:
- Assign a dedicated team member that will work with the backup activities to establish the schedule to support the shutdown.
- Obtain the list of the backup jobs and replication jobs from the server and database team in order for them to review the jobs and completion time.
- If new backup jobs are required, coordinate the creation and testing of the new jobs to make sure all data is backed up prior to the shutdown.
- Add to your risk list an entry that identifies the risk for the backup routines that may not finish and leave you exposed for recovery. Obtain a signature page that the stakeholders know about this risk.
Scripts/Jobs that run in your data center
Applications often depend on file feeds from one source location to another. The distribution of this data occurs using a centralized job engine. The scripts run from the Job Engine need to be reviewed and modified to support the data center shutdown. This is imperative if you have a Mainframe. Once the jobs have been identified that will not run, a decision is required on the method to run the skipped jobs. This may include setting up a special batch file to run the missed jobs.
Similar to the backup jobs, obtain a list of the scripts from the job engine early in the process and determine which jobs will not run when the data center is shut down. These jobs will need to be rescheduled.
The skipped jobs will be added to the risk list. Application teams will quote “application impact” if they do not get the data on the scheduled basis or if a cycle is skipped and they need to process twice as much data. Flagging this a risk will give it the required visibility.
Disaster Recovery Test
Powering down your data center is the opportunity to test your disaster recovery plan. The executing of the disaster recovery test is an entirely separate project and needs to be led by the disaster recovery team. Do not let somebody convince you to lead the power down of the data center and the turn up of the disaster recovery site. These are separate activities.
Your disaster recovery plan needs to already be developed and previously tested. If this is your first adventure into a disaster recovery test, this is not the time to do this.
If your disaster recovery plan has already been developed and the tasks are documented you can use the Data Center Shutdown to test your disaster recovery plan. The execution of the disaster recovery plan requires that you have a project manager or coordinator for each application that will failover.
Prior to starting the disaster recovery test make sure you disaster recovery team identifies all of the artifacts that they will want for their final report.
In order to improve communication on the disaster recovery test consider building an automated call distributor queue in your phone system. You will use this to obtain resources to help with an infrastructure component. For example, during this activity you will need somebody to start a server. When that time occurs the application team can call into the queue and get the next available team member. Another example is that if you need a helper to start a storage device the team can call in and obtain the next storage resource waiting in the queue. If you don’t do it this way, then you need to find a way to distribute resources that will support the applications. Usually the amount of people that support the application is far less than the number of applications.
The disaster recovery test occurs in a very short time frame. Usually disaster recovery tests assume that you will stay running at the disaster recovery facility for a long duration. During the test, you are failing over to the disaster recovery site and then failing back within a 48 hour period. To support this effort, consider using a “follow the sun” approach and place people in time zones that can support the activities all night long. Ideally you should have 3 shifts of people that can work 8-hour shifts. If you don’t do this, you will have some people that you will work 24 hours straight, and then they will collapse and you will not have anyone to support your activities.
During the disaster recovery test, hardware will fail during this event. The vendors that you have coordinated with the need to know that they will support your disaster recovery site as well.
Combining data center shutdown and a disaster recovery event is a complex activity. In summary, start early and know the hour x hour plan for each application and infrastructure component. Setup off-site conference rooms and help desk locations. Remember your building will lose more than the power to the data center, it will lose power to the entire building. Assume you are in a disaster.