A Review of SAN Storage Failure
on September 15, 2005
- Summary of the SAN Failure Incident
- Recommendations for Future Improvement
1. Summary of the SAN Failure Incident On
September 15, 2005 at around 0:42 a.m., the SAN
(Storage Area Network) system that serves a number of central servers
failed. The failure was caused by a very rare case of
successive failures of two disk units of a RAID (Redundant Array of
Independent Disk) group of the system within an
interval of two minutes. According to HP, the equipment vendor for
this SAN, it was a case that had hit the technology limit of storage
system based on RAID5 technology that the Computer Centre had
implemented for our disk storage systems.
Several computer servers,
which used the storage on the SAN system, failed. The affected
servers include those supporting the University's email and www
services, which included the hkusua.hku.hk, hkucc.hku.hk, www.hku.hk,
graduate.hku.hk and extranet.hku.hk. Services of these systems were
affected extensively, ranging from partial outage of 13 hours for hkucc.hku.hk to a total outage of 34 hours for www.hku.hk.
While the equipment
engineers took about 12 hours in repairing the SAN storage, it had taken
a much longer time for the Computer Centre to restore all the data
files and
email files from backup tapes due to the large volume of disk storage
involved and the lack of equipment for supporting fast data recovery.
It had taken the Computer Centre over 5 days to complete the restoration
of the latest available data files for the affected user accounts,
including 1,700 out of the 7,000 accounts of hkucc.hku.hk, 5,500 out of the 32,000
accounts of hkusua.hku.hk and all of the 40,000 accounts of
graduate.hku.hk.
On facing the wide scope of disk storage
unavailability, the Computer Centre had taken immediate actions to minimize the
impact to users of the affected systems:
- setting up a
temporary web server to keep users informed of the system recovery
progress and provide the necessary homepage for users to gain access to
the unaffected HKU Portal and Student Connect services,
- mobilizing staff
resources in information dissemination and enquiry answering,
and
- escalating the requests of urgent technical support from the related
hardware and software vendors, namely HP and Veritas, to senior levels
of the companies.
Our systems staff have acted speedily and
professionally in recovering the affected services in the earliest
possible time and consequently they stayed in the computer room for over
40 hours before they could take a rest.
The recovery process was lengthy as a software bug was found on the
operating system of the hkucc.hku.hk, www.hku.hk and graduate.hku.hk
systems which slowed down the data restoration process significantly.
Indeed, a long time was needed in recovering all the corrupted
data from the backup tapes even without the software bug which is an
area that we must consider for improvement. Besides, insufficient
staff in the Centre's Systems team also attributed to the fact that
www.hku.hk and graduate.hku.hk systems could only be recovered after an
outage of more than 1 day, and the latest version of data from backup
tapes could only be restored on September 20, 2005.
2. Recommendations for Future Improvement
As an interim
measure, HP has been requested to monitor the behaviour of the SAN
Storage and its RAID groups more closely so as to expose and fix any
possible inherent defects of the equipment early before recurrences of
similar problems. For
near-term measure, the existing configuration of the SAN storage
systems should be reviewed and enhanced so as to allow the needed
capacity and capability for carrying out full-scale data recovery from
backup tapes regularly. Implementation of the real-time data replication function to enable much
faster storage recovery and set up a disaster recovery site with
installation of the necessary server, storage and network equipment for
quick recovery of the mission critical systems and services would be
considered as the medium and long term measures for improvement
respectively.
The Computer
Centre would like to apologize for all the inconvenience caused to the
University members from this unfortunate incident.
|