利用業務持續性計劃以提昇 服務可用性 梁家駿 香港理工大學 二零零七年四月

大綱 香港理工大學背景 香港理工大學內之災難應變計劃及業務持續性計劃的發展 現時的環境 行政, 電子化服務, 學術方面 業務持續性計劃的測試 高可用性系統可令服務水平達致何種級別? 有計劃及非計劃檔機

The Hong Kong Polytechnic University 香港理工大學

The Hong Kong Polytechnic University 香港理工大學 使命 : 發展以專業為基礎的卓越學術水平 校訓 : 開物成務勵學利民

The Hong Kong Polytechnic University 香港理工大學 教職員 學生人數 理大約有 1031 位全職助理講師及助理教授職級或以上的教學人員 2005/06 年度理大開辦的教資會資助課程共 71 項, 有 12313 名全日制學生及 2284 名兼讀制學生, 是全港學生人數最多的資助高等院校, 尚有 12043 名學員修讀理大的自負盈虧課程

The Hong Kong Polytechnic University 香港理工大學 學術架構 應用科學及紡織學院 工商管理學院 設計及語文學院 建設及地政學院 理大設有六所學院及一所酒店及旅遊業管理學院 工程學院 醫療及社會科學院 酒店及旅遊業管理學院

The Hong Kong Polytechnic University 香港理工大學 獨有課程 設計學工程物理學服裝及紡織學醫療化驗科學國際航運及物流管理學 職業治療學眼科視光學物理治療學放射學繪測及地理資訊學

The Hong Kong Polytechnic University 香港理工大學 為工商界提供的服務 企業發展院 專業及持續教育學院 香港專上學院 (HKCC) 開辦共十三個副學士課程專業進修學院 (SPEED) 開辦成人持續教育課程香港網上學府 (HKCyberU) 開辦網上遙距教育課程 境外中心拓展處

The Hong Kong Polytechnic University 香港理工大學 在內地的發展 境外中心拓展處在中國內地設立之培訓中心 : 杭州..與浙江大學合作深圳 : 由香港青年工業家協會資助成立, 由深圳清華大學研究院協辦珠海 : 與哈爾濱哈爾濱工大集團合作西安 : 與西安交通大學合作

The Hong Kong Polytechnic University 香港理工大學 在內地的發展 目前理大內地開辦了 25 多項課程, 其中獲國家教育部及國務院學位委員會辦公室認可的理大本科及碩士學位課程共 17 項, 是香港及海外眾多大學之冠大學之冠

災難應變計劃的發展歷史 香港理工大學用災難應變計劃去 : 保障嚴重環境破壞 提高備份功能 保障因操作或人為上之錯誤 保障因程式錯誤而導致全部數據遺失 保障天然災難之威脅 滿足審計師之要求

災難應變計劃的發展歷史 副電腦室內之硏究機器於災難應變情況下作為重要業務機器 只限於下列情況啟動 主電腦室發生故障而完全無法操作 ; 及 於三天內無法恢復操作 啟動的機會較微 從 1998 年開始每年進行定期測試

SARS 及禽流感的威脅 當 2003 年 SARS 爆發時, 香港理工大學面對新威脅而在我們當時的災難應變計劃中沒有預計的 在 SARS 及禽流感新的威脅下, 業務持續性計劃的需要應運而生 應政府要求成立應變計劃以作 SARS 及禽流感的戒備 如主電腦室有感染發生, 應變計劃將會被啟動 應變計劃啟動的機會仍然低, 但相對於自然災難事故中則大大提高, 特別於 2003 年

業務持續性計劃的目標 保持高效率 改善服務的可用性 對用戶降低其非計劃檔機及影響減至最低 減少對大學形象或收支的損害 符合政府及審計的要求 給較多時間予系統提升及補丁 能夠定期進行測試

業務持續性計劃的目標 於下列情況下啟動 單一電腦室完全無法操作 單一電腦室不能進入 單一系統 / 服務停止 系統啟動由少於一秒 ( 用戶不知情 ) 至 30 分鐘內 故障時由生產機器轉移至用戶承兌或發展機器

業務持續性計劃比對災難應變計劃 Item Concept Technology Scope of recovery Time to execute Time to recover Required man power Planned drill time Use during maintenance Business Continuity (BC) NEW, includes processes, personnel, business oriented, etc Fault Tolerance, HA, SAN, Load Balancer, By individual system Instant auto-detect Second / Minute LOW Selected day for individual system but not for site failure YES Disaster Recovery (DR) OLD, system oriented, business continues with additional procedures Backup and Restore By site After declare of DR (3 days) Hour / Day HIGH Fixed day for whole site failure NO

What does our BCP protect or not protect? BCP can protect against BCP cannot protect against power failure of a few hours in local area environment unfit (airconditioning, power, fire, flooding, bomb attack, contaminated by virus) to access to one of the computer room multiple disks break down in one side of the image under SAN broken in part of the communication links internet connection lost in one of the computer rooms total data lost prolonged power failure environment unfit to both of the computer rooms computer virus or security attack (protected by Firewall) lost of desktop PC natural disaster (e.g. earthquake, tsunami, bush fire) war, riot, bomb, terrorism or malicious attack to both computer rooms lost of most IT personnel in the BCP team

現時的環境 Multiple Hosts (MH) High Availability (HA) u Network Infrastructure u University Portal, SSH Gateway, LDAP u DNS, DHCP, SPAN, Radius, Dialup u ias servers, OF middle tier u Use of load balancer structure u University Portal DB u Internet / Intranet Web Server u WebCT,, Academic NFS Server u AS, RO, HKCC, SPEED, SAO HIMS u FO, OF, CHRIS systems Disaster Recovery (DR) u Campus Email service u Central Novell and Groupwise service

現時的環境 Year(s) No. of DR Systems No. of HA/MH Systems Total 1998-2003 10 0 10 2004 4 9 13 2005 2 18 20 2006 2 19 21

HA Cluster for Administrative Computing P404 SF20K SF6800 L002 FE or GE Heartbeat Link Fiber Channel Switch SAN Inter Switch Links Brocade 3800 SE6120 Data is written to both storages using Veritas VM SE6120

高可用性系統可令服務水平達致何種級別? % 100.0 99.999 IAI Information Availability Index Fault Tolerance Custom Solutions for Life or Death Applications Multi-node Clustering Fault Tolerance IAI A V A I L A B I L I T Y 99.99 Data Replication 99.5 99.0 98.0 96.5 95.0 SAN Architecture Log Shipping Enhanced Disk & File Management Basic Systems High Availability (HA) Application and Data Availability Enhanced Availability Use RAID technology to reduce downtime caused by disk failure Basic Availability Regular backup Source: Windows NT MSCS, Richard Lee, 2000

Planned Downtime is Painful Gartner states 70% of application and database downtime is caused by planned outages.. Application Upgrades OS Upgrades Server Maintenance System and application people perform upgrade or maintenance after office hours and on weekends The The biggest biggest near-term near-term customer customer pain-point pain-point I I see see is is related related to to planned planned migrations migrations and and avoiding avoiding or or reducing reducing planned planned downtime. downtime. Donna Donna Scott, Scott, 2005 2005 Gartner Gartner // VERITAS VERITAS Interview Interview

Address Planned Downtime Make use of Multiple Hosts (MH) and High Availability (HA) environment Client connections remain uninterrupted during migration 83% 83% of of customers customers considered considered the the need need to to keep keep applications applications running running during during server server maintenance maintenance an an absolute absolute must must do do or or important important in in their their environment. environment. Tier Tier 1 Research Research UC UC and and Virtualization Virtualization Survey, Survey, 2005 2005 Business Value: Perform server maintenance during normal business hours Reduce associated Application Server outages Increase server utilization through moving applications based on changing resource requirements

How would you rank in planned downtime? Planned Downtime in Data Centre (Gartner Research April 2005) Average (>250 downtime hours per year) 15% 2% 8% Very Good (<200 downtime hours per year) 36% 39% Outstanding (<50 downtime hours per year) Best-in-Class (<12 downtime hours per year) 100 percent (zero planned downtime)

Unplanned Downtime is Unacceptable Causes for Unplanned Downtime 40% 20% 40% failure in hardware, operating system, environment or disasters operation failures (infrastructure change, configuration/problem management) application failures (change management) Gartner Research: April 2005

How would you rank in unplanned downtime?

Services Availability in 2004, 2005, 2006 Service Service Availability 2004 Service Availability 2005 Service Availability 2006 Campus Network 99.974% 99.958% 99.994% Campus Email 99.908% 100.000% 99.908% Internet & Internet2 99.920% 99.954% 99.960% University Portal 99.760% 99.945% 100.000% Internet Web-hosting 100.000% 100.000% 99.999% Intranet Web-hosting 99.930% 100.000% 99.870% SPAN & Wireless LAN 100.000% 100.000% 100.000% Academic Unix Cluster 99.790% 100.000% 99.970% myweb 99.720% 100.000% 99.910% mystore 100.000% 99.990% 99.995%

業務持續性計劃本身是一個持續過程 業務持續性計劃本身是一個持續過程 業務持續性計劃一定要定期測試以確保它能夠在突發事故後繼續按計劃執行既定的重要業務系統 使用可靠性高的架構 HA cluster and SAN structure 使用多台機器及平衡負荷的架構 服務擁有者的功能 考慮將重要的服務加入業務持續性計劃 判斷投資回報 管理及實施 定期測試以確定它的可用情況
