SafeGuard Solutions Troubleshooting Guide - Public Support Login ...
SafeGuard Solutions Troubleshooting Guide - Public Support Login ...
SafeGuard Solutions Troubleshooting Guide - Public Support Login ...
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
Unisys <strong>SafeGuard</strong> <strong>Solutions</strong><br />
<strong>Troubleshooting</strong> <strong>Guide</strong><br />
Unisys <strong>SafeGuard</strong> <strong>Solutions</strong> Release 6.0<br />
June 2008
Unisys <strong>SafeGuard</strong> <strong>Solutions</strong><br />
<strong>Troubleshooting</strong> <strong>Guide</strong><br />
Unisys <strong>SafeGuard</strong> <strong>Solutions</strong> Release 6.0<br />
June 2008 6872 5688–002<br />
unisys<br />
imagine it. done.
NO WARRANTIES OF ANY NATURE ARE EXTENDED BY THIS DOCUMENT. Any product or related information<br />
described herein is only furnished pursuant and subject to the terms and conditions of a duly executed agreement to<br />
purchase or lease equipment or to license software. The only warranties made by Unisys, if any, with respect to the<br />
products described in this document are set forth in such agreement. Unisys cannot accept any financial or other<br />
responsibility that may be the result of your use of the information in this document or software material, including<br />
direct, special, or consequential damages.<br />
You should be very careful to ensure that the use of this information and/or software material complies with the laws,<br />
rules, and regulations of the jurisdictions with respect to which it is used.<br />
The information contained herein is subject to change without notice. Revisions may be issued to advise of such<br />
changes and/or additions.<br />
Notice to U.S. Government End Users: This is commercial computer software or hardware documentation developed at<br />
private expense. Use, reproduction, or disclosure by the Government is subject to the terms of Unisys standard<br />
commercial license for the products, and where applicable, the restricted/limited rights provisions of the contract data<br />
rights clauses.<br />
Unisys is a registered trademark of Unisys Corporation in the United States and other countries.<br />
All other brands and products referenced in this document are acknowledged to be the trademarks or registered<br />
trademarks of their respective holders.
Unisys <strong>SafeGuard</strong><br />
<strong>Solutions</strong><br />
<strong>Troubleshooting</strong> <strong>Guide</strong><br />
Unisys <strong>SafeGuard</strong><br />
<strong>Solutions</strong> Release 6.0<br />
Unisys <strong>SafeGuard</strong><br />
<strong>Solutions</strong><br />
<strong>Troubleshooting</strong><br />
<strong>Guide</strong><br />
Unisys<br />
<strong>SafeGuard</strong><br />
<strong>Solutions</strong><br />
Release 6.0<br />
6872 5688–002 6872 5688–002<br />
Bend here, peel upwards and apply to spine.
Contents<br />
Section 1. About This <strong>Guide</strong><br />
Section 2. Overview<br />
Purpose and Audience .......................................................................... 1–1<br />
Related Product Information ................................................................. 1–1<br />
Documentation Updates ....................................................................... 1–1<br />
What’s New in This Release ................................................................. 1–2<br />
Using This <strong>Guide</strong> ................................................................................... 1–3<br />
Geographic Replication Environment .................................................... 2–1<br />
Geographic Clustered Environment ...................................................... 2–2<br />
Data Flow .............................................................................................. 2–3<br />
Diagnostic Tools and Capabilities.......................................................... 2–7<br />
Event Log ............................................................................. 2–7<br />
System Status ..................................................................... 2–7<br />
E-mail Notifications .............................................................. 2–8<br />
Installation Diagnostics ........................................................ 2–9<br />
Host Information Collector (HIC) ......................................... 2–9<br />
Cluster Logs......................................................................... 2–9<br />
Unisys <strong>SafeGuard</strong> 30m Collector......................................... 2–9<br />
RA Diagnostics .................................................................... 2–9<br />
Hardware Indicators ............................................................ 2–9<br />
SNMP <strong>Support</strong> ................................................................... 2–10<br />
kutils Utility ........................................................................ 2–10<br />
Discovering Problems ......................................................................... 2–10<br />
Events That Cause Journal Distribution ............................ 2–10<br />
<strong>Troubleshooting</strong> Procedures ............................................................... 2–11<br />
Identifying the Main Components and Connectivity<br />
of the Configuration....................................................... 2–11<br />
Understanding the Current State of the System ............... 2–12<br />
Verifying the System Connectivity .................................... 2–12<br />
Analyzing the Configuration Settings ................................ 2–13<br />
Section 3. Recovering in a Geographic Replication<br />
Environment<br />
Manual Failover of Volumes and Data Consistency Groups ................. 3–2<br />
Accessing an Image ............................................................ 3–2<br />
Testing the Selected Image at Remote Site ....................... 3–3<br />
Manual Failover of Volumes and Data Consistency Groups for<br />
ClearPath MCP Hosts ....................................................................... 3–5<br />
6872 5688–002 iii
Contents<br />
Accessing an Image ............................................................. 3–5<br />
Testing the Selected Image at Remote Site ........................ 3–5<br />
Section 4. Recovering in a Geographic Clustered Environment<br />
Checking the Cluster Setup ................................................................... 4–1<br />
MSCS Properties .................................................................. 4–1<br />
Network Bindings ................................................................. 4–2<br />
Group Initialization Effects on a Cluster Move-Group<br />
Operation ........................................................................................... 4–3<br />
Full-Sweep Initialization ........................................................ 4–4<br />
Long Resynchronization ....................................................... 4–4<br />
Initialization from Marking Mode .......................................... 4–5<br />
Behavior of <strong>SafeGuard</strong> 30m Control During a Move-Group<br />
Operation ........................................................................................... 4–5<br />
Recovering by Manually Moving an Auto-Data (Shared<br />
Quorum) Consistency Group ............................................................. 4–7<br />
Taking a Cluster Data Group Offline ..................................... 4–7<br />
Performing a Manual Failover of an Auto-Data<br />
(Shared Quorum) Consistency Group to a<br />
Selected Image ................................................................ 4–8<br />
Bringing a Cluster Data Group Online and Checking<br />
the Validity of the Image .................................................. 4–9<br />
Reversing the Replication Direction of the<br />
Consistency Group ......................................................... 4–10<br />
Recovery When All RAs Fail on Site 1 (Site 1 Quorum Owner) .......... 4–11<br />
Recovery When All RAs Fail on Site 1 (Site 2 Quorum Owner) .......... 4–17<br />
Recovery When All RAs and All Servers Fail on One Site ................... 4–19<br />
Site 1 Failure (Site 1 Quorum Owner) ................................ 4–19<br />
Site 1 Failure (Site 2 Quorum Owner) ................................ 4–25<br />
Section 5. Solving Storage Problems<br />
User or Replication Volume Not Accessible .......................................... 5–4<br />
Repository Volume Not Accessible ....................................................... 5–6<br />
Reformatting the Repository Volume ................................... 5–8<br />
Journal Not Accessible ........................................................................ 5–11<br />
Journal Volume Lost Scenarios ........................................................... 5–13<br />
Total Storage Loss in a Geographic Replicated Environment ............. 5–13<br />
Storage Failure on One Site in a Geographic Clustered<br />
Environment .................................................................................... 5–16<br />
Storage Failure on One Site with Quorum Owner<br />
on Failed Site ................................................................. 5–17<br />
Storage Failure on One Site with Quorum Owner<br />
on Surviving Site ............................................................ 5–20<br />
Section 6. Solving SAN Connectivity Problems<br />
Volume Not Accessible to RAs .............................................................. 6–3<br />
Volume Not Accessible to <strong>SafeGuard</strong> 30m Splitter ............................... 6–7<br />
iv 6872 5688–002
Contents<br />
RAs Not Accessible to <strong>SafeGuard</strong> 30m Splitter .................................. 6–12<br />
Total SAN Switch Failure on One Site in a Geographic<br />
Clustered Environment ................................................................... 6–17<br />
Cluster Quorum Owner Located on Site with Failed<br />
SAN Switch ................................................................... 6–18<br />
Cluster Quorum Owner Not on Site with Failed<br />
SAN Switch ................................................................... 6–22<br />
Section 7. Solving Network Problems<br />
<strong>Public</strong> NIC Failure on a Cluster Node in a Geographic<br />
Clustered Environment ..................................................................... 7–3<br />
<strong>Public</strong> or Client WAN Failure in a Geographic Clustered<br />
Environment ..................................................................................... 7–6<br />
Management Network Failure in a Geographic Clustered<br />
Environment ................................................................................... 7–11<br />
Replication Network Failure in a Geographic Clustered<br />
Environment ................................................................................... 7–15<br />
Temporary WAN Failures .................................................................... 7–21<br />
Private Cluster Network Failure in a Geographic Clustered<br />
Environment ................................................................................... 7–22<br />
Total Communication Failure in a Geographic Clustered<br />
Environment ................................................................................... 7–26<br />
Port Information .................................................................................. 7–32<br />
Section 8. Solving Replication Appliance (RA) Problems<br />
Single RA Failures ................................................................................. 8–4<br />
Single RA Failure with Switchover ...................................... 8–5<br />
Reboot Regulation ............................................................. 8–12<br />
Failure of All SAN Fibre Channel Host Bus Adapters<br />
(HBAs ............................................................................ 8–14<br />
Failure of Onboard WAN Adapter or Failure of<br />
Optional Gigabit Fibre Channel WAN Adapter .............. 8–19<br />
Single RA Failures Without a Switchover ........................................... 8–21<br />
Port Failure on a Single SAN Fibre Channel HBA on<br />
One RA .......................................................................... 8–21<br />
Onboard Management Network Adapter Failure .............. 8–23<br />
Single Hard Disk Failure ..................................................... 8–24<br />
Failure of All RAs at One Site .............................................................. 8–25<br />
All RAs Are Not Attached .................................................................... 8–27<br />
Section 9. Solving Server Problems<br />
Cluster Node Failure (Hardware or Software) in a Geographic<br />
Clustered Environment ..................................................................... 9–2<br />
Possible Subset Scenarios .................................................. 9–3<br />
Windows Server Reboot ..................................................... 9–3<br />
Unexpected Server Shutdown Because of a Bug<br />
Check .............................................................................. 9–8<br />
6872 5688–002 v
Contents<br />
Server Crash or Restart ...................................................... 9–12<br />
Server Unable to Connect with SAN .................................. 9–14<br />
Server HBA Failure ............................................................. 9–17<br />
Infrastructure (NTP) Server Failure ...................................................... 9–18<br />
Server Failure (Hardware or Software) in a Geographic<br />
Replication Environment ................................................................. 9–20<br />
Section 10. Solving Performance Problems<br />
Slow Initialization ................................................................................. 10–2<br />
General Description of High-Load Event ............................................. 10–3<br />
High-Load (Disk Manager) Condition ................................................... 10–4<br />
High-Load (Distributor) Condition ........................................................ 10–5<br />
Failover Time Lengthens ..................................................................... 10–5<br />
Appendix A. Collecting and Using Logs<br />
Collecting RA Logs ............................................................................... A–1<br />
Setting the Automatic Host Info Collection Option ............. A–2<br />
Testing FTP Connectivity .................................................... A–2<br />
Determining When the Failure Occurred ............................ A–2<br />
Converting Local Time to GMT or UTC ............................... A–3<br />
Collecting RA Logs .............................................................. A–3<br />
Collecting Server (Host) Logs ............................................................... A–6<br />
Using the MPS Report Utility .............................................. A–6<br />
Using the Host Information Collector (HIC) Utility .............. A–7<br />
Analyzing RA Log Collection Files ........................................................ A–8<br />
RA Log Extraction Directory ................................................ A–9<br />
tmp Directory .................................................................... A–14<br />
Host Log Extraction Directory ........................................... A–15<br />
Analyzing Server (Host) Logs .............................................................. A–16<br />
Analyzing Intelligent Fabric Switch Logs ............................................ A–16<br />
Appendix B. Running Replication Appliance (RA) Diagnostics<br />
Clearing the System Event Log (SEL)................................................... B–1<br />
Running Hardware Diagnostics ............................................................ B–2<br />
Custom Test ........................................................................ B–3<br />
Express Test ........................................................................ B–4<br />
LCD Status Messages .......................................................................... B–4<br />
Appendix C. Running Installation Manager Diagnostics<br />
Using the SSH Client ............................................................................ C–1<br />
Running Diagnostics ............................................................................. C–1<br />
IP Diagnostics ...................................................................... C–2<br />
Fibre Channel Diagnostics ................................................... C–9<br />
Synchronization Diagnostics ............................................. C–17<br />
Collect System Info ........................................................... C–18<br />
vi 6872 5688–002
Appendix D. Replacing a Replication Appliance (RA)<br />
Contents<br />
Saving the Configuration Settings ........................................................ D–2<br />
Recording Policy Properties and Saving Settings ................................. D–2<br />
Modifying the Preferred RA Setting ..................................................... D–3<br />
Removing Fibre Channel Adapter Cards ............................................... D–4<br />
Installing and Configuring the Replacement RA ................................... D–4<br />
Cable and Apply Power to the New RA .............................. D–4<br />
Connecting and Accessing the RA ...................................... D–4<br />
Checking Storage-to-RA Access .......................................... D–5<br />
Enabling PCI-X Slot Functionality ......................................... D–5<br />
Configuring the RA .............................................................. D–6<br />
Verifying the RA Installation .................................................................. D–7<br />
Restoring Group Properties .................................................................. D–8<br />
Ensuring the Existing RA Can Switch Over to the New RA ................. D–8<br />
Appendix E. Understanding Events<br />
Event Log .............................................................................................. E–1<br />
Event Topics ........................................................................ E–1<br />
Event Levels ........................................................................ E–2<br />
Event Scope......................................................................... E–2<br />
Displaying the Event Log ..................................................... E–3<br />
Using the Event Log for <strong>Troubleshooting</strong> ............................ E–3<br />
List of Events ........................................................................................ E–4<br />
List of Normal Events .......................................................... E–5<br />
List of Detailed Events ...................................................... E–22<br />
Appendix F. Configuring and Using SNMP Traps<br />
Software Monitoring ............................................................................. F–1<br />
SNMP Monitoring and Trap Configuration ............................................ F–3<br />
Installing MIB Files on an SNMP Browser ............................................ F–3<br />
Resolving SNMP Issues ........................................................................ F–4<br />
Appendix G. Using the Unisys <strong>SafeGuard</strong> 30m Collector<br />
Appendix H. Using kutils<br />
Installing the <strong>SafeGuard</strong> 30m Collector ................................................ G–1<br />
Before You Begin the Configuration ..................................................... G–2<br />
Handling the Security Breach Warning ................................ G–3<br />
Using Collector Mode ........................................................................... G–4<br />
Getting Started .................................................................... G–4<br />
Understanding Operations in Collector Mode ..................... G–7<br />
Using View Mode ............................................................................... G–15<br />
Usage .................................................................................................... H–1<br />
Path Designations ................................................................................. H–1<br />
Command Summary ............................................................................. H–2<br />
6872 5688–002 vii
Contents<br />
Appendix I. Analyzing Cluster Logs<br />
Introduction to Cluster Logs ................................................................... I–1<br />
Creating the Cluster Log ....................................................... I–2<br />
Understanding the Cluster Log Layout ................................. I–3<br />
Sample Cluster Log ................................................................................ I–5<br />
Posting Information to the Cluster Log ................................. I–5<br />
Diagnosing a Problem Using Cluster Logs ............................................. I–6<br />
Gathering Materials ............................................................... I–7<br />
Opening the Cluster Log ....................................................... I–7<br />
Converting GMT/UCT to Local Time ..................................... I–8<br />
Converting Cluster Log GUIDs to Text Resource<br />
Names ............................................................................... I–8<br />
Understanding State Codes ................................................ I–10<br />
Understanding Persistent State .......................................... I–14<br />
Understanding Error and Status Codes ............................... I–15<br />
Index ............................................................................................. 1<br />
viii 6872 5688–002
Figures<br />
2–1. Basic Geographic Clustered Environment ......................................................... 2–2<br />
2–2. Data Flow ........................................................................................................... 2–3<br />
2–3. Data Flow with Fabric Splitter ............................................................................ 2–5<br />
2–4. Data flow in CDP ................................................................................................ 2–6<br />
4–1. All RAs Fail on Site 1 (Site 1 Quorum Owner) ................................................. 4–11<br />
4–2. All RAs Fail on Site 1 (Site 2 Quorum Owner) ................................................. 4–17<br />
4–3. All RAs and Servers Fail on Site 1 (Site 1 Quorum Owner) ............................. 4–20<br />
4–4. All RAs and Servers Fail on Site 1 (Site 2 Quorum Owner) ............................. 4–25<br />
5–1. Volumes Tab Showing Volume Connection Errors ............................................ 5–4<br />
5–2. Management Console Messages for the User Volume Not Accessible<br />
Problem ......................................................................................................... 5–5<br />
5–3. Groups Tab Shows “Paused by System” .......................................................... 5–5<br />
5–4. Management Console Display: Storage Error and RAs Tab Shows<br />
Volume Errors ................................................................................................ 5–7<br />
5–5. Volumes Tab Shows Error for Repository Volume ............................................ 5–7<br />
5–6. Groups Tab Shows All Groups Paused by System ............................................ 5–7<br />
5–7. Management Console Messages for the Repository Volume not<br />
Accessible Problem ....................................................................................... 5–8<br />
5–8. Volumes Tab Shows Journal Volume Error ..................................................... 5–11<br />
5–9. RAs Tab Shows Connection Errors .................................................................. 5–11<br />
5–10. Groups Tab Shows Group Paused by System ................................................. 5–12<br />
5–11. Management Console Messages for the Journal Not Accessible<br />
Problem ....................................................................................................... 5–12<br />
5–12. Management Console Volumes Tab Shows Errors for All Volumes ............... 5–14<br />
5–13. RAs Tab Shows Volumes That Are Not Accessible ......................................... 5–14<br />
5–14. Multipathing Software Reports Failed Paths to Storage Device ..................... 5–15<br />
5–15. Storage on Site 1 Fails ..................................................................................... 5–16<br />
5–16. Cluster “Regroup” Process ............................................................................. 5–17<br />
5–17. Cluster Administrator Displays ......................................................................... 5–19<br />
5–18. Multipathing Software Shows Server Errors for Failed Storage<br />
Subsystem ................................................................................................... 5–19<br />
6–1. Management Console Showing “Inaccessible Volume” Errors ........................ 6–3<br />
6–2. Management Console Messages for Inaccessible Volumes ............................. 6–3<br />
6–3. Management Console Error Display Screen ...................................................... 6–7<br />
6–4. Management Console Messages for Volumes Inaccessible to Splitter ............ 6–8<br />
6–5. EMC PowerPath Shows Disk Error .................................................................. 6–10<br />
6–6. Management Console Display Shows a Splitter Down ................................... 6–12<br />
6–7. Management Console Messages for Splitter Inaccessible to RA ................... 6–13<br />
6–8. SAN Switch Failure on One Site ...................................................................... 6–17<br />
6–9. Management Console Display with Errors for Failed SAN Switch .................. 6–18<br />
6872 5688–002 ix
Figures<br />
6–10. Management Console Messages for Failed SAN Switch ................................ 6–19<br />
6–11. Management Console Messages for Failed SAN Switch with Quorum<br />
Owner on Surviving Site ............................................................................... 6–23<br />
7–1. <strong>Public</strong> NIC Failure of a Cluster Node .................................................................. 7–3<br />
7–2. <strong>Public</strong> NIC Error Shown in the Cluster Administrator ......................................... 7–5<br />
7–3. <strong>Public</strong> or Client WAN Failure............................................................................... 7–7<br />
7–4. Cluster Administrator Showing <strong>Public</strong> LAN Network Error ................................ 7–8<br />
7–5. Management Network Failure .......................................................................... 7–11<br />
7–6. Management Console Display: “Not Connected” ........................................... 7–13<br />
7–7. Management Console Message for Event 3023 .............................................. 7–13<br />
7–8. Replication Network Failure .............................................................................. 7–15<br />
7–9. Management Console Display: WAN Down .................................................... 7–17<br />
7–10. Management Console Log Messages: WAN Down ........................................ 7–17<br />
7–11. Management Console RAs Tab: All RAs Data Link Down ............................... 7–18<br />
7–12. Private Cluster Network Failure ........................................................................ 7–22<br />
7–13. Cluster Administrator Display with Failures ...................................................... 7–23<br />
7–14. Total Communication Failure ............................................................................ 7–26<br />
7–15. Management Console Display Showing WAN Error ........................................ 7–27<br />
7–16. RAs Tab for Total Communication Failure ........................................................ 7–28<br />
7–17. Management Console Messages for Total Communication Failure ................ 7–28<br />
7–18. Cluster Administrator Showing Private Network Down ................................... 7–31<br />
7–19. Cluster Administrator Showing <strong>Public</strong> Network Down .................................... 7–31<br />
8–1. Single RA Failure ................................................................................................. 8–5<br />
8–2. Sample BIOS Display .......................................................................................... 8–6<br />
8–3. Management Console Display Showing RA Error and RAs Tab......................... 8–7<br />
8–4. Management Console Messages for Single RA Failure with<br />
Switchover...................................................................................................... 8–8<br />
8–5. LCD Display on Front Panel of RA .................................................................... 8–10<br />
8–6. Rear Panel of RA Showing Indicators ............................................................... 8–11<br />
8–7. Location of Network LEDs................................................................................ 8–11<br />
8–8. Location of SAN Fibre Channel HBA LEDs ....................................................... 8–12<br />
8–9. Management Console Display: Host Connection with RA Is Down ................ 8–15<br />
8–10. Management Console Messages for Failed RA (All SAN HBAs Fail) ............... 8–16<br />
8–11. Management Console Showing WAN Data Link Failure .................................. 8–20<br />
8–12. Location of Hard Drive LEDs ............................................................................ 8–25<br />
8–13. Management Console Showing All RAs Down ................................................ 8–26<br />
9–1. Cluster Node Failure ........................................................................................... 9–2<br />
9–2. Management Console Display with Server Error ............................................... 9–4<br />
9–3. Management Console Messages for Server Down ........................................... 9–5<br />
9–4. Management Console Messages for Server Down for Bug Check ................... 9–9<br />
9–5. Management Console Display Showing LA Site Server Down ........................ 9–14<br />
9–6. Management Console Images Showing Messages for Server Unable<br />
to Connect to SAN ....................................................................................... 9–15<br />
9–7. PowerPath Administrator Console Showing Failures ....................................... 9–16<br />
9–8. PowerPath Administrator Console Showing Adapter Failure ........................... 9–17<br />
9–9. Event 1009 Display ........................................................................................... 9–19<br />
I–1. Layout of the Cluster Log .................................................................................... I–3<br />
I–2. Expanded Cluster Hive (in Windows 2000 Server) ............................................ I–10<br />
x 6872 5688–002
Figures<br />
6872 5688–002 xi
Figures<br />
xii 6872 5688–002
Tables<br />
2–1. User Types ......................................................................................................... 2–8<br />
2–2. Events That Cause Journal Distribution ........................................................... 2–11<br />
5–1. Possible Storage Problems with Symptoms ..................................................... 5–1<br />
5–2. Indicators and Management Console Errors to Distinguish Different<br />
Storage Volume Failures ................................................................................ 5–3<br />
6–1. Possible SAN Connectivity Problems ................................................................ 6–1<br />
7–1. Possible Networking Problems with Symptoms ............................................... 7–1<br />
7–2. Ports for Internet Communication ................................................................... 7–33<br />
7–3. Ports for Management LAN Communication and Notification ........................ 7–33<br />
7–4. Ports for RA-to-RA Internal Communication .................................................... 7–34<br />
8–1. Possible Problems for Single RA Failure with a Switchover .............................. 8–2<br />
8–2. Possible Problems for Single RA Failure Wthout a Switchover ......................... 8–3<br />
8–3. Possible Problems for Multiple RA Failures with Symptoms ............................ 8–3<br />
8–4. Management Console Messages Pertaining to Reboots ................................ 8–13<br />
9–1. Possible Server Problems with Symptoms ....................................................... 9–1<br />
10–1. Possible Performance Problems with Symptoms ........................................... 10–1<br />
B–1. LCD Status Messages ....................................................................................... B–5<br />
C–1. Messages from the Connectivity Testing Tool .................................................. C–8<br />
E–1. Normal Events .................................................................................................... E–5<br />
E–2. Detailed Events ................................................................................................ E–23<br />
F–1. Trap Variables and Values .................................................................................. F–2<br />
I–1. System Environment Variables Related to Clustering ........................................ I–2<br />
I–2. Modules of MSCS ............................................................................................... I–4<br />
I–3. Node State Codes ............................................................................................. I–12<br />
I–4. Group State Codes ............................................................................................ I–12<br />
I–5. Resource State Codes ...................................................................................... I–12<br />
I–6. Network Interface State Codes ........................................................................ I–13<br />
I–7. Network State Codes ........................................................................................ I–13<br />
6872 5688–002 xiii
Tables<br />
xiv 6872 5688–002
Section 1<br />
About This <strong>Guide</strong><br />
Purpose and Audience<br />
This document presents procedures for problem analysis and troubleshooting of the<br />
Unisys <strong>SafeGuard</strong> 30m solution. It is intended for Unisys service representatives and<br />
other technical personnel who are responsible for maintaining the Unisys <strong>SafeGuard</strong><br />
30m solution installation.<br />
Related Product Information<br />
The methods described in this document are based on support and diagnostic tools that<br />
are provided as standard components of the Unisys <strong>SafeGuard</strong> 30m solution. You can<br />
find additional information about these tools in the following documents:<br />
• Unisys <strong>SafeGuard</strong> <strong>Solutions</strong> Planning and Installation <strong>Guide</strong><br />
• Unisys <strong>SafeGuard</strong> <strong>Solutions</strong> Replication Appliance Administrator’s <strong>Guide</strong><br />
• Unisys <strong>SafeGuard</strong> <strong>Solutions</strong> Introduction to Replication Appliance Command Line<br />
Interface (CLI)<br />
• Unisys <strong>SafeGuard</strong> <strong>Solutions</strong> Replication Appliance Installation <strong>Guide</strong><br />
Note: Review the information in the Unisys <strong>SafeGuard</strong> <strong>Solutions</strong> Planning and<br />
Installation <strong>Guide</strong> about making configuration changes before you begin troubleshooting<br />
a problem.<br />
Documentation Updates<br />
This document contains all the information that was available at the time of<br />
publication. Changes identified after release of this document are included in problem list<br />
entry (PLE) 18609274. To obtain a copy of the PLE, contact your Unisys service<br />
representative or access the current PLE from the Unisys Product <strong>Support</strong> Web site:<br />
http://www.support.unisys.com/all/ple/18609274<br />
Note: If you are not logged into the Product <strong>Support</strong> site, you will be asked to do so.<br />
6872 5688–002 1–1
About This <strong>Guide</strong><br />
What’s New in This Release<br />
Some of the important changes in the 6.0 release are summarized in this table.<br />
Unisys <strong>SafeGuard</strong><br />
Continuous Data<br />
Protection (CDP)<br />
Change Notes<br />
<strong>Support</strong> for Concurrent<br />
Local and Remote (CLR)<br />
<strong>Support</strong> for CLARiiON<br />
splitter.<br />
<strong>Support</strong> for Brocade<br />
intelligent fabric splitting<br />
(multi-VI mode only), using<br />
the Brocade 7500 SAN<br />
Router.<br />
<strong>Support</strong> for configurations<br />
using a mix of splitters<br />
within the same RA<br />
cluster and across RA<br />
clusters at different sites.<br />
Redesign of the<br />
Management Console GUI<br />
for greater ease-of-use.<br />
SNMP trap viewer, Log<br />
Collection and Analysis,<br />
Auto-discovery of<br />
<strong>SafeGuard</strong> components in<br />
Safeguard Command<br />
Center.<br />
A Unisys <strong>SafeGuard</strong> Duplex solution that uses one<br />
Replication Appliance (RA) cluster to replicate data<br />
across the Storage Area Network (SAN).<br />
Concurrent Local (CDP) and Concurrent Remote<br />
Replication (CRR) of the same production volumes.<br />
Unisys <strong>SafeGuard</strong> solutions work with the<br />
CLARiiON CX3 Series CLARiiON Splitter service to<br />
deliver a fully heterogeneous array-based data<br />
replication solution that is achieved without the<br />
need for host-based agents.<br />
To support the heterogeneous environment at<br />
switch level, Safeguard Solution supports<br />
Intelligent fabric splitting with Brocade switch.<br />
<strong>SafeGuard</strong> solutions can support mixed splitters in<br />
a given solution configuration.<br />
New RA GUI interface is easy to navigate and<br />
more clear to use.<br />
Command Center now has the log collection and<br />
automatic discovery of the devices.<br />
1–2 6872 5688–002
Using This <strong>Guide</strong><br />
About This <strong>Guide</strong><br />
This guide offers general information in the first four sections. Read Section 2 to<br />
understand the overall approach to troubleshooting and to gain an understanding of the<br />
Unisys <strong>SafeGuard</strong> 30m solution architecture.<br />
Section 3 describes recovery in a geographic replication environment, and Section 4<br />
offers information and recovery procedures for geographic clustered environments.<br />
Sections 5 through 10 group potential problems into categories and describe the<br />
problems. You must recognize symptoms, identify the problem or failed component, and<br />
then decide what to do to correct the problem. Sections 5 through 10 include a table at<br />
the beginning of each section that lists symptoms and potential problems.<br />
Each problem is then presented in the following format:<br />
• Problem Description: Description of the problem<br />
• Symptoms: List of symptoms that are typical for this problem<br />
• Actions to Resolve the Problem: Steps recommended to solve the problem<br />
The appendixes provide information about using tools and offer reference information<br />
that you might find useful in different situations.<br />
6872 5688–002 1–3
About This <strong>Guide</strong><br />
1–4 6872 5688–002
Section 2<br />
Overview<br />
The Unisys <strong>SafeGuard</strong> <strong>Solutions</strong> are flexible, integrated business continuance solutions<br />
especially suitable for protecting business-critical application environments. The Unisys<br />
<strong>SafeGuard</strong> 30m solution provides two distinct functions that act in concert: replication of<br />
data and automated application recovery through clustering over great distances.<br />
Typically, the Unisys <strong>SafeGuard</strong> 30m solution is implemented in one of these<br />
environments:<br />
• Geographic replication environment: In this replication environment, data from<br />
servers at one site are replicated to a remote site.<br />
• Geographic clustered environment: In this replication environment, Microsoft Cluster<br />
Service (MSCS) is installed on servers that span sites and that participate in one<br />
cluster. The use of a Unisys <strong>SafeGuard</strong> 30m Control resource allows automated<br />
failover and recovery by controlling the replication direction with a MSCS resource.<br />
The resource is used in this environment only.<br />
Geographic Replication Environment<br />
Unisys <strong>SafeGuard</strong> <strong>Solutions</strong> supports replication of data over Fibre Channel to local SANattached<br />
storage and over WAN to remote sites. It also allows failover to a secondary<br />
site and continues operations in the event of a disaster at the primary site.<br />
Unisys <strong>SafeGuard</strong> <strong>Solutions</strong> replicates data over any distance:<br />
• within the same site (CDP), or<br />
• to another site halfway around the globe (CRR), or<br />
• both (CLR.)<br />
6872 5688–002 2–1
Overview<br />
Geographic Clusteered<br />
Environment<br />
2–2<br />
In the geographic clusterred<br />
environment, MSCS and cluster nodes are part of o the<br />
environment. Figure 2–1 illustrates a basic geographic clustered environmen nt that<br />
consists of two sites. In addition to server clusters, the typical configuration is made up<br />
of an RA cluster (RA 1 annd<br />
RA 2) at each of the two sites. However, multiple e RA cluster<br />
configurations are also poossible.<br />
Note: The dashed liness<br />
in Figure 2–1 represent the server WAN connections.<br />
To<br />
simplify the view, redunddant<br />
and physical connections are not shown.<br />
Figure 2–11.<br />
Basic Geographic Clustered Environment t<br />
68 872 5688–002
Data Flow<br />
Write<br />
Overview<br />
Figure 2–2 shows the data flow in the basic system configuration for data written by the<br />
server. The system replicates the data in snapshot replication mode to a remote site.<br />
The data flow is divided into the following segments: write, transfer, and distribute.<br />
Figure 2–2. Data Flow<br />
The flow of data for a write transaction is as follows:<br />
1. The host writes data to the splitter (either on the host or the fabric) that immediately<br />
sends it to the RA and to the production site replication volume (storage system).<br />
2. After receiving the data, the RA returns an acknowledgement (ACK) to the splitter.<br />
The storage system returns an ACK after successfully writing the data to storage.<br />
3. The splitter sends an ACK to the host that the write operation has been completed<br />
successfully.<br />
In snapshot replication mode, this sequence of events (steps 1 to 3) can be repeated<br />
multiple times before the snapshot is closed.<br />
6872 5688–002 2–3
Overview<br />
Transfer<br />
Distribute<br />
The flow of data for transfer is as follows:<br />
1. After processing the snapshot data (that is, applying the various compression<br />
techniques), the RA sends the snapshot over the WAN to its peer RA at the remote<br />
site.<br />
2. The RA at the remote site writes the snapshot to the journal. At the same time, the<br />
remote RA returns an ACK to its peer at the production site.<br />
Note: Alternatively, you can set an advanced policy parameter so that lag is<br />
measured to the journal. In that case, the RA at the target site returns an ACK to its<br />
peer at the source site only after it receives an ACK from the journal (step 3).<br />
3. After the complete snapshot is written to the journal, the journal returns an ACK to<br />
the RA.<br />
When possible, and unless instructed otherwise, the Unisys <strong>SafeGuard</strong> 30m solution<br />
proceeds at first opportunity to “distribute” the image to the appropriate location on the<br />
storage system at the remote site. The logical flow of data for distribution is as follows:<br />
1. The remote RA reads the image from the journal.<br />
2. The RA reads existing information from the relevant remote replication volume.<br />
3. The RA writes “undo” information (that is, information that can support a rollback, if<br />
necessary) to the journal.<br />
Note: Steps 2 and 3 are skipped when the maximum journal lag policy parameter<br />
causes distribution to operate in fast-forward mode.<br />
(See the Unisys <strong>SafeGuard</strong> <strong>Solutions</strong> Replication Appliance Administrator’s <strong>Guide</strong> for<br />
more information.)<br />
4. The RA writes the image to the appropriate remote replication volume.<br />
Alternatives to the basic system architecture<br />
The following are derivatives of the basic system architecture:<br />
Fabric Splitter<br />
An intelligent fabric switch can perform the splitting function instead of a Unisys<br />
<strong>SafeGuard</strong> <strong>Solutions</strong> host-based Splitter installed on the host. In this case, the host<br />
sends a single write transaction to the switch on its way to storage. At the switch,<br />
however, the message is split, with a copy sent also to RA (as shown in Figure 2–3). The<br />
system behaves the same way as it does when using a Unisys <strong>SafeGuard</strong> <strong>Solutions</strong><br />
host-based splitter on the host to perform the splitting function.<br />
2–4 6872 5688–002
Figure 2–3. Data Flow with Fabric Splitter<br />
Local Replication by CDP<br />
Overview<br />
You can use CDP to perform replication over short distances—that is, to replicate<br />
storage at the same site as CRR does over long distances. Operation of the system is<br />
similar to CRR including the ability to use the journal to recover from a corrupted data<br />
image, and the ability, if necessary, to fail over to the remote side or storage pool. In<br />
Figure 2–4, there is no WAN, the storage pools are part of the storage at the same site,<br />
and the same RA appears in each of the segments.<br />
6872 5688–002 2–5
Overview<br />
Figure 2–4. Data flow in CDP<br />
Note: The repository volume must belong to remote-side storage pool. Unisys<br />
<strong>SafeGuard</strong> <strong>Solutions</strong> support a simultaneous mix of groups for remote and local<br />
replication. Individual volumes and groups, however, must be designated for either<br />
remote or local replication, but not for both. Certain policy parameters do not apply for<br />
local replication by CDP.<br />
Single RA<br />
Note: Unisys <strong>SafeGuard</strong> <strong>Solutions</strong> does not support single RA configuration (at both<br />
sites or at a single site).<br />
2–6 6872 5688–002
Diagnostic Tools and Capabilities<br />
Event Log<br />
Overview<br />
The Unisys <strong>SafeGuard</strong> 30m solution offers the following tools and capabilities to help<br />
you diagnose and solve problems.<br />
The replication capability of the Unisys <strong>SafeGuard</strong> 30m solution records log entries in<br />
response to a wide range of predefined events. The event log records all significant<br />
events that have recently occurred in the system. Appendix E lists and explains the<br />
events.<br />
Each event is classified by an event ID. The event ID can be used to help analyze or<br />
diagnose system behavior, including identifying the trigger for a rolling problem,<br />
understanding a sequence of events, and examining whether the system performed the<br />
correct set of actions in response to a component failure.<br />
You can monitor system behavior by viewing the event log through the management<br />
console, by issuing CLI commands, or by reading RA logs. The exact period of time<br />
covered by the log varies according to the operational state of the environment during<br />
that period or, in the case of RA logs, the time period that was specified. The capacity of<br />
the event log is 5000 events.<br />
For problems that are not readily apparent and for situations that you are monitoring for<br />
failure, you can configure an e-mail notification to send all logs to you in a daily summary.<br />
Once you resolve the problem, you can remove the event notifications. See “Configuring<br />
a Diagnostic E-mail Notification” in this section to configure a daily summary of events.<br />
System Status<br />
The management console displays an immediate indication of any problem that<br />
interferes with normal operation of the Unisys <strong>SafeGuard</strong> 30m environment. If a<br />
component fails, the indication is accompanied by an error message that provides<br />
detailed information about the failure.<br />
6872 5688–002 2–7
Overview<br />
You must log in to the management console to monitor the environment and to view<br />
events. The RAs are preconfigured with the users defined in Table 2–1.<br />
Table 2–1. User Types<br />
User Initial Password Permissions<br />
boxmgmt boxmgmt Install<br />
admin admin All except install and<br />
webdownload<br />
monitor monitor Read only<br />
webdownload webdownload webdownload<br />
SE Unisys(CSC) All except install and<br />
webdownload<br />
Note: The password boxmgmt is not used to log in to the management console; it is<br />
only used for SSH sessions.<br />
The CLI provides all users with status commands for the complete set of Unisys<br />
<strong>SafeGuard</strong> 30m components. You can use the information and statistics provided by<br />
these commands to identify bottlenecks in the system.<br />
E-mail Notifications<br />
The e-mail notification mechanism sends specified event notifications (or alerts) to<br />
designated individuals. Also, you can set up an e-mail notification for once a day that<br />
contains a daily summary of events.<br />
Configuring a Diagnostic E-mail Notification<br />
1. From the management console, click Alert Settings on the System menu.<br />
2. Under Rules, click Add.<br />
3. Using the diagnostic rule, select the appropriate topic, level, and type options.<br />
Diagnostic Rule<br />
This rule sends all messages on a daily basis to personnel of your choice.<br />
Topics: All Topics<br />
Level: Information<br />
Scope: Detailed<br />
Type Daily<br />
4. Under Addresses, click Add.<br />
2–8 6872 5688–002
Overview<br />
5. In the New Address box, type the e-mail address to which you would like event<br />
notifications sent. You can specify more than one e-mail address.<br />
6. Click OK.<br />
7. Repeat steps 4 through 6 for each additional e-mail recipient.<br />
8. Click OK.<br />
9. Click OK.<br />
Installation Diagnostics<br />
The Diagnostics menu of the Installation Manager provides a suite of diagnostic tools for<br />
testing the functionality and connectivity of the installed RAs and Unisys <strong>SafeGuard</strong> 30m<br />
components. Appendix C explains how to use the Installation Manager diagnostics.<br />
Installation Manager is also used to collect RA logs and host splitter logs from one<br />
centralized location. See Appendix A for more information about collecting logs.<br />
Host Information Collector (HIC)<br />
Cluster Logs<br />
The HIC collects extensive information about the environment, operation, and<br />
performance of any server on which a splitter has been installed. You can use the<br />
Installation Manager to collect logs across the entire environment including RAs and all<br />
servers on which the HIC feature is enabled. The HIC can also be used at the server. See<br />
Appendix A for more information about collecting logs.<br />
In a geographic clustered environment, MSCS maintains logs of events for the clustered<br />
environment. Analyzing these logs is helpful in diagnosing certain problems. Appendix I<br />
explains how to analyze these logs.<br />
Unisys <strong>SafeGuard</strong> 30m Collector<br />
The Unisys <strong>SafeGuard</strong> 30m Collector utility enables you to easily collect various pieces of<br />
information about the environment that can help in solving problems. Appendix G<br />
describes this utility.<br />
RA Diagnostics<br />
Diagnostics specific to the RAs are available to aid in identifying problems. Appendix B<br />
explains how to use the RA diagnostics.<br />
Hardware Indicators<br />
Hardware problems—for example, RA disk failures or RA power problems—are<br />
identified by status LEDs located on the RAs themselves. Several indicators are<br />
explained in Section 8, “Solving Replication Appliance (RA) Problems.”<br />
6872 5688–002 2–9
Overview<br />
SNMP <strong>Support</strong><br />
kutils Utility<br />
The RAs support monitoring and problem notification using standard SNMP, including<br />
support for SNMPv3. You can use SNMP queries to the agent on the RA. Also, you can<br />
configure the environment such that events generate SNMP traps that are then sent to<br />
designated hosts. Appendix F explains how to configure and use SNMP traps.<br />
The kutils utility is a proprietary server-based program that enables you to manage server<br />
splitters across all platforms. The command-line utility is installed automatically when the<br />
Unisys <strong>SafeGuard</strong> 30m splitter is installed on the application server. If the splitting<br />
function is not on a host but rather is on an intelligent switch, the kutils utility is copied<br />
from the Splitter CD-ROM. (See the Unisys <strong>SafeGuard</strong> <strong>Solutions</strong> Planning and Installation<br />
<strong>Guide</strong> for more information.)<br />
Appendix H explains some kutils commands that are helpful in troubleshooting<br />
problems. See the Unisys <strong>SafeGuard</strong> <strong>Solutions</strong> Replication Appliance Administrator’s<br />
<strong>Guide</strong> for complete reference information on the kutils utility.<br />
Discovering Problems<br />
Symptoms of problems and notifications occur in various ways with the Unisys<br />
<strong>SafeGuard</strong> 30m solution. The tools and capabilities described previously provide<br />
notifications for some conditions and events. Other problems are recognized from<br />
failures. Problems might be noted in the following ways:<br />
• Problems with data because of a rolling disaster, which means that the site needs to<br />
use a previous snapshot to recover<br />
• Problems with applications failing<br />
• Inability to switch processing to the remote or secondary site<br />
• Problems with the MSCS cluster (such as a failover to another cluster or site)<br />
• Problems reported in an e-mail notification from an RA<br />
• Problem reported in an SNMP trap notification<br />
• Problems listed on the management console as reported in the overall system status<br />
or in group state or properties<br />
• Problems reported in the daily summary of events<br />
In this guide, symptoms and notifications are often listed with potential problems.<br />
However, the messages and notifications vary based on the problem, and multiple<br />
events and notifications are possible at any given time.<br />
Events That Cause Journal Distribution<br />
Certain conditions might occur that can prevent access to the expected journal image.<br />
For instance, images might be flushed or distributed so that they are not available. Table<br />
2–2 lists events that might cause the images to be unavailable. For tables listing all<br />
events, see Appendix E.<br />
2–10 6872 5688–002
Table 2–2. Events That Cause Journal Distribution<br />
Event ID Level Scope Description Trigger<br />
4042 Info Detailed Group deactivated.<br />
(Group , RA<br />
)<br />
4062 Info Detailed Access enabled to<br />
latest image. (Group<br />
, Failover site<br />
)<br />
4097 Warning Detailed Maximum journal lag<br />
exceeded.<br />
Distribution in fastforward—older<br />
images removed from<br />
journal. (Group<br />
)<br />
4099 Info Detailed Initializing in long<br />
resynchronization<br />
mode. (Group<br />
)<br />
<strong>Troubleshooting</strong> Procedures<br />
Overview<br />
A user action deactivated<br />
the group.<br />
Access was enabled to<br />
the latest image during<br />
automatic failover.<br />
Fast-forward action<br />
started and caused the<br />
snapshots taken before<br />
the fast-forward action to<br />
be lost and the maximum<br />
journal lag to be<br />
exceeded.<br />
The system started a long<br />
resynchronization<br />
For troubleshooting, you must differentiate between problems that arise from<br />
environmental changes, network changes (cabling, routing and port blocking), or those<br />
changes related to zoning, logical unit number (LUN) masking, other devices in the SAN,<br />
and storage failures and problems that arise from misconfiguration or internal errors in<br />
the environmental setup.<br />
Refer to the preceding diagrams as you consider the general troubleshooting procedures<br />
that follow. Use the following four general tasks to help you identify symptoms and<br />
causes whenever you encounter a problem.<br />
Identifying the Main Components and Connectivity of the<br />
Configuration<br />
Knowledge of the main system components and the connectivity between these<br />
components is a key to understanding how the entire environment operates. This<br />
knowledge helps you understand where the problem exists in the overall system context<br />
and can help you correctly identify which components are affected.<br />
6872 5688–002 2–11
Overview<br />
Identify the following components:<br />
• Storage device, controller, and the configuration of connections to the Fibre Channel<br />
(FC) switch<br />
• Switch and port types, and their connectivity<br />
• Network configuration (WAN and LAN): IP addresses, routing schemes, subnet<br />
masks, and gateways<br />
• Participating servers: operating system, host bus adapters (HBAs), connectivity to<br />
the FC switch<br />
• Participating volumes: repository volumes, journal volumes, and replication volumes<br />
Understanding the Current State of the System<br />
Use the management console and the CLI get commands to understand the current<br />
state of the system:<br />
• Is there any component which is shown to be in an error state?<br />
If so, what is the error? Is it down, disconnected from any other components?<br />
• What is the state of the groups, splitters, volumes, transfer, and distribution?<br />
• Is the current state stable or changing within intervals of time?<br />
Verifying the System Connectivity<br />
To verify the system connectivity, use physical and tool-based verification methods to<br />
answer the following questions:<br />
• Are all the components physically connected? Are the activity or link lights active?<br />
• Are the components connected to the correct switch or switches? Are they<br />
connected to the correct ports?<br />
• Is there connectivity over the WAN between all appliances? Is there connectivity<br />
between the appliances on the same site over the management network?<br />
2–12 6872 5688–002
Analyzing the Configuration Settings<br />
Many problems occur because of improper configuration settings such as improper<br />
zoning. Analyze the configuration settings to ensure they are not the cause of the<br />
problem.<br />
Overview<br />
• Are the zones properly configured?<br />
− Splitter-to-storage?<br />
− Splitter-to-RA?<br />
− RA-to-storage?<br />
− RA-to-RA?<br />
• Are the zones in the switch config?<br />
• Has the proper switch config been applied?<br />
• Are the LUNs properly masked?<br />
− Is the splitter masked to see only the relevant replication volume or volumes?<br />
− Are the RAs masked to see the relevant replication volume or volumes,<br />
repository volume, and journal volume or volumes?<br />
• Are the network settings (such as gateway) for the RAs correct?<br />
• Are there any possible IP conflicts on the network?<br />
6872 5688–002 2–13
Overview<br />
2–14 6872 5688–002
Section 3<br />
Recovering in a Geographic Replication<br />
Environment<br />
This section provides recovery procedures so that user applications can be online as<br />
quickly as possible in a geographic replication environment.<br />
An older image might be required to recover from a rolling disaster, human error, a virus,<br />
or any other failure that corrupts the latest snapshot image. Ensure that the image is<br />
tested prior to reversing direction.<br />
Complete the procedures for each group that needs to be moved based on the type of<br />
hosts in the environment:<br />
• Manual Failover of Volumes and Data Consistency Groups<br />
• Manual Failover of Volumes and Data Consistency Groups for ClearPath MCP Hosts<br />
Refer to the Unisys <strong>SafeGuard</strong> <strong>Solutions</strong> Replication Appliance Administrator’s <strong>Guide</strong> for<br />
more information on logged and virtual (with roll or without roll) access modes. For<br />
specific environments, refer to the best practices documents listed under <strong>SafeGuard</strong><br />
<strong>Solutions</strong> documentation on the Unisys Product <strong>Support</strong> Web site,<br />
www.support.unisys.com<br />
6872 5688–002 3–1
Recovering in a Geographic Replication Environment<br />
Manual Failover of Volumes and Data Consistency<br />
Groups<br />
When you need to perform a manual failover of volumes and data consistency groups,<br />
complete the following tasks:<br />
1. Accessing an image<br />
2. Testing the selected image<br />
Accessing an Image<br />
1. From the Management Console, select any one of the data consistency groups<br />
on the navigation pane.<br />
2. Select the Status tab, (if it is not opened.)<br />
3. Perform the following steps to allow access to the target image:<br />
a. Right-click the Consistency Group and select Pause Transfer. Click Yes<br />
when the system prompts that the group activity will be paused.<br />
b. Right-click the Consistency Group and scroll down.<br />
c. Select the Remote Copy name and click Enable Image Access.<br />
The Enable Image Access dialog box appears.<br />
d. Choose Select an image from the list and click Next.<br />
The Select Explicit Image dialog box appears and displays the available<br />
images.<br />
e. Select the desired image from the list and click Next.<br />
The Image Access Mode dialog box appears.<br />
f. Select the option Logged access (physical) and click Next.<br />
The Summary screen displays the Image name and the Image Access mode.<br />
g. Click Finish.<br />
Note: This process might take a long time to complete depending on the value<br />
of the journal lag setting in the group policy of the consistency group. The<br />
following message appears during the process:<br />
Enabling log access<br />
h. Verify the target image name displayed below the bitmap in the components<br />
pane under the Status tab.<br />
Transfer:Paused displays at the bottom in the Status tab under the<br />
components pane.<br />
3–2 6872 5688–002
Testing the Selected Image at Remote Site<br />
Recovering in a Geographic Replication Environment<br />
Perform the following steps to test the selected image at the remote site:<br />
1. Run the following batch file to mount a volume at the remote site. If necessary,<br />
modify the program files\kdriver path to fit your environment.<br />
@echo off<br />
cd "c:\program files\kdriver\kutils"<br />
"c:\program files\kdriver\kutils\kutils.exe" umount e:<br />
"c:\program files\kdriver\kutils\kutils.exe" mount e:<br />
2. Repeat step 1 for all volumes in the group.<br />
3. Ensure that the selected image is valid:<br />
• all applications start successfully using the selected image<br />
• the data in the image is consistent and valid<br />
For example, you might want to test whether you can start a database application on<br />
this image. You might also want to run proprietary test procedures to validate the<br />
data.<br />
4. Skip to “Unmounting Volumes at Production site and Reversing Replication<br />
Direction” if you have tested the validity of the image and the test is successful. If<br />
the test is unsuccessful, continue with step 5.<br />
5. To test a different image, perform the procedure “Unmounting the Volumes and<br />
Disabling the Image Access at Remote site.”<br />
Unmounting the Volumes and Disabling the Image Access at Remote<br />
Site<br />
1. Before choosing another image, unmount the volume using the following batch file.<br />
If necessary, modify the program files/kdriver path to fit your environment.<br />
@echo off<br />
cd "c:\program files\kdriver\kutils"<br />
"c:\program files\kdriver\kutils\kutils.exe" flushFS e:<br />
"c:\program files\kdriver\kutils.exe" umount e:<br />
2. Repeat step 1 for all volumes in the group.<br />
3. Select one of the Consistency Groups in the navigation pane on the<br />
Management Console.<br />
4. Right-click the Consistency Group and scroll down.<br />
5. Select the Remote Copy name and click Disable Image Access.<br />
6. Click Yes when the system prompts you to ensure that all group volumes are<br />
unmounted.<br />
7. Repeat the procedures “Accessing an Image” and “Testing the Selected Image at<br />
the Remote Site”.<br />
6872 5688–002 3–3
Recovering in a Geographic Replication Environment<br />
Unmounting the Volumes at Production Site and Reversing<br />
Replication Direction<br />
Perform these steps at the host:<br />
1. To unmount a volume at the production site, run the following batch file. If<br />
necessary, modify the program files\kdriver path to fit your environment.<br />
@echo off<br />
cd "c:\program files\kdriver\kutils"<br />
"c:\program files\kdriver\kutils\kutils.exe" flushFS e:<br />
"c:\program files\kdriver\kutils\kutils.exe" umount e:<br />
2. Repeat step 1 for all volumes in the group.<br />
Perform these steps on the Management Console:<br />
1. Select a Consistency Group from the navigation pane.<br />
2. Right-click the Group and select Pause Transfer. Click Yes when the system<br />
prompts that the group activity will be paused.<br />
3. Click the Status tab. The status of the transfer must display Paused.<br />
4. Right-click the Consistency group and select Failover to .<br />
5. Click Yes when the system prompts you to confirm failover.<br />
6. Ensure that the Start data transfer immediately check box is selected.<br />
The following warning message appears:<br />
Warning: Journal will be erased. Do you wish to continue?<br />
7. Click Yes to continue.<br />
3–4 6872 5688–002
Recovering in a Geographic Replication Environment<br />
Manual Failover of Volumes and Data Consistency<br />
Groups for ClearPath MCP Hosts<br />
When you need to perform a manual failover of volumes and data consistency groups,<br />
complete the following tasks:<br />
1. Accessing an image<br />
2. Testing the selected image<br />
Note: For ClearPath MCP hosts, close and free units at the remote site before<br />
completing the following procedures. This action prevents SCSI Reserved errors being<br />
logged to units that are no longer accessible.<br />
Accessing an Image<br />
Quiescence any databases before accessing an image. Once the pack has failed over<br />
and has been acquired, resume the databases.<br />
If the volumes to be failed over are not in use by a database, issue the CLOSE PK<br />
command from the operator display terminal (ODT) to close the<br />
volumes.<br />
For more information on how to access an image, refer to the procedures,<br />
“Accessing an Image under Manual Failover of Volumes” and “Data Consistency<br />
Groups”.<br />
Testing the Selected Image at Remote Site<br />
1. Mount a volume at the remote site by issuing the ACQUIRE PK <br />
command from the remote site ODT to acquire the unit. Also acquire any controls<br />
necessary to access the unit if these controls are not automatically acquired.<br />
Verify that the MCP can access the volume using commands such as SC– and P PK<br />
to display the status of the peripherals.<br />
2. Repeat step 1 for all volumes in the group.<br />
3. Ensure that the selected image is valid; that is, verify that<br />
• All applications start successfully using the selected image.<br />
• The data in the image is consistent and valid.<br />
For example, you might want to test whether you can start a database application on<br />
this image. You might also want to run proprietary test procedures to validate the<br />
data.<br />
4. If you tested the validity of the image and the test completed successfully, skip to<br />
“Unmounting Volumes and Reversing Replication Direction at Production site.” If the<br />
testing is not successful, continue with step 5.<br />
5. To test a different image, perform the procedure “Unmounting the Volumes and<br />
Disabling the Image Access at Remote Site.”<br />
6872 5688–002 3–5
Recovering in a Geographic Replication Environment<br />
Unmounting the Volumes and Disabling the Image Access at Remote<br />
Site<br />
1. Before choosing another image, unmount the volume by issuing the CLOSE PK<br />
command followed by the FREE PK command from<br />
the ODT. Verify that the units are closed and freed using peripheral status<br />
commands.<br />
2. Repeat step 1 for all volumes in the group.<br />
3. Click the Status tab. The status of the transfer must display Paused.<br />
4. Right-click the Consistency group and select Failover to .<br />
5. Click Yes when the system prompts you to confirm failover.<br />
6. Ensure that the Start data transfer immediately check box is selected.<br />
The following warning message appears:<br />
Warning: Journal will be erased. Do you wish to continue?<br />
7. Click Yes to continue.<br />
Unmounting the Volumes at Source Site and Reversing Replication<br />
Direction<br />
Perform these steps at the source site host:<br />
1. Unmount a volume at the source site by issuing the CLOSE PK <br />
command followed by the FREE PK command from the ODT to close<br />
and free the volume.<br />
If the site is down when the host is recovered, use the FREE PK <br />
command to free the original source units. In response to inquiry commands, the<br />
status of the original source units is “closed.” Free the units to prevent access by<br />
the original source site host.<br />
2. Repeat step 1 for all volumes in the group.<br />
3. Select a Consistency Group from the navigation pane.<br />
4. Right-click the Group and select Pause Transfer. Click Yes when the system<br />
prompts that the group activity will be paused.<br />
5. Click the Status tab. The status of the transfer must display Paused.<br />
6. Right-click the Consistency group and select Failover to .<br />
7. Click Yes when the system prompts you to confirm failover.<br />
8. Ensure that the Start data transfer immediately check box is selected.<br />
The following warning message appears:<br />
Warning: Journal will be erased. Do you wish to continue?<br />
9. Click Yes to continue.<br />
3–6 6872 5688–002
Section 4<br />
Recovering in a Geographic Clustered<br />
Environment<br />
This section provides information and procedures that relate to geographic clustered<br />
environments running Microsoft Cluster Service (MSCS).<br />
Checking the Cluster Setup<br />
To ensure that the cluster configuration is correct, check the MSCS properties and the<br />
network bindings. For more detailed information, refer to “<strong>Guide</strong> to Creating and<br />
Configuring a Server Cluster under Windows Server 2003”, which you can download at<br />
MSCS Properties<br />
http://www.microsoft.com/downloads/details.aspx?familyid=96F76ED7-9634-4300-<br />
9159-89638F4B4EF7&displaylang=en<br />
To check the MSCS properties, enter the following command from the command<br />
prompt:<br />
Cluster /prop<br />
Output similar to the following is displayed:<br />
T Cluster Name Value<br />
-- -------------------- ------------------------------ -----------------------<br />
M AdminExtensions {4EC90FB0-D0BB-11CF-B5EF-0A0C90AB505}<br />
D DefaultNetworkRole 2 (0x2)<br />
S Description<br />
B Security 01 00 14 80 ... (148 bytes)<br />
B Security Descriptor 01 00 14 80 ... (148 bytes)<br />
M Groups\AdminExtensions<br />
M Networks\AdminExtensions<br />
M NetworkInterfaces\AdminExtensions<br />
M Nodes\AdminExtensions<br />
M Resources\AdminExtensions<br />
M ResourceTypes\AdminExtensions<br />
D EnableEventLogReplication 0 (0x0)<br />
D QuorumArbitrationTimeMax 300 (0x12c)<br />
D QuorumArbitrationTimeMin 15 (0xf)<br />
D DisableGroupPreferredOwnerRandomization 0 (0x0)<br />
D EnableEventDeltaGeneration 1 (0x1)<br />
D EnableResourceDllDeadlockDetection 0 (0x0)<br />
D ResourceDllDeadlockTimeout 240 (0xf0)<br />
D ResourceDllDeadlockThreshold 3 (0x3)<br />
D ResourceDllDeadlockPeriod 1800 (0x708)<br />
D ClusSvcHeartbeatTimeout 60 (0x3c)<br />
D HangRecoveryAction 3 (0x3)<br />
6872 5688–002 4–1
Recovering in a Geographic Clustered Environment<br />
If the properties are not set correctly, use one of the following commands to correct the<br />
settings.<br />
Majority Node Set Quorum<br />
Cluster /prop HangRecoveryAction=3<br />
Cluster /prop EnableEventLogReplication=0<br />
Shared Quorum<br />
Cluster /prop QuorumArbitrationTimeMax=300 (not for majority node set)<br />
Network Bindings<br />
Cluster /prop QuorumArbitrationTimeMin=15<br />
Cluster /prop HangRecoveryAction=3<br />
Cluster /prop EnableEventLogReplication=0<br />
The following binding priority order and settings are suggested as best practices for<br />
clustered configurations. These procedures assume that you can identify the public and<br />
private networks by the connection names that are referenced in the steps.<br />
Host-Specific Network Bindings and Settings<br />
1. Open the Network Connections window.<br />
2. On the Advanced menu, click Advanced Settings.<br />
3. Select the Networks and Bindings tab.<br />
This tab shows the binding order in the upper pane and specific connection<br />
properties in the lower pane.<br />
4. Verify that the public network connection is above the private network in the binding<br />
list in the upper pane.<br />
If it is not, follow these steps to change the order:<br />
a. Select a network connection in the binding list in the upper pane.<br />
b. Use the arrows to the right to move the network connection up or down in the<br />
list as appropriate.<br />
5. Select the private network in the binding list. In the lower pane, verify that the File<br />
and Print Sharing for Microsoft Networks and the Client for Microsoft<br />
Networks check boxes are cleared for the private network.<br />
6. Click OK.<br />
7. Highlight the public connections, then right-click and click Properties.<br />
8. Select Internet (TCP.IP) in the list, and click Properties.<br />
9. Click Advanced.<br />
10. Select the WINS tab.<br />
4–2 6872 5688–002
Recovering in a Geographic Clustered Environment<br />
11. Ensure that Enable LM/Hosts lookup is selected.<br />
12. Ensure that Disable NetBIOS over TCP/IP is selected.<br />
13. Repeat steps 7 through 12 for the private network connection.<br />
Cluster-Specific Network Bindings and Settings<br />
1. Open the Cluster Administrator.<br />
2. Right-click the cluster (the top node in the tree structure in the left pane and click<br />
Properties.<br />
3. Select the Networks Priority tab.<br />
4. Ensure that the private network is at the top of the list and that the public network is<br />
below the private network.<br />
If it is not, follow these steps to change the order:<br />
a. Select the private network.<br />
b. Use the command button at the right to move up the private network up in the<br />
list as appropriate.<br />
5. Select the private network, and click Properties.<br />
6. Verify that the Enable this network for cluster use check box is selected and<br />
that Internal cluster communications only (private network) is selected.<br />
7. Click OK.<br />
8. Select the public network, and click Properties.<br />
9. Verify that the Enable this network for cluster use check box is selected and<br />
that All communications (mixed network) is selected.<br />
10. Click OK.<br />
Group Initialization Effects on a Cluster<br />
Move-Group Operation<br />
The following conditions affect failover times for a cluster move-group operation. A<br />
cluster move-group operation cannot complete if a lengthy consistency group<br />
initialization, such as a full-sweep initialization, long resynchronization, or initialization<br />
from marking mode, is executing in the background. Review these conditions and plan<br />
accordingly.<br />
6872 5688–002 4–3
Recovering in a Geographic Clustered Environment<br />
Full-Sweep Initialization<br />
A full-sweep initialization occurs when the disks on both sites are scanned or read in<br />
their entirety and a comparison is made, using checksums, to check for differences. Any<br />
differences are then replicated from the Production site disk to the remote site disk. A<br />
full-sweep initialization generates an entry in the management console log.<br />
A full-sweep initialization occurs in the following circumstances:<br />
• Disabling or enabling a group<br />
Disabling a group causes all disk replication in the group to stop. A full-sweep<br />
initialization is performed once the group is enabled. The full-sweep initialization<br />
guarantees that the disks are consistent between the sites.<br />
• Adding a new splitter server or host that has access to the disks in the group<br />
When adding a new splitter to the replication, there is a time before the splitter is<br />
added to the configuration when activity from this splitter to the disks is not being<br />
monitored or replicated. To guarantee that no write operations were performed by<br />
the new splitter before the splitter was configured in the replication, a full-sweep<br />
initialization is required for all groups that contain disks accessed by this splitter. This<br />
initialization is done automatically by the system.<br />
• Double failure of a main component<br />
When a double failure of a main component occurs, a full-sweep initialization is<br />
required to guarantee that consistency was maintained. The main components<br />
include the host, the replication appliance (RA), and the storage subsystem.<br />
Long Resynchronization<br />
A long resynchronization occurs when the data difference that needs to be replicated to<br />
the other site cannot fit on the journal volume. The data is split into multiple snapshots<br />
for distribution to the other site, and all the previous snapshots are lost. Long<br />
resynchronization can be caused by long WAN outages, a group being disabled for a long<br />
time period, and other instances when replication has not been functional for a long time<br />
period.<br />
Long resynchronization is not connected with full-sweep initialization and can also<br />
happen during initialization from marking (see “Initialization from Marking Mode”). It is<br />
dependant only on the journal volume size and the amount of data to be replicated.<br />
A long resynchronization is identified in the Status Tab in Components Pane under<br />
the remote journal bitmap in the management console. The status Performing Long<br />
Resync is visible for the group that is currently performing a long resynchronization.<br />
4–4 6872 5688–002
Initialization from Marking Mode<br />
Recovering in a Geographic Clustered Environment<br />
All other instances of initialization in the replication are caused by marking. The marking<br />
mode refers to a replication mode in which the location of “dirty,” or changed, data is<br />
marked in a bitmap on the repository volume. This bitmap is a standard size—no matter<br />
how much data changes or what size disks are being monitored—so the repository<br />
volume cannot fill up during marking.<br />
The replication moves to marking mode when replication cannot be performed normally,<br />
such as during WAN outages. This marking mode guarantees that all data changes are<br />
still being recorded until replication is functioning normally. When replication can perform<br />
normally again, the RAs read the dirty, or changed, data from the source disk based on<br />
data recorded in the bitmap and replicates it to the disk on the remote site. The length of<br />
time for this process to complete depends on the amount of dirty, or changed, data as<br />
well as the performance of other components in the configuration, such as bandwidth<br />
and the storage subsystem.<br />
A high-load state can also cause the replication to move to marking mode. A high-load<br />
state occurs when write activity to the source disks exceeds the limits that the<br />
replication, bandwidth, or remote disks can handle. Replication moves into marking<br />
mode at this time until the replication determines the activity has reached a level at<br />
which it can continue normal replication. The replication then exits the high-load state<br />
and an initialization from marking occurs.<br />
See Section 10, “Solving Performance Problems,” for more information on high-load<br />
conditions and problems.<br />
Behavior of <strong>SafeGuard</strong> 30m Control During a<br />
Move-Group Operation<br />
During a move-group operation, the Unisys <strong>SafeGuard</strong> 30m Control resource in a<br />
clustered environment behaves as follow. Be aware of this information when dealing<br />
with various failure scenarios.<br />
1. MSCS issues an offline request because of a failure with a group resource—for<br />
example, a physical disk—or an MSCS move group. The request is sent to the<br />
Unisys <strong>SafeGuard</strong> 30m Control resource on the node that owns the group.<br />
The MSCS resources that are dependent on the Unisys <strong>SafeGuard</strong> 30m Control<br />
resource, such as physical disk resources, are taken offline first. Taking the<br />
resources offline does not issue any commands to the RA.<br />
2. MSCS issues an online request to the Unisys <strong>SafeGuard</strong> 30m Control resource on<br />
the node to which a group was moved, or in the case of failure, to the next node in<br />
the preferred owners list.<br />
3. When the resource receives an online request from MSCS, the Unisys <strong>SafeGuard</strong><br />
30m Control resource issues two commands to control the access to disks:<br />
initiate_failover and verify_failover.<br />
6872 5688–002 4–5
Recovering in a Geographic Clustered Environment<br />
Initiate_Failover Command<br />
This command changes the replication direction from one site to another.<br />
• If a same-site failover is requested, the command completes successfully with<br />
no action performed by the RA.<br />
• The resource issues the verify_failover command to see if the RA performed<br />
the operations successfully.<br />
• If a different-site failover is requested, the RA starts changing direction between<br />
sites and returns successfully. In certain circumstances, the RA returns a failure<br />
when the WAN is down or a long resynchronization occurs.<br />
• If the RA returns a failure to the Unisys <strong>SafeGuard</strong> 30m Control resource, the<br />
resource logs the failure in the Windows application event log and retries the<br />
command continuously until the cluster pending timeout is reached. When a<br />
move-group operation fails to view events posted by the resource, check the<br />
application event log. The event source of the event entry is the 30m Control.<br />
Verify_Failover Command<br />
This command enables the Unisys <strong>SafeGuard</strong> 30m Control resource to determine<br />
the time at which the change of the replication direction completes.<br />
• If a same-site failover is requested, the command completes successfully with<br />
no action performed by the RA.<br />
• If a different-site failover is requested, the verify_failover command returns a<br />
pending status until the replication direction changes. The change of direction<br />
takes from 2 to 30 minutes.<br />
• When the verify_failover command completes, write access to the physical disk<br />
is enabled to the host from the RA and the splitter.<br />
• If the time to complete the verify_failover command is within the pending<br />
timeout, the Unisys <strong>SafeGuard</strong> 30m Control resource comes online followed by<br />
all the resources dependent on this resource.<br />
All dependent disks come online using the default physical disk timeout of an<br />
MSCS cluster. The physical disk is available to the physical disk resource<br />
immediately; there is no delay. Physical disk access is available when the Unisys<br />
<strong>SafeGuard</strong> 30m Control resource comes online. You do not need to change the<br />
default resource settings for the physical disk. However, the physical disk must<br />
be dependent on the Unisys <strong>SafeGuard</strong> 30m Control resource.<br />
• If the time to complete the verify_failover command is longer than the pending<br />
timeout of the Unisys <strong>SafeGuard</strong> 30m Control resource, MSCS fails this<br />
resource.<br />
The default pending timeout for a Unisys <strong>SafeGuard</strong> 30m Control resource is<br />
15 minutes or 900 seconds. This timeout occurs before the cluster disk timeout.<br />
4–6 6872 5688–002
Recovering in a Geographic Clustered Environment<br />
If you use the default retry value of 1, this resource issues the following<br />
commands:<br />
• Initiate_failover<br />
• Verify_failover<br />
• Initiate_failover<br />
• Verify_failover<br />
Using the default pending timeout, the Unisys <strong>SafeGuard</strong> 30m Control resource<br />
waits a total of 30 minutes to come online; this timeout period equals the<br />
timeout plus one retry. If the resource does not come online, MSCS attempts to<br />
move the group to the next node in the preferred owners list and then repeats<br />
this process.<br />
Recovering by Manually Moving an Auto-Data<br />
(Shared Quorum) Consistency Group<br />
An older image might be required to recover from a rolling disaster, human error, a virus,<br />
or any other failure that corrupts the latest snapshot image. It is impossible to recover<br />
automatically to an older image using MSCS because automatic cluster failover is<br />
designed to minimize data loss. The Unisys <strong>SafeGuard</strong> 30m solution always attempts to<br />
fail over to the latest image.<br />
Note: Manual image recovery is only for data consistency groups, not for the quorum<br />
group.<br />
To recover a data consistency group using an older image, you must complete the<br />
following tasks:<br />
• Take the cluster data group offline.<br />
• Perform a manual failover of an auto-data (shared quorum) consistency group to a<br />
selected image.<br />
• Bring the cluster group online and check the validity of the image.<br />
• Reverse the replication direction of the consistency group.<br />
Taking a Cluster Data Group Offline<br />
To take a group offline in the cluster for which you are performing a manual recovery,<br />
complete the following steps:<br />
1. Open Cluster Administrator on one of the nodes in the MSCS cluster.<br />
2. Right-click the group that you want to recover and click Take Offline.<br />
3. Wait until all resources in the group show the status as Offline.<br />
6872 5688–002 4–7
Recovering in a Geographic Clustered Environment<br />
Performing a Manual Failover of an Auto-Data (Shared Quorum)<br />
Consistency Group to a Selected Image<br />
1. Open the Management Console.<br />
2. Select a Consistency Group from the navigation pane.<br />
Note: Do not select the quorum group. The data consistency group you select<br />
should be the cluster data group that you took offline.<br />
4. Click the Policy tab on the selected Consistency Group.<br />
5. Scroll down and select Advanced in the Policy tab.<br />
6. In Global Cluster mode, select Manual (shared quorum) in the Global cluster<br />
mode list.<br />
7. Click Apply.<br />
8. Perform the following steps to access the image:<br />
a. Right-click the Consistency Group and select Pause Transfer. Click Yes<br />
when the system prompts that the group activity will be paused.<br />
b. Right-click the Consistency Group and scroll down.<br />
c. Select the Remote Copy name and click Enable Image Access.<br />
The Enable Image Access dialog box appears.<br />
d. Choose Select an image from the list and click Next.<br />
The Select Explicit Image dialog box appears and displays the available<br />
images.<br />
e. Select the desired image from the list and click Next.<br />
The Image Access Mode dialog box appears.<br />
f. Select the option Logged access (physical) and click Next.<br />
The Summary screen displays the Image name and the Image Access mode.<br />
g. Click Finish.<br />
Note: This process might take a long time to complete depending on the value<br />
of the journal lag setting in the group policy of the consistency group. The<br />
following message appears during the process:<br />
Enabling log access<br />
h. Verify the target image name displayed below the bitmap in the components<br />
pane under the Status tab.<br />
Transfer:Paused status appears at the bottom in the Status tab under the<br />
components pane.<br />
4–8 6872 5688–002
Recovering in a Geographic Clustered Environment<br />
Bringing a Cluster Data Group Online and Checking the Validity<br />
of the Image<br />
1. Open the Cluster Administrator window on the Management Console.<br />
2. Move the group to the node on the recovered site by right-clicking the group that<br />
you previously took offline and then clicking Move Group.<br />
• If the cluster has more than two nodes, a list of possible owner target nodes<br />
appears. Select the node to which you want to move the group.<br />
• If the cluster has only two nodes, the move starts immediately. Go to step 3.<br />
3. Bring the group online by right-clicking the group name and then clicking Bring<br />
Online.<br />
4. Ensure that the selected image is valid; that is, verify that<br />
• All applications start successfully using the selected image.<br />
• The data in the image is consistent and valid.<br />
For example, you might want to test whether you can start a database application on<br />
this image. You might also want to run proprietary test procedures to validate the<br />
data.<br />
5. If you tested the validity of the image and the test completed successfully, skip to<br />
“Reversing the Replication Direction of the Consistency Group.”<br />
6. If the validity of the image fails and you choose to test a different image, perform the<br />
following steps:<br />
a. To take the group offline, right-click the group name and then click Take<br />
Offline on the Cluster Administrator.<br />
b. Select one of the Consistency Groups in the navigation pane on the<br />
Management Console.<br />
c. Right-click the Consistency Group and scroll down.<br />
d. Select the Remote Copy name and click Disable Image Access.<br />
e. Click Yes when the system prompts you to ensure that all group volumes are<br />
unmounted.<br />
7. Perform the following steps if you want to choose a different image:<br />
a. Right-click the Consistency Group and select Pause Transfer. Click Yes<br />
when the system prompts that the group activity will be paused.<br />
b. Right-click the Consistency Group and scroll down.<br />
c. Select the Remote Copy name and click Enable Image Access.<br />
The Enable Image Access dialog box appears.<br />
d. Choose Select an image from the list and click Next.<br />
The Select Explicit Image dialog box appears and displays the available<br />
images.<br />
6872 5688–002 4–9
Recovering in a Geographic Clustered Environment<br />
e. Select the desired image from the list and click Next.<br />
The Image Access Mode dialog box appears.<br />
f. Select the option Logged access (physical) and click Next.<br />
The Summary screen displays the Image name and the Image Access mode.<br />
g. Click Finish.<br />
Note: This process might take a long time to complete depending on the value<br />
of the journal lag setting in the group policy of the consistency group. The<br />
following message appears during the process:<br />
Enabling log access<br />
h. Verify the target image name displayed below the bitmap in the components<br />
pane under Status tab.<br />
Transfer:Paused status appears at the bottom in the Status tab under the<br />
components pane.<br />
8. To bring the cluster group online, using the Cluster Administrator, right-click the<br />
group name and then click Online to.<br />
9. Ensure that the selected image is valid. Verify that<br />
• All applications start successfully using the selected image.<br />
• The data in the image is consistent and valid.<br />
For example, you might want to test whether you can start a database application on<br />
this image. You might also want to run proprietary test procedures to validate the<br />
data.<br />
10. If you tested the validity of the image and the test completed successfully, skip to<br />
“Reversing the Replication Direction of the Consistency Group.”<br />
11. If the image is not valid, repeat steps 6 through 9 as necessary.<br />
Reversing the Replication Direction of the Consistency Group<br />
1. Select the Consistency Group from the navigation pane.<br />
2. Right-click the Group and select Pause Transfer. Click Yes when the system<br />
prompts that the group activity will be paused.<br />
3. Click the Status tab. The status transfer must display Paused.<br />
4. Click the Policy tab and expand the Advanced Settings (if it is not expanded).<br />
5. Select Auto data (shared quorum) from the Global Cluster mode list.<br />
6. Right-click the Consistency Group and select Failover to .<br />
7. Click Yes when the system prompts you to confirm failover.<br />
4–10 6872 5688–002
6872 5688–002<br />
8. Ensure that thee<br />
Start data transfer immediately check box is s selected.<br />
The following wwarning<br />
message appears:<br />
Warning: JJournal<br />
will be erased. Do you wish to continue e?<br />
9. Click Yes to coontinue.<br />
Problem Description<br />
The following pointts<br />
describe the behavior of the components in this event:<br />
• When the quorum<br />
group is running on the site where the RAs faile ed (site 1), the<br />
cluster nodes oon<br />
site 1 fail because of quorum lost reservations, an nd cluster nodes<br />
on site 2 attempt<br />
to arbitrate for the quorum resource.<br />
• To prevent a “ssplit<br />
brain” scenario, the RAs assume that the other site is active<br />
when a WAN faailure<br />
occurs. (A WAN failure occurs if the RAs cannot<br />
communicate<br />
to at least one RA at the other site.)<br />
• When the MSCCS<br />
Reservation Manager on the surviving site (site 2)<br />
attempts the<br />
quorum arbitrattion<br />
request, the RA prevents access. Eventually, all cluster services<br />
stop and manuaal<br />
intervention is required to bring up the cluster service.<br />
Figure 4–1 illustratees<br />
this failure.<br />
Recovering in a Geographic Clustere ed Environment<br />
Recovery When All RAs Fail on Site 1 (Site 1<br />
Quorum Owner) )<br />
Figure 44–1.<br />
All RAs Fail on Site 1 (Site 1 Quorum Owner)<br />
O<br />
4–11
Recovering in a Geographic Clustered Environment<br />
Symptoms<br />
The following symptoms might help you identify this failure:<br />
• The management console display shows errors and messages similar to those for<br />
“Total Communication Failure in a Geographic Clustered Environment” in Section 7.<br />
• If you review the system event log, you find messages similar to the following<br />
examples:<br />
System Event Log for Usmv-East2 Host (Surviving Host)<br />
8/2/2008 1:35:48 PM ClusSvc Error Physical Disk Resource 1038 N/A USMV-WEST2 Reservation of<br />
cluster disk 'Disk Q:' has been lost. Please check your system and disk configuration.<br />
8/2/2008 1:35:48 PM Ftdisk Warning Disk 57 N/A USMV-WEST2 The system failed to flush data to<br />
the transaction log. Corruption may occur.<br />
System Event Log for Usmv-West2 (Failure Host)<br />
8/2/2008 1:35:48 PM ClusSvc Error Physical Disk Resource 1038 N/A USMV-WEST2 Reservation of<br />
cluster disk 'Disk Q:' has been lost. Please check your system and disk configuration.<br />
8/2/2008 1:35:48 PM Ftdisk Warning Disk 57 N/A USMV-WEST2 The system failed to flush data to<br />
the transaction log. Corruption may occur.<br />
• If you review the cluster log, you find messages similar to the following examples:<br />
Cluster Log for Usmv-East2 (Surviving Host)<br />
Attempted to try five times before the cluster timed-out. The entries recorded five times in the log:<br />
00000f90.00000620::2008/02/02-20:36:06.195 ERR Physical Disk : [DiskArb] GetPartInfo<br />
completed, status 170 (The requested resource is in use).<br />
00000f90.00000620::2008/02/02-20:36:06.195 ERR Physical Disk : [DiskArb] Failed to read<br />
(sector 12), error 170.<br />
00000f90.00000620::2008/02/02-20:36:08.210 ERR Physical Disk : [DiskArb] GetPartInfo<br />
completed, status 170.<br />
00000f90.00000620::2008/02/02-20:36:08.210 ERR Physical Disk : [DiskArb] Failed to write<br />
(sector 12), error 170.<br />
00000638.00000b10::2008/02/02-20:36:18.273 ERR [FM] Failed to arbitrate quorum resource c336021a-<br />
083e-4fa0-9d37-7077a590c206, error 170.<br />
00000638.00000b10::2008/02/02-20:36:18.273 ERR [RGP] Node 2: REGROUP ERROR: arbitration failed.<br />
00000638.00000b10::2008/02/02-20:36:18.273 ERR [CS] Halting this node to prevent an inconsistency<br />
within the cluster. Error status = 5892 (The membership engine requested shutdown of the cluster<br />
service on this node).<br />
00000684.000005a8::2008/02/02-20:37:53.473 ERR [JOIN] Unable to connect to any sponsor node.<br />
00000684.000005a8::2008/02/02-20:38:06.020 ERR [FM] FmGetQuorumResource failed, error 170.<br />
00000684.000005a8::2008/02/02-20:38:06.020 ERR [INIT] ClusterForm: Could not get quorum resource.<br />
No fixup attempted. Status = 5086 (The quorum disk could not be located by the cluster service).<br />
00000684.000005a8::2008/02/02-20:38:06.020 ERR [INIT] Failed to form cluster, status 5086 (The<br />
quorum disk could not be located by the cluster service).<br />
4–12 6872 5688–002
Cluster Log for Usmv-West2 (Failure Host)<br />
Recovering in a Geographic Clustered Environment<br />
00000d80.00000bbc::2008/02/02-20:31:21.257 ERR [FM] FmpSetGroupEnumOwner:: MM returned<br />
MM_INVALID_NODE, chose the default target<br />
00000da0.00000130::2008/02/02-20:35:48.395 ERR Physical Disk : [DiskArb]<br />
CompletionRoutine: reservation lost! Status 170 (The requested resource is in use)<br />
00000da0.00000130::2008/02/02-20:35:48.395 ERR [RM] LostQuorumResource, cluster service<br />
terminated...<br />
00000da0.00000b80::2008/02/02-20:35:49.145 ERR Network Name : Unable to open<br />
handle to cluster, status 1753 (There are no more endpoints available from the endpoint mapper).<br />
00000da0.00000c20::2008/02/02-20:35:49.145 ERR IP Address : WorkerThread:<br />
GetClusterNotify failed with status 6 (The handle is invalid).<br />
00000a04.00000a14::2008/02/02-20:37:23.456 ERR [JOIN] Unable to connect to any sponsor node.<br />
Attempted to try five times before the cluster timed-out, The entries recorded five times in the log:<br />
000001e4.00000598::2008/02/02-20:37:23.799 ERR Physical Disk : [DiskArb] GetPartInfo<br />
completed, status 170 (The resource is in use).<br />
000001e4.00000598::2008/02/02-20:37:23.831 ERR Physical Disk : [DiskArb] Failed to read<br />
(sector 12), error 170.<br />
000001e4.00000598::2008/02/02-20:37:23.831 ERR Physical Disk : [DiskArb] BusReset<br />
completed, status 31 (A device attached to the system is not functioning).<br />
000001e4.00000598::2008/02/02-20:37:23.831 ERR Physical Disk : [DiskArb] Failed to break<br />
reservation, error 31.<br />
00000a04.00000a14::2008/02/02-20:37:25.830 ERR [FM] FmGetQuorumResource failed, error 31.<br />
00000a04.00000a14::2008/08/02-20:37:25.830 ERR [INIT] ClusterForm: Could not get quorum resource.<br />
No fixup attempted. Status = 5086 (The quorum disk could not be located by the cluster service).<br />
00000a04.00000a14::2008/02/02-20:37:25.830 ERR [INIT] Failed to form cluster, status 5086 (The<br />
quorum disk could not be located by the cluster service).<br />
00000a04.00000a14::2008/02/02-20:37:25.830 ERR [CS] ClusterInitialize failed 5086<br />
00000a04.00000a14::2008/02/02-20:37:25.846 ERR [CS] Service Stopped. exit code = 5086<br />
Actions to Resolve the Problem<br />
If all RAs on site 1 fail and site 1 owns the quorum resource, perform the following tasks<br />
to recover:<br />
1. Disable MSCS on all nodes at the site with the failed RAs.<br />
2. Perform a manual failover of the quorum consistency group.<br />
3. Reverse replication direction.<br />
4. Start MSCS on a node on the surviving site.<br />
5. Complete the recovery process.<br />
6872 5688–002 4–13
Recovering in a Geographic Clustered Environment<br />
Caution<br />
Manual recovery is required only if the quorum device is lost because of a<br />
failure of an RA cluster.<br />
Before you bring the remote site online and before you perform the manual<br />
recovery procedure, ensure that MSCS is stopped and disabled on the cluster<br />
nodes at the production site (site 1 in this case). You must verify the server<br />
status with a network test.<br />
Improper use of the manual recovery procedure can lead to an inconsistent<br />
quorum disk and unpredictable results that might require a long recovery<br />
process.<br />
Disabling MSCS<br />
Stop MSCS on each node at the site where the RAs failed by completing the following<br />
steps:<br />
1. In the Control Panel, point to Administrative Tools, and then click Services.<br />
2. Right-click Cluster Service and click Stop.<br />
3. Change the startup type to Disabled.<br />
4. Repeat steps 1 through 3 for each node on the site.<br />
Performing a Manual Failover of the Quorum Consistency Group<br />
1. Connect to the Management Console by opening a browser to the management IP<br />
address of the surviving site. The management console can be accessed only by the<br />
site with a functional RA cluster because the WAN is down.<br />
2. Click the Quorum Consistency Group (that is, the consistency group that holds<br />
the quorum drive) in the navigation pane.<br />
3. Click the Policy tab.<br />
4. Under Advanced, select Manual (shared quorum) in the Global cluster<br />
mode list, and click Apply.<br />
5. Right-click the Quorum Consistency Group and then select Pause Transfer.<br />
Click Yes when the system prompts that the group activity will be stopped.<br />
6. Perform the following steps to allow access to the target image:<br />
a. Right-click the Consistency Group and scroll down.<br />
b. Select the Remote Copy name and click Enable Image Access.<br />
The Enable Image Access dialog box appears.<br />
c. Choose Select an image from the list and click Next.<br />
The Select Explicit Image dialog box displays the available images.<br />
d. Select the desired image from the list and then click Next.<br />
The Image Access Mode dialog box appears.<br />
4–14 6872 5688–002
Recovering in a Geographic Clustered Environment<br />
e. Select Logged access (physical) and click Next.<br />
The Summary screen shows the Image name and the Image Access mode.<br />
f. Click Finish.<br />
Note: This process might take a long time to complete depending on the value<br />
of the journal lag setting in the group policy of the consistency group.<br />
g. Verify the target image name displayed below the bitmap in the components<br />
pane under the Status tab.<br />
Transfer:Paused status displays under the bitmap in the Status tab under the<br />
components pane.<br />
Reversing Replication Direction<br />
1. Select the Quorum Consistency Group in the navigation pane.<br />
2. Right-click the Group and select Pause Transfer. Click Yes when the system<br />
prompts that the group activity will be paused.<br />
3. Click the Status tab. The status of the transfer must show Paused.<br />
4. Right-click the Consistency Group and select Failover to .<br />
5. Click Yes when the system prompts to confirm failover.<br />
6. Ensure that the Start data transfer immediately check box is selected.<br />
The following warning message appears:<br />
Warning: Journal will be erased. Do you wish to continue?<br />
7. Click Yes to continue.<br />
Starting MSCS<br />
MSCS should start within 1 minute on the surviving nodes when the MSCS recovery<br />
setting is enabled. You can manually start MSCS on each node of the surviving site by<br />
completing the following steps:<br />
1. In the Control Panel, point to Administrative Tools, and then click Services.<br />
2. Right-click Cluster Service, and click Start.<br />
MSCS starts the cluster group and automatically moves all groups to the first-started<br />
cluster node.<br />
3. Repeat steps 1 through 2 for each node on the site.<br />
6872 5688–002 4–15
Recovering in a Geographic Clustered Environment<br />
Completing the Recovery Process<br />
To complete the recovery process, you must restore the global cluster mode property<br />
and start MSCS.<br />
• Restoring the Global Cluster Mode Property for the Quorum Group<br />
Once the primary site is operational and you have verified that all nodes at both sites<br />
are online in the cluster, restore the failover settings by performing the following<br />
steps:<br />
1. Click the Quorum Consistency Group (that is, the consistency group that<br />
holds the quorum device) from the navigation pane.<br />
2. Click the Policy tab.<br />
3. Under Advanced, select Auto-quorum (shared quorum) in the Global<br />
cluster mode list.<br />
4. Click Apply.<br />
5. Click Yes when the system prompts that the group activity will be stopped.<br />
• Enabling MSCS<br />
Enable and start MSCS on each node at the site where the RAs failed by completing<br />
the following steps:<br />
1. In the Control Panel, point to Administrative Tools, and then click<br />
Services.<br />
2. Right-click Cluster Services and click Properties.<br />
3. Change the startup type to Automatic.<br />
4. Click Start<br />
5. Repeat steps 1 through 4 for each node on the site.<br />
6. Open the Cluster Administrator and move the groups to the preferred node.<br />
4–16 6872 5688–002
Problem Description<br />
Symptoms<br />
6872 5688–002<br />
If the quorum groupp<br />
is running on site 2 and the RAs fail on site 1, all cluster<br />
nodes<br />
remain in a running state. All consistency groups remain at the respective<br />
sites because<br />
all disk accesses arre<br />
successful. In this case, because data is stored on n the replication<br />
volumes—but the ccorresponding<br />
marking information is not written to the repository<br />
volume—a full-sweeep<br />
resynchronization is required following recovery.<br />
An exception is if thhe<br />
consistency group option “Allow application to ru un even when<br />
Unisys <strong>SafeGuard</strong> S<strong>Solutions</strong><br />
cannot mark data” was selected. The split tter prevents<br />
access to disks when<br />
the RAs are not available to write marking data to o the repository<br />
volume, and I/Os faail.<br />
Figure 4–2 illustratees<br />
this failure.<br />
Recovering in a Geographic Clustere ed Environment<br />
Recovery When All RAs Fail on Site 1 (Site 2<br />
Quorum Owner) )<br />
Figure 44–2.<br />
All RAs Fail on Site 1 (Site 2 Quorum Owner) O<br />
The following sympptoms<br />
might help you identify this failure:<br />
• The managemeent<br />
console display shows errors and messages sim milar to those for<br />
“Total Communnication<br />
Failure in a Geographic Clustered Environme ent” in Section 7.<br />
• If you review thhe<br />
system event log, you find messages similar to th he following<br />
examples:<br />
4–17
Recovering in a Geographic Clustered Environment<br />
System Event Log for Usmv-East2 Host (Surviving Site—Site 2)<br />
8/2/2008 3:09:27 PM ClusSvc Information Failover Mgr 1204 N/A USMV-EAST2 "The Cluster<br />
Service brought the Resource Group ""Group 0"" offline."<br />
8/2/2008 3:09:47 PM ClusSvc Error Failover Mgr 1069 N/A USMV-WEST2 Cluster resource 'Data1' in<br />
Resource Group 'Group 0' failed.<br />
8/2/2008 3:10:29 PM ClusSvc Information Failover Mgr 1153 N/A USMV-WEST2 Cluster service is<br />
attempting to failover the Cluster Resource Group 'Group 0' from node USMV-WEST2 to node USMV-<br />
EAST2.<br />
8/2/2008 3:10:53 PM ClusSvc Information Failover Mgr 1201 N/A USMV-EAST2 "The Cluster<br />
Service brought the Resource Group ""Group 0"" online."<br />
System Event Log for Usmv-West2 Host (Failure Site—Site 1)<br />
8/2/2008 3:09:27 PM ClusSvc Information Failover Mgr 1204 N/A USMV-EAST2 "The Cluster<br />
Service brought the Resource Group ""Group 0"" offline."<br />
8/2/2008 3:09:47 PM ClusSvc Error Failover Mgr 1069 N/A USMV-WEST2 Cluster resource 'Data1' in<br />
Resource Group 'Group 0' failed.<br />
8/2/2008 3:10:29 PM ClusSvc Information Failover Mgr 1153 N/A USMV-WEST2 Cluster service is<br />
attempting to failover the Cluster Resource Group 'Group 0' from node USMV-WEST2 to node USMV-<br />
EAST2.<br />
8/2/2008 3:10:53 PM ClusSvc Information Failover Mgr 1201 N/A USMV-EAST2 "The Cluster<br />
Service brought the Resource Group ""Group 0"" online."<br />
• If you review the cluster log, you find messages similar to the following examples:<br />
Cluster Log for Surviving Site (Site 2)<br />
000005a0.00000fdc::2008/02/02-21:57:33.543 ERR [FM] FmpSetGroupEnumOwner:: MM returned<br />
MM_INVALID_NODE, chose the default target<br />
00000ec8.000008b4::2008/02/02-22:09:03.139 ERR Unisys <strong>SafeGuard</strong> 30m Control : KfLogit:<br />
Online resource failed. Cannot complete transfer for auto failover. Action: Verify through the<br />
Management Console that the WAN connection is operational.<br />
00000ec8.00000f48::2008/02/02-22:10:39.715 ERR Unisys <strong>SafeGuard</strong> 30m Control : KfLogit:<br />
Online resource failed. Cannot complete transfer for auto failover. Action: Verify through the<br />
Management Console that the WAN connection is operational.<br />
Cluster Log for Failure Site (Site 1)<br />
0000033c.000008e4::2008/02/02-22:09:47.159 ERR Unisys <strong>SafeGuard</strong> 30m Control :<br />
KfGetKboxData: get_system_settings command failed. Error: (2685470674).<br />
0000033c.000008e4::2008/02/02-22:09:47.159 ERR Unisys <strong>SafeGuard</strong> 30m Control :<br />
UrcfKConGroupOnlineThread: Error 1117 bringing resource online. (Error 1117: the request could not be<br />
performed because of an I/O device error).<br />
0000033c.00000b8c::2008/02/02-22:10:08.168 ERR Unisys <strong>SafeGuard</strong> 30m Control :<br />
KfGetKboxData: get_version command failed. Error: (2685470674).<br />
0000033c.00000b8c::2008/02/02-22:10:29.146 ERR Unisys <strong>SafeGuard</strong> 30m Control :<br />
KfGetKboxData: get_system_settings command failed. Error: (2685470674).<br />
0000033c.00000b8c::2008/02/02-22:10:29.146 ERR Unisys <strong>SafeGuard</strong> 30m Control :<br />
UrcfKConGroupOnlineThread: Error 1117 bringing resource online. (Error 1117: the request could not be<br />
performed because of an I/O device error).<br />
4–18 6872 5688–002
Actions to Resolve the Problem<br />
Recovering in a Geographic Clustered Environment<br />
If all RAs on site 1 fail and site 2 owns the quorum resource, you do not need to perform<br />
manual recovery. Because the surviving site owns the quorum consistency group, MSCS<br />
automatically restarts, and the data consistency group fails over on the surviving site.<br />
Recovery When All RAs and All Servers Fail on One<br />
Site<br />
The following two cases describe an event in which a complete site fails (for example,<br />
site 1) and all data I/O, cluster node communication, disk reservations, and so forth, stop<br />
responding. MSCS nodes on site 2 detect a network heartbeat loss and loss of disk<br />
reservations, and try to take over the cluster groups that had been running on the nodes<br />
that failed.<br />
There are two cases for recovering from this failure based on which site owns the<br />
quorum group:<br />
• The RAs and servers fail on site 1 and that site owns the quorum group.<br />
• The RAs and servers fail on site 1 and site 2 owns the quorum group.<br />
Manual recovery of MSCS is required as described in the following topic, “Site 1 Failure<br />
(Site 1 Quorum Owner).”<br />
If the site can recover in an acceptable amount of time and the quorum owner does not<br />
reside on the failed site, manual recovery should not be performed.<br />
The two cases that follow respond differently and are solved differently based on where<br />
the quorum owner resides.<br />
Site 1 Failure (Site 1 Quorum Owner)<br />
Problem Description<br />
In the first failure case, all nodes at site 1 fail as well as the RAs. Thus, the RAs must fail<br />
quorum arbitration attempts initiated by nodes on the surviving site. Because the RAs on<br />
the surviving site (site 2) are not able to communicate over the communication<br />
networks, the RAs assume that it is a WAN network failure and do not allow automatic<br />
failover of cluster resources.<br />
MSCS attempts to fail over to a node at site 2. Because the quorum resource was<br />
owned by site 1, site 2 must be brought up using the manual quorum recovery<br />
procedure.<br />
Figure 4–3 illustrates this case.<br />
6872 5688–002 4–19
Recovering in a Geographic CClustered<br />
Environment<br />
4–20<br />
Figure 4–3. All RAs annd<br />
Servers Fail on Site 1 (Site 1 Quorum Ow wner)<br />
68 872 5688–002
Symptoms<br />
Recovering in a Geographic Clustered Environment<br />
The following symptoms might help you identify this failure:<br />
• The management console display shows errors and messages similar to those for<br />
“Total Communication Failure in a Geographic Clustered Environment” in Section 7.<br />
• If you review the system event log, you find messages similar to the following<br />
examples:<br />
System Event Log for Usmv-East2 Host (Failure Site)<br />
8/3/2008 10:46:01 AM ClusSvc Error Startup/Shutdown 1073 N/A USMV-EAST2 Cluster service<br />
was halted to prevent an inconsistency within the server cluster. The error code was 5892 (The<br />
membership engine requested shutdown of the cluster service on this node).<br />
8/3/2008 10:46:00 AM ClusSvc Error Membership Mgr 1177 N/A USMV-EAST2 Cluster service is<br />
shutting down because the membership engine failed to arbitrate for the quorum device. This could be<br />
due to the loss of network connectivity with the current quorum owner. Check your physical network<br />
infrastructure to ensure that communication between this node and all other nodes in the server cluster is<br />
intact.<br />
8/3/2008 10:47:40 AM ClusSvc Error Startup/Shutdown 1009 N/A USMV-EAST2 Cluster service<br />
could not join an existing server cluster and could not form a new server cluster. Cluster service has<br />
terminated.<br />
8/3/2008 10:50:16 AM ClusDisk Error None 1209 N/A USMV-EAST2 Cluster service is requesting<br />
a bus reset for device \Device\ClusDisk0.<br />
• If you review the cluster log, you find messages similar to the following examples:<br />
Cluster Log for Surviving Site (Site 2)<br />
00000c54.000008f4::2008/02/02-17:13:31.901 ERR [NMJOIN] Unable to begin join, status 1717 (the NIC<br />
interface is unknown).<br />
00000c54.000008f4::2008/02/02-17:13:31.901 ERR [CS] ClusterInitialize failed 1717<br />
00000c54.000008f4::2008/02/02-17:13:31.917 ERR [CS] Service Stopped. exit code = 1717<br />
00000be0.000008e0::2008/02/02-17:14:53.686 ERR [JOIN] Unable to connect to any sponsor node.<br />
00000be0.000008e0::2008/02/02-17:14:56.374 ERR [FM] FmpSetGroupEnumOwner:: MM returned<br />
MM_INVALID_NODE, chose the default target<br />
000001e0.00000bac::2008/02/02-17:16:37.563 ERR IP Address : WorkerThread:<br />
GetClusterNotify failed with status 6.<br />
00000e8c.00000ea8::2008/02/02-17:30:20.275 ERR Physical Disk : [DiskArb] Signature of disk<br />
has changed or failed to find disk with id, old signature 0xe1e7208e new signature 0xe1e7208e, status 2<br />
(the system cannot find the file specified).<br />
00000e8c.00000ea8::2008/02/02-17:30:20.289 ERR Physical Disk : SCSI: Attach, error<br />
attaching to signature e1e7208e, error 2.<br />
000008e8.000008fc::2008/02/02-17:30:20.289 ERR [FM] FmGetQuorumResource failed, error 2.<br />
000008e8.000008fc::2008/02/02-17:30:20.289 ERR [INIT] ClusterForm: Could not get quorum resource.<br />
No fixup attempted. Status = 5086 (The quorum disk could not be located by the cluster service).<br />
000008e8.000008fc::2008/02/0-17:30:20.289 ERR [INIT] Failed to form cluster, status 5086.<br />
000008e8.000008fc::2008/02/02-17:30:20.289 ERR [CS] ClusterInitialize failed 5086<br />
000008e8.000008fc::2008/02/02-17:30:20.360 ERR [CS] Service Stopped. exit code = 5086<br />
00000710.00000e80::2008/02/02-17:55:02.092 ERR [FM] FmpSetGroupEnumOwner:: MM returned<br />
MM_INVALID_NODE, chose the default target<br />
000009cc.00000884::2008/02/02-17:55:12.413 ERR Unisys <strong>SafeGuard</strong> 30m Control : KfLogit:<br />
6872 5688–002 4–21
Recovering in a Geographic Clustered Environment<br />
Online resource failed. Cannot complete transfer for auto failover. Action: Verify through the<br />
Management Console that the WAN connection is operational.<br />
Cluster Log for Failure Site (Site 1)<br />
00000dc8.00000c48::2008/02/02-17:12:53.942 ERR Physical Disk : [DiskArb] GetPartInfo<br />
completed, status 2.<br />
00000dc8.00000c48::2008/02/02-17:12:53.942 ERR Physical Disk : [DiskArb] Failed to read<br />
(sector 12), error 2.<br />
00000dc8.00000c48::2008/02/02-17:12:55.942 ERR Physical Disk : [DiskArb] GetPartInfo<br />
completed, status 2.<br />
00000dc8.00000c48::2008/02/02-17:12:55.942 ERR Physical Disk : [DiskArb] Failed to write<br />
(sector 12), error 2.<br />
00000fe4.00000810::2008/02/02-17:13:20.030 ERR [FM] Failed to arbitrate quorum resource c336021a-<br />
083e-4fa0-9d37-7077a590c206, error 2.<br />
00000fe4.00000810::2008/02/02-17:13:20.030 ERR [RGP] Node 1: REGROUP ERROR: arbitration failed.<br />
00000fe4.00000810::2008/02/02-17:13:20.030 ERR [NM] Halting this node due to membership or<br />
communications error. Halt code = 1000<br />
00000fe4.00000810::2008/02/02-17:13:20.030 ERR [CS] Halting this node to prevent an inconsistency<br />
within the cluster. Error status = 5892 (The membership engine requested shutdown of the cluster<br />
service on this node).<br />
00000dc8.00000f34::2008/02/02-17:13:20.670 ERR Unisys <strong>SafeGuard</strong> 30m Control : KfLogit:<br />
Online resource failed. Pending processing terminated by resource monitor.<br />
00000dc8.00000f34::2008/02/02-17:13:20.670 ERR Unisys <strong>SafeGuard</strong> 30m Control :<br />
UrcfKConGroupOnlineThread: Error 1117 bringing resource online.<br />
000009e4::2008/02/02-17:29:20.587 ERR [FM] FmGetQuorumResource failed, error 2.<br />
000008e4.000009e4::2008/02/02-17:29:20.587 ERR [INIT] ClusterForm: Could not get quorum resource.<br />
No fixup attempted. Status = 5086<br />
000008e4.000009e4::2008/02/02-17:29:20.587 ERR [INIT] Failed to form cluster, status 5086.<br />
000008e4.000009e4::2008/02/02-17:29:20.587 ERR [CS] ClusterInitialize failed 5086<br />
000008e4.000009e4::2008/02/02-17:29:20.602 ERR [CS] Service Stopped. exit code = 5086<br />
000005b4.000008cc::2008/02/02-17:31:11.075 ERR [FM] FmpSetGroupEnumOwner:: MM returned<br />
MM_INVALID_NODE, chose the default target<br />
00000ff4.000008d8::2008/02/02-17:31:19.901 ERR Unisys <strong>SafeGuard</strong> 30m Control : KfLogit:<br />
Online resource failed. Cannot complete transfer for auto failover. Action: Verify through the<br />
Management Console that the WAN connection is operational.<br />
Actions to Resolve the Problem<br />
If all RAs and servers on site 1 fail and site 1 owns the quorum resource, perform the<br />
following tasks to recover:<br />
1. Perform a manual failover of the quorum consistency group.<br />
2. Reverse replication direction.<br />
3. Start MSCS.<br />
4. Power on the site if a power failure occurred.<br />
5. Restore the failover settings.<br />
Note: Do not bring up any nodes until the manual recovery process is complete.<br />
4–22 6872 5688–002
Recovering in a Geographic Clustered Environment<br />
Caution<br />
Manual recovery is required only if the quorum device is lost because of a<br />
failure of an RA cluster.<br />
If the cluster nodes at the production site are operational, you must disable<br />
MSCS. You must verify the server status with a network test or attempt to<br />
log in to the server. Use the procedure in ”Recovery When All RAs Fail on<br />
Site 1 (Site 1 Quorum Owner).”<br />
Improper use of the manual recovery procedure can lead to an inconsistent<br />
quorum disk and unpredictable results that might require a long recovery<br />
process.<br />
Performing a Manual Failover of the Quorum Consistency Group<br />
To perform a manual failover of the quorum consistency group, follow the procedure<br />
given in the “Actions to Resolve the Problem” for “Recovery When All RAs Fail on Site 1<br />
(Site 1 Quorum Owner)” earlier in this section.<br />
Reversing Replication Direction<br />
1. Select the Consistency Group from the navigation pane.<br />
2. Right-click the Group and select Pause Transfer. Click Yes when the system<br />
prompts that the group activity will be paused.<br />
3. Click the Status tab. The status of the transfer must display Paused.<br />
4. Right-click the Consistency Group and select Failover to <br />
5. Click Yes when the system prompts to confirm failover.<br />
6. Ensure that the Start data transfer immediately check box is selected.<br />
The following warning message appears:<br />
Warning: Journal will be erased. Do you wish to continue?<br />
7. Click Yes to continue.<br />
Starting MSCS<br />
MSCS should start within 1 minute on the surviving nodes when the MSCS recovery<br />
setting is enabled. You can manually start MSCS on each node of the surviving site by<br />
completing the following steps:<br />
1. In the Control Panel, point to Administrative Tools, and then click<br />
Services.<br />
2. Right-click Cluster Service, and click Start.<br />
MSCS starts the cluster group and automatically moves all groups to the<br />
first-started cluster node.<br />
3. Repeat steps 1 through 2 for each node on the site.<br />
6872 5688–002 4–23
Recovering in a Geographic Clustered Environment<br />
Powering-on a Site<br />
If a site experienced a power failure, power on the site in the following order:<br />
• Switches<br />
• Storage<br />
Note: Wait until all switches and storage units are initialized before continuing to<br />
power on the site.<br />
• RAs<br />
Note: Wait 10 minutes after you power on the RAs before you power on the hosts.<br />
• Hosts<br />
Restoring the Global Cluster Mode Property for the Quorum Group<br />
Once the primary site is again operational and you have verified that all nodes at both<br />
sites are online in the cluster, restore the failover settings by completing the following<br />
steps:<br />
1. Click the Quorum Consistency Group (that is, the consistency group that holds<br />
the quorum drive) from the navigation pane.<br />
2. Click the Policy tab.<br />
3. Under Advanced, select Auto-quorum (shared quorum) in the Global<br />
cluster mode list.<br />
4. Ensure that the Allow Regulation box check box is selected.<br />
5. Click Apply.<br />
4–24 6872 5688–002
Site 1 Failure (Site 2 Quorum Owner)<br />
Problem Description<br />
6872 5688–002<br />
If the quorum groupp<br />
is running on site 2 and a complete site failure occ curs on site 1, a<br />
quorum failover is nnot<br />
required. Only data groups on the failed site will require failover.<br />
All data that is not mmirrored<br />
and was in the failed RA cache is lost; the latest<br />
image on<br />
the remote site is uused<br />
to recover. Cluster services will be up on all nod des on site 2, and<br />
cluster nodes will faail<br />
on site 1. You cannot move a group to nodes on a site where the<br />
RAs are down (site 1).<br />
MSCS attempts to fail over to a node at site 2. An e-mail alert is sent st tating that a site<br />
or RA cluster has faailed.<br />
Figure 4–4 illustratees<br />
this case.<br />
Recovering in a Geographic Clustere ed Environment<br />
Figure 4–4. All RAAs<br />
and Servers Fail on Site 1 (Site 2 Quorum m Owner)<br />
4–25
Recovering in a Geographic Clustered Environment<br />
Symptoms<br />
The following symptoms might help you identify this failure:<br />
• The management console display shows errors and messages similar to those for<br />
“Total Communication Failure in a Geographic Clustered Environment” in Section 7.<br />
• If you review the system event log, you find messages similar to the following<br />
examples:<br />
System Event Log for Usmv-West2 (Failure Site)<br />
8/3/2006 1:49:26 PM ClusSvc Information Failover Mgr 1205 N/A USMV-WEST2 "The Cluster<br />
Service failed to bring the Resource Group ""Cluster Group"" completely online or offline."<br />
8/3/2008 1:49:26 PM ClusSvc Information Failover Mgr 1203 N/A USMV-WEST2 "The Cluster<br />
Service is attempting to offline the Resource Group ""Cluster Group""."<br />
8/3/2008 1:50:46 PM ClusDisk Error None 1209 N/A USMV-WEST2 Cluster service is requesting a<br />
bus reset for device \Device\ClusDisk0.<br />
• If you review the cluster log, you find messages similar to the following examples:<br />
Cluster Log for Failure Site (Site 1)<br />
00000e50.00000c10::2008/02/02-20:50:46.165 ERR Physical Disk : [DiskArb] GetPartInfo<br />
completed, status 170 (the requested resource is in use).<br />
00000e50.00000c10::2008/02/02-20:50:46.165 ERR Physical Disk : [DiskArb] Failed to read<br />
(sector 12), error 170.<br />
00000e50.00000fb4::2008/02/02-20:52:05.133 ERR IP Address : WorkerThread:<br />
GetClusterNotify failed with status 6 (the handle is invalid).<br />
Cluster Log for Surviving Site (Site 2)<br />
00000178.00000dd8::2008/02/02-20:49:30.976 ERR Physical Disk : [DiskArb] GetPartInfo<br />
completed, status 170.<br />
00000178.00000dd8::2008/02/02-20:49:30.992 ERR Physical Disk : [DiskArb] Failed to read<br />
(sector 12), error 170.<br />
00000d80.00000e68::2008/02/02-20:49:57.679 ERR [GUM] GumSendUpdate: GumQueueLocking update<br />
to node 1 failed with 1818 (The remote procedure call was cancelled).<br />
00000d80.00000e68::2008/02/02-20:49:57.679 ERR [GUM] GumpCommFailure 1818 communicating<br />
with node 1<br />
00000178.00000810::2008/02/02-20:50:45.492 ERR IP Address : WorkerThread:<br />
GetClusterNotify failed with status 6 (The handle is invalid).<br />
Actions to Resolve the Problem<br />
If all RAs and all servers on site 1 fail and site 2 owns the quorum resource, you do not<br />
need to perform manual recovery. Because the surviving site owns the quorum<br />
consistency group, MSCS automatically restarts, and the data consistency group fails<br />
over on the surviving site.<br />
4–26 6872 5688–002
Section 5<br />
Solving Storage Problems<br />
This section lists symptoms that usually indicate problems with storage. Table 5–1 lists<br />
symptoms and possible problems indicated by the symptom. The problems and their<br />
solutions are described in this section. The graphics, behaviors, and examples in this<br />
section are similar to what you observe with your system but might differ in some<br />
details.<br />
In addition to the symptoms listed, you might receive e-mail messages or SNMP traps<br />
for possible problems. Also, messages similar to e-mail notifications might be displayed<br />
on the management console. If you do not see the messages, they might have already<br />
dropped off the display. Review the management console logs for messages that have<br />
dropped off the display.<br />
Table 5–1. Possible Storage Problems with Symptoms<br />
Symptom Possible Problem<br />
The system pauses the transfer for the<br />
relevant consistency group.<br />
The server cannot access this volume;<br />
writes to this volume fail; the file system<br />
cannot be mounted; and so forth.<br />
The management console shows an error<br />
for all connections to this volume—that is,<br />
all RAs on the relevant site and all splitters<br />
attached to this volume.<br />
The system pauses the transfer for all<br />
consistency groups.<br />
The management console shows an error<br />
for all connections to this volume—that is,<br />
all RAs on the relevant site and all splitters<br />
attached to this volume.<br />
The event log reports that the repository<br />
volume is inaccessible.<br />
The event log indicates that the repository<br />
volume is corrupted.<br />
User or replication volume not accessible<br />
Repository volume not accessible<br />
6872 5688–002 5–1
Solving Storage Problems<br />
Table 5–1. Possible Storage Problems with Symptoms<br />
Symptom Possible Problem<br />
The management console shows an error<br />
for the connections between this volume<br />
and all RAs on the relevant site.<br />
The system pauses the transfer for the<br />
relevant consistency group.<br />
The event log indicates that the journal<br />
was lost or corrupted.<br />
No volumes from the relevant target and<br />
worldwide name (WWN) are accessible to<br />
any initiator on the SAN.<br />
The cluster regroup process begins and<br />
the quorum device fails over to a site<br />
without failed storage.<br />
The management console shows a storage<br />
error and replication has stopped.<br />
Servers report multipath software errors.<br />
Applications that depend on physical disk<br />
resources go offline and fail when<br />
attempting to come online.<br />
Once resource retry threshold parameters<br />
are reached, site 1 fails over to site 2. With<br />
the default settings, this timing is about 30<br />
minutes.<br />
Journal not accessible<br />
Total storage loss in a geographic<br />
replicated environment<br />
Storage failure on one site with quorum<br />
owner on failed site in a geographic<br />
clustered environment<br />
Storage failure on one site with quorum<br />
owner on surviving site in a geographic<br />
clustered environment<br />
5–2 6872 5688–002
Solving Storage Problems<br />
Table 5–2 lists specific storage volume failures and the types of errors and indicators on<br />
the management console that distinguish each failure.<br />
Table 5–2. Indicators and Management Console Errors to<br />
Distinguish Different Storage Volume Failures<br />
Failure<br />
Data volume<br />
lost or failed<br />
Journal<br />
volume lost,<br />
failed, or<br />
corrupt<br />
Repository<br />
volume lost,<br />
failed, or<br />
corrupt<br />
Groups<br />
Paused<br />
Status<br />
Relevant<br />
Data<br />
Group<br />
Relevant<br />
Data<br />
Group<br />
System<br />
Status<br />
All Storage and<br />
RA error<br />
failure<br />
Volumes<br />
Tab<br />
Storage error Replication<br />
volume with<br />
error status<br />
Storage error Journal<br />
volume with<br />
error status<br />
Repository<br />
volume with<br />
error status<br />
6872 5688–002 5–3<br />
Logs<br />
Tab<br />
Error<br />
3012<br />
Error<br />
3012<br />
Error<br />
3014
Solving Storage Problems<br />
User or Replication Volume Not Accessible<br />
Problem Description<br />
Symptoms<br />
The replication volume is not accessible to any host or splitter.<br />
The following symptoms might help you identify this failure:<br />
• The management console shows an error for storage and the Volumes tab (status<br />
column) shows additional errors (See Figure 5–1).<br />
Figure 5–5–1. Volumes Tab Showing Volume Connection Errors<br />
• Warnings and informational messages similar to those shown in Figure 5–2 appear<br />
on the management console. See the table after the figure for an explanation of the<br />
numbered console messages.<br />
5–4 6872 5688–002
Solving Storage Problems<br />
Figure 5–2. Management Console Messages for the User Volume Not Accessible<br />
Problem<br />
Reference<br />
No.<br />
The following table explains the numbered messages in Figure 5–2.<br />
Event<br />
ID<br />
Description E-mail<br />
Immediate<br />
1 4003 Group capabilities problem with the details<br />
showing that the RA is unable to access .<br />
E-mail<br />
Daily<br />
Summary<br />
2 3012 The RA is unable to access the volume. X<br />
• The Groups tab on the management console shows that the system paused the<br />
transfer for the relevant consistency group. (See Figure 5–3.)<br />
Figure 5–3. Groups Tab Shows “Paused by System”<br />
• The server cannot access this volume; writes to this volume fail; the file system<br />
cannot be mounted; and so forth.<br />
6872 5688–002 5–5<br />
X
Solving Storage Problems<br />
Actions to Resolve<br />
Perform the following actions to isolate and resolve the problem:<br />
• Determine whether other volumes from the same storage device are accessible to<br />
the same RAs to rule out a total storage loss. If no volumes are seen by an RA, refer<br />
to “Total Storage Loss in a Geographic Replicated Environment.”<br />
• Verify that this LUN still exists and has not failed or been removed from the storage<br />
device.<br />
• Verify that the LUN is masked to the proper splitter or splitters and RAs.<br />
• Verify that other servers in the SAN do not use this volume. For example, if an<br />
MSCS cluster in the SAN acquired ownership of this volume, it might reserve the<br />
volume and block other initiators from seeing the volume.<br />
• Verify that the volume has read and write permissions on the storage system.<br />
• Verify that the volume, as configured in the management console, has the expected<br />
WWN and LUN.<br />
Repository Volume Not Accessible<br />
Problem Description<br />
Symptoms<br />
The repository volume is not accessible to any SAN-attached initiator, including the<br />
splitter and RAs.<br />
Or, the repository volume is corrupted---either by another initiator because of storage<br />
changes or as a result of storage failure. You must reformat the repository volume<br />
before replication can proceed normally.<br />
The following symptoms might help you identify this failure:<br />
• The management console shows an error for all connections to this volume—that is,<br />
all RAs on the relevant site and all splitters attached to this volume. The RAs tab on<br />
the management console shows errors for the volume. (See Figure 5–4.)<br />
The following error messages appear for the RAs error condition when you click<br />
Details:<br />
Error: RA 1 in Sydney can't access repository volume<br />
Error: RA 2 in Sydney can't access repository volume<br />
The following error message appears for the storage error condition, when you click<br />
Details:<br />
Error: Repository volume can't be accessed by any RAs<br />
5–6 6872 5688–002
Solving Storage Problems<br />
Figure 5–4. Management Console Display: Storage Error and RAs Tab Shows<br />
Volume Errors<br />
• The Volumes tab on the management console shows an error for the repository<br />
volume, as shown in Figure 5–5.<br />
Figure 5–5. Volumes Tab Shows Error for Repository Volume<br />
• The Groups tab on the management console shows that the system paused the<br />
transfer for all consistency groups, as shown in Figure 5–6.<br />
Figure 5–6. Groups Tab Shows All Groups Paused by System<br />
• The Logs tab on the management console lists a message for event ID 3014. This<br />
message indicates that the RA is unable to access the repository volume or the<br />
repository volume is corrupted. (See Figure 5–7.)<br />
6872 5688–002 5–7
Solving Storage Problems<br />
Figure 5–7. Management Console Messages for the Repository Volume not<br />
Accessible Problem<br />
Actions to Resolve<br />
Perform the following actions to isolate and resolve the problem:<br />
• Determine whether other volumes from the same storage device are accessible to<br />
the same RAs to rule out a total storage loss. If no volumes are seen by an RA, refer<br />
to “Total Storage Loss in a Geographic Replicated Environment.”<br />
• Verify that this LUN still exists and has not failed or been removed from the storage<br />
device.<br />
• Verify that the LUN is masked to the proper splitter or splitters and RAs.<br />
• Verify that other servers in the SAN do not use this volume. For example, if an<br />
MSCS cluster in the SAN acquired ownership of this volume, it might reserve the<br />
volume and block other initiators from seeing the volume.<br />
• Verify that the volume has read and write permissions on the storage system.<br />
• Verify that the volume, as configured in the management console, has the expected<br />
WWN and LUN.<br />
• If the volume is corrupted or you determine that it must be reformatted, perform the<br />
steps in “Reformatting the Repository Volume.”<br />
Reformatting the Repository Volume<br />
Before you begin the reformatting process in a geographic clustered environment, be<br />
sure that all groups are located at the site for which the repository volume is not to be<br />
formatted.<br />
On RA 1 at the site for which the repository volume is to be formatted, determine from<br />
the Site Planning <strong>Guide</strong> which LUN is used for the repository volume. If the LUN is not<br />
recorded for the repository volume, a list is presented during the volume formatting<br />
process that shows LUNs and the previously used repository volume is identified.<br />
5–8 6872 5688–002
Solving Storage Problems<br />
Perform the following steps to reformat a repository volume for a particular site:<br />
1. Click the Data Group in the Management Console, and perform the following<br />
steps:<br />
a. Click Policy in the right pane and change the Global Cluster mode<br />
selection to Manual.<br />
b. Click Apply.<br />
c. Right-click the Data Group and select Disable Group.<br />
d. Click Yes when the system prompts that the copy activities will be stopped.<br />
2. Skip to step 6 for geographic replication environments.<br />
3. Perform the following steps for geographic clustered environments:<br />
a. Open the Group Policy window for the quorum group.<br />
b. Change the Global Cluster mode selection to Manual.<br />
c. Click Apply.<br />
4. Right-click the Consistency Group and select Disable Group.<br />
5. Click Yes when the system prompts that the copy activities will be stopped.<br />
6. Select the Splitters tab.<br />
a. Open the Splitter Properties window for the splitter.<br />
b. Select all the attached volumes.<br />
c. Click Detach and then click Apply.<br />
d. Click OK to close the window.<br />
e. Delete the splitter at the site for which the repository volume is to be<br />
reformatted.<br />
7. Open the PuTTY session on RA1 for the site.<br />
a. Log on with boxmgmt as the User ID and boxmgmt as the password.<br />
The Main menu is displayed.<br />
b. At the prompt, type 2 (Setup) and press Enter.<br />
c. On the Setup menu, type 2 (Configure repository volume) and press Enter.<br />
d. Type 1 (Format repository volume) and press Enter.<br />
e. Enter the appropriate number from the list to select the LUN. Ensure that<br />
the WWN and LUN are for the volume that you want to format. The LUN<br />
and identifier displays.<br />
f. Confirm the volume to format.<br />
All data is removed from the volume.<br />
g. Verify that the operation succeeds and press Enter.<br />
h. On the Main Menu, type Q (quit) and press Enter.<br />
8. Open a PuTTY session on each additional RA at the site for which the repository<br />
volume is to be formatted.<br />
6872 5688–002 5–9
Solving Storage Problems<br />
9. Log on with the boxmgmt as the user ID and boxmgmt as the password.<br />
The Main menu displays.<br />
a. At the prompt, type 2 (Setup) and press Enter.<br />
b. On the Setup menu, type 2 (Configure repository volume) and press Enter.<br />
c. Type 2 (Select a previously formatted repository volume) and press Enter.<br />
d. Enter the appropriate number from the list to select the LUN. Ensure that<br />
the WWN and LUN are for the volume that you want to format. The LUN<br />
and identifier displays.<br />
e. Confirm the volume to format. All data is removed from the volume.<br />
f. Verify that the operation succeeds and press Enter.<br />
g. On the Main menu, type Q (quit) and press Enter.<br />
Note: Complete step 9 for each additional RA at the site.<br />
10. On the Management Console, select the Splitters tab.<br />
a. Click the Add New Splitter icon to open the Add splitter window.<br />
b. Click Rescan and select the splitter.<br />
11. Open the Group Properties window and click the Policy tab and perform the<br />
following steps for each data group:<br />
a. Change the Global cluster mode selection to auto-data (shared<br />
quorum).<br />
b. Right-click the Data Group and click Enable Group.<br />
12. Skip to step 16 for geographic replication environments.<br />
13. Perform the following steps for geographic clustered environments.<br />
a. Right-click the Quorum Group and click Enable Group.<br />
b. Click the Quorum Group and select Policy in the right pane.<br />
c. Change the Global Cluster mode selection to Auto-quorum (shared<br />
quorum).<br />
14. Verify that initialization completes for all the groups.<br />
15. Review the Management Console event log.<br />
16. Ensure that no storage error or other component error appears.<br />
5–10 6872 5688–002
Journal Not Accessible<br />
Problem Description<br />
Symptoms<br />
The journal is not accessible to either RA.<br />
Solving Storage Problems<br />
A journal for one of the consistency groups is corrupted. The corruption results from<br />
another initiator because of storage changes or as a result of storage failure. Because<br />
the snapshot history is corrupted, replication for the relevant consistency group cannot<br />
proceed.<br />
The following symptoms might help you identify this failure:<br />
• The Volumes tab on the management console shows an error for the journal volume.<br />
(See Figure 5–8.)<br />
Figure 5–8. Volumes Tab Shows Journal Volume Error<br />
• The RAs tab on the management console shows errors for connections between<br />
this volume and the RAs. (See Figure 5–9.)<br />
Figure 5–9. RAs Tab Shows Connection Errors<br />
6872 5688–002 5–11
Solving Storage Problems<br />
• The Groups tab on the management console shows that the system paused the<br />
transfer for the relevant consistency group, as shown in Figure 5–10.<br />
Figure 5–10. Groups Tab Shows Group Paused by System<br />
• The Logs tab on the management console lists a message for event ID 3012. This<br />
message indicates that the RA is unable to access the volume. (See Figure 5–11.)<br />
Figure 5–11. Management Console Messages for the Journal Not Accessible<br />
Problem<br />
Actions to Resolve<br />
Perform the following actions to isolate and resolve the problem:<br />
• Determine whether other volumes from the same storage device are accessible to<br />
the same RAs to rule out a total storage loss. If no volumes are seen by an RA, refer<br />
to “Total Storage Loss in a Geographic Replicated Environment.”<br />
• Verify that this LUN still exists on the storage device and that it is only masked to<br />
the RAs.<br />
• Verify that the volume has read and write permissions on the storage system.<br />
• Verify that the volume, as configured in the management console, has the expected<br />
WWN and LUN.<br />
• For a corrupted journal, check that the system recovers automatically by re-creating<br />
the data structures for the corrupted journal and that the system then initiates a fullsweep<br />
resynchronization. No manual intervention is needed.<br />
5–12 6872 5688–002
Journal Volume Lost Scenarios<br />
Problem Description<br />
Scenarios<br />
Solving Storage Problems<br />
The journal volume is lost and will not be available in some scenarios as described<br />
below.<br />
• Attempt to write data to the Journal volume with the speed higher than the journal<br />
data is distributed to the replication volume will result in Journal data loss. In this<br />
case the Journal volume may be full and attempt to perform write operation on it<br />
creates a problem.<br />
• The user performs the following operations:<br />
− Failover<br />
− Recover production<br />
Actions to Resolve<br />
You can minimize the occurrence of this problem in scenario 1 by carefully configuring<br />
the Journal Lag. It is unavoidable in scenario 2.<br />
Total Storage Loss in a Geographic Replicated<br />
Environment<br />
Problem Description<br />
Symptoms<br />
All volumes belonging to a certain storage target and WWN (or controller, device) have<br />
been lost.<br />
The following symptoms might help you identify this failure:<br />
• The symptoms can be the same as those from any of the volume failure problems<br />
listed previously (or a subset of those symptoms), if the symptoms are relevant to<br />
the volumes that were used on this target. All volumes common to a particular<br />
storage array have failed.<br />
The Volumes tab on the management console shows errors for all volumes. (See<br />
Figure 5–12.)<br />
6872 5688–002 5–13
Solving Storage Problems<br />
Figure 5–12. Management Console Volumes Tab Shows Errors for All Volumes<br />
• No volumes from the relevant target and WWN are accessible to any initiator on the<br />
SAN, as shown on the RAs tab on the management console. (See Figure 5–13.)<br />
Figure 5–13. RAs Tab Shows Volumes That Are Not Accessible<br />
• Multipathing software (such as EMC PowerPath Administrator) reports failed paths<br />
to the storage device, as shown in Figure 5–14.<br />
5–14 6872 5688–002
Figure 5–14. Multipatthing<br />
Software Reports Failed Paths to Storage<br />
Device<br />
Actions to Resolve<br />
6872 5688–002<br />
Perform the followiing<br />
actions to isolate and resolve the problem:<br />
Solving Sto orage Problems<br />
• Verify that the sstorage<br />
device has not experienced a power outage.<br />
Instead, the<br />
device is functioning<br />
normally according to all external indicators.<br />
• Verify that the FFibre<br />
Channel switch and the storage device indicate e an operating<br />
Fibre Channel cconnection<br />
(that is, the relevant LEDs show OK). If the<br />
indicators are<br />
not OK, the prooblem<br />
might be a faulty Fibre Channel port (storage, switch, or patch<br />
panel) or a faultty<br />
Fibre Channel cable.<br />
• Verify that the iinitiator<br />
can be seen from the switch name server. If f not, the problem<br />
could be a Fibree<br />
Channel port or cable problem (as in the preceding g item). Otherwise,<br />
the problem coould<br />
be a misconfiguration of the port on the switch (for ( example, type<br />
or speed could be wrong).<br />
• Verify that the ttarget<br />
WWN is included in the relevant zones (that is s, hosts and RA).<br />
Verify also that the current zoning configuration is the active config guration. If you use<br />
the default zonee,<br />
verify that it is set to permit by default.<br />
• Verify that the rrelevant<br />
LUNs still exist on the storage device and are<br />
masked to the<br />
proper splitters and RAs.<br />
• Verify that volumes<br />
have read and write permissions on the storage<br />
system.<br />
• Verify that thesse<br />
volumes are exposed and managed by the proper r hosts and that<br />
there are no othher<br />
hosts on the SAN that use this volume.<br />
5–15
Solving Storage Problems<br />
Storage Failure on One Site in a Geographic<br />
Clustered Environmment<br />
5–16<br />
In a geographic clusteredd<br />
environment where MSCS is running, if the storage<br />
subsystem<br />
on one site fails, the symmptoms<br />
and resulting actions depend on whether the e quorum<br />
owner resided on the failed<br />
storage subsystem.<br />
To understand the two scenarios<br />
and to follow the actions for both possibilit ties, review<br />
Figure 5–15.<br />
Fiigure<br />
5–15. Storage on Site 1 Fails<br />
68 872 5688–002
Storage Failure on OOne<br />
Site with Quorum Owner on Failed Site<br />
Problem Description<br />
Symptoms<br />
6872 5688–002<br />
In this case, the cluuster<br />
quorum owner as well as the quorum resource e resides on the<br />
failed storage subsyystem.<br />
The quorum and resource<br />
automatically fail over to the node that gains control through<br />
MSCS arbitration. TThis<br />
node resides on the site without the storage failure.<br />
The RAs use the lasst<br />
available image. This action results in a loss of dat ta that has yet to<br />
be replicated. The rresources<br />
cannot fail back to the failed site until the storage<br />
subsystem is restored.<br />
The following sympptoms<br />
might help you identify this failure.<br />
• A node on whicch<br />
the cluster was running might report a delayed write w failure or<br />
similar error.<br />
• The quorum resservation<br />
is lost, and MSCS stops on the cluster nod de that owned the<br />
quorum resourcce.<br />
This action triggers a cluster “regroup” process, which allows<br />
other cluster noodes<br />
to arbitrate for the quorum device. Figure 5–16 6 shows typical<br />
listings for the ccluster<br />
regroup process.<br />
Figuure<br />
5–16. Cluster “Regroup” Process<br />
Solving Sto orage Problems<br />
5–17
Solving Storage Problems<br />
• Cluster nodes located on the failed storage subsystem fail quorum arbitration<br />
because the service cannot provide a reservation on the quorum volume. The<br />
resources fail over to the site without a storage failure. The first cluster node on the<br />
site without the storage failure that successfully completes arbitration of the quorum<br />
device assumes ownership of the cluster.<br />
The following messages illustrate this process.<br />
Cluster Log Entries<br />
INFO Physical Disk : [DiskArb]------- DisksArbitrate -------.<br />
INFO Physical Disk : [DiskArb] DisksOpenResourceFileHandle: Attaching to disk with<br />
signature f6fb216<br />
INFO Physical Disk : [DiskArb] DisksOpenResourceFileHandle: Disk unique id present<br />
trying new attach<br />
INFO Physical Disk : [DiskArb] DisksOpenResourceFileHandle: Retrieving disk number<br />
from ClusDisk registry key<br />
INFO Physical Disk : [DiskArb] DisksOpenResourceFileHandle: Retrieving handle to<br />
PhysicalDrive9<br />
INFO Physical Disk : [DiskArb] DisksOpenResourceFileHandle: Returns success.<br />
INFO Physical Disk : [DiskArb] Arbitration Parameters: ArbAttempts 5,<br />
SleepBeforeRetry 500 ms.<br />
INFO Physical Disk : [DiskArb] Read the partition info to insure the disk is<br />
accessible.<br />
INFO Physical Disk : [DiskArb] Issuing GetPartInfo on signature f6fb216.<br />
INFO Physical Disk : [DiskArb] GetPartInfo completed, status 0.<br />
INFO Physical Disk : [DiskArb] Arbitrate for ownership of the disk by<br />
reading/writing various disk sectors.<br />
INFO Physical Disk : [DiskArb] Successful read (sector 12) [:0]<br />
(0,00000000:00000000).<br />
INFO Physical Disk : [DiskArb] Successful write (sector 11) [USMV-DL580:0]<br />
(0,6ddd5cac:01c6d778).<br />
INFO Physical Disk : [DiskArb] Successful read (sector 12) [:0]<br />
(0,00000000:00000000).<br />
INFO Physical Disk : [DiskArb] Successful write (sector 12) [USMV-DL580:0]<br />
(0,6ddd5cac:01c6d778).<br />
INFO Physical Disk : [DiskArb] Successful read (sector 11) [USMV-DL580:0]<br />
(0,6ddd5cac:01c6d778).<br />
INFO Physical Disk : [DiskArb] Issuing Reserve on signature f6fb216.<br />
INFO Physical Disk : [DiskArb] Reserve completed, status 0.<br />
INFO Physical Disk : [DiskArb] CompletionRoutine starts.<br />
INFO Physical Disk : [DiskArb] Posting request to check reserve progress.<br />
INFO Physical Disk : [DiskArb] ********* IO_PENDING ********** - Request to insure<br />
reserves working is now posted.<br />
WARN Physical Disk : [DiskArb] Assume ownership of the device.<br />
INFO Physical Disk : [DiskArb] Arbitrate returned status 0.<br />
5–18 6872 5688–002
6872 5688–002<br />
• In Cluster Administrator,<br />
the groups that were online on one node change to the<br />
node that wins arbitration, as shown in Figure 5–17.<br />
Figuree<br />
5–17. Cluster Administrator Displays<br />
Solving Sto orage Problems<br />
• Multipathing sooftware,<br />
if present, reports errors on the host server rs of the site for<br />
which the storaage<br />
subsystem failed. Figure 5–18 shows errors for failed f storage<br />
devices.<br />
Figure 5–18. Multipatthing<br />
Software Shows Server Errors for Fai iled Storage<br />
Subsystem<br />
5–19
Solving Storage Problems<br />
Actions to Resolve<br />
Perform the following actions to isolate and resolve the problem:<br />
• Verify that all cluster resources failed over to a node on the site for which the<br />
storage subsystem did not fail and that these resources are online. If the cluster is<br />
running and no additional errors are reported, the problem has probably been isolated<br />
to a total site storage failure.<br />
• Log in to the storage subsystem, and verify that all LUNs are present and configured<br />
properly.<br />
• If the storage subsystem appears to be operating, the problem is most likely<br />
because of a failed SAN switch. See “Total SAN Switch Failure on One Site in a<br />
Geographic Clustered Environment” in Section 6.<br />
• Resolve the failure of the storage subsystem before attempting failback. Once the<br />
storage subsystem is working and the RAs and host can access it, a full initialization<br />
is initiated.<br />
Storage Failure on One Site with Quorum Owner on Surviving<br />
Site<br />
Problem Description<br />
Symptoms<br />
In this case, the cluster quorum owner does not reside on the failed storage subsystem,<br />
but other resources do reside on the failed storage subsystem.<br />
The cluster resources fail over to a site without a failed storage subsystem. The RAs use<br />
the last available image. This action results in a loss of data that has yet to be replicated<br />
(if not synchronous). The resources cannot fail back to the failed site until the storage<br />
subsystem is restored.<br />
The following symptoms might help you identify this failure:<br />
• The cluster marks the data groups containing the physical disk resources as failed.<br />
• Applications dependent on the physical disk resource go offline. Failed resources<br />
attempt to come online on the failed site, but fail. Then the resources fail over to the<br />
site with a valid storage subsystem.<br />
Actions to Resolve<br />
Perform the following actions to isolate and resolve the problem:<br />
• Verify that multipathing software, if present, reports errors on the host servers at the<br />
site with the suspected failed storage subsystem. (See Figure 5–19.)<br />
• Verify that all cluster resources failed over to site 2 in Cluster Administrator. Entries<br />
similar to the following occur in the cluster log for a host at the site with a failed<br />
storage subsystem (thread ID and timestamp removed).<br />
5–20 6872 5688–002
Cluster Log<br />
Solving Storage Problems<br />
Disk reservation lost ..<br />
ERR Physical Disk : [DiskArb] CompletionRoutine: reservation lost! Status 2<br />
Arbitrate for disk ....<br />
INFO Physical Disk : [DiskArb] Arbitration Parameters: ArbAttempts 5,<br />
SleepBeforeRetry 500 ms.<br />
INFO Physical Disk : [DiskArb] Read the partition info to insure the disk is<br />
accessible.<br />
INFO Physical Disk : [DiskArb] Issuing GetPartInfo on signature f6fb211.<br />
ERR Physical Disk : [DiskArb] GetPartInfo completed, status 2.<br />
INFO Physical Disk : [DiskArb] Arbitrate for ownership of the disk by<br />
reading/writing various disk sectors.<br />
ERR Physical Disk : [DiskArb] Failed to read (sector 12), error 2.<br />
INFO Physical Disk : [DiskArb] We are about to break reserve.<br />
INFO Physical Disk : [DiskArb] Issuing BusReset on signature f6fb211.<br />
Give up after 5 re-tries ...<br />
INFO Physical Disk : [DiskArb] We are about to break reserve.<br />
INFO Physical Disk : [DiskArb] Issuing BusReset on signature f6fb211.<br />
INFO Physical Disk : [DiskArb] BusReset completed, status 0.<br />
INFO Physical Disk : [DiskArb] Read the partition info from the disk to insure<br />
disk is accessible.<br />
INFO Physical Disk : [DiskArb] Issuing GetPartInfo on signature f6fb211.<br />
ERR Physical Disk : [DiskArb] GetPartInfo completed, status 2.<br />
ERR Physical Disk : [DiskArb] Failed to write (sector 12), error 2.<br />
ERR Physical Disk : Online, arbitration failed. Error: 2.<br />
INFO Physical Disk : Online, setting ResourceState 4 .<br />
Control goes offline at failed site...<br />
INFO [FM] FmpDoMoveGroup: Entry<br />
INFO [FM] FmpMoveGroup: Entry<br />
INFO [FM] FmpMoveGroup: Moving group 97ac3c3b-6985-44dd-bacd-a26e14966572 to node 4 (4)<br />
INFO [FM] FmpOfflineResource: Disk R: depends on Data1. Shut down first.<br />
INFO Unisys <strong>SafeGuard</strong> 30m Control : KfResourceOffline: Resource 'Data1' going<br />
offline.<br />
After trying other nodes at site move to remote site ...<br />
INFO [FM] FmpMoveGroup: Take group 97ac3c3b-6985-44dd-bacd-a26e14966572 request to remote<br />
node 4<br />
Move succeeds ...<br />
INFO [FM] FmpMoveGroup: Exit group , status = 0<br />
INFO [FM] FmpDoMoveGroup: Exit, status = 0<br />
INFO [FM] FmpDoMoveGroupOnFailure: FmpDoMoveGroup returns 0<br />
INFO [FM] FmpDoMoveGroupOnFailure Exit.<br />
INFO [GUM] s_GumUpdateNode: dispatching seq 5720 type 0 context 9<br />
INFO [FM] GUM update group 97ac3c3b-6985-44dd-bacd-a26e14966572, state 0<br />
INFO [FM] New owner of Group 97ac3c3b-6985-44dd-bacd-a26e14966572 is 2, state 0, curstate<br />
0.<br />
• Log in to the failed storage subsystem and determine whether the storage reports<br />
failed or missing disks. If the storage subsystem appears to be fine, the problem is<br />
most likely because of a SAN switch failure. See “Total SAN Switch Failure on One<br />
Site in a Geographic Clustered Environment” in Section 6.<br />
• Once the storage for the site that failed is back online, a full sweep is initiated.<br />
Check that the messages “Starting volume sweep“ and “Starting full sweep “ are<br />
displayed as an Events Notice.<br />
6872 5688–002 5–21
Solving Storage Problems<br />
5–22 6872 5688–002
Section 6<br />
Solving SAN Connectivity Problems<br />
This section lists symptoms that usually indicate problems with connections to the<br />
storage subsystem. Table 6–1 lists symptoms and possible problems indicated by the<br />
symptom. The problems and their solutions are described in this section. The graphics,<br />
behaviors, and examples in this section are similar to what you observe with your<br />
system but might differ in some details.<br />
In addition to the symptoms listed, you might receive e-mail messages or SNMP traps<br />
for possible problems. Also, messages similar to e-mail notifications might be displayed<br />
on the management console. If you do not see the messages, they might have already<br />
dropped off the display. Review the management console logs for messages that have<br />
dropped off the display.<br />
Table 6–1. Possible SAN Connectivity Problems<br />
Symptoms Possible Problem<br />
The system pauses the transfer. If the<br />
volume is accessible to another RA, a<br />
switchover occurs, and the relevant groups<br />
start running on the new RA.<br />
The relevant message appears in the event<br />
log.<br />
The link to the volume from the<br />
disconnected RA or RAs shows an error.<br />
The volume is accessible to the splitters<br />
that are attached to it.<br />
The system pauses the transfer for the<br />
relevant groups.<br />
If the volume is not accessible, the<br />
management console shows an error for<br />
the splitter. If a replication volume is not<br />
accessible, the splitter connection to that<br />
volume shows an error.<br />
Volume not accessible to RAs<br />
Volume not accessible to <strong>SafeGuard</strong> 30m<br />
splitter<br />
6872 5688–002 6–1
Solving SAN Connectivity Problems<br />
Table 6–1. Possible SAN Connectivity Problems<br />
Symptoms Possible Problem<br />
The system pauses the transfer for the<br />
relevant group or groups. If the connection<br />
with only one of the RAs is lost, the group<br />
or groups can restart the transfer by<br />
means of another RA, beginning with a<br />
short initialization.<br />
The splitter connection to the relevant RAs<br />
shows an error.<br />
The relevant message describes the lost<br />
connection in the event log.<br />
The management console shows a server<br />
down.<br />
Messages on the management console<br />
show that the splitter is down and that the<br />
node fails over.<br />
Multipathing software (such as EMC<br />
PowerPath Administrator) messages report<br />
an error.<br />
Cluster nodes fail and the cluster regroup<br />
process begins.<br />
Applications fail and attempt to restart.<br />
Messages regarding failed physical disks<br />
are displayed on the management console.<br />
The cluster resources fail over to the<br />
remote site.<br />
RAs not accessible to <strong>SafeGuard</strong> 30m<br />
splitter<br />
Server unable to connect with SAN<br />
(See “Server Unable to Connect with<br />
SAN” in Section 9. This problem is not<br />
described in this section.)<br />
Total SAN switch failure on one site in a<br />
geographic clustered environment<br />
6–2 6872 5688–002
Volume Not Accessible to RAs<br />
Problem Description<br />
Symptoms<br />
Solving SAN Connectivity Problems<br />
A volume (repository volume, replication volume, or journal) is not accessible to one or<br />
more RAs, but it is accessible to all other relevant initiators—that is, the splitter.<br />
The following symptoms might help you identify this failure:<br />
• The system pauses the transfer. If the volume is accessible to another RA, a<br />
switchover occurs, and the relevant group or groups start running on the new RA.<br />
• The management console displays failures similar to those in Figure 6–1.<br />
Figure 6–1. Management Console Showing “Inaccessible Volume” Errors<br />
• Warnings and informational messages similar to those shown in Figure 6–2 appear<br />
on the management console. See the table after the figure for an explanation of the<br />
numbered console messages.<br />
Figure 6–2. Management Console Messages for Inaccessible Volumes<br />
6872 5688–002 6–3
Solving SAN Connectivity Problems<br />
Referenc<br />
e No.<br />
The following table explains the numbered messages shown in Figure 6–2.<br />
Event<br />
ID<br />
Description<br />
1 3012 The RA is unable to access the<br />
volume (RA 2, quorum).<br />
2 5049 Splitter writer to RA failed. X<br />
3 4003 For each consistency group, the<br />
surviving site reports a group<br />
consistency problem. The details<br />
show a WAN problem.<br />
4 4044 The group is deactivated indefinitely<br />
by the system.<br />
5 4003 For each consistency group, a minor<br />
problem is reported. The details<br />
show that sides are not linked and<br />
also cannot transfer data.<br />
6 4001 For each consistency group, a minor<br />
problem is reported. The details<br />
show that sides are not linked and<br />
also cannot transfer data.<br />
7 5032 The splitter is splitting to replication<br />
volumes.<br />
E-mail<br />
Immediate<br />
E-mail<br />
Daily<br />
Summary<br />
To see the details of the messages listed on the management console display, you<br />
must collect the logs and then review the messages for the time of the failure.<br />
Appendix A explains how to collect the management console logs, and Appendix E<br />
lists the event IDs with explanations.<br />
• If you review the Windows system event log, you can find messages similar to the<br />
following examples that are based on the testing cases used to generate the<br />
previous management console images:<br />
System Event Log for USMV-SYDNEY Host (Host on Failure Site)<br />
5/28/2008 9:31:53 PM ClusSvc Error Physical Disk Resource 1038 N/A USMV-SYDNEY<br />
Reservation of cluster disk 'Disk Q:' has been lost. Please check your system and disk configuration<br />
5/28/2008 9:31:53 PM Srv Warning None 2012 N/A USMV-SYDNEY While transmitting or receiving<br />
data, the server encountered a network error. Occasional errors are expected, but large amounts of these<br />
indicate a possible error in your network configuration. The error status code is contained within the<br />
returned data (formatted as Words) and may point you towards the problem.<br />
5/28/2008 9:31:54 PM Ftdisk Warning Disk 57 N/A USMV CAS100P2 the system failed to<br />
flush data to the transaction log. Corruption may occur.<br />
5/28/2008 9:32:54 PM Service Control Manager Information None 7035 CLUSTERNET\clusadminUSMV-<br />
SYDNEY The Windows Internet Name Service (WINS) service was successfully sent a stop control.<br />
6–4 6872 5688–002<br />
X<br />
X<br />
X<br />
X<br />
X<br />
X
Solving SAN Connectivity Problems<br />
System Event Log for Usmv-x455 Host (Host on Surviving Site)<br />
5/28/2008 9:32:56 PM ClusSvc Warning Node Mgr 1123 N/A USMV-X455<br />
The node lost communication with cluster node 'USMV-SYDNEY' on network '<strong>Public</strong>'.<br />
5/28/2008 9:32:56 PM ClusSvc Warning Node Mgr 1123 N/A USMV-X455<br />
The node lost communication with cluster node 'USMV-SYDNEY' on network 'Private LAN'.<br />
5/28/2008 9:33:10 PM ClusDisk Error None 1209 N/A USMV-X455<br />
Cluster service is requesting a bus reset for device \Device\ClusDisk0.<br />
5/28/2008 9:33:30 PM ClusSvc Warning Node Mgr 1135 N/A USMV-X455 Cluster node USMV-<br />
SYDNEY was removed from the active server cluster membership. Cluster service may have been<br />
stopped on the node, the node may have failed, or the node may have lost communication with the other<br />
active server cluster nodes.<br />
5/28/2008 9:33:30 PM ClusSvc Information Failover Mgr 1200 N/A USMV-X455<br />
"The Cluster Service is attempting to bring online the Resource Group ""Cluster Group""."<br />
5/28/2008 9:33:34 PM ClusSvc Information Failover Mgr 1201 N/A USMV-X455<br />
"The Cluster Service brought the Resource Group ""Cluster Group"" online."<br />
5/28/2008 9:34:08 PM Service Control Manager Information None 7036 N/A USMV-X455<br />
The Windows Internet Name Service (WINS) service entered the running state.<br />
• If you review the cluster log, you can find messages similar to the following<br />
examples that are based on the testing cases used to generate the previous<br />
management console images:<br />
Cluster Log for USMV-SYDNEY Host (Host on Failure Site)<br />
00000e44.00000380::2008/05/28-21:31:53.841 ERR Physical Disk : [DiskArb]<br />
CompletionRoutine: reservation lost! Status 170 (Error 170: the requested resource is in use)<br />
00000e44.00000380::2008/05/28-21:31:53.841 ERR [RM] LostQuorumResource, cluster service<br />
terminated...<br />
00000e44.00000f0c::2008/05/28-21:31:55.011 ERR Network Name : Unable to<br />
open handle to cluster, status 1753. (Error 1753: there are no more endpoints available from the endpoint<br />
mapper)<br />
00000e44.00000f08::2008/05/28-21:31:55.341 ERR IP Address : WorkerThread:<br />
GetClusterNotify failed with status 6. (Error 6: the handle is invalid)<br />
00000e44.00000f0c::2008/05/28-1:35:56.125 ERR Physical Disk : [DiskArb] Failed to read<br />
(sector 12), error 170.<br />
00000e44.00000f0c::2008/05/28-1:35:56.125 ERR Physical Disk : [DiskArb] Error cleaning<br />
arbitration sector, error 170.<br />
Cluster Log for Usmv-x455 Host (Host on Surviving Site)<br />
0000015c.00000234::2008/05/28-21:31:56.299 INFO [ClMsg] Received interface unreachable event for<br />
node 1 network 2<br />
0000015c.00000234::2008/05/28-21:31:56.299 INFO [ClMsg] Received interface unreachable event for<br />
node 1 network 1<br />
00000688.00000e10::2008/05/28-1:35:10.712 ERR Physical Disk : [DiskArb] Signature of disk<br />
has changed or failed to find disk with id, old signature 0x98f3f0b new signature 0x98f3f0b, status 2.<br />
(Error 2: The system cannot find the file specified)<br />
0000015c.000007c8::2008/05/28-1:35:31.136 WARN [NM] Interface f409cf69-9c30-48f0-8519ad5dd14c3300<br />
is unavailable (node: USMV-SYDNEY, network: Private LAN).<br />
0000015c.000004fc::2008/05/28-1:35:31.136 WARN [NM] Interface 5019923b-d7a1-4886-825f-<br />
207b5938d11e is unavailable (node: USMV-SYDNEY, network: <strong>Public</strong>).<br />
6872 5688–002 6–5
Solving SAN Connectivity Problems<br />
Actions to Resolve the Problem<br />
Perform the following actions to isolate and resolve the problem:<br />
• Verify that the physical connection between the inaccessible RAs and the Fibre<br />
Channel switch is healthy.<br />
• Verify that any disconnected RA appears in the name server of the Fibre Channel<br />
switch. If not, the problem could be because of a bad port on the switch, a bad host<br />
bus adaptor (HBA), or a bad cable.<br />
• Verify that any disconnected RA is present in the proper zone and that the current<br />
zoning configuration is enabled.<br />
• Verify that the correct volume is configured (WWN and LUN). To double-check, enter<br />
the Create Volume command in the management console, and verify that the same<br />
volume does not appear on the list of volumes that are available to be “created.”<br />
• If the volume is not accessible to the RAs but is accessible to a splitter, and the<br />
server on which that splitter is installed is clustered using MSCS, Oracle RAC, or any<br />
other software that uses a reservation method, the problem probably occurs<br />
because the server has reserved the volume.<br />
For more information about the clustered environment installation process, see the<br />
Unisys <strong>SafeGuard</strong> <strong>Solutions</strong> Planning and Installation <strong>Guide</strong> and the Unisys<br />
<strong>SafeGuard</strong> <strong>Solutions</strong> Replication Appliance Administrator's <strong>Guide</strong>.<br />
6–6 6872 5688–002
Solving SAN Connectivity Problems<br />
Volume Not Accessible to <strong>SafeGuard</strong> 30m Splitter<br />
Problem Description<br />
Symptoms<br />
A volume (repository volume, replication volume, or journal) is not accessible to one or<br />
more splitters but is accessible to all other relevant initiators (for example, the RAs).<br />
The following symptoms might help you identify this failure:<br />
• The system pauses the transfer for the relevant groups.<br />
• If the repository volume is not accessible, the management console shows an error<br />
for the splitter. If a replication volume is not accessible, the splitter connection to<br />
that volume shows an error.<br />
• The management console System Status screen and the Splitter Settings screen<br />
show error indications similar to those in Figure 6–3.<br />
Figure 6–3. Management Console Error Display Screen<br />
• Warnings and informational messages similar to those shown in Figure 6–4 appear<br />
on the management console. See the table after the figure for an explanation of the<br />
numbered console messages.<br />
6872 5688–002 6–7
Solving SAN Connectivity Problems<br />
Figure 6–4. Management Console Messages for Volumes Inaccessible to Splitter<br />
6–8 6872 5688–002
Solving SAN Connectivity Problems<br />
The following table explains the numbered messages shown in Figure 6–4.<br />
Reference<br />
No. Event ID Description<br />
1 4008 For each consistency group at the failed site, the<br />
transfer is paused to allow a failover to the<br />
surviving site.<br />
E-mail<br />
Immediate<br />
2 5030 The splitter write operation failed. X<br />
3 4001 For each consistency group, a minor problem is<br />
reported. The details show sides are not linked<br />
and cannot transfer data.<br />
E-mail Daily<br />
Summary<br />
4 4005 Negotiating Transfer Protocol X<br />
5 4016 Transferring the latest snapshot before pausing<br />
the transfer (no data is lost).<br />
6 4007 Pausing Data Transfer X<br />
7 4087 For each consistency group at the failed site,<br />
initialization completes.<br />
8 5032 The splitter is splitting to replication volumes at<br />
the surviving site.<br />
9 5049 Splitter write to RA failed X<br />
10<br />
4086<br />
For each consistency group at the failed site, the<br />
data transfer starts and then the initialization<br />
starts.<br />
11 4104 Group Started Accepting Writes X<br />
12 5015 Splitter is Up X<br />
To see the details of the messages listed on the management console display, you<br />
must collect the logs and then review the messages for the time of the failure.<br />
Appendix A explains how to collect the management console logs, and Appendix E<br />
lists the event IDs with explanations.<br />
6872 5688–002 6–9<br />
X<br />
X<br />
X<br />
X<br />
X
Solving SAN Connectivity Prooblems<br />
6–10<br />
• The multipathing sofftware<br />
(such as EMC PowerPath) on the server at the<br />
failed site<br />
reports disk error as shown in Figure 6–5.<br />
Figure 6–5.<br />
EMC PowerPath Shows Disk Error<br />
• If you review the Windows<br />
system event log, you can find messages sim milar to the<br />
following examples tthat<br />
are based on the testing cases used to generate e the<br />
previous management<br />
console images:<br />
System Event Log foor<br />
USMV-SYDNEY Host (Host on Failure Site e)<br />
5/29/2008 1:35:20 AM EmccpBase<br />
Error None 108 N/A USMV-SYDNEY Volume<br />
6006016011321100158233EDE0B23DB11<br />
is unbound.<br />
5/29/2008 1:35:20 AM EmccpBase<br />
to APM00042302162 is dead.<br />
Error None 100 N/A USMV-SYDNEY Path Bu us 5 Tgt 3 Lun 2<br />
5/29/2008 1:35:20 AM EmccpBase<br />
to APM00042302162 is dead.<br />
Error None 100 N/A USMV-SYDNEY Path Bu us 5 Tgt 0 Lun 2<br />
5/29/2008 1:35:20 AM EmccpBase<br />
to APM00042302162 is dead.<br />
Error None 100 N/A USMV-SYDNEY Path Bu us 3 Tgt 3 Lun 2<br />
5/29/2008 1:35:20 AM EmccpBase<br />
to APM00042302162 is dead.<br />
Error None 100 N/A USMV-SYDNEY Path Bu us 3 Tgt 0 Lun 2<br />
5/29/2008 1:35:20 AM EmccpBase<br />
Error None 104 N/A USMV-SYDNEY All path hs to<br />
6006016011321100158233EDE0B23DB11<br />
are dead.<br />
5/29/2008 1:35:20 AM Srv Error None 2000 N/A USMV-SYDNEY The server's call to t a system<br />
service failed unexpectedly.<br />
5/29/2008 1:36:18 AM Ftdiisk<br />
Warning Disk 57 N/A USMV-SYDNEY The system failed to flush<br />
data to the transaction logg.<br />
Corruption may occur.<br />
5/29/2008 1:36:18 AM Srv Error None 2000 N/A USMV-SYDNEY The server's call to t a system<br />
service failed unexpectedly.<br />
5/29/2008 1:36:18 AM Ntfss<br />
Warning None 50 N/A USMV-SYDNEY {Delayed Write Failed} F Windows<br />
was unable to save all thee<br />
data for the file. The data has been lost. This error may be cause ed by a failure of<br />
your computer hardware oor<br />
network connection. Please try to save this file elsewhere.<br />
68 872 5688–002
Solving SAN Connectivity Problems<br />
5/29/2008 1:36:18 AM Application Popup Information None 26 N/A USMV-SYDNEY<br />
Application popup: Windows - Delayed Write Failed : Windows was unable to save all the data for the file<br />
S:\$BitMap. The data has been lost. This error may be caused by a failure of your computer hardware or<br />
network connection. Please try to save this file elsewhere.<br />
5/29/2008 1:36:19 AM Service Control Manager Information None 7035 CLUSTERNET\clusadmin<br />
USMV-SYDNEY The Windows Internet Name Service (WINS) service was successfully sent a stop<br />
control.<br />
System Event Log for Usmv-x455 Host (Host on Surviving Site)<br />
5/29/2008 1:35:26 AM ClusSvc Warning Node Mgr 1123 N/A USMV-X455 The node lost<br />
communication with cluster node 'USMV-SYDNEY' on network '<strong>Public</strong>'.<br />
5/29/2008 1:35:26 AM ClusSvc Warning Node Mgr 1123 N/A USMV-X455 The node lost<br />
communication with cluster node 'USMV-SYDNEY' on network 'Private LAN'.<br />
5/29/2008 1:35:40 AM ClusDisk Error None 1209 N/A USMV-X455 Cluster service is requesting<br />
a bus reset for device \Device\ClusDisk0.<br />
5/29/2008 1:36:06 AM ClusSvc Warning Node Mgr 1135 N/A USMV-X455 Cluster node USMV-<br />
SYDNEY was removed from the active server cluster membership. Cluster service may have been<br />
stopped on the node, the node may have failed, or the node may have lost communication with the other<br />
active server cluster nodes.<br />
5/29/2008 1:36:06 AM ClusSvc Information Failover Mgr 1200 N/A USMV-X455 "The Cluster<br />
Service is attempting to bring online the Resource Group ""Cluster Group""."<br />
5/29/2008 1:36:10 AM ClusSvc Information Failover Mgr 1201 N/A USMV-X455 "The Cluster<br />
Service brought the Resource Group ""Cluster Group"" online."<br />
5/29/2008 1:36:36 AM Service Control Manager Information None 7035<br />
CLUSTERNET\clusadmin USMV-X455 The Windows Internet Name Service (WINS) service was<br />
successfully sent a start control.<br />
• If you review the cluster log, you can find messages similar to the following<br />
examples that are based on the testing cases used to generate the previous<br />
management console images:<br />
Cluster Log for USMV-SYDNEY Host (Host on Failure Site)<br />
00000d68.00000284::2008/05/29-1:35:21.703 ERR Physical Disk : [DiskArb]<br />
CompletionRoutine: reservation lost! Status 21 (Error 21: the device is not ready)<br />
00000d68.00000284::2008/05/29-1:35:22.713 ERR Physical Disk : [DiskArb]<br />
CompletionRoutine: reservation lost! Status 2 (Error 2: the system cannot find the file specified)<br />
00000d68.00000284::2008/05/29-1:35:22.713 ERR [RM] LostQuorumResource, cluster service<br />
terminated...<br />
00000d68.00000e68::2008/05/29-1:35:23.133 ERR Physical Disk : LooksAlive, error checking<br />
device, error 2.<br />
00000d68.00000e68::2008/05/29-1:35:23.133 ERR Physical Disk : IsAlive, error checking<br />
device, error 2.<br />
00000d68.00000e68::2008/05/29-1:35:23.143 ERR Network Name : Name query request<br />
failed, status 3221225860.<br />
00000d68.00000e68::2008/05/29-1:35:23.143 INFO Network Name : Name SYDNEY-<br />
AUCKLAND failed IsAlive/LooksAlive check, error 22. (Error 22: the device does not recognize the<br />
command)<br />
00000d68.00000cd0::2008/05/29-1:35:23.303 ERR Network Name : Unable to<br />
open handle to cluster, status 1753.<br />
00000d68.00000cd0::2008/05/29-21:33:19.245 ERR Physical Disk : [DiskArb] Failed to read<br />
(sector 12), error 1117. (Error 1117: the request could not be performed because of an I/O device error)<br />
00000d68.00000cd0::2008/05/29-21:33:19.245 ERR Physical Disk : [DiskArb] Error cleaning<br />
arbitration sector, error 1117.<br />
Cluster Log for Usmv-x455 Host (Host on Surviving Site)<br />
0000015c.00000234::2008/05/29-1:35:26.121 INFO [ClMsg] Received interface unreachable event for<br />
node 1 network 2<br />
0000015c.00000234::2008/05/29-1:35:26.121 INFO [ClMsg] Received interface unreachable event for<br />
node 1 network 1<br />
00000688.00000d08::2008/05/29-1:35:40.523 ERR Physical Disk : [DiskArb] GetPartInfo<br />
completed, status 170.<br />
00000688.00000d08::2008/05/29-1:35:40.653 ERR Physical Disk : [DiskArb] Failed to read<br />
(sector 12), error 170.<br />
6872 5688–002 6–11
Solving SAN Connectivity Problems<br />
Actions to Resolve the Problem<br />
Perform the following actions to isolate and resolve the problem:<br />
• Verify that the physical connection between the disconnected splitter or splitters and<br />
the Fibre Channel switch is healthy.<br />
• Verify that any host on which a disconnected splitter resides appears in the name<br />
server of the Fibre Channel switch. If not, the problem could be because of a bad<br />
port on the switch, a bad HBA, or a bad cable.<br />
• Verify that any host on which a disconnected splitter resides is present in the proper<br />
zone and that the current zoning configuration is enabled.<br />
• If a replication volume is not accessible to the splitter at the source site, but appears<br />
as OK in the management console for that splitter, verify that the splitter is not<br />
functioning at the target site (TSP not enabled). During normal replication, the<br />
system prevents target-site splitters from accessing the replication volumes.<br />
RAs Not Accessible to <strong>SafeGuard</strong> 30m Splitter<br />
Problem Description<br />
Symptoms<br />
One or more RAs on a site are not accessible to the splitter through the Fibre Channel.<br />
The following symptoms might help you identify this failure:<br />
• The system pauses the transfer for the relevant groups. If the connection with only<br />
one of the RAs is lost, the groups can restart the transfer by means of another RA,<br />
beginning with a short initialization.<br />
• The splitter connection to the relevant RAs shows an error.<br />
• The management console displays error indicators similar to those in Figure 6–6.<br />
Figure 6–6. Management Console Display Shows a Splitter Down<br />
• Warnings and informational messages similar to those shown in Figure 6–7 appear<br />
on the management console. See the table after the figure for an explanation of the<br />
numbered console messages.<br />
6–12 6872 5688–002
Solving SAN Connectivity Problems<br />
Figure 6–7. Management Console Messages for Splitter Inaccessible to RA<br />
6872 5688–002 6–13
Solving SAN Connectivity Problems<br />
Reference<br />
No.<br />
The following table explains the numbered messages shown in Figure 6–7.<br />
Event<br />
ID<br />
Description E-mail<br />
Immediate<br />
1 4005 The surviving site Negotiating transfer<br />
protocol<br />
2 4008 For each consistency group at the<br />
failed site, the transfer is paused to<br />
allow a failover to the surviving site.<br />
3 5002 The splitter for server USMV-SYDNEY<br />
is unable to access the RA.<br />
4 4105 The failed site stop accepting writes to<br />
the consistency group<br />
5 4008 For each consistency group at the<br />
failed site, the transfer is paused to<br />
allow a failover to the surviving site.<br />
6 5013 Splitter down problem X<br />
7 4087 The synchronization completed<br />
message after the splitter is restored<br />
and replication completes<br />
8 5032 The splitter starts splitting the<br />
replication volumes<br />
9 4001 Group capabilities reporting problem. X<br />
10 5032 The splitter is splitting to replication<br />
volumes<br />
13 5049 The splitter unable to write to the RAs X<br />
14 4086 The original site starts the<br />
synchronization<br />
15 4104 Consistency Group start replicating X<br />
E-mail<br />
Daily<br />
Summary<br />
To see the details of the messages listed on the management console display, you<br />
must collect the logs and then review the messages for the time of the failure.<br />
Appendix A explains how to collect the management console logs, and Appendix E<br />
lists the event IDs with explanations.<br />
• If you review the Windows system event log, you can find messages similar to the<br />
following examples that are based on the testing cases used to generate the<br />
previous management console images:<br />
6–14 6872 5688–002<br />
X<br />
X<br />
X<br />
X<br />
X<br />
X<br />
X<br />
X<br />
X
Solving SAN Connectivity Problems<br />
System Event Log for USMV-SYDNEY Host (Host on Failure Site)<br />
5/29/2008 2:25:20 AM ClusSvc Error Physical Disk Resource 1038 N/A USMV-SYDNEYReservation<br />
of cluster disk 'Disk Q:' has been lost. Please check your system and disk configuration.<br />
5/29/2008 2:25:20 AM Service Control Manager Error None 7034 N/A USMV-SYDNEYThe Cluster<br />
service terminated unexpectedly.<br />
5/29/2008 2:25:50 AM Srv Warning None 2012 N/A USMV-SYDNEY While transmitting or<br />
receiving data, the server encountered a network error. Occasional errors are expected, but large amounts<br />
of these indicate a possible error in your network configuration. The error status code is contained within<br />
the returned data (formatted as Words) and may point you towards the problem.<br />
5/29/2008 2:25:20 AM Ftdisk Warning Disk 57 N/A USMV-SYDNEY<br />
The system failed to flush data to the transaction log. Corruption may occur.<br />
5/29/2008 2:25:21 AM Ntfs Warning None 50 N/A USMV-SYDNEY {Delayed Write Failed}<br />
Windows was unable to save all the data for the file. The data has been lost. This error may be caused by<br />
a failure of your computer hardware or network connection. Please try to save this file elsewhere.<br />
5/29/2008 2:25:32 AM Ftdisk Warning Disk 57 N/A USMV-SYDNEY<br />
The system failed to flush data to the transaction log. Corruption may occur.<br />
5/29/2008 2:25:32 AM Srv Error None 2000 N/A USMV-SYDNEY<br />
The server's call to a system service failed unexpectedly.<br />
5/29/2008 2:25:32 AM ClusSvc Error IP Address Resource 1077 N/A USMV-SYDNEY<br />
The TCP/IP interface for Cluster IP Address '' has failed.<br />
5/29/2008 2:25:32 AM ClusSvc Error Physical Disk Resource 1036 N/A USMV-SYDNEY<br />
Cluster disk resource '' did not respond to a SCSI maintenance command.<br />
5/29/2008 2:25:32 AM ClusSvc Error Network Name Resource 1215 N/A USMV-SYDNEYCluster<br />
Network Name SYDNEY-AUCKLAND is no longer registered with its hosting system. The associated<br />
resource name is ''.<br />
System Event Log for Usmv-x455 Host (Host on Surviving Site)<br />
5/29/2008 2:25:23 AM ClusSvc Warning Node Mgr 1123 N/A USMV-X455<br />
The node lost communication with cluster node 'USMV-SYDNEY' on network '<strong>Public</strong>'.<br />
5/29/2008 2:25:23 AM ClusSvc Warning Node Mgr 1123 N/A USMV-X455<br />
The node lost communication with cluster node 'USMV-SYDNEY' on network 'Private LAN'.<br />
5/29/2008 2:25:37 AM ClusDisk Error None 1209 N/A USMV-X455<br />
Cluster service is requesting a bus reset for device \Device\ClusDisk0.<br />
5/29/2008 2:25:53 AM ClusSvc Warning Node Mgr 1135 N/A USMV-X455 Cluster node USMV-<br />
SYDNEY was removed from the active server cluster membership. Cluster service may have been<br />
stopped on the node, the node may have failed, or the node may have lost communication with the other<br />
active server cluster nodes.<br />
5/29/2008 2:25:53 AM ClusSvc Information Failover Mgr 1200 N/A USMV-X455<br />
"The Cluster Service is attempting to bring online the Resource Group ""Cluster Group""."<br />
5/29/2008 2:25:58 AM ClusSvc Information Failover Mgr 1201 N/A USMV-X455<br />
"The Cluster Service brought the Resource Group ""Cluster Group"" online."<br />
5/28/2008 2:25:35 AM Service Control Manager Information None 7035<br />
CLUSTERNET\clusadmin USMV-X455<br />
The Windows Internet Name Service (WINS) service was successfully sent a start control.<br />
5/29/2008 2:25:37 AM Service Control Manager Information None 7035 NT<br />
AUTHORITY\SYSTEM USMV-X455<br />
The Windows Internet Name Service (WINS) service was successfully sent a continue control.<br />
• If you review the cluster log, you can find messages similar to the following<br />
examples that are based on the testing cases used to generate the previous<br />
management console images:<br />
6872 5688–002 6–15
Solving SAN Connectivity Problems<br />
Cluster Log for USMV-SYDNEY Host (Host on Failure Site)<br />
00000f70.00000d10::2008/05/29-2:25:20.426 ERR Physical Disk : [DiskArb]<br />
CompletionRoutine: reservation lost! Status 31. (Error 31: a device attached to the system is not<br />
functioning)<br />
00000f70.00000d10::2008/05/29-2:25:20.426 ERR [RM] LostQuorumResource, cluster service<br />
terminated...<br />
00000f70.00000e78::2008/05/29-2:25:32.778 ERR Physical Disk : IsAlive, error checking device,<br />
error 995. (Error 995: The I/O operation has been aborted because of either a thread exit or an application<br />
request)<br />
00000f70.00000e78::2008/05/29-2:25:32.778 ERR Physical Disk : LooksAlive, error checking<br />
device, error 31.<br />
00000f70.00000e78::2008/05/29-2:25:32.778 ERR Physical Disk : IsAlive, error checking<br />
device, error 31.<br />
00000f70.00000e78::2008/05/29-2:25:32.778 ERR Network Name : Name query request<br />
failed, status 3221225860.<br />
00000f70.00000b54::2008/05/29-2:25:32.868 ERR Network Name : Unable to open<br />
handle to cluster, status 1753. (Error 1753: there are no more endpoints available from the endpoint<br />
mapper)<br />
00000f70.00000b54::2008/05/29-2:25:33.258 ERR Physical Disk : Terminate, error opening<br />
\Device\Harddisk10\Partition1, error C0000022.<br />
00000f70.00000b54::2008/05/29-2:25:33.528 ERR Physical Disk : [DiskArb] Failed to read<br />
(sector 12), error 170. (Error 170: the requested resource is in use)<br />
00000f70.00000b54::2008/05/29-2:25:33.528 ERR Physical Disk : [DiskArb] Error cleaning<br />
arbitration sector, error 170.<br />
Cluster Log for Usmv-x455 Host (Host on Surviving Site)<br />
0000015c.00000234::2008/05/29-2:25:23.496 INFO [ClMsg] Received interface unreachable event for<br />
node 1 network 2<br />
0000015c.00000234::2008/05/29-2:25:23.496 INFO [ClMsg] Received interface unreachable event for<br />
node 1 network 1<br />
00000688.00000d08::2008/05/29-2:25:37.898 ERR Physical Disk : [DiskArb] GetPartInfo<br />
completed, status 170.<br />
00000688.00000d08::2008/05/29-2:25:37.898 ERR Physical Disk : [DiskArb] Failed to read<br />
(sector 12), error 170.<br />
Actions to Resolve the Problem<br />
Perform the following actions to isolate and resolve the problem:<br />
• Identify which of the components is the problematic one. A problematic component<br />
is likely to have additional errors or problems:<br />
− A problematic RA might not be accessible to other splitters or might not<br />
recognize certain volumes.<br />
− A problematic splitter might not recognize any RAs or the storage subsystem.<br />
• Connect to the storage switch to verify the status of each connection. Ensure that<br />
each connection is configured correctly.<br />
• If you cannot find any additional problems, there is a good chance that the problem is<br />
with the zoning; that is, somehow, the splitters are not exposed to the RAs.<br />
• Verify the physical connectivity of the RAs and the servers (those on which the<br />
potentially problematic splitters reside) to the Fibre Channel switch. For each<br />
connection, verify that it is healthy and appears correctly in the name server, zoning,<br />
and so forth.<br />
• Verify that this is not a temporary situation---for instance, if the RAs were rebooting<br />
or recovering from another failure, the splitter might not yet identify them.<br />
6–16 6872 5688–002
Total SAN Switcch<br />
Failure on One Site in a<br />
Geographic Clusstered<br />
Environment<br />
6872 5688–002<br />
Solving SAN Connec ctivity Problems<br />
A total SAN switch failure implies that cluster nodes and RAs have lost t access to the<br />
storage device thatt<br />
was connected to the SAN on one site. This failure causes the<br />
cluster nodes to losse<br />
their reservation of the physical disks and triggers s an MSCS failover<br />
to the remote site. In a geographic clustered environment where MSCS S is running, if the<br />
connection to a storage<br />
device on one site fails, the symptoms and res sulting actions<br />
depend on whetherr<br />
or not the quorum owner resided on the failed stor rage device.<br />
To understand the ttwo<br />
scenarios and to follow the actions for both pos ssibilities, review<br />
Figure 6–8.<br />
FFigure<br />
6–8. SAN Switch Failure on One Site e<br />
6–17
Solving SAN Connectivity Problems<br />
Cluster Quorum Owner Located on Site with Failed SAN Switch<br />
Problem Description<br />
Symptoms<br />
The following point explains the expected behavior of the MSCS Reservation Manager<br />
when an event of this nature occurs:<br />
• If the cluster quorum owner is located on the site with the failed SAN, the quorum<br />
reservation is lost. This loss causes the cluster nodes to fail and triggers a cluster<br />
“regroup” process. This regroup process allows other cluster nodes participating in<br />
the cluster to arbitrate for the quorum device.<br />
Cluster nodes located on the failed SAN fail quorum arbitration because the failed<br />
SAN is not able to provide a reservation on the quorum volume. The cluster nodes in<br />
the remote location attempt to reserve the quorum device and succeed arbitration of<br />
the quorum. The node that owns the quorum device assumes ownership of the<br />
cluster. The cluster owner brings online the data groups that were owned by the<br />
failed site.<br />
The following symptoms might help you identify this failure:<br />
• All resources fail over to the surviving site (site 2 in this case) and come online<br />
successfully. Cluster nodes fail at the source site. If the consistency groups are<br />
configured asynchronously, this failover results in loss of data. The failover is fully<br />
automated and does not require additional downtime. The RAs cannot replication<br />
data until the SAN is operational.<br />
• Failures are reported on the server and the management console. Replication<br />
stopped on all consistency groups.<br />
• The management console displays error indications similar to those in Figure 6–9.<br />
Figure 6–9. Management Console Display with Errors for Failed SAN Switch<br />
• Warnings and informational messages similar to those shown in Figure 6–10 appear<br />
on the management console. See the table after the figure for an explanation of the<br />
numbered console messages.<br />
6–18 6872 5688–002
Solving SAN Connectivity Problems<br />
Figure 6–10. Management Console Messages for Failed SAN Switch<br />
6872 5688–002 6–19
Solving SAN Connectivity Problems<br />
Reference<br />
No.<br />
The following table explains the numbered messages shown in Figure 6–10.<br />
Event<br />
ID<br />
Description<br />
1 3012 The RA is unable to access the<br />
volume.<br />
E-mail<br />
Immediate<br />
2 5002 RA unable to access splitter X<br />
3 4001 The surviving site reports of the<br />
Group Capabilities problem<br />
4 4008 The Surviving site pauses the data<br />
transfer<br />
5 5013 The original site reporting the<br />
splitter down status<br />
6 4003 For each consistency group, the<br />
surviving site reports a group<br />
consistency problem. The details<br />
show a WAN problem.<br />
7 3014 The RA is unable to access the<br />
repository volume.<br />
8 4044 The group is deactivated indefinitely<br />
by the system.<br />
9 4007 The system is pausing data transfer<br />
on the surviving site (Quorum ---<br />
South).<br />
E-mail<br />
Daily<br />
Summary<br />
10 4086 Synchronization started message X<br />
11 4000 Group capabilities OK message X<br />
12 5032 The splitter starts splitting X<br />
To see the details of the messages listed on the management console display, you<br />
must collect the logs and then review the messages for the time of the failure.<br />
Appendix A explains how to collect the management console logs, and Appendix E<br />
lists the event IDs with explanations.<br />
• If you review the Windows system event log, you can find messages similar to the<br />
following examples that are based on the testing cases used to generate the<br />
previous management console images:<br />
6–20 6872 5688–002<br />
X<br />
X<br />
X<br />
X<br />
X<br />
X<br />
X<br />
X
Solving SAN Connectivity Problems<br />
System Event Log for USMV-SYDNEY Host (Host on Failure Site)<br />
5/29/2008 05:15:31 PM Application Popup Information None 26 N/A USMV-SYDNEY<br />
Application popup: Windows - Delayed Write Failed : Windows was unable to save all the data for the file<br />
Q:\. The data has been lost. This error may be caused by a failure of your computer hardware or network<br />
connection. Please try to save this file elsewhere.<br />
5/29/2008 05:15:31 PM Ftdisk Warning Disk 57 N/A USMV-SYDNEY The system failed to<br />
flush data to the transaction log. Corruption may occur.<br />
5/29/2008 05:15:31 PM Ftdisk Warning Disk 57 N/A USMV-SYDNEY The system failed to<br />
flush data to the transaction log. Corruption may occur.<br />
System Event Log for USMV-AUCKLAND Host (Host on Surviving Site)<br />
5/29/2008 05:13:33 PM ClusSvc Error Physical Disk Resource 1038 N/A USMV-SYDNEY<br />
Reservation of cluser disk 'Disk Q:' has been lost. Please check your system and disk configuration.<br />
5/29/2008 05:13:33 PM Service Control Manager Error None 7031 N/A USMV-SYDNEY<br />
The Cluster Service terminated unexpectedly. It has done this 2 time(s). The following corrective action<br />
will be taken in 120000 milliseconds: Restart the service.<br />
5/29/2008 05:15:31 PM Application Popup Information None 26 N/A USMV-SYDNEY<br />
Application popup: Windows - Delayed Write Failed : Windows was unable to save all the data for the file<br />
Q:\$Mft. The data has been lost. This error may be caused by a failure of your computer hardware or<br />
network connection. Please try to save this file elsewhere.<br />
5/29/2008 05:15:31 PM Ftdisk Warning Disk 57 N/A USMV-SYDNEY The system failed to<br />
flush data to the transaction log. Corruption may occur.<br />
• If you review the cluster log, you can find messages similar to the following<br />
examples that are based on the testing cases used to generate the previous<br />
management console images:<br />
Cluster Log for USMV-SYDNEY Host (Host on Failure Site)<br />
00001130.00001354::2008/5/29-17:14:33.712 ERR Physical Disk : [DiskArb]<br />
CompletionRoutine: reservation lost! Status 170 (Error 170: the requested resource is in use)<br />
00001130.00001354::2008/5/29-17:14:33.712 ERR [RM] LostQuorumResource, cluster service<br />
terminated...<br />
00001130.00001744::2008/5/29-17:15:31.733 ERR Physical Disk : [DiskArb] Failed to read<br />
(sector 12), error 170.<br />
00001130.00001744::2008/5/29-17:15:31.733 ERR Physical Disk : [DiskArb] Error cleaning<br />
arbitration sector, error 170.<br />
00001130.00001744::2008/5/29-17:15:31.733 ERR Network Name : Unable to open<br />
handle to cluster, status 1753. (Error 1753: there are no more endpoints available from the endpoint<br />
mapper)<br />
00001130.00000d3c::2008/5/29-17:15:31.733 ERR IP Address : WorkerThread:<br />
GetClusterNotify failed with status 6. (Error 6: the handle is invalid)<br />
6872 5688–002 6–21
Solving SAN Connectivity Problems<br />
Cluster Log for USMV-AUCKLAND Host (Host on Surviving Site)<br />
00000668.00000d90::2008/5/29-17:14:35.120 INFO [ClMsg] Received interface unreachable event for<br />
node 2 network 1<br />
00000668.00000d90::2008/5/29-17:14:35.120 INFO [ClMsg] Received interface unreachable event for<br />
node 2 network 2<br />
00000bb8.00000d0c::2008/5/29-17:14:49.706 ERR Physical Disk : [DiskArb] GetPartInfo<br />
completed, status 170.<br />
00000bb8.00000d0c::2008/5/29-17:14:49.706 ERR Physical Disk : [DiskArb] Failed to read<br />
(sector 12), error 170.<br />
Actions to Resolve the Problem<br />
To resolve this situation, diagnose the SAN switch failure.<br />
Cluster Quorum Owner Not on Site with Failed SAN Switch<br />
Problem Description<br />
Symptoms<br />
The following points explain the expected behavior of the MSCS Reservation Manager<br />
when an event of this nature occurs:<br />
• If a SAN failure occurs and the cluster nodes do not own the quorum resource, the<br />
state of the cluster services on these nodes is not affected.<br />
• The cluster nodes remain as active cluster members; however, the data groups<br />
containing the <strong>SafeGuard</strong> 30m Control instance and the physical disk resources on<br />
these nodes are marked as failed, and any applications dependent on them are taken<br />
offline. These resources first try to restart, and then eventually fail over to the<br />
surviving site.<br />
The following symptoms might help you identify this failure:<br />
• Applications fail and attempt to restart.<br />
• The data groups containing the <strong>SafeGuard</strong> 30m Control instance and the physical<br />
disk resources on these nodes are marked as failed, and any applications dependent<br />
on them are taken offline. These resources first try to restart, and then eventually fail<br />
over to the surviving site. The cluster nodes remain as active cluster members.<br />
• The management console displays error indications similar to those in Figure 6–9.<br />
• Warnings and informational messages similar to those shown in Figure 6–11 appear<br />
on the management console. See the table after the figure for an explanation of the<br />
numbered console messages.<br />
6–22 6872 5688–002
Solving SAN Connectivity Problems<br />
Figure 6–11. Management Console Messages for Failed SAN Switch with Quorum<br />
Owner on Surviving Site<br />
6872 5688–002 6–23
Solving SAN Connectivity Problems<br />
Reference<br />
No.<br />
The following table explains the numbered messages shown in Figure 6–11.<br />
Event ID<br />
Description<br />
1 5002 The RA is unable to access<br />
the splitter.<br />
2 3012 The RA is unable to access<br />
the volume (RA 2, Quorum).<br />
3 4003 For each consistency group,<br />
the surviving site reports a<br />
group consistency problem.<br />
The details show a WAN<br />
problem.<br />
4 3014 The RA is unable to access<br />
the repository volume<br />
(RA2).<br />
5 4009 The system is pausing data<br />
transfer on the failure site<br />
6 4044 The group is deactivated<br />
indefinitely by the system.<br />
E-mail<br />
Immediate<br />
E-mail<br />
Daily<br />
Summary<br />
To see the details of the messages listed on the management console display, you<br />
must collect the logs and then review the messages for the time of the failure.<br />
Appendix A explains how to collect the management console logs, and Appendix E<br />
lists the event IDs with explanations.<br />
• If you review the Windows system event log, you can find messages similar to the<br />
following examples that are based on the testing cases used to generate the<br />
previous management console images:<br />
System Event Log for USMV-SYDNEY Host (Host on Failure Site)<br />
5/29/2008 5:14:24 PM ClusDisk Error None<br />
a bus reset for device \Device\ClusDisk0.<br />
1209 N/A USMV-AUCKLAND Cluster service is requesting<br />
5/29/2008 5:14:38 PM ClusSvc Warning Node Mgr 1123 N/A USMV- AUCKLAND The node lost<br />
communication with cluster node 'USMV-SYDNEY' on network 'Private LAN'.<br />
5/29/2008 5:14:39 PM ClusSvc Information Node Mgr 1122 N/A USMV- AUCKLAND The node<br />
(re)established communication with cluster node 'USMV-SYDNEY' on network 'Private LAN'.<br />
6–24 6872 5688–002<br />
X<br />
X<br />
X<br />
X<br />
X<br />
X
Solving SAN Connectivity Problems<br />
System Event Log for Usmv-Auckland Host (Host on Surviving Site)<br />
5/29/2008 5:14:38 PM ClusSvc Warning Node Mgr 1123 N/A USMV- AUCKLAND The node lost<br />
communication with cluster node 'USMV-SYDNEY' on network 'Private LAN'.<br />
5/29/2008 5:14:39 PM ClusSvc Information Node Mgr 1122 N/A USMV- AUCKLAND The node<br />
(re)established communication with cluster node 'USMV-SYDNEY' on network 'Private LAN'.<br />
• If you review the cluster log, you can find messages similar to the following<br />
examples that are based on the testing cases used to generate the previous<br />
management console images:<br />
Cluster Log for Usmv USMV-SYDNEY Host (Host on Failure Site)<br />
00001524.00000360::2008/5/29-17:14:56.750 ERR Physical Disk : [DiskArb] GetPartInfo<br />
completed, status 170.<br />
00001524.00000360::2008/5/29-17:14:56.750 ERR Physical Disk : [DiskArb] Failed to read<br />
(sector 12), error 170.<br />
00001524.000017e4::2008/5/29-17-15:22.899 ERR IP Address : WorkerThread:<br />
GetClusterNotify failed with status 6.<br />
Cluster Log for USMV-AUCKLAND Host (Host on Surviving Site)<br />
00000bb8.00000c5c::2008/5/29-17:14:14.596 ERR IP Address : WorkerThread:<br />
GetClusterNotify failed with status 6.<br />
00000bb8.0000026c::2008/5/29-17:14:24.188 ERR Physical Disk : [DiskArb] GetPartInfo<br />
completed, status 170.<br />
00000bb8.0000026c::2008/5/29-17:14:24.188 ERR Physical Disk : [DiskArb] Failed to read<br />
(sector 12), error 170.<br />
Actions to Resolve the Problem<br />
To resolve this situation, diagnose the SAN switch failure.<br />
6872 5688–002 6–25
Solving SAN Connectivity Problems<br />
6–26 6872 5688–002
Section 7<br />
Solving Network Problems<br />
This section lists symptoms that usually indicate networking problems. Table 7–1 lists<br />
symptoms and possible problems indicated by the symptom. The problems and their<br />
solutions are described in this section. The graphics, behaviors, and examples in this<br />
section are similar to what you observe with your system but might differ in some<br />
details.<br />
In addition to the symptoms listed, you might receive e-mail messages or SNMP traps<br />
for possible problems. Also, messages are displayed on the management console similar<br />
to the e-mail messages. If you do not see the messages, they might have already<br />
dropped off the display. Review the management console logs for messages that have<br />
dropped off the display.<br />
Table 7–1. Possible Networking Problems with Symptoms<br />
Symptom Possible Problem<br />
The cluster groups with the failed network<br />
connection fail over to the next preferred<br />
node. If only one node is configured at the<br />
site with the failure, replication direction<br />
changes and applications run on the<br />
backup site.<br />
If the NIC is teamed, no failover occurs and<br />
no symptoms are obvious.<br />
The networks on the Cluster Administrator<br />
screen show an error.<br />
Host system and application event log<br />
messages contain error or warning<br />
messages.<br />
Clients on site 2 are not able to access<br />
resources associated with the IP resource<br />
located on site 1.<br />
<strong>Public</strong> communication between the two<br />
sites fails, only allowing local cluster public<br />
communication between cluster nodes and<br />
local clients.<br />
The networks on the Cluster Administrator<br />
screen show an error.<br />
<strong>Public</strong> NIC failure on a cluster node in a<br />
geographic clustered environment<br />
<strong>Public</strong> or client WAN failure in a geographic<br />
clustered environment<br />
6872 5688–002 7–1
Solving Network Problems<br />
Table 7–1. Possible Networking Problems with Symptoms<br />
Symptom Possible Problem<br />
You cannot access the management<br />
console or initiate an SSH session through<br />
PuTTY using the management IP address<br />
of the remote site.<br />
Management console log indicates that the<br />
WAN data links to the RAs are down.<br />
All consistency groups show the transfer<br />
status as “Paused by system.”<br />
On the management console, all<br />
consistency groups show the transfer<br />
status switching between “Paused by<br />
system” and “initializing/active.” All<br />
groups appear unstable over the WAN<br />
connection.<br />
The networks on the Cluster Administrator<br />
screen show an error.<br />
You cannot access the management<br />
console using the management IP address<br />
of the remote site.<br />
The cluster is no longer accessible from<br />
nodes except from one surviving node.<br />
Unable to reach DNS server.<br />
Unable to communicate to NTP server.<br />
Unable to reach mail server.<br />
The management console shows errors for<br />
the WAN or for RA data links.<br />
The management console logs show RA<br />
communication errors.<br />
Management network failure in a<br />
geographic clustered environment<br />
Replication network failure in a geographic<br />
clustered environment<br />
Temporary WAN failures<br />
Private cluster network failure in a<br />
geographic clustered environment<br />
Total communication failure in a<br />
geographic clustered environment<br />
Port information<br />
7–2 6872 5688–002
<strong>Public</strong> NIC Failuure<br />
on a Cluster Node in a<br />
Geographic Clusstered<br />
Environment<br />
Problem Description<br />
6872 5688–002<br />
If a public network interface card (NIC) of a cluster node failed, the clus ster node of the<br />
failed public NIC cannot<br />
access clients. The cluster node of the failed NIC N can participate<br />
in the cluster as a mmember<br />
because it can communicate over the privat te cluster<br />
network. Other clusster<br />
nodes are not affected by this error.<br />
The MSCS software<br />
detects a failed network and the cluster resources s fail over to the<br />
next preferred nodee.<br />
All cluster groups used for replication that contain a virtual IP<br />
address for the faileed<br />
network connection succeed to fail over to the ne ext preferred<br />
node. However, thee<br />
Unisys <strong>SafeGuard</strong> 30m Control resources cannot fail f back to the<br />
node with a failed ppublic<br />
network because they cannot communicate with w the site<br />
management IP adddress<br />
of the RAs.<br />
Note: A teamed ppublic<br />
network interface does not experience this pr roblem and<br />
therefore is the reccommended<br />
configuration.<br />
Figure 7–1 illustratees<br />
this failure.<br />
Solving Net twork Problems<br />
Figgure<br />
7–1. <strong>Public</strong> NIC Failure of a Cluster Node<br />
7–3
Solving Network Problems<br />
Symptoms<br />
The following symptoms might help you identify this failure:<br />
• All cluster groups used for replication that contain a virtual IP address for the failed<br />
network connection fail over to the next preferred node.<br />
• If no other node exists at the same site, replication direction changes and the<br />
application run at the backup site.<br />
• If you review the host system event log, you can find messages similar to the<br />
following examples:<br />
Windows System Event Log Messages on Host Server<br />
Type: error<br />
Source: ClusSvc<br />
EventID: 1077, 1069<br />
Description: The TCP/IP interface for Cluster IP Address “xxx” has failed.<br />
Type: error<br />
Source: ClusSvc<br />
EventID: 1069<br />
Description: Cluster resource ‘xxx’ in Resource Group ‘xxx’ failed.<br />
Type: error<br />
Source: ClusSvc<br />
EventID: 1127<br />
Description: The interface for cluster node ‘xxx’ on network ‘xxx’ failed. If the condition persists, check<br />
the cabling connecting the node to the network. Next, check for hardware or software errors in nodes’s<br />
network Adapter.<br />
• If you attempt to move a cluster group to the node with the failing public NIC, the<br />
event 2002 message is displayed in the host application event log.<br />
Application Event Log Message on Host Server<br />
Type: warning<br />
Source: 30mControl<br />
Event Category: None<br />
EventID: 2002<br />
Date : 05/30/2008<br />
Time: 11:12:02 AM<br />
User : N/A\<br />
Computer: USMV-DL580<br />
Description: Online resource failed. RA CLI command failed because of a network communication error or<br />
invalid IP address.<br />
Action: Verify the network connection between the system and the site management IP Address<br />
specified for the resource. Ping each site management IP Address specified for the specified resource.<br />
Note: The preceding information can also be viewed in the cluster log.<br />
7–4 6872 5688–002
6872 5688–002<br />
• The managemeent<br />
console display and management console logs do d not show any<br />
errors.<br />
• When the publiic<br />
NIC fails on a node that does not use teaming, the e Cluster<br />
Administrator ddisplays<br />
an error indicator similar to Figure 7–2. If the e public NIC<br />
interface is teammed,<br />
you do not see error messages in the Cluster Administrator.<br />
Figure 7–2. Pubblic<br />
NIC Error Shown in the Cluster Adminis strator<br />
Actions to Resolve thhe<br />
Problem<br />
Perform the followiing<br />
actions to isolate and resolve the problem:<br />
Solving Net twork Problems<br />
1. In the Cluster AAdministrator,<br />
verify that the public interface for all nodes<br />
is in an<br />
“Up” state. If mmultiple<br />
nodes at a site show public connections failed<br />
in the Cluster<br />
Administrator, pphysically<br />
check the network switch for connection errors.<br />
If the private neetwork<br />
also shows errors, physically check the netw work switch for<br />
connection erroors.<br />
2. Inspect the NICC<br />
link indicators on the host and, from a client, use th he Ping command<br />
to verify the physical<br />
IP address of the adapter (not the virtual IP ad ddress).<br />
3. Isolate a NIC orr<br />
cabling issue by moving cables at the network swit tch and at the NIC.<br />
4. Replace the NICC<br />
in the host if necessary. No configuration of the re eplaced NIC is<br />
necessary.<br />
5. Move the cluster<br />
resources back to the original node after the reso olution of the<br />
failure.<br />
7–5
Solving Network Problems<br />
<strong>Public</strong> or Client WAN Failure in a Geographic<br />
Clustered Environment<br />
Problem Description<br />
When the public or client WAN fails, some clients cannot access virtual IP networks that<br />
are associated with the cluster. The WAN components that comprise this failure might<br />
be two switches that are possibly on different subnets using gateways. This failure<br />
results from connectivity issues. The MSCS cluster would detect and fail the associated<br />
node if the failure resulted from an adapter failure or media failure to the adapter.<br />
Instead, cluster groups do not fail and the public LAN shows an “unreachable for this<br />
failure” mode.<br />
<strong>Public</strong> communication between the two sites failed, only allowing local cluster public<br />
communication between cluster nodes and local clients. The cluster node state does not<br />
change on either site because all cluster nodes are able to communicate with the private<br />
cluster network.<br />
All resources remain online and no cluster group errors are reported in the Cluster<br />
Administrator. Clients on the remote site cannot access resources associated with the IP<br />
resource located on the local site until the public or client network is again operational.<br />
Depending on the cause of the failure and the network configuration, the <strong>SafeGuard</strong> 30m<br />
Control might fail to move a cluster group because the management network might be<br />
the same physical network as the public network. Whether this failure to move the<br />
group occurs or not depends on how the RAs are physically wired to the network.<br />
7–6 6872 5688–002
Symptoms<br />
6872 5688–002<br />
Figure 7–3 illustratees<br />
this scenario.<br />
Figure 7–3. <strong>Public</strong> or Client WAN Failure<br />
The following symmptoms<br />
might help you identify this failure:<br />
Solving Net twork Problems<br />
• Clients on site 2 are not able to access resources associated with the t IP resource<br />
located on site 1.<br />
• <strong>Public</strong> communnication<br />
between the two sites displays as “unreach hable” allowing<br />
local cluster public<br />
communication between cluster nodes and loca al clients.<br />
• When the publiic<br />
cluster network fails, the Cluster Administrator dis splays an error<br />
indicator similar<br />
to Figure 7–4.<br />
All private netwwork<br />
connections show as “unreachable” when the problem is a WAN<br />
issue.<br />
If only two of thhe<br />
connections show as failed (and the nodes are ph hysically located at<br />
the same site), the issue is probably local to the site.<br />
If only one connnection<br />
failed, the issue is probably a host network adapter.<br />
a<br />
7–7
Solving Network Problems<br />
7–8<br />
Figure 7–4. Cluster Administrator<br />
Showing <strong>Public</strong> LAN Network Error E<br />
• If you review the sysstem<br />
event log, messages similar to the following ex xamples are<br />
displayed:<br />
Event Type : Warning<br />
Event Source : ClusSvc<br />
Event Category: Node Mggr<br />
Event ID: 1123<br />
Date : 05/30/2008<br />
Time: 9:49:34 AM<br />
User : N/A<br />
Computer: USMV-WEST22<br />
Description:<br />
The node lost communicaation<br />
with cluster node 'USMV-EAST2' on network '<strong>Public</strong> LAN'.<br />
Event Type: Warning<br />
Event Source: ClusSvc<br />
Event Category: Node Mggr<br />
Event ID: 1126<br />
Date : 05/30/2008<br />
Time: 9:49:36 AM<br />
User : N/A<br />
Computer: USMV-WEST22<br />
Description:<br />
The interface for cluster nnode<br />
'USMV-WEST2' on network '<strong>Public</strong> LAN' is unreachable by at a least one<br />
other cluster node attacheed<br />
to the network. the server cluster was not able to determine the t location of<br />
the failure. Look for additional<br />
entries in the system event log indicating which other nodes s have lost<br />
communication with nodee<br />
USMV-WEST2. If the condition persists, check the cable connec cting the node<br />
to the network. Next, cheeck<br />
for hardware or software errors in the node's network adapter.<br />
Finally, check<br />
for failures in any other neetwork<br />
components to which the node is connected such as hubs,<br />
switches, or<br />
bridges.<br />
68 872 5688–002
Solving Network Problems<br />
Event Type: Warning<br />
Event Source: ClusSvc<br />
Event Category: Node Mgr<br />
Event ID: 1130<br />
Date : 05/30/2008<br />
Time: 9:49:36 AM<br />
User : N/A<br />
Computer: USMV-WEST2<br />
Description:<br />
Cluster network '<strong>Public</strong> network is down. None of the available nodes can communicate using this<br />
network. If the condition persists, check for failures in any network components to which the nodes are<br />
connected such as hubs, switches, or bridges. Next, check the cables connecting the nodes to the<br />
network. Finally, check for hardware or software errors in the adapters that attach the nodes to the<br />
network.<br />
• A cluster group containing a <strong>SafeGuard</strong> 30m Control resource might fail to move to<br />
another node when the management network has network components common to<br />
the public network. (Refer to “Management Network Failure in a Geographic<br />
Clustered Environment.”)<br />
• Symptoms might include those in “Management Network Failure in a Geographic<br />
Clustered Environment” when these networks are physically the same network.<br />
Refer to this topic if the clients at one site are not able to access the IP resources at<br />
another site.<br />
• The management console logs might display the messages in the following table<br />
when this connection fails and is then restored.<br />
Event<br />
ID<br />
Description<br />
3023 For each RA at the site, this console log<br />
message is displayed:<br />
Error in LAN link to RA. (RA )<br />
3022<br />
When the LAN link is restored, a<br />
management console log displays:<br />
LAN link to RA restored. (RA)<br />
E-mail<br />
Immediate<br />
E-mail<br />
Daily<br />
Summary<br />
6872 5688–002 7–9<br />
X<br />
X
Solving Network Problems<br />
Actions to Resolve the Problem<br />
Note: Typically, a network administrator for the site is required to diagnose which<br />
network switch, gateway, or connection is the cause of this failure.<br />
Perform the following actions to isolate and resolve the problem:<br />
1. In the Cluster Administrator, view the network properties of the public and private<br />
network.<br />
The private network should be operational with no failure indications.<br />
The public network should display errors. Refer to the previous symptoms to identify<br />
that this is a WAN issue. If the error is limited to one host, the problem might be a<br />
host network adapter. See “Cluster Node <strong>Public</strong> NIC Failure in a Geographic<br />
Clustered Environment.”<br />
2. Check for network problems using a method such as isolating the failure to the<br />
network switch or gateway by pinging from the cluster node to the gateway at each<br />
site.<br />
3. Use the Installation Manager site connectivity IP diagnostic from the RAs to the<br />
gateway at each site by performing the following steps. (For more information, see<br />
Appendix C.)<br />
a. Log on to an RA with user ID as boxmgmt and password as boxmgmt.<br />
b. On the Main Menu, type 3 (Diagnostics) and press Enter.<br />
c. On the Diagnostics menu, type 1 (IP diagnostics) and press Enter.<br />
d. On the IP Diagnostics menu, type 1 (Site connectivity tests) and press Enter.<br />
e. When asked to select a target for the tests, type 5 (Other host) and press<br />
Enter.<br />
f. Enter the IP address for the gateway that you want to test.<br />
g. Repeat steps a through f for each RA.<br />
4. Isolate the site by determining which gateway or network switch failed. Use<br />
standard network methods such as pinging to make the determination.<br />
7–10 6872 5688–002
Management Neetwork<br />
Failure in a Geograp phic<br />
Clustered Enviroonment<br />
Problem Description<br />
Symptoms<br />
6872 5688–002<br />
When the managemment<br />
network fails in a geographic clustered environ nment, you cannot<br />
access the manageement<br />
console for the affected site. The replication environment e<br />
is not<br />
affected. If you try tto<br />
move a cluster group to the site with the failed management<br />
m<br />
network, the move fails.<br />
Figure 7–5 illustratees<br />
this scenario.<br />
Figure 7–5. Management Network Failure<br />
The following sympptoms<br />
might help you identify this failure:<br />
Solving Net twork Problems<br />
• The indicators ffor<br />
the onboard management network adapter of the e RA are not<br />
illuminated.<br />
• Network switchh<br />
port lights show that no link exists with the host adapter.<br />
7–11
Solving Network Problems<br />
• You cannot access the management console or initiate a SSH session through<br />
PuTTY using the management IP address of the failed site from remote site. You can<br />
access the management console from a client local to the site. If you cannot access<br />
the management IP address from either site, see Section 8, “Solving Replication<br />
Appliance (RA) Problems.”<br />
• A cluster move operation to the site with the failed management network might fail.<br />
The event ID 2002 message is displayed in the host application event log.<br />
Application Event Log Message on Host Server<br />
Type : warning<br />
Source : 30mControl<br />
Event Category: None<br />
EventID : 2002<br />
Date : 05/30/2008<br />
Time : 2:46:29 PM<br />
User : N/A<br />
Computer : USMV-SYDNEY<br />
Description : Online resource failed. RA CLI command failed because of a network communication<br />
error or invalid IP address.<br />
Action : Verify the network connection between the system and the site management IP Address<br />
specified for the resource. Ping each site management IP Address mentioned for the specified resource.<br />
Note: The preceding information can also be viewed in the cluster log.<br />
• If the management console was open with the IP address of the failed site, the<br />
message “Connection with RA was lost, please check RA and network settings” is<br />
displayed. The management console display shows “not connected,” and the<br />
components have a question mark “Unknown” status as illustrated in Figure 7–6.<br />
7–12 6872 5688–002
Solving Network Problems<br />
Figure 7–6. Management Console Display: “Not Connected”<br />
• The management console log displays a message for event 3023 as shown in<br />
Figure 7–7.<br />
Figure 7–7. Management Console Message for Event 3023<br />
6872 5688–002 7–13
Solving Network Problems<br />
• The management console log messages might appear as in the following table.<br />
Event<br />
ID<br />
Description<br />
3023 For each RA at the site, this console log<br />
message is displayed:<br />
Error in LAN link to RA. (RA )<br />
3022<br />
When the LAN link is restored, a<br />
management console log displays:<br />
LAN link to RA restored. (RA )<br />
Actions to Resolve the Problem<br />
E-mail<br />
Immediate<br />
E-mail<br />
Daily<br />
Summary<br />
Note: Typically, a network administrator for the site is required to diagnose which<br />
network switch, gateway, or connection is the cause of this failure.<br />
Perform the following actions to isolate and resolve the problem:<br />
1. Ping from the cluster node to the RA box management IP address at the same site.<br />
Repeat this action for the other site. If the local connections are working at both<br />
sites, the problem is with the WAN connection such as a network switch or gateway<br />
connection.<br />
2. If one site from step 1 fails, ping from the cluster node to the gateway of that site. If<br />
the ping completes, then proceed to step 3.<br />
3. Use the Installation Manager site connectivity IP diagnostic from the RAs to the<br />
gateway at each site by performing the following steps. (For more information, see<br />
Appendix C.)<br />
a. Log in to an RA as user boxmgmt with the password boxmgmt.<br />
b. On the Main Menu, type 3 (Diagnostics) and press Enter.<br />
c. On the Diagnostics menu, type 1 (IP diagnostics) and press Enter.<br />
d. On the IP Diagnostics menu, type 1 (Site connectivity tests) and press Enter.<br />
e. When asked to select a target for the tests, type 5 (Other host) and press<br />
Enter.<br />
f. Enter the IP address for the gateway that you want to test.<br />
g. Repeat steps a through f for each RA.<br />
4. Isolate the site by determining which gateway failed. Use standard network methods<br />
such as pinging to make the determination.<br />
7–14 6872 5688–002<br />
X<br />
X
Replication Netwwork<br />
Failure in a Geograph hic<br />
Clustered Enviroonment<br />
Problem Description<br />
6872 5688–002<br />
This type of event ooccurs<br />
when the RA cannot replicate data to the rem mote site because<br />
of a replication netwwork<br />
(WAN) failure. Because this error is transparen nt to MSCS and<br />
the cluster nodes, ccluster<br />
resources and nodes are not affected. Each cluster c node<br />
continues to run, annd<br />
data transactions sent to their local cluster disk are<br />
completed.<br />
Figure 7–8 illustratees<br />
this failure.<br />
Figure 7–8. Replication Network Failure<br />
Solving Net twork Problems<br />
The RA cannot replicate<br />
data while the WAN is down. During this failur re, the RA keeps a<br />
record of data writtten<br />
to local storage. Once the WAN is restored, the RA updates the<br />
replication volumess<br />
on the remote site.<br />
During the replicatioon<br />
network failure, the RAs prevent the quorum and d data resources<br />
from failing over to the remote site. This behavior differs from a total co ommunication<br />
failure or a total sitee<br />
failure in which the data groups are allowed to fail over. The quorum<br />
group is never allowwed<br />
to fail over automatically when the RAs cannot communicate c<br />
over<br />
the WAN.<br />
7–15
Solving Network Problems<br />
Symptoms<br />
Notes:<br />
• If the management network has also failed, see “Total Communication Failure in a<br />
Geographic Clustered Environment” later in this section.<br />
• If all RAs at a site have failed, see “Failure of All RAs at One Site” in Section 8.<br />
If the administrator issues a move-group operation from the Cluster Administrator for a<br />
data or quorum group, the cluster accepts failover only to another node within the same<br />
site. Group failover to the remote site is not allowed, and the resource group fails back<br />
to a node on the source site.<br />
Although automatic failover is not allowed, the administrator can perform a manual<br />
failover to the remote site. Performing a manual failover results in a loss of data. The<br />
administrator chooses an available image for the failover.<br />
Important considerations for this type of failure are as follow:<br />
• This type of failure does not have an immediate effect on the cluster service or the<br />
cluster nodes. The quorum group cannot fail over to the remote site and goes back<br />
online at the source site.<br />
• Only local failovers are permitted. Remote failovers require that the administrator<br />
perform the manual failover process.<br />
• The <strong>SafeGuard</strong> 30m Control resource and the data consistency groups cannot fail<br />
over to the remote site while the WAN is down; they go back online at the source<br />
site.<br />
• Only one site has up-to-date data. Replication does not occur until the WAN is<br />
restored.<br />
• If the administrator manually chooses to use remote data instead of the source data,<br />
data loss occurs.<br />
• Once the WAN is restored, normal operation continues; however, the groups might<br />
initiate a long resynchronization.<br />
The following symptoms might help you identify this failure:<br />
• The management console display shows errors similar to the image in Figure 7–9.<br />
This image shows the dialog box displayed after clicking the red Errors in the right<br />
column. The More Info message box is displayed with messages similar to those in<br />
the figure but appropriate for your site. If only one RA is down, see Section 8 for<br />
resolution actions. Notice in the figure that all RA data links at the site are down.<br />
7–16 6872 5688–002
Figure 7–9. Management Console Display: WAN Down<br />
Solving Network Problems<br />
This figure also shows the Groups tab and the messages that the data consistency<br />
groups and the quorum group are “Paused by system.” If the groups are not paused<br />
by the system, a switchover might have occurred. See Section 8 for more<br />
information. If all groups are not paused, see Section 5, “Solving Storage Problems.”<br />
• Warnings and informational messages similar to those shown in Figure 7–10 appear<br />
on the management console when the WAN is down. See the table after the figure<br />
for an explanation of the numbered console messages.<br />
Figure 7–10. Management Console Log Messages: WAN Down<br />
The following table explains the numbers in Figure 7–10. You might also see the<br />
events in the table denoted by an asterisk (*) in the management console log.<br />
6872 5688–002 7–17
Solving Network Problems<br />
Reference<br />
No./Legend<br />
Event<br />
ID<br />
Description<br />
* 3001 The RA is currently experiencing a problem<br />
communicating with its cluster. The details<br />
explain that an event 3000 means that the RA<br />
functionality will be restored.<br />
* 3000 The RA is successfully communicating with its<br />
cluster. In this case, the RA communicates by<br />
means of the management link.<br />
1 4001 For each consistency group on the Auckland<br />
and the Sydney sites, the transfer is paused.<br />
2 4008 For each quorum group on the Auckland and<br />
the Sydney sites, the transfer is paused.<br />
* 4043 For each group on the Auckland and Sydney<br />
sites, the “group site is deactivated” message<br />
might appear with the detail showing the<br />
reason for the switchover. The RA attempts to<br />
switch over to resolve the problem.<br />
3 4001 The event is repeated after the switchover<br />
attempt.<br />
E-mail<br />
Immediate<br />
E-mail<br />
Daily<br />
Summary<br />
• If you review the management console RAs tab, the data link column lists errors for<br />
all RAs, as shown in Figure 7–11. The data link is the replication link between peer<br />
RAs. Notice that the WAN link shows OK because the RAs can still communicate<br />
over the management link. There is no column for the management link.<br />
Figure 7–11. Management Console RAs Tab: All RAs Data Link Down<br />
• If you review the host application event log, no messages appear for this failure<br />
unless a data resource move-group operation is attempted. If this move-group<br />
operation is attempted, then messages similar to the following are listed:<br />
Application event log<br />
Event Type : Warning<br />
Event Source : 30mControl<br />
Event Category: None<br />
Event ID : 1119<br />
Date : 5/30/2008<br />
Time : 3:27:49 PM<br />
User : N/A<br />
Computer : USMV-SYDNEY<br />
7–18 6872 5688–002<br />
X<br />
X<br />
X<br />
X<br />
X<br />
X
Description : Online resource failed.<br />
Cannot complete transfer for auto failover (7).<br />
The following could cause this error:<br />
1. Wan is down.<br />
2. Long resynchronization might be in progress.<br />
The resource might have to be brought online manually.<br />
Solving Network Problems<br />
RA Version: 3.0(g.60)<br />
Resource name: Data1<br />
RA CLI command: UCLI -ssh -l plugin -pw **** -superbatch 172.16.25.50 initiate_failover group=Data1<br />
active_site=Sydney cluster_owner=USMV-SYDNEY<br />
• If you review the system event log, a message similar to the following example is<br />
displayed:<br />
System Event Log<br />
Event Type : Error<br />
Event Source : ClusSvc<br />
Event Category: Failover Mgr<br />
Event ID : 1069<br />
Date : 5/30/2008<br />
Time : 3:27:50 PM<br />
User : N/A<br />
Computer : USMV-SYDNEY<br />
Description : Cluster resource 'Data1' in Resource Group 'Group 0' failed.<br />
Note: Data1 would change to the Quorum drive if the quorum was moved.<br />
• If you review the cluster log, you can see an error if a data or a quorum move-group<br />
operation is attempted. Messages similar to the following are listed:<br />
Cluster Log for the Node to which the Move Was Attempted<br />
Key messages<br />
00000d4c.00000910::2008/05/30-15:27:22.077 INFO Physical Disk : [DiskArb]-------<br />
DisksArbitrate -------.<br />
………………..<br />
00000d4c.00000910::2008/05/30-15:27:35.608 ERR Physical Disk : [DiskArb] Failed to write<br />
(sector 12), error 170.<br />
00000d4c.00000910::2008/05/30-15:27:35.608 INFO Physical Disk : [DiskArb] Arbitrate returned<br />
status 170.<br />
Cluster Log for the Node to which the Data Group Move Was Attempted<br />
00000e60.00000940::2008/05/30-15:53:38.470 INFO Unisys <strong>SafeGuard</strong> 30m Control :<br />
KfResourceTerminate: Resource 'Data1' terminated. AbortOnline=1 CancelConnect=0<br />
terminateProcess=0.<br />
0000099c.00000dd4::2008/05/30-15:53:38.470 INFO [CP] CppResourceNotify for resource Data1<br />
0000099c.00000dd4::2008/05/30-15:53:38.470 INFO [FM] RmTerminateResource: a16fc059-e4d3-4bc8a15a-6440e9b2f976<br />
is now offline<br />
0000099c.00000dd4::2008/05/30-15:53:38.470 WARN [FM] Group failure for group . Create thread to take offline and move<br />
6872 5688–002 7–19
Solving Network Problems<br />
Actions to Resolve the Problem<br />
Note: Typically, a network administrator for the site is required to diagnose which<br />
network switch, gateway, or connection is the cause of this failure.<br />
Perform the following actions to isolate and resolve the problem:<br />
1. On the management console, observe that a WAN error occurred for all RAs and that<br />
the data link is in error for all RAs. If that is not the case, see Section 8 for resolution<br />
actions.<br />
2. Use the Installation Manager site connectivity IP diagnostic from the RAs to the<br />
gateway at each site by performing the following steps. (For more information, see<br />
Appendix C.)<br />
a. Log in to an RA as user boxmgmt with the password boxmgmt.<br />
b. On the Main Menu, type 3 (Diagnostics) and press Enter.<br />
c. On the Diagnostics menu, type 1 (IP diagnostics) and press Enter.<br />
d. On the IP Diagnostics menu, type 1 (Site connectivity tests) and press Enter.<br />
e. When asked to select a target for the tests, type 5 (Other host) and press<br />
Enter.<br />
f. Enter the IP address for the gateway that you want to test.<br />
g. Repeat steps a through f for each RA.<br />
3. Isolate the site by determining which network switch or gateway failed. Use<br />
standard network methods such as pinging to make the determination.<br />
4. In some cases, the WAN connection might appear to be down because a firewall is<br />
blocking ports. See “Port Information” later in this section.<br />
5. If all RAs at both sites can connect to the gateway, the problem is related to the link.<br />
In this case, check the connectivity between subnets by pinging between machines<br />
on the same subnet (not RAs) and between a non-RA machine at one site and an RA<br />
at the other site.<br />
6. Verify that no routing problems exist between the sites.<br />
7. Optionally, follow the recovery actions to manually move cluster and data resource<br />
groups to the other site if necessary. This action results in a loss of data. Do not<br />
attempt this manual recovery unless the WAN failure has affected applications.<br />
If you choose to manually move groups, refer to Section 4 for the procedures.<br />
Once you observe on the management console that the WAN error is gone, verify<br />
that the consistency groups are resynchronizing.<br />
If a move-group operation is issued to the other site while the group is<br />
resynchronizing, the command fails with a return code 7 (long resync in progress)<br />
and move back to the original node.<br />
7–20 6872 5688–002
Temporary WAN Failures<br />
Problem Description<br />
Symptoms<br />
All applications are unaffected. The target image is not up-to-date.<br />
Solving Network Problems<br />
On the management console, messages showing the transfer between sites switch<br />
between the “paused by system” and “initializing/active.” All groups appear<br />
unstable over the WAN connection.<br />
Actions to Resolve the Problem<br />
Perform the following actions to isolate and resolve this problem:<br />
1. If the connection problem is temporary but reoccurs, check for a problematic<br />
network such as a high percentage of packet loss because of bad network<br />
connections, insufficient bandwidth that is causing an overloaded network, and so<br />
on.<br />
2. Verify that the bandwidth allocated to this link is reasonable and that no<br />
unreasonable external or internal (consistency group bandwidth policy) limits are<br />
causing an overloaded network.<br />
6872 5688–002 7–21
Solving Network Problems<br />
Private Cluster Nettwork<br />
Failure in a Geograph hic<br />
Clustered Environmment<br />
Problem Description<br />
7–22<br />
When the private clusterr<br />
network fails, the cluster nodes are able to commu unicate with<br />
the public cluster networrk<br />
if the cluster public address is set for all communication.<br />
No<br />
cluster resources fail oveer,<br />
and current processing on the cluster nodes cont tinues.<br />
Clients do not experiencee<br />
any impact by this failure.<br />
Figure 7–12 illustrates thhis<br />
scenario.<br />
Figuree<br />
7–12. Private Cluster Network Failure<br />
Unisys recommends thatt<br />
the public cluster network be set for “All communications”<br />
and<br />
the private cluster LAN bbe<br />
set for “internal cluster communications only…” You Y can<br />
verify these settings in thhe<br />
“Networks” properties section within Cluster Administrator.<br />
See “Checking the Clustter<br />
Setup” in Section 4.<br />
If the public cluster netwwork<br />
was not set for “All communications” but instead<br />
was set<br />
for “Client access only,” the following symptoms occur:<br />
• All nodes except the node that owned the quorum stop MSCS. This action<br />
is<br />
completed to prevennt<br />
a “split brain” situation.<br />
• All resources move tto<br />
the surviving node.<br />
68 872 5688–002
Symptoms<br />
The following symptoms might help you identify this failure:<br />
Solving Network Problems<br />
• When the private cluster network fails, the Cluster Administrator displays an error<br />
indicator similar to Figure 7–13.<br />
All private network connections show a status of “Unknown” when the problem is a<br />
WAN issue.<br />
If only two of the connections failed (and the nodes are physically located at the<br />
same site), the issue is probably local to the site.<br />
If only one connection failed, the issue is probably a host network adapter.<br />
Figure 7–13. Cluster Administrator Display with Failures<br />
• On the cluster nodes at both sides, the system event log contains entries from the<br />
cluster service similar to the following:<br />
Event Type : Warning<br />
Event Source : ClusSvc<br />
Event Category: Node Mgr<br />
Event ID : 1123<br />
Date : 5/30/2008<br />
Time : 4:03:10 PM<br />
User : N/A<br />
6872 5688–002 7–23
Solving Network Problems<br />
Computer<br />
Description:<br />
: USMV-SYDNEY<br />
The node lost communication with cluster node 'USMV-AUCKLAND' on network 'Private'.<br />
Event Type : Warning<br />
Event Source : ClusSvc<br />
Event Category: Node Mgr<br />
Event ID : 1126<br />
Date : 5/30/2008<br />
Time : 4:03:12 AMP<br />
User : N/A<br />
Computer<br />
Description:<br />
: USMV-SYDNEY<br />
The interface for cluster node 'USMV-AUCKLAND' on network 'Private' is unreachable by at least one<br />
other cluster node attached to the network. The server cluster was not able to determine the location of<br />
the failure. Look for additional entries in the system event log indicating which other nodes have lost<br />
communication with node USMV-AUCKLAND. If the condition persists, check the cable connecting the<br />
node to the network. Then, check for hardware or software errors in the node's network adapter. Finally,<br />
check for failures in any other network components to which the node is connected such as hubs,<br />
switches, or bridges.<br />
Event Type : Warning<br />
Event Source : ClusSvc<br />
Event Category: Node Mgr<br />
Event ID : 1130<br />
Date : 5/30/2008<br />
Time : 4:03:12 PM<br />
User : N/A<br />
Computer<br />
Description:<br />
: USMV-SYDNEY<br />
Cluster network 'Private’ is down. None of the available nodes can communicate using this network. If<br />
the condition persists, check for failures in any network components to which the nodes are connected<br />
such as hubs, switches, or bridges. Next, check the cables connecting the nodes to the network. Finally,<br />
check for hardware or software errors in the adapters that attach the nodes to the network.<br />
7–24 6872 5688–002
Actions to Resolve the Problem<br />
Solving Network Problems<br />
Note: Typically, a network administrator for the site is required to diagnose which<br />
network switch, gateway, or connection is the cause of this failure.<br />
Perform the following actions to isolate and resolve the problem:<br />
1. In the Cluster Administrator, view the network properties of the public and private<br />
network.<br />
The public network should be operational with no failure indications.<br />
The private network should display errors. Refer to the previous symptoms to<br />
identify that this is a WAN issue. If the error is limited to one host, the problem<br />
might be a host network adapter. See “<strong>Public</strong> NIC Failure on a Cluster Node in a<br />
Geographic Clustered Environment” for action to resolve a host network problem.<br />
2. Check for network problems using methods such as isolating the failure to the<br />
network switch or gateway with the problem.<br />
6872 5688–002 7–25
Solving Network Problems<br />
Total Communicattion<br />
Failure in a Geographic c<br />
Clustered Environmment<br />
Problem Description<br />
7–26<br />
A total communication faailure<br />
implies that the cluster nodes and RAs are no longer able<br />
to communicate with eacch<br />
other over the public and private network interfac ces.<br />
Figure 7–14 illustrates this<br />
failure.<br />
Figurre<br />
7–14. Total Communication Failure<br />
When this failure occurs, , the cluster nodes on both sites detect that the clus ster<br />
heartbeat has been brokeen.<br />
After six missed heartbeats, the cluster nodes go<br />
into a<br />
“regroup” process to determine<br />
which node takes ownership of all cluster re esources.<br />
This process consists of checking network interface states and then arbitrati ing for the<br />
quorum device.<br />
During the network interrface<br />
detection phase, all nodes perform a network interface<br />
check to determine that the node is communicating through at least one net twork<br />
interface dedicated for cllient<br />
access, assuming the network interface is set for f “All<br />
communications” or “Cliient<br />
access only.” If this process determines that the<br />
node is not<br />
communicating through aany<br />
viable network, the cluster node voluntarily stop ps cluster<br />
service and drops out of the quorum arbitration process. The remaining node es then<br />
attempt to arbitrate for thhe<br />
quorum device.<br />
68 872 5688–002
Symptoms<br />
Solving Network Problems<br />
Quorum arbitration succeeds on the site that originally owned the quorum consistency<br />
group and fails on the nodes that did not own the quorum consistency group. Cluster<br />
service then shuts itself down on the nodes where quorum arbitration fails.<br />
In Microsoft Windows 2000 environments, MSCS does not check for network interface<br />
availability during the regroup process and starts the quorum arbitration process<br />
immediately after a regroup process is initiated—that is, after six missed heartbeats.<br />
Once the cluster has determined which nodes are allowed to remain active in the<br />
cluster, the cluster node attempts to bring online all data groups previously owned by the<br />
other cluster nodes. The <strong>SafeGuard</strong> 30m Control resource and its associated dependent<br />
resources will come online.<br />
During this total communication failure, replication is “Paused by system.” An extended<br />
outage requires a full volume sweep. Refer to Section 4 for more information.<br />
The following symptoms might help you identify this failure:<br />
• The management console shows a WAN error; all groups are paused. The other site<br />
shows a status of “Unknown.” Figure 7–15 illustrates one site.<br />
Figure 7–15. Management Console Display Showing WAN Error<br />
6872 5688–002 7–27
Solving Network Problems<br />
• The RAs tab on the management console lists errors as shown in Figure 7–16.<br />
Figure 7–16. RAs Tab for Total Communication Failure<br />
• Warnings and informational messages similar to those shown in Figure 7–17 appear<br />
on the management console. See the table after the figure for an explanation of the<br />
numbered console messages.<br />
Figure 7–17. Management Console Messages for Total Communication Failure<br />
7–28 6872 5688–002
Reference<br />
No.<br />
The following table explains the numbered messages in Figure 7–17.<br />
Event ID<br />
Description<br />
1 4001 For each consistency group, a group<br />
capabilities minor problem is reported. The<br />
details indicate that a WAN problem is<br />
suspected on both RAs.<br />
2 4008 For each consistency group on the West and<br />
the East sites, the transfer is paused. The<br />
details indicate a WAN problem is<br />
suspected.<br />
3 3021 For each RA at each site, the following error<br />
message is reported:<br />
Error in WAN link to RA at other site<br />
(RA x)<br />
4 1008 The following message is displayed:<br />
User action succeeded. The details indicate<br />
that a failover was initiated. This message<br />
appears when the groups are moved by the<br />
<strong>SafeGuard</strong> Control resource to the surviving<br />
cluster node.<br />
Solving Network Problems<br />
E-mail<br />
Immediate<br />
E-mail<br />
Daily<br />
Summary<br />
• All cluster resources appear online after successfully failing over to the surviving<br />
node.<br />
• The cluster service stops on all nodes except the surviving node.<br />
• From the surviving node, the host system event log has entries similar to the<br />
following:<br />
Event Type : Warning<br />
Event Source : ClusSvc<br />
Event Category: Node Mgr<br />
Event ID : 1123<br />
Date : 6/1/2008<br />
Time : 12:58:55 PM<br />
User : N/A<br />
Computer : USMV-WEST2<br />
Description:<br />
The node lost communication with cluster node 'USMV-EAST2' on <strong>Public</strong> network.<br />
Event Type : Warning<br />
Event Source : ClusSvc<br />
Event Category: Node Mgr<br />
Event ID : 1123<br />
Date : 6/1/2008<br />
6872 5688–002 7–29<br />
X<br />
X<br />
X<br />
X
Solving Network Problems<br />
Time : 12:58:55 PM<br />
User : N/A<br />
Computer : USMV-WEST2<br />
Description:<br />
The node lost communication with cluster node 'USMV-EAST2' on Private network.<br />
Event Type : Warning<br />
Event Source : ClusSvc<br />
Event Category: Node Mgr<br />
Event ID : 1135<br />
Date : 6/1/2008<br />
Time : 12:58:16 PM<br />
User : N/A<br />
Computer : USMV-WEST2<br />
Description:<br />
Cluster node USMV-EAST2 was removed from the active server cluster membership. Cluster service may<br />
have been stopped on the node, the node may have failed, or the node may have lost communication<br />
with the other active server cluster nodes.<br />
Event Type : Information<br />
Event Source : ClusSvc<br />
Event Category: Failover Mgr<br />
Event ID : 1200<br />
Date : 6/1/2008<br />
Time : 12:58:21 PM<br />
User : N/A<br />
Computer :<br />
Description:<br />
USMV-WEST2<br />
The Cluster Service is attempting to bring online the Resource Group "Group 1".<br />
Event Type : Information<br />
Event Source : ClusSvc<br />
Event Category: Failover Mgr<br />
Event ID : 1201<br />
Date : 6/1/2008<br />
Time : 1:02:54 PM<br />
User : N/A<br />
Computer :<br />
Description:<br />
USMV-WEST2<br />
The Cluster Service brought the Resource Group "Group 1" online.<br />
7–30 6872 5688–002
Solving Network Problems<br />
• From the surviving node, the private and public network connections show an<br />
exclamation mark “Unknown” status as shown in Figures 7–18 and 7–19.<br />
Figure 7–18. Cluster Administrator Showing Private Network Down<br />
Figure 7–19. Cluster Administrator Showing <strong>Public</strong> Network Down<br />
6872 5688–002 7–31
Solving Network Problems<br />
Actions to Resolve the Problem<br />
Note: Typically, a network administrator for the site is required to diagnose which<br />
network switch, gateway, or connection is the cause of this failure.<br />
Perform the following actions to isolate and resolve the problem:<br />
1. When you observe on the management console that a WAN error occurred on site 1<br />
and on site 2, call the other site to verify that each management console is available<br />
and shows a WAN down because of the failure. If only one site can access the<br />
management console, the problem is probably not a total WAN failure but rather a<br />
management network failure. In that case, see “Management Network Failure in a<br />
Geographic Clustered Environment.”<br />
2. In the Cluster Administrator, verify that only one node is active in the cluster.<br />
3. View the network properties of the public and private network.<br />
The display should show an “Unknown” status for the private and public network.<br />
4. Check for network problems using methods such as isolating the failure to the<br />
network switch or gateway by pinging from the cluster node to the gateway at each<br />
site.<br />
Port Information<br />
Problem Description<br />
Symptoms<br />
Communications problems might occur because of firewall settings that prevent all<br />
necessary communication.<br />
The following symptoms might help you identify this problem:<br />
• Unable to reach the DNS server.<br />
• Unable to communicate to the NTP server.<br />
• Unable to reach the mail server.<br />
• The RAs tab shows RA data link errors.<br />
• The management console shows errors for the WAN.<br />
• The management console logs show RA communications errors.<br />
7–32 6872 5688–002
Actions to Resolve<br />
Solving Network Problems<br />
Perform the port diagnostics from each of the RAs by following the steps given in<br />
Appendix C.<br />
The following tables provide port information that you can use in troubleshooting the<br />
status of connections.<br />
Port Numbers<br />
Table 7–2. Ports for Internet Communication<br />
Protocol or Protocols<br />
21 FTP 192.61.61.78<br />
443 Used for remote maintenance<br />
(TCP)<br />
Unisys Product <strong>Support</strong><br />
IP Address<br />
129.225.216.130<br />
The following tables list ports used for communication other than Internet<br />
communication.<br />
Table 7–3. Ports for Management LAN<br />
Communication and Notification<br />
Port Numbers Protocol or Protocols<br />
21 Default FTP port (needed for collecting system<br />
information)<br />
22 Default SSH and communications between RAs<br />
25 Default outgoing mail (SMTP) e-mail alerts from<br />
the RA are configured.<br />
80 Web server for management (TCP)<br />
123 Default NTP port<br />
161 Default SNMP port<br />
443 Secure Web server for management (TCP)<br />
514 Syslog (UDP)<br />
1097 RMI (TCP)<br />
1099 RMI (TCP)<br />
4401 RMI (TCP)<br />
4405 Host-to-RA kutils communications (SQL<br />
commands) and KVSS (TCP)<br />
7777 Automatic host information collection<br />
6872 5688–002 7–33
Solving Network Problems<br />
The ports listed in Table 7–4 are used for both the management LAN and WAN.<br />
Table 7–4. Ports for RA-to-RA Internal<br />
Communication<br />
Port Numbers Protocol or Protocols<br />
23 telnet<br />
123 NTP (UDP)<br />
1097 RMI (TCP)<br />
1099 RMI (TCP)<br />
4444 TCP<br />
5001 TCP (default iperf port for performance<br />
measuring between RAs)<br />
5010 Management server (UDP, TCP)<br />
5020 Control (UDP, TCP)<br />
5030 RMI (TCP)<br />
5040 Replication (UDP, TCP)<br />
5060 Mpi_perf (TCP)<br />
5080 Connectivity diagnostics tool<br />
7–34 6872 5688–002
Section 8<br />
Solving Replication Appliance (RA)<br />
Problems<br />
This section lists symptoms that usually indicate problems with one or more Unisys<br />
<strong>SafeGuard</strong> 30m replication appliances (RAs). The problems include hardware failures.<br />
The graphics, behaviors, and examples in this section are similar to what you observe<br />
with your system but might differ in some details.<br />
For problems relating to RAs, gather the RA logs and ask the following questions:<br />
• Are any errors displayed on the management console?<br />
• Is the issue constant? Is the issue a one-time occurrence? Does the issue occur at<br />
intervals?<br />
• What are the states of the consistency groups?<br />
• What is the timeframe in which the problem occurred?<br />
• When was the first occurrence of the problem?<br />
• What actions were taken as a result of the problem or issue?<br />
• Were any recent changes made in the replication environment? If so, what?<br />
Table 8–1 lists symptoms and possible causes for the failure of a single RA on one site<br />
with a switchover as a symptom. Table 8–2 lists symptoms and possible causes for the<br />
failure of a single RA on one site without switchover symptoms. Table 8–3 lists<br />
symptoms and other possible problems regarding multiple RA failures. Each problem and<br />
the actions to resolve it are described in this section.<br />
In addition to the symptoms listed, you might receive e-mail messages or SNMP traps<br />
for possible problems. Also, messages similar to e-mail notifications might be displayed<br />
on the management console. If you do not see the messages, they might have already<br />
dropped off the display. Review the management console logs for messages that have<br />
dropped off the display.<br />
6872 5688–002 8–1
Solving Replication Appliance (RA) Problems<br />
Table 8–1. Possible Problems for Single RA Failure with a<br />
Switchover<br />
Symptoms Possible Problem<br />
The management console shows RA<br />
failure.<br />
Single RA failure<br />
Possible Contributing Causes to Single RA Failure with a Switchover<br />
The system frequently pauses transfer<br />
for all consistency groups.<br />
If you log in to the failed RA as the<br />
boxmgmt user, a message is displayed<br />
explaining that the reboot regulation<br />
limit has been exceeded.<br />
The management console shows<br />
repeated events that report an RA is<br />
up followed by an RA is down.<br />
The link indicator lights on all host bus<br />
adapters (HBAs) are not illuminated.<br />
The port indicator lights on the Fibre<br />
Channel switch no longer show a link<br />
to the RA.<br />
Port errors occur or there is no target<br />
when running the SAN diagnostics.<br />
The management console shows RA<br />
failure with details pointing to a<br />
problem with the repository volume.<br />
The link indicator lights on the HBA or<br />
HBAs are not illuminated.<br />
The port indicator lights on the<br />
network switch or hub no longer show<br />
a link to the RA.<br />
Reboot regulation failover<br />
Failure of all SAN Fibre Channel HBAs on one RA<br />
Onboard WAN network adapter failure<br />
(Or failure of the optional gigabit Fibre Channel<br />
WAN network adapter)<br />
8–2 6872 5688–002
Solving Replication Appliance (RA) Problems<br />
Table 8–2. Possible Problems for Single RA Failure Wthout a<br />
Switchover<br />
Symptoms Possible Problem<br />
The link indicators lights on the onboard<br />
management network adapter are not<br />
illuminated.<br />
The failure light for the hard disk<br />
indicates a failure.<br />
An error message that appears during a<br />
boot operation indicates failure of one of<br />
the internal disks.<br />
The link indicator lights on the HBA are<br />
not illuminated.<br />
The port indicator lights on the Fibre<br />
Channel switch no longer show a link to<br />
the RA.<br />
For one of the ports on the relevant RA,<br />
errors appear when running the SAN<br />
diagnostics.<br />
Onboard management network adapter<br />
failure<br />
Single hard-disk failure<br />
Port failure of a single SAN Fibre Channel<br />
HBA on one RA<br />
Table 8–3. Possible Problems for Multiple RA Failures with<br />
Symptoms<br />
Symptoms Possible Problem<br />
Replication has stopped on all groups.<br />
MSCS fails over groups to the other<br />
site, or MSCS fails on all nodes.<br />
The management console displays a<br />
WAN error to the other site.<br />
Replication has stopped on all groups.<br />
MSCS fails over groups to the other<br />
site, or MCSC fails on all nodes.<br />
The management console displays a<br />
WAN error to the other site.<br />
Failure of all RAs on one site<br />
All RAs on one site are not attached<br />
6872 5688–002 8–3
Solving Replication Appliance (RA) Problems<br />
Single RA Failures<br />
Problem Description<br />
When an RA fails, a switchover might occur. In some cases, a switchover does not<br />
occur. See “Single RA Failures With Switchover” and “Single RA Failures Without<br />
Switchover.”<br />
Understanding Management Console Access<br />
If the RA that failed had been running site control—that is, the RA owned the virtual<br />
management IP network—and a switchover occurs, the virtual IP address moves to the<br />
new RA.<br />
If you attempt to connect to the management console using one of the static<br />
management IP addresses of the RAs, a connection error occurs if the RA does not have<br />
site control. Thus, you should use the site management IP address to connect to the<br />
management console.<br />
At least one RA (either RA 1 or RA 2) must be attached to the RA cluster for the<br />
management console to function.<br />
If the RA that failed was running site control and a switchover does not occur (such as<br />
with an onboard management network connection failure), the management console<br />
might not be accessible. Also, attempts to log in using PuTTY fail if you use the<br />
boxmgmt log-in account. When an RA does not have site control, you can always log in<br />
using PuTTY and the boxmgmt log-in account.<br />
You cannot determine which RA owns site control unless the management console is<br />
accessible. The site control RA is designated at the bottom of the display as follows:<br />
Another situation in which you cannot log in to the management console is when the<br />
user account has been locked. In this case, follow these steps:<br />
1. Log in interactively using PuTTY with another unlocked user account.<br />
2. Enter unlock_user.<br />
3. Determine whether any users are listed, and follow the messages to unlock the<br />
locked user accounts.<br />
8–4 6872 5688–002
6872 5688–002<br />
Figure 8–1 illustratees<br />
a single RA failure.<br />
Single RA Failure wwith<br />
Switchover<br />
Solving Replication Appliance e (RA) Problems<br />
Figure 8–1. Single RA Failure<br />
In this case, a single<br />
RA fails, and there is an automatic switchover to a surviving RA on<br />
the same site. Any groups that had been running on the failed RA run on o a surviving RA<br />
at the same site.<br />
Each RA handles thhe<br />
replicating activities of the consistency groups for r which it is<br />
designated as the ppreferred<br />
RA. The consistency groups that are affect ted are those that<br />
were configured wiith<br />
the failed RA as the preferred RA. Thus, whenever<br />
an RA becomes<br />
inoperable, the handling<br />
of the consistency groups for that RA switches s over<br />
automatically to thee<br />
functioning RAs in the same RA cluster.<br />
During the RA switchover<br />
process, the server applications do not experience<br />
any I/O<br />
failures. In a geograaphic<br />
clustered environment, MSCS is not aware of the RA failure,<br />
and all application aand<br />
replication operations continue to function norma ally. However,<br />
performance mightt<br />
be affected because the I/O load on the surviving RAs R is now<br />
increased.<br />
8–5
Solving Replication Appliance (RA) Problems<br />
Symptoms<br />
Failures of an RA that cause a switchover are as follows:<br />
• RA hardware issues (such as memory, motherboard, and so forth)<br />
• Reboot regulation failover<br />
• Failure of all SAN Fibre Channel HBAs on one RA<br />
• Onboard WAN network adapter failure (or failure of the optional gigabit Fibre Channel<br />
WAN network adapter)<br />
The following symptoms might help you identify this failure:<br />
• The RA does not boot.<br />
From a power-on reset, the BIOS display shows the BIOS information, RAID adapter<br />
utility prompt, logical drives found, and so forth. The display is similar to the<br />
information shown in Figure 8–2.<br />
Figure 8–2. Sample BIOS Display<br />
Once the RA initializes, the log-in screen is displayed.<br />
Note: Because status messages normally scroll on the screen, you might need to<br />
press Enter to see the log-in screen.<br />
• The management console system status shows an RA failure. (See Figure 8–3.)<br />
To display more information about the error, click the red error in the right column.<br />
The More Info dialog box is displayed with a message similar to the following:<br />
RA 1 in West is down<br />
8–6 6872 5688–002
Solving Replication Appliance (RA) Problems<br />
Figure 8–3. Management Console Display Showing RA Error and RAs Tab<br />
• The RAs tab on the management console shows information similar to that in<br />
Figure 8–3, specifically<br />
− The RA status for RA 1 on the West site shows an error.<br />
− The peer RA on the East site (RA 1) shows a data link error.<br />
− Each RA on the East site shows a WAN connection failure.<br />
− The surviving RA at the failed site (West) does not show any errors.<br />
• Warnings and informational messages similar to those shown in Figure 8–4 appear<br />
on the management console when an RA fails and a switchover occurs. See the<br />
table after the figure for an explanation of the numbered console messages. In your<br />
environment, the messages pertain only to the groups configured to use the failed<br />
RA as the preferred RA.<br />
6872 5688–002 8–7
Solving Replication Appliance (RA) Problems<br />
Figure 8–4. Management Console Messages for Single RA Failure with Switchover<br />
Reference<br />
No.<br />
The following table explains the numbered messages shown in Figure 8–4.<br />
Event<br />
ID<br />
Description E-mail<br />
Immediate<br />
1 3023 At the same site, the other RA reports a<br />
problem getting to the LAN of the failed RA.<br />
2 3008 The site with the failed RA reports that the RA is<br />
probably down.<br />
3 2000 The management console is now running on RA<br />
2.<br />
4 4001 For each consistency group, a minor problem is<br />
reported. The details show that the RA is down<br />
or not a cluster member.<br />
5 4008 For each consistency group, the transfer is<br />
paused at the surviving site to allow a<br />
switchover. The details show the reason for the<br />
pause as switchover.<br />
E-mail Daily<br />
Summary<br />
8–8 6872 5688–002<br />
X<br />
X<br />
X<br />
X<br />
X
Reference<br />
No.<br />
Event<br />
ID<br />
Solving Replication Appliance (RA) Problems<br />
Description E-mail<br />
Immediate<br />
6 4041 For each consistency group at the same site,<br />
the groups are activated at the surviving RA.<br />
This probably means that a switchover to RA 2<br />
at the failed site was successful.<br />
7 5032 For each consistency group at the failed site, the<br />
splitter is again splitting.<br />
8 3021 A WAN link error is reported from each RA at<br />
the surviving site regarding the failed RA at the<br />
other site.<br />
9 4010 For each consistency group at the failed site, the<br />
transfer is started.<br />
10 4086 For each consistency group at the failed site, an<br />
initialization is performed.<br />
11 4087 For each consistency group at the failed site,<br />
the initialization completes.<br />
E-mail Daily<br />
Summary<br />
12 3007 The failed RA (RA 1) is now restored. X<br />
To see the details of the messages listed on the management console display, you must<br />
collect the logs and then review the messages for the time of the failure. Appendix A<br />
explains how to collect the management console logs, and Appendix E lists the event<br />
IDs with explanations.<br />
Actions to Resolve the Problem<br />
The following list summarizes the actions you need to perform to isolate and resolve the<br />
problem:<br />
• Check the LCD display on the front panel of the RA. See “LCD Status Messages” in<br />
Appendix B for more information.<br />
If the LCD display shows an error, run the RA diagnostics. See Appendix B for more<br />
information.<br />
• Check all indicator lights on the rear panel of the RA.<br />
• Review the symptoms and actions in the following topics:<br />
− Reboot Regulation<br />
− Onboard WAN Network Adapter Failure<br />
• If you determine that the failed RA must be replaced, contact the Unisys service<br />
representative for a replacement RA.<br />
After you receive the replacement RA, follow the steps in Appendix D to install and<br />
configure it.<br />
6872 5688–002 8–9<br />
X<br />
X<br />
X<br />
X<br />
X<br />
X
Solving Replication Appliance (RA) Problems<br />
The following procedure provides a detailed description of the actions to perform:<br />
1. Remove the front bezel of the RA and look at the LCD display. During normal<br />
operation, the illuminated message should identify the system.<br />
If the LCD display flashes amber, the system needs attention because of a problem<br />
with power supplies, fans, system temperature, or hard drives.<br />
Figure 8–5 shows the location of the LCD display.<br />
Figure 8–5. LCD Display on Front Panel of RA<br />
If an error message is displayed, check Table B–1. For example, the message E0D76<br />
indicates a drive failure. (Refer to “Single Hard Disk Failure” in this section.)<br />
If the message code is not listed in the Table B–1, run the RA diagnostics, (see<br />
Appendix B).<br />
2. Check the indicators at the rear of the RA as described in the following steps and<br />
visually verify that all are working correctly.<br />
Figure 8–6 illustrates the rear panel of the RA.<br />
Note: The network connections on the rear panel labeled 1 and 2 in the following<br />
illustration might appear different on your RA. The connection labeled 1 is always the RA<br />
replication network, and the connection labeled 2 is always the RA management<br />
network. Pay special attention to the labeling when checking the network connections.<br />
8–10 6872 5688–002
6872 5688–002<br />
Solving Replication Appliance e (RA) Problems<br />
Figure 88–6.<br />
Rear Panel of RA Showing Indicators<br />
• Ping each netwwork<br />
connection (management network and replicatio on network), and<br />
visually verify thhat<br />
the LEDs on either side of the cable on the back k panel are<br />
illuminated. Figure<br />
8–7 shows the location of these LEDs.<br />
If the LEDs are off, the network is not connected. The green LED is<br />
lit if the network<br />
is connected too<br />
a valid link partner on the network. The amber LED D blinks when<br />
network data iss<br />
being sent or received.<br />
If the managemment<br />
network LEDs indicate a problem, refer to “Onboard<br />
Management NNetwork<br />
Adapter Failure” in this section.<br />
If the replication<br />
network LEDs indicate a problem, refer to “Onboa ard WAN Network<br />
Adapter Failure”<br />
in this section.<br />
Figure 8–7. Location of Network LEDs<br />
• Check that the green LEDs for the SAN Fibre Channel HBAs are illu uminated as<br />
shown in Figuree<br />
8–8.<br />
8–11
Solving Replication Appliance (RA) Problems<br />
Figure 8–8. Location of SAN Fibre Channel HBA LEDs<br />
The following table explains the LED patterns and their meanings. If the LEDs<br />
indicate a problem, refer to the two topics for SAN Fibre Chanel HBA failures in this<br />
section.<br />
Green LED Amber LED Activity<br />
On On Power<br />
On Off Online<br />
Off On Signal acquired<br />
Off Flashing Loss of synchronization<br />
Flashing Flashing Firmware error<br />
Reboot Regulation<br />
Problem Description<br />
After frequent, unexplained reboots or restarts of the replication process, the RA<br />
automatically detaches from the RA cluster.<br />
When installing the RAs, you can enable or disable this reboot regulation feature. The<br />
factory default is for the feature to be enabled so that reboot regulation is triggered<br />
whenever a specified number of reboots or failures occur within the specified time<br />
interval.<br />
The two parameters available for the reboot regulation feature are the number of reboots<br />
(including internal failures) and the time interval. The default value for the number of<br />
reboots is 10, and the default value for the time interval is 2 hours.<br />
Only Unisys personnel should change these values. Use the Installation Manager to<br />
change the parameter values or disable the feature. See the Unisys <strong>SafeGuard</strong> <strong>Solutions</strong><br />
Replication Appliance Installation <strong>Guide</strong> for information about using the Installation<br />
Manager tools to make these changes.<br />
8–12 6872 5688–002
Symptoms<br />
The following symptoms might help you identify this failure:<br />
Solving Replication Appliance (RA) Problems<br />
• Frequent transfer pauses for all consistency groups that have the same preferred<br />
RA.<br />
• If you log in to the RA as the boxmgmt user, the following message is displayed:<br />
Reboot regulation limit has been exceeded<br />
• Several messages might be displayed on the Logs tab of the management console<br />
as an RA reboots to try to correct a problem. These messages are listed in<br />
Table 8–4.<br />
Table 8–4. Management Console Messages Pertaining to Reboots<br />
Reference<br />
No./Legend<br />
Event<br />
ID<br />
* 3008 The RA appears to be down.<br />
The RA might attempt to<br />
perform a reboot to correct<br />
the problem.<br />
* 3023 Error in LAN link (as RA<br />
reboots).<br />
* 3021 Error in WAN link (as RA<br />
reboots).<br />
* 3007 The RA is up (the reboot<br />
completes).<br />
* 3022 The LAN link is restored (the<br />
reboot has completed).<br />
* 3020 The WAN link at other site is<br />
restored (the reboot has<br />
completed).<br />
Description E-mail<br />
Immediate<br />
E-mail Daily<br />
Summary<br />
When any of these messages appear multiple times in a short time period, they<br />
might indicate an RA that has continuously rebooted and might have reached the<br />
reboot regulation limit.<br />
Actions to Resolve the Problem<br />
Perform the following actions to isolate and resolve the problem:<br />
1. Collect the RA logs before you attempt to resolve the problem. See Appendix A for<br />
information about collecting logs.<br />
2. To determine whether the hardware is faulty, run the RA diagnostics described in<br />
Appendix B.<br />
3. If the problem remains, submit the RA logs to Unisys for analysis.<br />
6872 5688–002 8–13<br />
X<br />
X<br />
X<br />
X<br />
X<br />
X
Solving Replication Appliance (RA) Problems<br />
4. Once the problem is corrected, the RA automatically attaches to the RA cluster after<br />
a power-on reset. If necessary, reattach the RA to the RA cluster manually by<br />
following these steps:<br />
a. Log in as boxmgmt to the RA through an SSH session using PuTTY.<br />
b. At the prompt, type 4 (Cluster operations) and press Enter.<br />
c. At the prompt, type 1 (Attach to cluster) to attach the RA to the cluster.<br />
d. At the prompt, type Q (Quit).<br />
Failure of All SAN Fibre Channel Host Bus Adapters (HBAs<br />
Problem Description<br />
Symptoms<br />
All SAN Fibre Channel HBAs or adapter ports on the RA fail. This scenario is unlikely<br />
because the RA has redundant ports that are located on different physical adapters. A<br />
SAN connectivity problem is more likely.<br />
Note: A single redundant path does not show errors on the management console<br />
display. See “Port Failure on a Single SAN Fibre Channel HBA on One RA.”<br />
The following symptoms might help you identify this failure:<br />
• The link indicator lights on all SAN Fibre Channel HBAs are not illuminated. (Refer to<br />
Figure 8–8 for the location of these LEDs.)<br />
• The port indicator lights on the Fibre Channel switch no longer show a link to the RA.<br />
• Port errors occur or no target appears when running the Installation Manager SAN<br />
diagnostics.<br />
• Information on the Volumes tab of the management console is inconsistent or<br />
periodically changing.<br />
• The management console shows failures for RAs, storage, and hosts. (See<br />
Figure 8–9.)<br />
8–14 6872 5688–002
Solving Replication Appliance (RA) Problems<br />
Figure 8–9. Management Console Display: Host Connection with RA Is<br />
Down<br />
If you click the red error indication for RAs in the right column, the message is<br />
RA 2 in East can’t access repository volume<br />
If you click the red error indication for storage in the right column, the following<br />
messages are displayed:<br />
If you click the red error indication in the right column for splitters, the message is<br />
ERROR: USMV-EAST2's connection with RA2 is down<br />
• Warnings and informational messages similar to those shown in Figure 8–10 appear<br />
on the management console when an RA fails with this type of problem. See the<br />
table after the figure for an explanation of the numbered console messages.<br />
Also, refer to Figure 8–4 and the table that explains the messages for information<br />
about an RA failure with a generic switchover.<br />
Refer to Table 8–4 for other messages that might occur whenever an RA reboots to<br />
try to correct the problem.<br />
6872 5688–002 8–15
Solving Replication Appliance (RA) Problems<br />
Figure 8–10. Management Console Messages for Failed RA (All SAN HBAs Fail)<br />
8–16 6872 5688–002
Reference<br />
No.<br />
Solving Replication Appliance (RA) Problems<br />
The following table explains the numbered messages shown in Figure 8–10. You<br />
might also see the messages denoted with an asterisk (*).<br />
Event<br />
ID<br />
Description<br />
1 3014 The RA is unable to access the<br />
repository volume (RA 2).<br />
2 4003 For each consistency group that had<br />
the failed RA as the preferred RA, a<br />
group consistency problem is<br />
reported. The details show a<br />
repository volume problem.<br />
3 3012 The RA is unable to access volumes<br />
(all volumes for repository, journal, and<br />
data are listed).<br />
4 4086 Initialization started (RA 1, Quorum ---<br />
West).<br />
5 4087 Initialization complete (RA 1, Quorum -<br />
West). The group has completed the<br />
switchover.<br />
E-mail<br />
Immediate<br />
E-mail<br />
Daily<br />
Summary<br />
To see the details of the messages listed on the management console display, you<br />
must collect the logs and then review the messages for the time of the failure.<br />
Appendix A explains how to collect the management console logs, and Appendix E<br />
lists the event IDs with explanations.<br />
Actions to Resolve the Problem<br />
Perform the following actions to isolate and resolve the problem:<br />
1. Refer to Section 6, “Solving SAN Connectivity Problems,” to determine whether the<br />
problem is described there.<br />
2. If you determine that the SAN Fibre Channel HBA failed and must be replaced,<br />
contact a Unisys service representative for a replacement adapter.<br />
3. Once the replacement adpter is received, perform the following steps to replace the<br />
failed HBA:<br />
a. Open a PuTTY session using the IP address of the RA and log in as<br />
boxmgmt/boxmgmt.<br />
Appendix C provides additional information about the Installation Manager<br />
diagnostics.<br />
b. On the Main menu, type 3 (Diagnostics) and press Enter.<br />
c. On the Diagnostics menu, type 2 (Fibre Channel diagnostics) and press<br />
Enter.<br />
d. On the Fibre Channel Diagnostics menu, type 4 (View Fibre Channel<br />
details) and press Enter.<br />
6872 5688–002 8–17<br />
X<br />
X<br />
X<br />
X<br />
X
Solving Replication Appliance (RA) Problems<br />
Information similar to the following is displayed:<br />
>>Site1 Box 1>>3<br />
Port 0<br />
wwn = 50012482001c6fb0<br />
node_wwn = 50012482001c6fb1<br />
Port id = 0x20100<br />
operating mode = point to point<br />
speed<br />
Port 1<br />
= 2 GB<br />
---------------------------------wwn<br />
= 50012482001ce3c4<br />
node_wwn = 50012482001ce3c5<br />
Port id = 0x10100<br />
operating mode = point to point<br />
speed = 2 GB<br />
e. Write down the port information.<br />
f. On the Fibre Channel Diagnostics menu, type B (Back) and press Enter.<br />
g. On the Diagnostics menu, type B (Back) and press Enter.<br />
h. On the Main Menu, type 4 (Cluster operations) and press Enter.<br />
i. On the Cluster Operations menu, type 2 (Detach from cluster) and press<br />
Enter.<br />
j. Shut down the RA.<br />
k. Replaced the failed adapter with the replacement and then boot the RA.<br />
Note: The replacement adapter does not require any settings to be changed.<br />
l. Repeat steps a through d, and again view the Fibre Channel details to see the<br />
new WWN for the replaced HBA.<br />
m. Using the management of the SAN switch, make the modifications to the zoning<br />
as needed to replace the failed WWN with the new WWN.<br />
n. Use the new WWN to configure the storage.<br />
o. On the Fibre Channel Diagnostics menu, type 1 (SAN diagnostics) and<br />
press Enter. (Refer to steps a through c to access the Fibre Channel<br />
Diagnostics menu.)<br />
When you select the SAN diagnostics option, the system conducts automatic<br />
tests that are designed to identify the most common problems encountered in<br />
the configuration of SAN environments.<br />
Once the tests complete, a message is displayed confirming the successful<br />
completion of SAN diagnostics, or a report is displayed that details any critical<br />
configuration problems.<br />
p. Once no problems are reported from the SAN diagnostics, type B (Back) and<br />
press Enter.<br />
q. On the Fibre Channel Diagnostics menu, type B (Back) and press Enter.<br />
8–18 6872 5688–002
Solving Replication Appliance (RA) Problems<br />
r. On the Diagnostics menu, type B (Back) and press Enter.<br />
s. On the Main Menu, type 4 (Cluster operations) and press Enter.<br />
t. On the Cluster Operations menu, type 1 (Attach to cluster) and press Enter.<br />
This action reattaches the RA, which automatically reboots and restarts<br />
replication.<br />
Note: The replacement Fibre Channel HBA does not need any configuration changes.<br />
Failure of Onboard WAN Adapter or Failure of Optional Gigabit<br />
Fibre Channel WAN Adapter<br />
Problem Description<br />
Symptoms<br />
The onboard WAN adapter failed. This capability serves the replication network.<br />
Notes:<br />
• The gigabit Fibre Channel WAN adapter is an optional component found in some<br />
environments. When this board fails, the symptoms are the same as those observed<br />
when the onboard WAN adapter fails. In that case, the indicator lights pertain to the<br />
gigabit Fibre Channel WAN board instead of the onboard capability.<br />
• The actions to resolve the problem are similar once you isolate the board as the<br />
problem. That is, contact a Unisys service representative for a replacement part.<br />
The following symptoms might help you identify this failure:<br />
• Transfer between sites pauses temporarily for all consistency groups for which this<br />
is the preferred RA while an RA switchover occurs.<br />
• Applications continue to run. High loads might occur because of reduced total<br />
throughput capacity.<br />
• The link indicators on the onboard WAN adapter might not be illuminated. (See<br />
Figure 8–6 for the location of the connector for the replication network WAN.<br />
Figure 8–7 illustrates the LEDs.)<br />
• The port lights on the network switch might indicate that there is no link to the<br />
onboard WAN adapter.<br />
• The management console shows a WAN data link failure for RA 1. The More<br />
information for this error provides the message: “RA-x WAN data link is down.” (See<br />
Figure 8–11.)<br />
6872 5688–002 8–19
Solving Replication Appliance (RA) Problems<br />
Figure 8–11. Management Console Showing WAN Data Link Failure<br />
• The RAs tab on the management console (Figure 8–11) shows an error for the same<br />
RA at each site, indicating that the connectivity between them has been lost.<br />
• Warnings and informational messages similar to those shown in Figure 8–4 for an<br />
RA failure are displayed for this failure. Refer to the table after Figure 8–4 for<br />
descriptions of the messages. For this failure, the details of event ID 4001 show a<br />
WAN data path problem.<br />
Actions to Resolve the Problem<br />
Perform the following actions to isolate and resolve the problem:<br />
• Isolate the problem to the onboard WAN adapter by performing the actions in<br />
“Replication Network Failure in a Geographic Clustered Environment” in<br />
Section 7.<br />
• If you determine that the motherboard must be replaced, contact a Unisys service<br />
representative for a replacement part.<br />
• Contact the Unisys <strong>Support</strong> Center for the appropriate BIOS for the replacement<br />
part.<br />
Note: The replacement motherboard might not have the disk controller set for<br />
RAID1 (mirroring). Check the setting and change it if necessary.<br />
• In rare cases, you might need to obtain a replacement RA from a Unisys service<br />
representative. After you receive the replacement RA, follow the steps in Appendix<br />
D to install and configure it.<br />
8–20 6872 5688–002
Solving Replication Appliance (RA) Problems<br />
Single RA Failures Without a Switchover<br />
Problem Description<br />
Some failures that might occur on an RA do not cause a switchover. These failures are<br />
• Port failure on a single SAN Fibre Channel HBA on one RA<br />
• Onboard management network adapter failure<br />
• Single hard disk failure<br />
Port Failure on a Single SAN Fibre Channel HBA on One RA<br />
Problem Description<br />
Symptoms<br />
One SAN Fibre Channel HBA port on the RA failed.<br />
The following symptoms might help you identify this failure:<br />
• The Logs tab on the management console displays a message for event ID 3030—<br />
Warning RA switched path to storage. (RA , Volumes )—only if the<br />
connection failed during an I/O operation.<br />
• The link indicator lights on the SAN Fibre Channel HBA are not illuminated. (Refer to<br />
Figure 8–8 for the location of these LEDs.)<br />
• The port indicator lights on the Fibre Channel switch no longer show a link to the RA.<br />
• For one port on the relevant RA, errors occur when running the Installation Manager<br />
SAN diagnostics. See Appendix C for information about these diagnostics.<br />
Actions to Resolve the Problem<br />
Perform the following actions to isolate and resolve the problem:<br />
1. If you determine that the SAN Fibre Channel HBA failed and must be replaced,<br />
contact a Unisys service representative for a replacement part.<br />
2. Once the replacement adapter is received, perform the following steps to replace<br />
the failed HBA:<br />
a. Open a PuTTY session using the IP address of the RA, and log in as<br />
boxmgmt/boxmgmt.<br />
Appendix C provides additional information about the Installation Manager<br />
diagnostics.<br />
b. On the Main menu, type 3 (Diagnostics) and press Enter.<br />
c. On the Diagnostics menu, type 2 (Fibre Channel diagnostics) and press<br />
Enter.<br />
d. On the Fibre Channel Diagnostics menu, type 4 (View Fibre Channel<br />
details) and press Enter.<br />
6872 5688–002 8–21
Solving Replication Appliance (RA) Problems<br />
Information similar to the following is displayed:<br />
>>Site1 Box 1>>3<br />
Port 0<br />
wwn = 50012482001c6fb0<br />
node_wwn = 50012482001c6fb1<br />
Port id = 0x20100<br />
operating mode = point to point<br />
speed<br />
Port 1<br />
= 2 GB<br />
---------------------------------wwn<br />
= 50012482001ce3c4<br />
node_wwn = 50012482001ce3c5<br />
Port id = 0x10100<br />
operating mode = point to point<br />
speed = 2 GB<br />
e. Write down the port information.<br />
f. On the Fibre Channel Diagnostics menu, type B (Back) and press Enter.<br />
g. On the Diagnostics menu, type B (Back) and press Enter.<br />
h. On the Main Menu, type 4 (Cluster operations) and press Enter.<br />
i. On the Cluster Operations menu, type 2 (Detach from cluster) and press<br />
Enter.<br />
j. Shut down the RA.<br />
k. Replaced the failed adapter with the replacement and then boot the RA.<br />
Note: The replacement adapter does not require any settings to be changed.<br />
l. Repeat steps a through d and again view the Fibre Channel details to see the<br />
new WWN for the replaced HBA.<br />
m. Using the management of the SAN switch, make the modifications to the zoning<br />
as needed to replace the failed WWN with the new WWN.<br />
n. Use the new WWN to configure the storage.<br />
o. On the Fibre Channel Diagnostics menu, type 1 (SAN diagnostics) and<br />
press Enter. (Refer to steps a through c to access the Fibre Channel<br />
Diagnostics menu.)<br />
When you select the SAN diagnostics option, the system conducts automatic<br />
tests that are designed to identify the most common problems encountered in<br />
the configuration of SAN environments.<br />
Once the tests complete, a message is displayed confirming the successful<br />
completion of SAN diagnostics, or a report is displayed that details any critical<br />
configuration problems.<br />
p. Once no problems are reported from the SAN diagnostics, type B (Back) and<br />
press Enter.<br />
q. On the Fibre Channel Diagnostics menu, type B (Back) and press Enter.<br />
8–22 6872 5688–002
Solving Replication Appliance (RA) Problems<br />
r. On the Diagnostics menu, type B (Back) and press Enter.<br />
s. On the Main Menu, type 4 (Cluster operations) and press Enter.<br />
t. On the Cluster Operations menu, type 1 (Attach to cluster) and press Enter.<br />
This action reattaches the RA, which automatically reboots and restarts<br />
replication.<br />
Note: The replacement Fibre Channel HBA does not need any configuration changes.<br />
Onboard Management Network Adapter Failure<br />
Problem Description<br />
Symptoms<br />
The onboard management network adapter failed.<br />
The following symptoms might help you identify this failure:<br />
• On the management console, the system status and RA status do not display any<br />
error indications.<br />
• The link indicators on the onboard management network adapter are not illuminated.<br />
(See Figure 8–6 for the location of the connector for the onboard management<br />
network adapter. Figure 8–7 illustrates the LEDs.)<br />
• If RA site control was running on the failed RA, you cannot access the management<br />
console or if the management console was open, a banner is displayed showing<br />
“not connected.”<br />
• If RA site control was not running on the failed RA, you can access the management<br />
console.<br />
• You cannot determine which RA owns site control unless the management console<br />
is accessible. The RA site control is designated at the bottom of the display as<br />
follows:<br />
• See “Management Network Failure in a Geographic Clustered Environment” in<br />
Section 7 for additional symptoms.<br />
• The Logs tab on the management console might display a message for event ID<br />
3023—Error in LAN link to RA (RA1)—for this failure.<br />
6872 5688–002 8–23
Solving Replication Appliance (RA) Problems<br />
Actions to Resolve the Problem<br />
Perform the following actions to isolate and resolve the problem:<br />
• Isolate the problem to the onboard management network adapter by performing the<br />
actions in “Management Network Failure in a Geographic Clustered Environment” in<br />
Section 7.<br />
• If you determine the motherboard must be replaced, contact a Unisys service<br />
representative for a replacement part.<br />
• Contact the Unisys <strong>Support</strong> Center for the appropriate BIOS for the replacement<br />
part.<br />
Note: The replacement motherboard might not have the disk controller set for<br />
RAID1 (mirroring). Check the setting and change it if necessary.<br />
• In rare cases, you might need to obtain a replacement RA from a Unisys service<br />
representative. After you receive the replacement RA, follow the steps in Appendix<br />
D to install and configure it.<br />
Single Hard Disk Failure<br />
Problem Description<br />
Symptoms<br />
One of the mirrored internal hard disks for the RA failed.<br />
The following symptoms might help you identify this failure:<br />
• The failure light for a hard disk indicates a failure. Figure 8–12 illustrates the location<br />
of the LEDs for hard disks in the RA.<br />
8–24 6872 5688–002
Solving Replication Appliance (RA) Problems<br />
Figure 8–12. Location of Hard Drive LEDs<br />
• An error message that appears during boot indicates failure of one of the internal<br />
disks.<br />
• The LCD display on the front panel of the RA indicates a drive failure. This error code<br />
is E0D76 as shown in Figure 8–5.<br />
Actions to Resolve the Problem<br />
Perform the following actions to isolate and resolve the problem:<br />
• If the drive failed, you must replace the hard drive. Contact a Unisys service<br />
representative for a replacement part.<br />
• Install the new drive; resynchronization occurs automatically.<br />
Do not power off or reboot the RA while resynchronization is taking place.<br />
Failure of All RAs at One Site<br />
Problem Description<br />
If all RAs fail on one site, replication stops and the data that are currently changing on the<br />
remote site are marked for synchronization. Once the RAs are restored, synchronization<br />
occurs through a full \-sweep operation.<br />
This type of failure is unlikely unless the power source fails.<br />
6872 5688–002 8–25
Solving Replication Appliance (RA) Problems<br />
Symptoms<br />
The following symptoms might help you identify this failure:<br />
• Transfer is paused for all consistency groups.<br />
• Depending on the environment and group settings, applications that were running on<br />
the failed site might stop.<br />
• If the quorum resource belonged to a node at the failed site, MCSC might fail.<br />
• The symptoms for this failure are similar to a total site failure and a network failure<br />
on both the management network and WAN. Because the WAN link is functioning,<br />
the difference is that the following are true:<br />
− Neither site can access the management console using the site management IP<br />
address of the site with the failed RAs.<br />
− Both sites can access the management console using the site management IP<br />
address of the site with the functioning RAs.<br />
Communicate with the administrator at the other site to determine whether that site<br />
can access the management console. Both sites should see a display similar to<br />
Figure 8–13.<br />
Figure 8–13. Management Console Showing All RAs Down<br />
Actions to Resolve the Problem<br />
Perform the following actions to isolate and resolve the problem:<br />
1. Restore power to the failed RAs.<br />
2. If recovery of applications is needed prior to restoring the RAs, see the recovery<br />
topics in Section 3 for geographic replication environments and in Section 4 for<br />
geographic clustered environments.<br />
8–26 6872 5688–002
All RAs Are Not Attached<br />
Problem Description<br />
Symptoms<br />
Solving Replication Appliance (RA) Problems<br />
If all RAs at a site are not attached, connection to the management console is not<br />
available. Also, you cannot access the RA using a PuTTY session and the site<br />
management IP address. You cannot log into the RA using the RA management IP<br />
address and the admin user account. The RA that runs site control is assigned a virtual IP<br />
address that is the site management IP address. Either RA 1 or RA 2 must be attached<br />
to the cluster to have an RA cluster with site control running.<br />
The following symptoms might help you identify this failure:<br />
• You cannot log in to the management console using the site management IP<br />
addresses of the failed sites.<br />
• You cannot initiate an SSH session through PuTTY using the admin account to either<br />
RA management IP address or the site management IP address.<br />
• From the management console of the other site, the WAN appears to be down. (See<br />
Figure 8–11.)<br />
Actions to Resolve the Problem<br />
Perform the following actions to isolate and resolve the problem:<br />
1. Ping the RA using the management IP address. If the ping is not successful, refer to<br />
“Management Network Failure in a Geographic Clustered Environment” in<br />
Section 7. If the ping completes successfully, continue with steps 2 through 5.<br />
2. Log in as boxmgmt to each RA management IP address through an SSH session<br />
using PuTTY. (See “Using the SSH Client” in Appendix C for more information.) If<br />
this is not successful, the RA is probably not attached.<br />
3. To verify that the RA is not attached, follow these steps:<br />
a. Log in as boxmgmt to the RA.<br />
b. At the prompt, type 4 (Cluster operations) and press Enter.<br />
Note: The “reboot regulation limit has been exceeded” message is displayed<br />
when you log in as boxmgmt. In that case, see “Reboot Regulation” in this<br />
section.<br />
c. At the prompt, type 2 (Detach from cluster) and press Enter.<br />
Do not type y to detach. If the RA was not attached, a message is displayed<br />
stating that it is not detached.<br />
6872 5688–002 8–27
Solving Replication Appliance (RA) Problems<br />
Note: Either RA 1 or RA 2 must be attached to have a cluster. RAs 3 through 8<br />
cannot become cluster masters.<br />
4. If the RA is not attached, then type B (Back) and press Enter.<br />
5. At the prompt, type 1 (Attach to cluster) to attach the RA to the cluster.<br />
6. At the prompt, type Q (Quit).<br />
7. Once the RA is attached, log in as admin to the management console and also<br />
initiate a SSH session to the management IP address to ensure that both are<br />
operational.<br />
8. At the management console, click the RAs tab and check that all connections are<br />
working.<br />
8–28 6872 5688–002
Section 9<br />
Solving Server Problems<br />
This section lists symptoms that usually indicate problems with one or more servers.<br />
The problems listed in this section include hardware failure problems. Table 9–1 lists<br />
symptoms and possible problems indicated by the symptom. The problems and their<br />
solutions are described in this section. The graphics, behaviors, and examples in this<br />
section are similar to what you observe with your system but might differ in some<br />
details.<br />
In addition to the symptoms listed, you might receive e-mail messages or SNMP traps<br />
for any of the possible problems or causes. Also, messages similar to e-mail notifications<br />
are displayed on the management console. If you do not see the messages, they might<br />
have already dropped off the display. Review the management console logs for<br />
messages that have dropped off the display.<br />
Table 9–1. Possible Server Problems with Symptoms<br />
Symptom Possible Problem<br />
The management console shows a server<br />
down.<br />
Messages on the management console<br />
show the splitter is down and that the<br />
node fails over.<br />
Multipathing software (such as EMC<br />
PowerPath Administrator) messages report<br />
errors. (This symptom might occur if the<br />
server is unable to connect with the SAN<br />
or if the server HBA fails.)<br />
Host logs and RA log timestamps are not<br />
synchronized.<br />
Cluster node failure (hardware or software)<br />
in geographic clustered environment<br />
possibly resulting from<br />
• Windows server reboot<br />
• Unexpected server shutdown because<br />
of a bug check<br />
• Server crash or restart<br />
• Server unable to connect with SAN<br />
• Server HBA failure<br />
Infrastructure (NTP) server failure<br />
Applications are down. Server failure (hardware or software) in<br />
geographic replication environment<br />
possibly resulting from<br />
• Windows server reboot<br />
• Unexpected server shutdown because<br />
of a bug check<br />
• Server crash or restart<br />
• Server unable to connect with SAN<br />
• Server HBA failure<br />
6872 5688–002 9–1
Solving Server Problems<br />
Cluster Node Failure<br />
(Hardware or Software) in i a<br />
Geographic Clusteered<br />
Environment<br />
Problem Description<br />
9–2<br />
MSCS uses several hearrtbeat<br />
mechanisms to detect whether a node is still actively<br />
responding to cluster acttivities.<br />
MSCS assumes a cluster node has failed wh hen the<br />
cluster node no longer reesponds<br />
to heartbeats that are broadcast over the pu ublic\private<br />
cluster networks and whhen<br />
a SCSI reservation is lost on the quorum volume e.<br />
Figure 9–1 illustrates thiss<br />
failure.<br />
Figure 9–1. Cluster Node Failure<br />
If the server that crashedd<br />
was the MSCS leader (quorum owner), another clu uster node<br />
(the challenger) tries to bbecome<br />
leader and arbitrate for the quorum device. Because the<br />
failed server is no longerr<br />
the quorum device owner in the reservation manag ger, the<br />
arbitration by the challenger<br />
instantly succeeds.<br />
If the challenger node is from the same site as the failed server, arbitration in nstantly<br />
succeeds, and no failoveer<br />
of the quorum device to the remote site is required.<br />
If the challenger node is from the remote site, the RA reverses the replicatio on direction<br />
of the quorum consistency<br />
group. Once failover completes, the challenger arbitration<br />
is<br />
completed.<br />
68 872 5688–002
Solving Server Problems<br />
When a nonleader MSCS node fails, the data groups move to the remaining MSCS local<br />
or remote nodes, depending on preferred ownership settings. From the perspective of<br />
the RA, this situation is equivalent to a user-initiated move of the data groups. That is,<br />
the <strong>SafeGuard</strong> 30m Control resource on the node that tries to bring the group online<br />
sends a command to fail over the group to its site. If the group fails over to a cluster<br />
node on the same site, failover occurs instantly. Otherwise, a consistency group failover<br />
is initiated to the remote site. The <strong>SafeGuard</strong> 30m Control resource does not come<br />
online until the consistency group has completed failover.<br />
Possible Subset Scenarios<br />
The symptoms of a server failure vary based on the reasons that the server went down.<br />
Five different scenarios are described as subsets of this type of failure:<br />
• Windows Server Reboot<br />
• Unexpected Server Shutdown Because of a Bug Check<br />
• Server Crash or Restart<br />
• Server Unable to Connect with SAN<br />
• Server HBA Failure<br />
One of the first things to determine in troubleshooting a server failure is whether the<br />
failure was an unexpected event (a “crash”) or an orderly event such as an operator<br />
reboot. When the server crashes, you usually see a “blue screen” and do not have<br />
access to messages. Once the server comes up again, then you can view messages<br />
regarding the reason it crashed. These messages help diagnose the reason for the initial<br />
shutdown or failure.<br />
In an orderly event, the Windows event log is stopped, and you can view events that<br />
point to the reason for the reboot or restart.<br />
Windows Server Reboot<br />
Problem Description<br />
The consistency groups fail over to another local node or to the other site because a<br />
server fails or goes down. In this scenario, the shutdown is an orderly event and thus<br />
causes the Windows event log service to stop.<br />
6872 5688–002 9–3
Solving Server Problems<br />
Symptoms<br />
The following symptoms might help you identify this failure:<br />
• The management console display shows a server failure similar to that shown in<br />
Figure 9–2.<br />
Figure 9–2. Management Console Display with Server Error<br />
• Warning and informational messages similar to those shown in Figure 9–3 appear on<br />
the management console when a server fails. See the table after the figure for an<br />
explanation of the numbered console messages.<br />
9–4 6872 5688–002
Solving Server Problems<br />
Figure 9–3. Management Console Messages for Server Down<br />
6872 5688–002 9–5
Solving Server Problems<br />
Reference<br />
No.<br />
The following table explains the numbered messages shown in Figures 9–3.<br />
Event<br />
ID<br />
Description<br />
1 5008 The source site reports that server<br />
USMV-CAS100P2 performed an<br />
orderly shutdown.<br />
2 4062 The surviving site accesses the<br />
latest image of the consistency<br />
group during the failover.<br />
3 5032 For each consistency group that<br />
moves to a surviving node, the<br />
splitter is again splitting.<br />
4 4008 For each consistency group that<br />
moves to a surviving node, the<br />
transfer is paused. In the details of<br />
this message, the reason for the<br />
pause is given.<br />
5 1008 The Unisys <strong>SafeGuard</strong> 30m Control<br />
resource successfully issued an<br />
initiate_failover command.<br />
6 4086 For each consistency group that<br />
moves to asurviving node, data<br />
transfer starts and then a quick<br />
initialization starts.<br />
7 4087 For each consistency group that<br />
moves to a surviving node,<br />
initialization completes.<br />
E-mail<br />
Immediate<br />
E-mail<br />
Daily<br />
Summary<br />
To see the details of the messages listed on the management console display, you<br />
must collect the logs and then review the messages for the time of the failure.<br />
Appendix A explains how to collect the management console logs, and Appendix E<br />
lists the event IDs with explanations.<br />
• If you review the system event logs, you can find messages similar to the following<br />
examples that are based on the testing cases used to generate the previous<br />
management console images.<br />
System Event Log for Usmv-Cas100p2 Host (Failure Host on Site 1)<br />
6/01/2008 16:19:13 PM EventLog Information None 6006 N/A USMV-WEST2 The Event log<br />
service was stopped.<br />
6/01/2008 16:19:48 PM EventLog Information None 6009 N/A USMV-WEST2 Microsoft (R)<br />
Windows (R) 5.02. 3790 Service Pack 1 Multiprocessor Free.<br />
6/01/2008 16:19:48 PM EventLog Information None 6005 N/A USMV-USMV-WEST2. The Event<br />
log service was started.<br />
9–6 6872 5688–002<br />
X<br />
X<br />
X<br />
X<br />
X<br />
X<br />
X
Solving Server Problems<br />
System Event Log for Usmv-x455 Host (Surviving Host on Site 2)<br />
6/01/2008 16:19:15 PM ClusSvc Warning Node Mgr 1123 N/A USMV-EAST2 The node lost<br />
communication with cluster node 'USMV-WEST2' on network '<strong>Public</strong>'.<br />
6/01/2008 16:19:15 PM ClusSvc Warning Node Mgr 1123 N/A USMV-EAST2 The node lost<br />
communication with cluster node 'USMV-WEST2' on network 'Private'.<br />
6/01/2008 16:19:56 PM ClusSvc Warning Node Mgr 1135 N/A USMV-EAST2 Cluster node<br />
USMV-WEST2 was removed from the active server cluster membership. Cluster service may have been<br />
stopped on the node, the node may have failed, or the node may have lost communication with the other<br />
active server cluster nodes.<br />
• If you review the cluster log, you can find messages similar to the following<br />
examples that are based on the failed node owning the quorum used to generate the<br />
previous management console images:<br />
Cluster Log for Usmv-West2 Host (Failure Host on Site 1)<br />
0000089c.00000a54:: 2008/05/25-10:31:42.107 ERR [GUM]GumUpdateRemoteNode: Failed to get<br />
completion status for async RPC call,status 1115.(Error 1115: A system shutdown is in progress)<br />
0000089c.00000a54::2008/05/25-10:31:42.107 ERR [GUM] GumSendUpdate: Update on node 2 failed<br />
with 1115 when it must succeed<br />
0000089c.00000a54:: 2008/05/25-10:31:42.107 ERR [GUM] GumpCommFailure 1115 communicating<br />
with node 20000089c.00000a54:: 2008/05/25-10:31:42.107 ERR [NM] Banishing node 1 from active<br />
cluster membership.<br />
0000089c.00000a54:: 2008/05/25-10:31:42.107 ERR [RGP] Node 1: REGROUP WARNING: reload failed.<br />
0000089c.00000a54:: 2008/05/25-10:31:42.107 ERR [NM] Halting this node due to membership or<br />
communications error. Halt code = 1.<br />
0000089c.00000a54:: 2008/05/25-10:31:42.107 ERR [CS] Halting this node to prevent an inconsistency<br />
within the cluster. Error status = 5890. (Error 5890: An operation was attempted that is incompatible with<br />
the current membership state of the node)<br />
0000091c.00000fe4:: 2008/05/25-10:31:42.107 ERR [RM] LostQuorumResource, cluster service<br />
terminated...<br />
Cluster Log for Usmv-East2 Host (Surviving Host on Site 2)<br />
00000268.00000c38::2008/05/25-10:31:42.107 INFO [ClMsg] Received interface unreachable event for<br />
node 1 network 2<br />
00000268.00000c38::2008/05/25-10:31:42.107 INFO [ClMsg] Received interface unreachable event for<br />
node 1 network 1<br />
00000268.00000b70::2008/05/25-10:31:42.107 WARN [NM] Communication was lost with interface<br />
374359a2-5782-4b1d-a863-07f84f8c97d9 (node: USMV-WEST2, network: private)<br />
00000268.00000b70::2008/05/25-10:31:42.107 INFO [NM] Updating local connectivity info for network<br />
afe1f350-f66a-460a-a526-6f58987b911d.<br />
00000268.00000b70::2008/05/25-10:31:42.107 INFO [NM] Started state recalculation timer (2400ms) for<br />
network afe1f350-f66a-460a-a526-6f58987b911d (private)<br />
00000268.00000b70::2008/05/25-10:31:42.107 WARN [NM] Communication was lost with interface<br />
15b9fbe1-c05f-4e90-b937-17fdc27c133e (node: USMV-WEST2, network: public)<br />
00000268.00000b70::2008/05/25-10:31:42.107 INFO [NM] Updating local connectivity info for network<br />
9d905035-8105-4c87-a5bc-ce82e49e764a.<br />
00000268.00000b70::2008/05/25-10:31:42.107 INFO [NM] Started state recalculation timer (2400ms) for<br />
network 9d905035-8105-4c87-a5bc-ce82e49e764a (public)<br />
00000268.000005d0::2008/05/25-10:31:39.733 INFO [NM] We own the quorum resource..<br />
6872 5688–002 9–7
Solving Server Problems<br />
Actions to Resolve the Problem<br />
Perform the following actions to isolate and resolve the problem:<br />
• Check for event 5008 in the management console logs. If this event is replaced by<br />
event 5013, the host probably crashed. See “Unexpected Server Shutdown Because<br />
of a Bug Check” and “Server Crash or Restart.”<br />
• Review the cluster log and check for the system shutdown message as shown in<br />
the preceding examples. Determine whether the quorum resource moved by<br />
checking the surviving nodes for the message “We own the quorum resource.”<br />
• Review the Windows system event log messages and determine whether or not the<br />
server failure was a crash or an orderly event.<br />
In this case, based on the example messages, the Windows system event log<br />
shows that the system started the reboot or shutdown in an orderly manner at<br />
6:19:13 p.m. (message 6006). Because the event log service was shut down, the<br />
events that follow show that the event log service restarted.<br />
For an orderly event, often an operator shuts down the system for some planned<br />
reason.<br />
• If the event log messages do not point to an orderly event, then review<br />
“Unexpected Server Shutdown Because of a Bug Check” and “Server Crash or<br />
Restart” as possible scenarios that fit the circumstances.<br />
Unexpected Server Shutdown Because of a Bug Check<br />
Problem Description<br />
Symptoms<br />
The consistency groups fail over to another local node or to the other site because a<br />
server fails or shuts down unexpectedly and then reboots after the “blue screen” event.<br />
The following symptoms might help you identify this failure:<br />
• The management console display shows a server failure similar to that shown in<br />
Figure 9–2.<br />
• Warning and informational messages similar to those shown in Figure 9–4 appear on<br />
the management console when a server fails. See the table after the figure for an<br />
explanation of the numbered console messages.<br />
9–8 6872 5688–002
Solving Server Problems<br />
Figure 9–4. Management Console Messages for Server Down for Bug Check<br />
6872 5688–002 9–9
Solving Server Problems<br />
Reference<br />
No.<br />
The following table explains the numbered messages shown in Figure 9–4.<br />
Event<br />
ID<br />
Description<br />
1 5013 The splitter for the server USMV-<br />
WEST2 is down unexpectedly.<br />
2 4008 For each consistency group, the<br />
transfer is paused at the source (down)<br />
site. In the details of this message, the<br />
reason for the pause is given.<br />
3 5002 The splitter for server USMV-WEST2 is<br />
unable to access the RA unexpectedly.<br />
4 4008 For each consistency group, the<br />
transfer is paused at the surviving site<br />
to allow a switchover. In the details of<br />
this message, the reason for the pause<br />
is given.<br />
5 4062 The surviving site accesses the latest<br />
image of the consistency group during<br />
the failover.<br />
6 5032 For each consistency group at the<br />
surviving site, the splitter is splitting to<br />
the replication volumes.<br />
7 5002 The RA at the source (down) site<br />
cannot access the splitter for server<br />
USMV-WEST2.<br />
8 4010 For each consistency group at the<br />
source site, the transfer is started.<br />
9 4086 For each consistency group at the<br />
source site, data transfer starts and<br />
then initialization starts.<br />
10 4087 For each consistency group at the<br />
source site, initialization completes.<br />
E-mail<br />
Immediate<br />
E-mail<br />
Daily<br />
Summary<br />
To see the details of the messages listed on the management console display, you<br />
must collect the logs and then review the messages for the time of the failure.<br />
Appendix A explains how to collect the management console logs, and Appendix E<br />
lists the event IDs with explanations.<br />
• If you review the Windows system event logs after the system reboots, you can find<br />
messages similar to the following examples that are based on the testing cases<br />
used to generate the previous management console images.<br />
9–10 6872 5688–002<br />
X<br />
X<br />
X<br />
X<br />
X<br />
X<br />
X<br />
X<br />
X
System Log for Usmv-West2 Host (Failure Host on Site 1)<br />
Solving Server Problems<br />
6/01/2008 18:12:42 PM EventLog Error None 6008 N/A USMV-WEST2 The previous system<br />
shutdown at 18:02:42 PM on 6/01/2008 was unexpected.<br />
6/01/2008 18:12:42 PM EventLog Information None 6009 N/A USMV-WEST2 Microsoft (R)<br />
Windows (R) 5.02. 3790 Service Pack 1 Multiprocessor Free.<br />
6/01/2008 18:12:42 PM EventLog Information None 6005 N/A USMV-WEST2 The Event log<br />
service was started.<br />
6/01/2008 18:12:42 PM Save Dump Information None 1001 N/A USMV-WEST2 The<br />
computer has rebooted from a bugcheck. The bugcheck was: 0x0000007e (0xffffffffc0000005,<br />
0xe000015f97c8a664, 0xe000015f9e52be68, 0xe000015f9e52afb0). A dump was saved in:<br />
C:\WINDOWS\MEMORY.DMP.<br />
System Log for Usmv-East2 Host (Surviving Host on Site 2)<br />
6/01/2008 18:02:42 PM ClusSvc Warning Node Mgr 1123 N/A USMV-EAST2 The node lost<br />
communication with cluster node 'USMV-WEST2' on network '<strong>Public</strong>'.<br />
6/01/2008 18:02:42 PM ClusSvc Warning Node Mgr 1123 N/A USMV-EAST2 The node lost<br />
communication with cluster node 'USMV-WEST2' on network 'Private'.<br />
6/01/2008 18:02:42 PM ClusDisk Error None 1209 N/A USMV-EAST2 Cluster service is requesting<br />
a bus reset for device\Device\ClusDisk0.<br />
6/01/2008 18:02:42 PM ClusSvc Warning Node Mgr 1135 N/A USMV-EAST2 Cluster node<br />
USMV-WEST2 was removed from the active server cluster membership. Cluster service may have been<br />
stopped on the node, the node may have failed, or the node may have lost communication with the other<br />
active server cluster nodes.<br />
• If you review the cluster log, you can find messages similar to the following<br />
examples that are based on the testing cases used to generate the previous<br />
management console images:<br />
Cluster Log for Usmv-West2 Host (Failure Host on Site 1)<br />
For this error situation, no entries appear in the cluster log.<br />
Cluster Log for Usmv-East2 Host (Surviving Host on Site 2)<br />
000007e0.00000138::2008/06/01-18:02:42.104 INFO [ClMsg] Received interface unreachable event for<br />
node 1 network 2<br />
000007e0.00000138:: 2008/06/01-18:02:42.104 INFO [ClMsg] Received interface unreachable event for<br />
node 1 network 1<br />
000007e0.00000124:: 2008/06/01-18:02:42.104 WARN [NM] Communication was lost with interface<br />
5019923b-d7a1-4886-825f-207b5938d11e (node: USMV-WEST2, network: <strong>Public</strong>)<br />
000007e0.00000124:: 2008/06/01-18:02:42.104 WARN [NM] Communication was lost with interface<br />
f409cf69-9c30-48f0-8519-ad5dd14c3300 (node: B)<br />
000001c0.00000664:: 2008/06/01-18:02:42.507 ERR Physical Disk : [DiskArb] GetPartInfo<br />
completed, status 170. (Error 170: the request resource is in use)<br />
000001c0.00000664:: 2008/06/01-18:02:42.507 ERR Physical Disk : [DiskArb] Failed to read<br />
(sector 12), error 170.<br />
000001c0.00000664:: 2008/06/01-18:02:42.507 INFO Physical Disk : [DiskArb] We are about to<br />
break reserve.<br />
000007e0.00000a0c:: 2008/06/01-18:02:42.881 INFO [NM] We own the quorum resource.<br />
6872 5688–002 9–11
Solving Server Problems<br />
Actions to Resolve the Problem<br />
Perform the following actions to isolate and resolve the problem:<br />
1. Review the Windows application event log messages to determine the cause of the<br />
unexpected event.<br />
In this case, based on the four example messages, the first Windows system event<br />
log shows event 6008 in which the system unexpectedly shut down; it was not a<br />
reboot.<br />
Then event 6009 is typically displayed as a reboot message. This event occurs<br />
regardless of the reason for the reboot. The same is true for event 6005.<br />
The Save Dump event 1001 shows that a memory dump was saved. Based on this<br />
message, consult the Microsoft Knowledge Base regarding bug checks.<br />
(http://support.microsoft.com/). Search for bug check 0x0000007e, or stop<br />
error 0x0000007e and replace the stop number with the one displayed.<br />
2. Once you have the appropriate Knowledge Base article from the Microsoft site,<br />
follow the recommendations in the article to resolve the issue.<br />
3. If the information from the Knowledge Base article does not solve resolve the<br />
problem, collect and save the memory dump file and then submit it to the Unisys<br />
<strong>Support</strong> Center.<br />
Server Crash or Restart<br />
Problem Description<br />
Symptoms<br />
When the server goes down for whatever reason and then restarts in a geographic<br />
clustered environment, the consistency groups fail over to the other site and then fail<br />
over to the original site once the server is restarted.<br />
The following symptoms might help you identify his failure:<br />
• The management console display shows a server failure similar to that shown in<br />
Figure 9–2.<br />
• Warnings and informational messages similar to those shown in Figure 9–4 appear<br />
on the management console when the server fails. See the table after that figure for<br />
an explanation of the numbered console messages.<br />
• If you review the Windows system event log, you can find messages similar to the<br />
following examples that are based on the testing cases used to generate the<br />
management console images for Figures 9–2 and 9–4:<br />
9–12 6872 5688–002
System Log for Usmv-West2 Host (Failure Host on Site 1)<br />
Solving Server Problems<br />
6/01/2008 18:42:39 PM EventLog Error None 6008 N/A USMV-WEST2 The previous system<br />
shutdown at 18:05:55 PM on 6/01/2008 was unexpected.<br />
6/01/2008 18:42:39 PM EventLog Information None 6009 N/A USMV-WEST2 Microsoft (R)<br />
Windows (R) 5.02. 3790 Service Pack 2 Multiprocessor Free.<br />
6/01/2008 18:42:39 PM EventLog Information None 6005 N/A USMV-WEST2 The Event log<br />
service was started.<br />
System Log for Usmv-East2 Host (Surviving Host on Site 2)<br />
6/01/2008 18:05:55 PM ClusSvc Warning Node Mgr 1123 N/A USMV-EAST2 The node lost<br />
communication with cluster node 'USMV-WEST2' on network '<strong>Public</strong>'.<br />
6/01/2008 18:05:55 PM ClusSvc Warning Node Mgr 1123 N/A USMV-EAST2 The node lost<br />
communication with cluster node 'USMV-WEST2' on network 'Private'.<br />
6/01/2008 18:05:55 PM ClusDisk Error None 1209 N/A USMV-EAST2 Cluster service is requesting<br />
a bus reset for device \Device\ClusDisk0.<br />
6/01/2008 18:05:55 PM ClusSvc Warning Node Mgr 1135 N/A USMV-EAST2 Cluster node<br />
USMV-WEST2 was removed from the active server cluster membership. Cluster service may have been<br />
stopped on the node, the node may have failed, or the node may have lost communication with the other<br />
active server cluster nodes.<br />
• If you review the cluster log, you can find messages similar to the following<br />
examples that are based on the testing cases used to generate the management<br />
console images for Figures 9–2 and 9–4:<br />
Cluster Log for Usmv-West2 Host (Failure Host on Site 1)<br />
For this error situation, no entries appear in the cluster log.<br />
Cluster Log for Usmv-East2 Host (Surviving Host on Site 2)<br />
000007e0.00000138::2008/06/01-18:05:55.102 INFO [ClMsg] Received interface unreachable event for<br />
node 1 network 2<br />
000007e0.00000138:: 2008/06/01-18:05:55.102 INFO [ClMsg] Received interface unreachable event for<br />
node 1 network 1<br />
000007e0.00000124:: 2008/06/01-18:05:55.102 WARN [NM] Communication was lost with interface<br />
5019923b-d7a1-4886-825f-207b5938d11e (node: USMV-WEST2, network: <strong>Public</strong>)<br />
000007e0.00000124:: 2008/06/01-18:05:55.102 WARN [NM] Communication was lost with interface<br />
f409cf69-9c30-48f0-8519-ad5dd14c3300 (node: USMV-WEST2, network: Private LAN)<br />
000001c0.00000168:: 2008/06/01-18:05:55.504 ERR Physical Disk : [DiskArb] GetPartInfo<br />
completed, status 170. (Error 170: the requested resource is in use)<br />
000001c0.00000168:: 2008/06/01-18:05:55.504 ERR Physical Disk : [DiskArb] Failed to read<br />
(sector 12), error 170.<br />
000001c0.00000168:: 2008/06/01-18:05:55.504 INFO Physical Disk : [DiskArb] We are about to<br />
break reserve.<br />
000007e0.00000764:: 2008/06/01-18:05:55.079 INFO [NM] We own the quorum resource.<br />
6872 5688–002 9–13
Solving Server Problems<br />
Actions to Resolve the Problem<br />
Perform the following actions to isolate and resolve the problem:<br />
1. Run the Microsoft Product <strong>Support</strong> MPS Report Utility to gather system information.<br />
(See “Using the MPS Report Utility” in Appendix A.)<br />
2. Submit the MPS report to the Unisys <strong>Support</strong> Center.<br />
Server Unable to Connect with SAN<br />
Problem Description<br />
Symptoms<br />
The server is unable to connect to the SAN.<br />
The following symptoms might help you identify this failure:<br />
• The management console display shows a server failure similar to that shown in<br />
Figure 9–5.<br />
Figure 9–5. Management Console Display Showing LA Site Server Down<br />
To display more information about the error, click on More in the right column. A<br />
message similar to the following is displayed:<br />
ERROR: Splitter USMV-WEST2 is down<br />
• Warnings and informational messages similar to those shown in Figure 9–6 appear<br />
on the management console when the server fails. See the table after the figure for<br />
an explanation of the numbered console messages.<br />
9–14 6872 5688–002
Solving Server Problems<br />
Figure 9–6. Management Console Images Showing Messages for Server Unable to<br />
Connect to SAN<br />
Reference<br />
No.<br />
The following table explains the numbered messages in Figure 9–6.<br />
Event<br />
ID<br />
Description<br />
1 5013 The splitter for the server USMV-WEST2 is<br />
down.<br />
2 4008 For each consistency group at the failed site,<br />
the transfer is paused to allow a failover to<br />
the surviving site.<br />
3 4008 For each consistency group, the transfer is<br />
paused at the surviving site to allow a failover.<br />
In the details of this message, the reason for<br />
the pause is given.<br />
4 5002 The splitter for the server USMV-WEST2 is<br />
unable to access the RA.<br />
5 4010 The consistency groups on the original failed<br />
site start data transfer.<br />
6 4086 For each consistency group at the failed site,<br />
data transfer starts and then initialization<br />
starts.<br />
7 4087 For each consistency group at the failed site,<br />
data transfer completes.<br />
E-mail<br />
Immediate<br />
E-mail<br />
Daily<br />
Summary<br />
• The multipathing software (EMC PowerPath Administrator) flashes a red X on the<br />
right side of the toolbar.<br />
6872 5688–002 9–15<br />
X<br />
X<br />
X<br />
X<br />
X<br />
X<br />
X
Solving Server Problems<br />
• The PowerPath Administrator Console reports failures similar to those shown in<br />
Figure 9–7.<br />
Figure 9–7. PowerPath Administrator Console Showing Failures<br />
• If you review the server system event log, you can find error messages similar to the<br />
following examples that are based on the testing cases used to generate the<br />
previous management console images.<br />
Type : warning<br />
Source : Ftdisk<br />
EventID : 57<br />
Description : The system failed to flush data to the transaction log. Corruption may occur.<br />
Type : error<br />
Source : Emcpbase<br />
EventID : 100<br />
Description : Path Bus x Tgt y LUN z to APMxxxx is dead<br />
The event 100 will appear numerous times for each bus, target and LUN.<br />
Actions to Resolve the Problem<br />
Perform the following actions to isolate and resolve the problem:<br />
1. At the server, run a tool such as the PowerPath Administrator that might aid in<br />
diagnosing the problem.<br />
2. Log in to the storage software and determine whether problems are reported. If so,<br />
use the information for that software to correct the problems.<br />
Something might have happened to the volume, or the zoning configuration on the<br />
switch might have been changed. Also, a connection issue could exist such as a<br />
fabric switch or storage cable failure.<br />
9–16 6872 5688–002
Solving Server Problems<br />
3. If the problem is not limited to one server, run the Installation Manager Fibre<br />
Channel diagnostics. Appendix C explains how to run the Installation Manager<br />
diagnostics and provides information about the various diagnostic capabilities.<br />
4. If the problem still appears at the host, an adapter with multiple ports might have<br />
failed. Replace the Fibre Channel adapter in the host if the storage, zoning, and<br />
cabling appear correct. Ensure that the storage and zoning are corrected to use the<br />
new WWN as necessary. (See “Server HBA Failure” for resolution actions.)<br />
Server HBA Failure<br />
Problem Description<br />
Symptoms<br />
One HBA in the server failed on a host that has multiple paths to storage.<br />
The following symptoms might help you identify this failure:<br />
• The multipathing software (such as EMC PowerPath Administrator) flashes a red X<br />
on the right side of the toolbar.<br />
• The PowerPath Administrator console reports failures similar to those shown in<br />
Figure 9–8.<br />
Figure 9–8. PowerPath Administrator Console Showing Adapter Failure<br />
6872 5688–002 9–17
Solving Server Problems<br />
• If you review the server system event log, you can find error messages similar to the<br />
following example:<br />
Actions to Resolve<br />
Type : error<br />
Source : Emcpbase<br />
EventID : 100<br />
Description:<br />
Path Bus x Tgt y LUN z to APMxxxx is dead<br />
The event 100 will appear numerous times for each target and LUN.<br />
To replace an HBA in the server, perform the following steps:<br />
1. Run Emulex HBAnywhere and record the WWNs in use by the server.<br />
2. Shut down the server.<br />
3. Replace the failed HBA and then boot the server.<br />
4. Run Emulex HBAnywhere and record the new WWN.<br />
5. Using the SAN switch management modify the zoning as needed to replace the<br />
failed WWN with new WWN.<br />
6. If manual discovery was used for the storage, update the configuration to use the<br />
new WWN.<br />
Infrastructure (NTP) Server Failure<br />
Problem Description<br />
Symptoms<br />
The replication environment is not affected by an NTP server failure. Timestamps of log<br />
entries are affected.<br />
The following symptoms might help you identify the failure:<br />
• When comparing log entries of a failover, the host application log and the<br />
management console entries are not synchronized.<br />
• You are unable to run the synchronization diagnostics as described in the Unisys<br />
<strong>SafeGuard</strong> <strong>Solutions</strong> Replication Appliance Installation <strong>Guide</strong>.<br />
Actions to Resolve the Problem<br />
To resolve an NTP server failure, perform the following steps:<br />
1. Temporarily change the cluster mode for a data consistency group to MSCS<br />
manual (for a group replicating from the source site to the target site).<br />
2. Perform a move-group operation on a cluster group that contains a Unisys <strong>SafeGuard</strong><br />
Control resource to a node at the target site.<br />
3. View the management console log for event 1009 as shown in Figure 9–9.<br />
9–18 6872 5688–002
6872 5688–002<br />
4. View the host aapplication<br />
event log for event 1115, as follows:<br />
Event Type :<br />
Event Source :<br />
Event Category :<br />
Event ID :<br />
Date :<br />
Time :<br />
User :<br />
Computer<br />
Description:<br />
:<br />
Online resource fai<br />
Group is not a MSC<br />
Action: Verify throu<br />
Or if doing manual<br />
Figure 9–9. Event 1009 Display<br />
Warning<br />
30mControl<br />
None<br />
1115<br />
9/10/2006<br />
12:09:04 PM<br />
N/A<br />
USMV-EAST2<br />
Resource name: Daata1<br />
RA CLI command: UCLI -ssh -l plugin -pw **** -superbatch 172.16.7.60 initiate_ _failover group=Data1<br />
active_site=East cluster_owner=USMV-EAST2<br />
5. Compare the timestamps.<br />
If the time betwween<br />
the timestamps is not within a couple of minut tes, the host and<br />
RAs are not synnchronized.<br />
6. Use the Installaation<br />
Manager site connectivity IP diagnostic by perf forming the<br />
following stepss.<br />
(For more information, see Appendix C.)<br />
a. Log in to ann<br />
RA as user boxmgmt with the password boxmg gmt.<br />
b. On the Maain<br />
Menu, type 3 (Diagnostics) and press Enter.<br />
Solving Server S Problems<br />
led.<br />
CS auto-data group (5).<br />
ugh the Management Console that the Global cluster mode is set t to MSCS auto-data.<br />
recovery, ensure an image has been selected.<br />
c. On the Diaagnostics<br />
menu, type 1 (IP diagnostics) and press Enter.<br />
9–19
Solving Server Problems<br />
d. On the IP Diagnostics menu, type 1 (Site connectivity tests) and press Enter.<br />
e. When asked to select a target for the tests, type 5 (Other host) and press<br />
Enter.<br />
f. Enter the IP address for the NTP server that you want to test.<br />
Note: In step e, you must specify 5 (Other host) and 4 (NTP Server). This<br />
choice is because site 2 does not specify an NTP server in the configuration, and<br />
the test will fail if you use 4 (NTP Server).<br />
7. If the NTP server fails, check that the NTP service on the NTP server is functioning<br />
correctly.<br />
8. Use the Installation Manager port diagnostics IP diagnostic to ensure that no ports<br />
are blocked. (For more information about running port diagnostics, see Appendix C.)<br />
9. Check that the NTP server specified for the host is the same NTP server specified<br />
for the RAs at site 1. (If you want to view the RA configuration settings, use the<br />
Installation Manager Setup View capability. For information about that capability,<br />
refer to the Unisys <strong>SafeGuard</strong> <strong>Solutions</strong> Replication Appliance Installation <strong>Guide</strong>.)<br />
10. Repeat steps 1 through 5 choosing a group that will move a group from the target<br />
site to the source site.<br />
Server Failure (Hardware or Software) in a<br />
Geographic Replication Environment<br />
Problem Description<br />
When a server goes down in a geographic replication environment, the circumstances<br />
and Windows event log messages are similar to those for the server failure in a<br />
geographic clustered environment. That is, the five subset scenarios previously<br />
presented apply as far as the event log messages and actions to resolve are concerned.<br />
The primary difference is that the main symptom of the server failure in this environment<br />
is that the user applications fail.<br />
Refer to the previous five subset scenarios for more details.<br />
9–20 6872 5688–002
Section 10<br />
Solving Performance Problems<br />
This section lists symptoms that usually indicate performance problems. Table 10–1 lists<br />
symptoms and possible problems indicated by the symptom. The problems and their<br />
solutions are described in this section. This section also includes a general discussion of<br />
high-load event. The graphics, behaviors, and examples in this section are similar to what<br />
you observe with your system but might differ in some details.<br />
The management console provides graphs that you can use to evaluate performance.<br />
For more information, see the Unisys <strong>SafeGuard</strong> <strong>Solutions</strong> Replication Appliance<br />
Administrator’s <strong>Guide</strong>.<br />
In addition to the symptoms listed, you might receive e-mail messages or SNMP traps<br />
for the possible problems. Also, messages similar to e-mail notifications are displayed on<br />
the management console. If you do not see the messages, they might have already<br />
dropped off the display. Review the management console logs for messages that have<br />
dropped off the display.<br />
Table 10–1. Possible Performance Problems with Symptoms<br />
Symptom Possible Problem<br />
The initialization progression indicator (%)<br />
in the management interface progresses<br />
significantly slower than expected.<br />
Initialization completes after a significantly<br />
longer period of time than expected.<br />
The event log indicates that the disk<br />
manager has reported high load conditions<br />
for a specific consistency group or groups.<br />
A consistency group or groups start to<br />
initialize. This initialization can occur once<br />
or multiple times, depending on the<br />
circumstances.<br />
Slow initialization<br />
High-load (disk manager)<br />
6872 5688–002 10–1
Solving Performance Problems<br />
Table 10–1. Possible Performance Problems with Symptoms<br />
Symptom Possible Problem<br />
The event log indicates that the distributor<br />
has reported high load conditions for a<br />
specific consistency group or groups.<br />
A consistency group or groups start to<br />
initialize. This initialization can occur once<br />
or multiple times, depending on the<br />
circumstances.<br />
Applications are offline for a lengthy period<br />
during changes in the replication direction.<br />
Slow Initialization<br />
Problem Description<br />
Symptoms<br />
High load (distributor)<br />
Failover time lengthens<br />
Initialization of a consistency group or groups takes longer than expected.<br />
Progression of initialization is reported through the management console in percentages.<br />
You might notice that the percentage for a group has not progressed in a long time or<br />
progresses at a slow rate. This progression might or might not be normal depending on<br />
several factors.<br />
For some groups, it might be natural to take a long time to advance to the next<br />
percentage. One percent of 10 TB is much larger than one percent of 100 GB; therefore,<br />
larger groups would take longer to advance in initialization.<br />
The following symptoms might help you identify this failure:<br />
• The initialization progression indicator (%) in the management interface progresses<br />
significantly slower than expected.<br />
• Initialization completes after a significantly longer period of time than expected.<br />
10–2 6872 5688–002
Actions to Resolve the Problem<br />
Perform the following actions to isolate and resolve the problem:<br />
Solving Performance Problems<br />
• Verify the bandwidth of the connection between sites using the Installation Manager<br />
network diagnostic tools to test the WAN speed while there is no traffic over the<br />
WAN. Appendix C explains how to run these diagnostics.<br />
• Use the Installation Manager Fibre Channel diagnostic tools or customer<br />
storage/SAN diagnostic tools to test the performance of the source and target<br />
storage LUNs to ensure that all storage LUNs are capable of handling the observed<br />
load. Appendix C explains how to run the Installation Manager diagnostics.<br />
If storage performance on either site is poor, the replication system could be limited<br />
in its ability to read from the replication volumes on the source site or to write to the<br />
journal volume on the remote site. Poor storage performance reduces the maximum<br />
speed at which the RAs can initialize.<br />
• Verify that no bandwidth limitation exists on the relevant group or groups properties.<br />
• Use the event log to verify that no other events occurred during initialization—for<br />
example, high load conditions, WAN disconnections, or storage disconnections—that<br />
could have caused the initialization to restart.<br />
• Diagnosis of these types of problems is usually specific to the environment. Collect<br />
RA logs and submit a service request to Unisys support if the cause of slow<br />
initialization cannot be determined through the actions given above. See Appendix A<br />
for information about collecting logs.<br />
General Description of High-Load Event<br />
A high-load event reports that, at the time of the event, a bottleneck existed in the<br />
replication process. To keep track of the changes being made during the bottleneck, the<br />
replication goes into “marking mode” and records the location of all changed data on the<br />
source replication volume until the activity causing the bottleneck has subsided.<br />
The three possible points at which a bottleneck might occur are<br />
• Between the host and RA—Disk Manager<br />
Of the three points for a bottleneck to occur, this point is the rarest to cause the<br />
bottleneck. This type of bottleneck occurs when the host is writing to the storage<br />
device faster than the RA can handle.<br />
• The WAN<br />
This type of bottleneck occurs when the host is writing to the storage device faster<br />
than the RAs can replicate over the available bandwidth. For example, a host is<br />
writing to the storage device during peak hours at a rate of 60 Mbps. The RAs<br />
compress this data down to 15 Mbps. The available bandwidth is 10 Mbps. Clearly,<br />
during peak hours, the bandwidth is not sufficient to support the write rate;<br />
therefore, during peak hours, a number of high load events occur.<br />
6872 5688–002 10–3
Solving Performance Problems<br />
• The remote storage—Distributor<br />
This type of bottleneck occurs when the storage device containing the journal<br />
volume on the remote site cannot keep up with the speed that the data is being<br />
replicated to the remote site. To avoid this situation, configure the journal volume on<br />
the fastest possible LUNs using the fastest RAID and the most disk spindles. Also,<br />
use multiple journal volumes located on different physical disks in the storage array<br />
or use separate disk subsystems in the same consistency group so that the<br />
replication can perform an additional layer of striping. The replication stripes the<br />
images across these multiple journal volumes.<br />
High-Load (Disk Manager) Condition<br />
Problem Description<br />
Symptoms<br />
The disk manager reports high-load conditions.<br />
The following symptoms might help you identify this failure:<br />
• The event log indicates that the disk manager reported high load conditions for a<br />
specific consistency group or groups (event ID 4019).<br />
• A consistency group or groups start to initialize. This initialization can occur once or<br />
multiple times, depending on the circumstances.<br />
Actions to Resolve<br />
Perform the following actions to isolate and resolve the problem:<br />
• Use the Installation Manager network diagnostic tools to test the WAN speed while<br />
there is no traffic over the WAN. Appendix C explains how to run these diagnostics.<br />
• Analyze the performance data for the consistency groups on the RA to ensure that<br />
the incoming write rate is not outside the limits of the available bandwidth or the<br />
capabilities of the RA.<br />
• High loads can occur naturally during traffic peaks or during periods of high external<br />
activity on the WAN. If the high load events occur infrequently or can be associated<br />
with a temporal peak, consider this behavior as normal.<br />
• Diagnosis of these types of problems is usually specific to the environment. Collect<br />
RA logs and submit a service request to the Unisys <strong>Support</strong> Center if the high load<br />
events occur frequently and you cannot resolve the problem through the actions<br />
previously listed. See Appendix A for information about collecting logs.<br />
10–4 6872 5688–002
High-Load (Distributor) Condition<br />
Problem Description<br />
Symptoms<br />
The distributor reports high-load conditions.<br />
The following symptoms might help you identify this failure:<br />
Solving Performance Problems<br />
• The event log indicates that the distributor reported high load conditions for a<br />
specific consistency group or groups.<br />
• A consistency group or groups start to initialize. This initialization can occur once or<br />
multiple times, depending on the circumstances.<br />
Actions to Resolve the Problem<br />
Perform the following actions to isolate and resolve the problem:<br />
• Use the Installation Manager Fibre Channel diagnostic tools or customer storage or<br />
SAN diagnostic tools to test the performance of the target-site storage LUNs.<br />
Appendix C explains how to run the Installation Manager diagnostics.<br />
• Analyze the WAN performance of the consistency group or groups, and ensure that<br />
loads are not too high for handling by the target-site storage devices.<br />
• High loads can occur naturally during traffic peaks. If the high-load events occur<br />
infrequently or can be associated with a temporal peak, consider this behavior as<br />
normal.<br />
• Diagnosis of these types of problems is usually specific to the environment. Collect<br />
RA logs and submit a service request to the Unisys <strong>Support</strong> Center if the high-load<br />
events occur frequently and you cannot resolve the problem through the actions<br />
previously listed. See Appendix A for information about collecting logs.<br />
Failover Time Lengthens<br />
Problem Description<br />
Symptoms<br />
Prior to changing the replication direction, the images must be distributed to the targetsite<br />
volumes. The applications are not available during this process.<br />
Applications are offline for a lengthy period during changes to the replication direction.<br />
Actions to Resolve the Problem<br />
Refer to the Unisys <strong>SafeGuard</strong> <strong>Solutions</strong> Planning and Installation <strong>Guide</strong> for more<br />
information on pending timeouts.<br />
6872 5688–002 10–5
Solving Performance Problems<br />
10–6 6872 5688–002
Appendix A<br />
Collecting and Using Logs<br />
Whenever a failure occurs, you might need to collect and analyze log information to<br />
assist in diagnosing the problem. This appendix presents information on the following<br />
tasks:<br />
• Collecting RA logs<br />
• Collecting server (host) logs<br />
• Analyzing RA log collection files<br />
• Analyzing server (host) logs<br />
• Analyzing intelligent fabric switch logs<br />
Collecting RA Logs<br />
When you collect logs from one RA, you automatically collect logs from all other RAs and<br />
from the servers. Occasionally, you might need to collect logs from the servers (hosts)<br />
manually. Refer to “Collecting Server (Host) Logs” later in this appendix for more<br />
information.<br />
Each time you complete a log collection, the files are saved for a maximum of 7 days.<br />
The length of time the files remain available depends on the size and number of log<br />
collections performed. To ensure that you have the log files that you need, download and<br />
store the files locally. Log files with dates older than 7 days from the current date are<br />
automatically removed.<br />
To collect the RA logs, perform the following procedures:<br />
1. Set the Automatic Host Info Collection option<br />
2. Test FTP connectivity<br />
3. Determine when the failure occurred<br />
4. Convert local time to GMT or UTC<br />
5. Collect logs from the RA<br />
6872 5688–002 A–1
Collecting and Using Logs<br />
Setting the Automatic Host Info Collection Option<br />
Perform the following steps to set the Automatic Host Info Collection Option:<br />
1. On the System menu select System Settings in the Management Console.<br />
The System Settings page appears.<br />
2. Choose the Automatic Host Info Collection option from Miscellaneous<br />
Settings.<br />
For more information, refer to the Unisys <strong>SafeGuard</strong> <strong>Solutions</strong> Planning and Installation<br />
<strong>Guide</strong>.<br />
Testing FTP Connectivity<br />
To test FTP connectivity, perform the following steps on the management PC. The<br />
information you provide depends on whether logs are being collected locally on an FTP<br />
server or sent to an FTP server at the Unisys Product <strong>Support</strong> site.<br />
1. To initiate an FTP session, type FTP at a command prompt. Press Enter.<br />
2. Type Open. Press Enter.<br />
3. At the To prompt, enter one of the following and then press Enter:<br />
• ftp.ess.unisys.com (the Unisys FTP address)<br />
• Your local FTP server IP address<br />
4. At the User prompt, enter one of the following and then press Enter:<br />
• FTP, if you specified the Unisys FTP address<br />
• Your local FTP user account<br />
5. At the Password prompt, enter one of the following and then press Enter:<br />
• Your Internet e-mail address if you specified the Unisys FTP address<br />
• Your local FTP account password<br />
6. Type bye and press Enter to log out.<br />
Determining When the Failure Occurred<br />
Perform the following steps to determine when the failure occurred:<br />
Note: If you cannot determine the failure time from the RA logs, use the Windows<br />
event logs on each server (host) to determine the failure time.<br />
1. Select the Logs tab from the navigation pane in the Management Console.<br />
A list of events is displayed. Each event entry includes a Level column that indicates<br />
the severity of the event.<br />
If necessary, click View and select Detailed.<br />
2. Scan the Description column to find the event for which you want to gather logs.<br />
A–2 6872 5688–002
Collecting and Using Logs<br />
3. Select the event and click the Filter Log option.<br />
The Filter Log dialog box appears.<br />
4. Select any option from scope list (normal, detailed, advanced) and level list (info,<br />
warning, error).<br />
5. Write down the timestamp that is displayed for the event. You must convert the<br />
time displayed to GMT—also called Coordinated Universal Time (UTC).<br />
This timestamp is used to calculate the start date and end time for log collection.<br />
6. Click OK.<br />
Converting Local Time to GMT or UTC<br />
Perform the following steps to convert the time in which the failure occurred to GMT or<br />
UTC. You need the time zone you wrote down in the preceding procedure.<br />
1. In Windows Control Panel, click Date and Time.<br />
2. Select the Time Zone tab.<br />
3. Look in the list for the GMT or UTC offset value corresponding to the time zone you<br />
wrote down in the procedure “Determining When the Failure Occurred.” The offset<br />
value represents the number of hours that the time zone is ahead or behind GMT or<br />
UTC.<br />
4. Add or subtract the GMT or UTC offset value from the local time.<br />
Example<br />
If the time zone is Pacific Standard Time, the GMT or UTC offset value is –8:00. If the<br />
time in which the failure occurred is 13:30, then GMT or UTC is 21:30.<br />
Collecting RA Logs<br />
Use the Installation Manager, which is a centralized collection tool, to collect logs from<br />
all accessible RAs, servers (hosts), and intelligent fabric switches.<br />
Before you begin log collection, determine the failure date and time. If you have SANTap<br />
switches and want to collect information from the switches, know the user name and<br />
password to access the switches.<br />
To collect RA logs, perform the following steps:<br />
1. Start the SSH client by performing the steps in “Using the SSH Client” in<br />
Appendix C. Use the site management IP address; log in with boxmgmt as the login<br />
user name and boxmgmt as the password.<br />
2. On the Main Menu, type 3 (Diagnostics) and press Enter.<br />
3. On the Diagnostics menu, type 4 (Collect system info) and press Enter.<br />
6872 5688–002 A–3
Collecting and Using Logs<br />
4. When prompted, provide the following information. Press Enter after each item.<br />
(The program displays date and time in GMT/UTC format.)<br />
a. Start date: This date specifies how far back the log collection is to start. Use<br />
the MM/DD/YYYY format. Do not accept the default date; the date should be at<br />
least 2 days earlier than the current date. This date must include the date and<br />
time in which the failure occurred.<br />
b. Start time: This time specifies the GMT/UTC in which log collection is to start.<br />
Use the HH:MM:SS format.<br />
c. End date: This date specifies when log collection is to end. Accept the default<br />
date, which is the current date.<br />
d. End time: This time specifies when log collection is to end. Accept the default<br />
time, which is the current time.<br />
5. Type y to collect information from the other site.<br />
6. Type y or n, and press Enter when asked about sending the results to an FTP<br />
server.<br />
If you choose not to send the results to an FTP server, skip to step 8. The results are<br />
stored at the URL http:///info/. You can access the<br />
collected results by logging in with webdownload as the log-in name and<br />
webdownload as the password. (If your system is set for secure Web<br />
transactions, then the URL begins with https://.)<br />
If you choose to send the results to an FTP server and the procedure has been<br />
performed previously, all of the information is filled in. If not, provide the following<br />
information for the management PC:<br />
a. When prompted for the FTP server, type one of the following and then press<br />
Enter.<br />
• The IP address of the Unisys Product <strong>Support</strong> FTP server, 192.61.61.78, or<br />
ftp.ess.unisys.com<br />
• The IP address of your local FTP server<br />
b. Press Enter to accept the default FTP port number, or type a different port<br />
number if you are using a management PC with a nonstandard port number.<br />
c. Type the local user account when prompted for the FTP user name. Press<br />
Enter.<br />
d. If you are using the Unisys FTP server, type incoming as the folder name of<br />
the FTP location in which to store the collected information. Press Enter.<br />
If you are using a local FTP server, press Enter for none.<br />
A–4 6872 5688–002
Collecting and Using Logs<br />
e. Type a name for the file on the FTP server in the following format:<br />
.tar<br />
Example: 19557111_Company1.tar<br />
Note: If no name is specified, the name will be similar to the following:<br />
sysInfo--hosts-from-)-.tar<br />
Example: sysInfo-l1-l2-r1-r2-hosts-from-l1-r1-2006.08.17.16.28.31.tar<br />
f. Type the appropriate password. Press Enter.<br />
7. On the Collection mode menu, type 3 (RAs and hosts) and press Enter.<br />
Note: The “hosts” part of this menu selection (RAs and hosts) collects intelligent<br />
fabric switch information.<br />
8. Type y or n, and press Enter when asked if you have SANTap switches from which<br />
you want to collect information.<br />
If you do not have SANTap switches, go to step 10.<br />
If you want to collect information from SANTap switches, enter the user name and<br />
password to access the switch when prompted.<br />
9. Type n if prompted on whether to perform a full collection, unless otherwise<br />
instructed by a Unisys service representative.<br />
10. Type n when prompted to limit collection time.<br />
The collection program checks connectivity to all RAs and then displays a list of the<br />
available hosts and SANTap switches from which to collect information.<br />
11. Type All and press Enter.<br />
The Installation Manager shows the collection progress and reports that it<br />
successfully collected data. This collection might take several minutes. Once the<br />
data collection completes, a message indicates that the collected information is<br />
available at the FTP server you specified or at the URL (http:///info/ or https:///info/).<br />
12. Press Enter.<br />
13. On the Diagnostics dialog box, type Q and press Enter to exit the program.<br />
14. Type Y when prompted to quit and press Enter.<br />
Verifying the Results<br />
• Ensure that “Failed for hosts” has no entries. The success or failure entries might be<br />
listed multiple times.<br />
For the collection to be successful for hosts and intelligent fabric switches, all entries<br />
must indicate “Succeeded for hosts.”<br />
For the collection to be successful for RAs, all entries must indicate “Collected data<br />
from .”<br />
6872 5688–002 A–5
Collecting and Using Logs<br />
• There is a 20-minute timeout on the collection process for RAs. There is a 15-minute<br />
timeout on the collection process for each host.<br />
• If the collection from the remote site failed because of a WAN failure, run the<br />
process locally at the remote site.<br />
• If the connection with an RA is lost while the collection is in process, no<br />
information is collected. Run the process again.<br />
• If you transferred the data by FTP to a management PC, you can transfer the<br />
collected data to the Unisys Product <strong>Support</strong> Web site at your convenience.<br />
Otherwise, if you are connected to the Unisys Product <strong>Support</strong> Web site, the<br />
collected data is transferred automatically to this Web site.<br />
• If you use the Web interface, you must download the collected data to the<br />
management PC and then transfer the collected data to the Unisys Product <strong>Support</strong><br />
Web site at your convenience.<br />
Collecting Server (Host) Logs<br />
Use the following utilities to collect log information:<br />
• MPS Report Utility<br />
• Host information collector (HIC) utility<br />
Using the MPS Report Utility<br />
Use the Microsoft MPS Report Utility to collect detailed information about the current<br />
host configuration. You must have administrative rights to run this utility.<br />
Unisys uses the cluster (MSCS) version of this utility if that version is available from<br />
Microsoft. This version of the utility enables you to gather cluster information as well as<br />
the standard Microsoft information. If the server is not clustered, the utility still runs, but<br />
the cluster files in the output are blank.<br />
The average time for the utility to complete is between 5 and 20 minutes. It might take<br />
longer if you run the utility during peak production time.<br />
You can download the MPS Report Utility from the Unisys FTP server at the following<br />
location: (You are not prompted for a username or password.)<br />
ftp://ftp.ntsupport.unisys.com/outbound/MPS-REPORTS/<br />
Select one of the following directories, depending on your operating system<br />
environment:<br />
• 32-BIT<br />
• 64-BIT-IA64<br />
• 64-BIT-X64 (not a clustered version)<br />
A–6 6872 5688–002
Collecting and Using Logs<br />
Output Files<br />
Individual output files are created by using the following directory structure. Depending<br />
on the MPS Report version, the file name and directory name might vary.<br />
Directory: %systemroot%\MPSReports , typically C:\windows\MPSReports<br />
File name: %COMPUTERNAME%_MPSReports_xxx.CAB<br />
Using the Host Information Collector (HIC) Utility<br />
Note: You can skip this procedure unless directed to complete it by the Unisys support<br />
personnel. Host log collection occurs automatically if the Automatic Host Info Collection<br />
option on the System menu of the management console is selected.<br />
Perform the following steps to collect log information from the hosts:<br />
1. At the command prompt on the host, change to the appropriate directory depending<br />
on your system:<br />
• For 32-bit and Intel Itanium 2-based systems, enter<br />
cd C:\Program Files\KDriver\hic<br />
• For x64 systems, enter<br />
cd C:\Program Files (x86)\KDriver\hic<br />
2. Type one of the following commands:<br />
• host_info_collector –n (noninteractive mode)<br />
• host_info_collector (interactive mode)<br />
If you choose the interactive mode command, provide the following site information:<br />
• Account ID: Click System Settings on the System menu of the<br />
Management Console, and click on Account Settings in the System<br />
Settings dialog box to access this information.<br />
• Account name: The name of the customer who purchased the Unisys <strong>SafeGuard</strong><br />
30m solution.<br />
• Contact name: The name of the person responsible for collecting logs.<br />
• Contact mail: The mail account of the person responsible for collecting logs.<br />
Note: Ignore messages about utilities that are not installed.<br />
6872 5688–002 A–7
Collecting and Using Logs<br />
Verifying the Results<br />
• The process generates a single tar file of the host logs in the gzip format.<br />
• On 32-bit and Intel Itanium 2-based systems, the host logs are located in the<br />
following directory:<br />
C:\Program Files\KDriver\hic<br />
• On 64-bit systems, the host logs are located on the following directory:<br />
C:\Program Files (x86)\KDriver\hic<br />
Analyzing RA Log Collection Files<br />
If you use the Installation Manager RA log collection process, logs are collected from all<br />
accessible RAs and servers (hosts). When the tar file is extracted using this process, the<br />
information is gathered in a file on the FTP server that is, by default, named with the<br />
following format:<br />
sysInfo--hosts-from-)-.tar<br />
The is in the format yyyy.mm.dd.hh.mm.ss.<br />
An example of such a file name is<br />
sysInfo-lr-l2-r1-r2-hosts-from-l1-r1-2007.09.07.17.37.39.tar<br />
For each RA on which logs were collected, directories are created with the following<br />
formats:<br />
extracted..<br />
HLR--<br />
The is in the format yyyy.mm.dd.hh.mm.ss.<br />
An example of the name of an extracted directory for the RA is<br />
extracted.l1.2007.06.05.19.25.03 (from left RA 1 on June 5, 2007 at 19:25:03)<br />
In the RA identifier information, the l1 to 8 and r1 to 8 designations refer to RAs at the<br />
left and right sites. That is, site 1 RAs 1 through 8 are designated with l, and site 2 RAs 1<br />
through 8 are designated with r.<br />
If the RA collected a host log, the host information is collected in a directory beginning<br />
with HLR. For example, HLR-r1-2007.06.05.19.25.03 is the directory from right (site 2)<br />
RA1 on June 5, 2007 at 19:25:03.<br />
This directory is described in “Host Log Extraction Directory” later in this appendix.<br />
A–8 6872 5688–002
RA Log Extraction Directory<br />
Collecting and Using Logs<br />
Several files and directories are placed inside the extracted directory for the RA:<br />
• parameters: file containing the time frame for the collection<br />
• CLI: file that containing the output collected by running CLI commands<br />
• aiw: file containing the internal log of the system, which is used by third-level<br />
support<br />
• aiq: file containing the internal log of the system, which is used by third-level support<br />
• cm_cli: internal file used by third-level support<br />
• init_hl: internal file used by third-level support<br />
• kbox_status: file used by third-level support<br />
• unfinished_init_hl: file used by third-level support<br />
• log: file containing the log of the collection process itself (used only by third-level<br />
support)<br />
• summary: file containing a summary of the main events from the internal logs of the<br />
system, which is used by third-level support<br />
• files: directory containing the original directories from the appliance<br />
• processes: directory containing some internal information from the system such as<br />
network configuration, processes state, and so forth<br />
• tmp: temporary directory<br />
Of the preceding items, you should understand the time frame of the collection from the<br />
parameters file and focus on the CLI file information. To determine whether the logs<br />
were correctly collected, check that the time frame of the collection correlates with the<br />
time of the issue, and verify that logs were collected from all nodes.<br />
Root-Level Files<br />
Several files are saved at the root level of the extracted directory: parameters file, CLI<br />
file, aiw file, aiq file, cm_cli file, init_hl file, kbox_status file, unfinished_init_hl file, log file,<br />
and summary file.<br />
Parameters File<br />
The parameters file contains the parameters given to the log gathering tool. Those<br />
parameters set the time frame for the log collection and are reflected in the parameters<br />
file. The format for the date is yyyy/mm/dd.<br />
The following example illustrates the contents of a parameters file:<br />
only_connectivity=”0”<br />
min=”2007/08/03 16:25:02”<br />
max=”2007/08/04 19:25:02”<br />
withCores=”1”<br />
6872 5688–002 A–9
Collecting and Using Logs<br />
The value ”0” for only_connectivity in the parameters file is a standard value for logs.<br />
The value “1” for withCores means that core logs (long) were collected for the time<br />
displayed.<br />
CLI File<br />
The CLI file contains the output from executing various CLI commands. The commands<br />
issued to produce the information are saved to the CLI file in the tmp directory. Usually<br />
executing CLI commands in the process of collecting logs produces volumes of output.<br />
The types of information that are contained in the CLI file are as follows:<br />
• Account settings and license<br />
• Alert settings<br />
• Box states<br />
• Consistency groups, settings, and state<br />
• Consistency group statistics<br />
• Site name<br />
• Splitters<br />
• Management console logs for the period collected<br />
• Global accumulators (used by third-level support)<br />
• Various settings and system statistics<br />
• Save_settings command output<br />
• Splitters settings and state<br />
• Volumes settings and state<br />
• Available images<br />
The commands used to collect the output are listed in the runCLIFile, described later in<br />
this appendix.<br />
Log File<br />
This file contains a report of the log collection that executed. It shows the start and stop<br />
time for the log.<br />
If there is a problem running CLI commands, information appears at the end of the file<br />
similar to the following:<br />
2007/06/05 19:25:40: info: running CLI commands<br />
2007/06/05 19:25:40: info: retrieving site name<br />
2007/06/05 19:25:40: info: site name is "Tunguska"<br />
2007/06/05 19:25:40: info: retrieving groups<br />
2007/06/05 19:25:40: error: while running CLI commands: when running CLI<br />
get_groups, RC=2<br />
2007/06/05 19:25:40: error: while running CLI commands: errors retrieving<br />
groups. skipping CLI commands.<br />
A–10 6872 5688–002
Collecting and Using Logs<br />
Summary File<br />
The summary file is at the root of the extracted directory and contains a summary of the<br />
main events from the internal logs of the system. The format of this file is used by thirdlevel<br />
support. However, you might find a summary of the errors helpful in some cases.<br />
Files Directory<br />
The files directory contains several subdirectories and files in those directories. The<br />
directories are etc, home, collector, rreasons, proc, and var.<br />
etc Directory<br />
This directory contains the rc.local file, which is used by third-level support.<br />
home Directory<br />
The home directory contains the kos directory containing several files and these<br />
subdirectories: cli, connectivity_tool, control, customer_monitor, hlr, install_logs, kbox,<br />
management, monitor, mpi__perf, old_config, replication, rmi, snmp, and utils.<br />
The home directory also contains the collector and rreasons directories.<br />
collector Directory<br />
This directory contains the connectivity_tool subdirectory, which lists results from<br />
connectivity tests to configured IP addresses on the local host loopback and the specific<br />
ports on the IP addresses that require testing for various protocols.<br />
rreasons Directory<br />
This directory contains the rreasons.log file, which lists the reasons for any reboots in<br />
the specified time frame.<br />
This file is used by third-level support but can be helpful in reviewing the reboot reasons,<br />
as shown in the following sample file:<br />
***************************************************************************<br />
***************************************************************************<br />
***************************************************************************<br />
***************************************************************************<br />
***************************************************************************<br />
=== LogLT STARTED HERE - 2007/07/05 22:40:40 ===<br />
***************************************************************************<br />
***************************************************************************<br />
***************************************************************************<br />
***************************************************************************<br />
***************************************************************************<br />
Couldn't open 'logger.ini' file, so assuming default 'all' with level<br />
DEBUG2007/07/05 22:40:40.834 - #2 - 1421 - RebootReasons:<br />
getRebootReasons2007/07/05 22:40:40.834 - #2 - 1421 - rreasons: Reboot Log:<br />
[Mon Apr 16 20:33:00 2007] : kernel watchdog 0 expired (time=66714<br />
lease=1390 last_tick=65233) 0=(1390,65233) 1=(30000,63214) 2=(1400,65233)<br />
6872 5688–002 A–11
Collecting and Using Logs<br />
Note: In the example, the “kernel watchdog 0 expired” message indicates a typical<br />
reboot that was not a result of an error.<br />
Other Directories<br />
The proc, and var directories are also contained within the files directory and are used by<br />
third-level support.<br />
processes Directory<br />
The processes directory contains the InfoCollect, sbin, usr, home, and bin directories and<br />
several subdirectories.<br />
InfoCollect Directory<br />
Under the InfoCollect directory, the SanDiag.sh file contains the SAN diagnostic logs.<br />
The ConnectivityTest.sh file contains connection information. Connection errors in this<br />
log do not indicate an error in the configuration or function.<br />
sbin Directory<br />
This directory contains files with information pertaining to networking.<br />
• Ifconfig file: Lists configuration information as shown in the following example:<br />
eth0 Link encap:Ethernet HWaddr 00:14:22:11.DD:1B<br />
inet addr:10.10.21.51 Bcast:10.255.255.255 Mask:255.255.255.0<br />
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1<br />
RX packets:286265797 errors:0 dropped:0 overruns:0 frame:0<br />
TX packets:228318046 errors:0 dropped:0 overruns:0 carrier:0<br />
collisions:0 txqueuelen:100<br />
RX bytes:1377792659 (1.2 GiB) TX bytes:2189256742 (2.0 GiB)<br />
Base address:0xecc0 Memory:fe6e0000-fe700000<br />
eth1 Link encap:Ethernet HWaddr 00:14:22:11.DD:1C<br />
inet addr:172.16.21.51 Bcast:172.16.255.255 Mask:255.255.0.0<br />
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1<br />
RX packets:13341097 errors:0 dropped:0 overruns:0 frame:0<br />
TX packets:12365085 errors:0 dropped:0 overruns:0 carrier:0<br />
collisions:0 txqueuelen:5000<br />
RX bytes:4156827090 (3.8 GiB) TX bytes:4192345752 (3.9 GiB)<br />
Base address:0xdcc0 Memory:fe4e0000-fe500000<br />
eth1 Link encap:Ethernet HWaddr 00:14:22:11.DD:1C<br />
inet addr:172.16.21.51 Bcast:172.16.255.255 Mask:255.255.0.0<br />
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1<br />
Base address:0xdcc0 Memory:fe4e0000-fe500000<br />
lo Link encap:Local Loopback<br />
inet addr:127.0.0.1 Mask:255.0.0.0<br />
UP LOOPBACK RUNNING MTU:16436 Metric:1<br />
RX packets:11289452 errors:0 dropped:0 overruns:0 frame:0<br />
TX packets:11289452 errors:0 dropped:0 overruns:0 carrier:0<br />
collisions:0 txqueuelen:0<br />
RX bytes:3269809825 (3.0 GiB) TX bytes:3269809825 (3.0 GiB)<br />
A–12 6872 5688–002
Collecting and Using Logs<br />
• route file: Lists other pieces of routing information, as shown in the following<br />
example:<br />
Kernel IP routing table<br />
Destination Gateway Genmask Flags Metric Ref Use Iface<br />
10.10.21.0 * 255.255.255.0 U 0 0 0 eth0<br />
172.16.0.0 * 255.255.0.0 U 0 0 0 eth1<br />
usr Directory<br />
The usr directory contains two subdirectories: bin and sbin.<br />
The bin subdirectory contains the kps.pl file.<br />
The following is an example of the kps.pl file for an attached RA:<br />
Uptime: 7:25pm up 18:24, 1 user, load average: 5.99, 4.38, 2.05<br />
Processes:<br />
control_process - UP<br />
control_loop.tcsh - UP<br />
replication - UP<br />
mgmt_loop.tcsh - UP<br />
management_server - UP<br />
cli - down<br />
rmi_loop.tcsh - UP<br />
rmi - UP<br />
monitor_loop.tcsh - UP<br />
load_monitor.pl - UP<br />
runall - down<br />
hlr_kbox - UP<br />
rcm_run_loop.tcsh - UP<br />
customer_monitor.pl - UP<br />
Modules:<br />
st - UP<br />
sll - UP<br />
var_link - UP<br />
kaio_mod-2.4.32-k22 - UP<br />
The following is an example of the kps.pl file for a detached RA:<br />
Uptime: 7:25pm up 18:24, 1 user, load average: 5.99, 4.38, 2.05<br />
Processes:<br />
control_process - down<br />
control_loop.tcsh - down<br />
replication - down<br />
mgmt_loop.tcsh - down<br />
management_server - down<br />
cli - down<br />
rmi_loop.tcsh - down<br />
rmi - down<br />
monitor_loop.tcsh - down<br />
load_monitor.pl - down<br />
runall - down<br />
hlr_kbox - UP<br />
rcm_run_loop.tcsh - down<br />
customer_monitor.pl - down<br />
Modules:<br />
st - UP<br />
6872 5688–002 A–13
Collecting and Using Logs<br />
sll - UP<br />
var_link - UP<br />
kaio_mod-2.4.32-k22 - UP<br />
The sbin subdirectory contains the biosdecode and dmidecode files. The biosdecode file<br />
provides hardware-specific RA BIOS information and the pointers to locations where this<br />
information is stored. The dmidecode file provides handle and other information for<br />
components capable of passing this information to a Desktop Management Interface<br />
(DMI) agent.<br />
home Directory<br />
The home directory contains the kos subdirectory, which contains other subdirectories<br />
that yield the get_users_lock_state.tcsh file. This file contains all the users on the RA.<br />
bin Directory<br />
The bin directory contains the df-h and lspci files. The df-h file contains directory size and<br />
disk size usage statistics for the RA hard disk drive. The lspci file contains PCI bridge bus<br />
numbers, revisions, and OEM identification strings for inbuilt devices in the RA.<br />
tmp Directory<br />
The tmp directory contains the runCLI file listing the commands that generated the CLI<br />
file. It also contains the getGroups file, which is a temporary file to gather the list of<br />
consistency groups.<br />
runCLI File<br />
The following is an example of the runCLI file saved in the tmp directory that shows the<br />
CLI commands executed:<br />
• get_logs from= to=–n<br />
The time and date are specified as day, month, year as follows:<br />
get_logs from="22:03 03/08/2007" to="17:03 04/08/2007” –n<br />
• config_io_throttling –n<br />
• config_multipath_monitoring –n<br />
• get_account_settings –n<br />
• get_alert_settings –n<br />
• get_box_states –n<br />
• get_global_policy –n<br />
• get_groups –n<br />
• get_groups_sets –n<br />
• get_group_settings –n<br />
• get_group_state –n<br />
• get_group_statistics –n<br />
A–14 6872 5688–002
Collecting and Using Logs<br />
• get_id_names –n<br />
• get_initiator_bindings –n<br />
• get_pairs –n<br />
• get_raw_stats –n<br />
• get_snmp_settings –n<br />
• get_syslog_settings –n<br />
• get_system_status –n<br />
• get_system_settings –n<br />
• get_system_statistics –n<br />
• get_tweak_params –n<br />
• get_version –n<br />
• get_virtual_targets –n<br />
• save_settings –n<br />
• get_splitter_settings site=""<br />
• get_splitter_states site=""<br />
• get_san_splitter_view site=""<br />
• get_san_volumes site=""<br />
• get_santap_view site=""<br />
• get_volume_settings site=""<br />
• get_volume_state site=""<br />
• get_images group="" (This command is repeated for each group.)<br />
getGroups File<br />
This internal file is used to generate the runCLI file.<br />
Host Log Extraction Directory<br />
When the RA collects a host log, the host information is collected in a directory named<br />
with the HLR-- format.<br />
Such a directory contains a tar.gz file for servers with a name similar in format to the<br />
following:<br />
HLR-r1_USMVEAST2_1157647546524147.tar.gz<br />
When you extract a tar.gz file, you can choose to decompress the ZIP file<br />
(to_transfer.tar) to a temp folder and open it, or you can choose to extract the files to a<br />
directory.<br />
When the file is for intelligent fabric switches, the file name does not have the .gz<br />
extension.<br />
6872 5688–002 A–15
Collecting and Using Logs<br />
Analyzing Server (Host) Logs<br />
The output file from host collection is named<br />
Unisys_host_info___.tar.gz<br />
This file contains a folder named “collected_items,” which contains the following files<br />
and directories:<br />
• Cluster_log: a folder containing the cluster.log file generated by MSCS<br />
• Hic_logs: a folder containing logs used by third-level support<br />
• Host_logs: a folder containing logs used by third-level support<br />
• Msinfo32: information from the Msinfo32.exe file<br />
• Registry.dump: the registry dump for this server<br />
• Tweak: the internal RA parameters on this server<br />
• Watchdog log: log created by the KDriverWatchDog service<br />
• Commands: a file containing output from commands executed on this server,<br />
including<br />
− A view of the LUNs recognized by this server<br />
− Some internal RA structures<br />
− Output from the dumpcfg.exe file<br />
− Windows event logs for system, security, and applications<br />
Analyzing Intelligent Fabric Switch Logs<br />
The output file from collecting information from intelligent fabric switches is named with<br />
the following format:<br />
HLR-__identifier.tar<br />
The following name is an example of this format:<br />
HLR-l1_CISCO_232c000dec1a7a02.tar<br />
Once you extract the .tar file, some files are listed with formats similar to the following:<br />
CVT_.tar_AT__M3_tech<br />
CVT_.tar_AT__M3_isapi_tech<br />
CVT_.tar_AT__M3_santap_tech<br />
A–16 6872 5688–002
Appendix B<br />
Running Replication Appliance (RA)<br />
Diagnostics<br />
This appendix<br />
• Explains how to clear the system event log (SEL.)<br />
• Describes how to run hardware diagnostics for the RA.<br />
• Lists the LCD status messages shown on the RA.<br />
Clearing the System Event Log (SEL)<br />
Before you run the RA diagnostics, you need to clear the SEL to prevent errors from<br />
being generated during the diagnostics run.<br />
1. Insert the bootable Replication Appliance (RA) Diagnostic CD-ROM in the CD/DVD<br />
drive.<br />
2. Press Ctrl+Alt+Delete to reboot the RA.<br />
The RA displays the following event log menu.<br />
3. Select Show all system event log records using the arrow keys, then press<br />
Enter.<br />
This action results in an SEL summary and indicates whether the SEL contains<br />
errors. If there are errors, an error description is given.<br />
Note: You cannot scroll up or down in this screen.<br />
A clear SEL without errors has “IPMI SEL contains 1 records” displayed in the<br />
summary. Anything greater than one record indicates that errors are present.<br />
6872 5688–002 B–1
Running Replication Appliance (RA) Diagnostics<br />
Note: The preceding step did not clear the SEL; ignore the statement “Log area<br />
Reset/Cleared.”<br />
4. Press any key to return to the main boot menu.<br />
5. Select Clear System Event Log using the arrow keys, and press Enter to ensure<br />
that the SEL is cleared of all error entries.<br />
Note: Depending on whether there are error entries, this clearing action could take<br />
up to 1 minute to complete.<br />
6. Press any key again to return to the main boot menu.<br />
7. Select Show all system event log records using the arrow keys and press<br />
Enter. Confirm that “IPMI SEL contains 1 records” is shown.<br />
8. Press any key to return to the main boot menu.<br />
Note: If you accidentally press Escape and leave the main boot menu, a Diag<br />
prompt is displayed. Type menu to return to the main boot menu.<br />
Running Hardware Diagnostics<br />
Running the hardware diagnostics for the RA includes completing the Custom Test and<br />
Express Test diagnostics.<br />
Follow these steps to run the hardware diagnostics for the RA:<br />
1. At the main boot menu, use the arrow keys to select Run Diags …; then press<br />
Enter.<br />
2. On the Customer Diagnostic Menu, press 2 to select Run ddgui graphicsbased<br />
diagnostic.<br />
The system diagnostic files begin loading and a message is displayed giving<br />
information about the software and showing “initializing…”<br />
Once the diagnostics are loaded and ready to be executed, the Main Menu is<br />
displayed.<br />
B–2 6872 5688–002
Custom Test<br />
Running Replication Appliance (RA) Diagnostics<br />
1. On the Main Menu, select Custom Test using the arrow keys; then press Enter.<br />
The Custom Test dialog box is displayed as follows:<br />
2. Expand the PCI Devices folder to view the PCI devices installed in the system<br />
including those devices that are “on-board.”<br />
3. Select the PCI Devices folder; then press Enter.<br />
This action causes each PCI device to be interrogated in turn and a message is<br />
displayed for each one. Verify that the correct number of QLogic adapters is shown.<br />
4. Press OK after each message is displayed until all PCI devices have been recognized<br />
and passed. The message “All tests passed.” is displayed.<br />
Note: If any devices fail this test, investigate and rectify the problem; then clear the<br />
SEL as explained in “Clearing the System Event Log (SEL).”<br />
5. Close the Custom Test dialog box and return to the Main Menu.<br />
6872 5688–002 B–3
Running Replication Appliance (RA) Diagnostics<br />
Express Test<br />
1. On the Main Menu, select Express Test using the arrow keys; then press Enter.<br />
A warning is displayed advising that media must be installed on all drives or else<br />
some tests might fail.<br />
2. If a diskette drive is installed in the system, insert a blank, formatted diskette and<br />
then click OK to start the test. If no diskette drive is installed, just click OK.<br />
During testing, a status screen is displayed.<br />
If the diagnostic test run is successful, the message “All tests passed.” appears.<br />
Notes:<br />
• During the video portion of the testing, the screen typically flickers and goes<br />
blank.<br />
• If any errors occur, investigate and resolve the problem, and then rerun the<br />
diagnostic tests. Before you rerun the tests, be sure to clear the SEL as<br />
explained in “Clearing the System Event Log (SEL).”<br />
3. Click OK to exit the diagnostic tests.<br />
The Main Menu is then displayed.<br />
4. Select Exit using the arrow keys; then press Enter.<br />
The following message is displayed:<br />
Displaying the end of test result.log ddgui.txt. Strike a Key when ready.<br />
5. Press any key to display the diagnostic test summary screen.<br />
6. Verify that no errors are listed. Scroll up and down to see the different portions of<br />
the output.<br />
Note: If any errors are listed, investigate and resolve the problem; then rerun the<br />
diagnostic tests. Before you rerun the tests, be sure to clear the SEL as explained in<br />
“Clearing the System Event Log (SEL).”<br />
7. Press Escape to return to the original Customer Diagnostic Menu.<br />
8. Press 4 to quit and return to the main boot menu.<br />
9. Select Exit; then press Enter.<br />
10. Remove all media from the diskette and CD/DVD drives.<br />
LCD Status Messages<br />
The LCDs on the RA signify status messages. Table B–1 lists the LCD status messages<br />
that can occur and the probable cause for each message. The LCD messages refer to<br />
events recorded in the SEL.<br />
Note: For information about corrective actions for the messages listed in Table B–1,<br />
refer to the documentation supplied with the system.<br />
B–4 6872 5688–002
Running Replication Appliance (RA) Diagnostics<br />
Table B–1. LCD Status Messages<br />
Line 1 Message Line 2 Message Cause<br />
SYSTEM ID SYSTEM NAME The system ID is a unique name, 5 characters or less,<br />
defined by the user.<br />
The system name is a unique name, 16 characters or<br />
less, defined by the user.<br />
The system ID and name display under the following<br />
conditions:<br />
• The system is powered on.<br />
• The power is off and active POST errors are<br />
displayed.<br />
E000 OVRFLW CHECK LOG LCD overflow message. A maximum of three error<br />
messages can display sequentially on the LCD. The<br />
fourth message is displayed as the standard overflow<br />
message.<br />
E0119 TEMP AMBIENT Ambient system temperature is out of the acceptable<br />
range.<br />
E0119 TEMP BP The backplane board is out of the acceptable temperature<br />
range.<br />
E0119 TEMP CPU n The specified microprocessor is out of the acceptable<br />
temperature range.<br />
E0119 TEMP SYSTEM The system board is out of the acceptable temperature<br />
range.<br />
E0212 VOLT 3.3 The system power supply is out of the acceptable voltage<br />
range; the power supply is faulty or improperly installed.<br />
E0212 VOLT 5 The system power supply is out of the acceptable voltage<br />
range; the power supply is faulty or improperly installed.<br />
E0212 VOLT 12 The system power supply is out of the acceptable voltage<br />
range; the power supply is faulty or improperly installed.<br />
E0212 VOLT BATT Faulty battery; faulty system board.<br />
E0212 VOLT BP 12 The backplane board is out of the acceptable voltage<br />
range.<br />
E0212 VOLT BP 3.3 The backplane board is out of the acceptable voltage<br />
range.<br />
E0212 VOLT BP 5 The backplane board is out of the acceptable voltage<br />
range.<br />
E0212 VOLT CPU VRM The microprocessor voltage regulator module (VRM)<br />
voltage is out of the acceptable range. The<br />
microprocessor VRM is faulty or improperly installed. The<br />
system board is faulty.<br />
E0212 VOLT NIC 1.8V Integrated NIC voltage is out of the acceptable range; the<br />
power supply is faulty or improperly installed. The system<br />
board is faulty.<br />
6872 5688–002 B–5
Running Replication Appliance (RA) Diagnostics<br />
Table B–1. LCD Status Messages<br />
Line 1 Message Line 2 Message Cause<br />
E0212 VOLT NIC 2.5V Integrated NIC voltage is out of the acceptable range. The<br />
power supply is faulty or improperly installed. The system<br />
board is faulty.<br />
E0212 VOLT PLANAR REG The system board is out of the acceptable voltage range.<br />
The system board is faulty.<br />
E0276 CPU VRM n The specified microprocessor VRM is faulty,<br />
unsupported, improperly installed, or missing.<br />
E0276 MISMATCH VRM n The specified microprocessor VRM is faulty,<br />
unsupported, improperly installed, or missing.<br />
E0280 MISSING VRM n The specified microprocessor VRM is faulty,<br />
unsupported, improperly installed, or missing.<br />
E0319 PCI OVER CURRENT The expansion cord is faulty or improperly installed.<br />
E0412 RPM FAN n The specified cooling fan is faulty, improperly installed, or<br />
missing.<br />
E0780 MISSING CPU 1 Microprocessor is not installed in socket PROC_1.<br />
E07F0 CPU IERR The microprocessor is faulty or improperly installed.<br />
E07F1 TEMP CPU n HOT The specified microprocessor is out of the acceptable<br />
temperature range and has halted operation.<br />
E07F4 POST CACHE The microprocessor is faulty or improperly installed.<br />
E07F4 POST CPU REG The microprocessor is faulty or improperly installed.<br />
E07FA TEMP CPU n THERM The specified microprocessor is out of the acceptable<br />
temperature range and is operating at a reduced speed or<br />
frequency.<br />
E0876 POWER PS n No power is available from the specified power supply.<br />
The specified power supply is improperly installed or<br />
faulty.<br />
E0880 INSUFFICIENT PS Insufficient power is being supplied to the system. The<br />
power supplies are improperly installed, faulty, or<br />
missing.<br />
E0CB2 MEM SPARE ROW The correctable errors threshold was met in a memory<br />
bank; the errors were remapped to the spare row.<br />
E0CF1 MBE DIMM Bank n The memory modules installed in the specified bank are<br />
not the same type and size. The memory module or<br />
modules are faulty.<br />
E0CF1 POST MEM 64K A parity failure occurred in the first 64 KB of main<br />
memory.<br />
E0CF1 POST NO MEMORY The main-memory refresh verification failed.<br />
E0CF5 LGO DISABLE SBE Multiple single-bit errors occurred on a single memory<br />
module.<br />
B–6 6872 5688–002
Running Replication Appliance (RA) Diagnostics<br />
Table B–1. LCD Status Messages<br />
Line 1 Message Line 2 Message Cause<br />
E0D76 DRIVE FAIL A hard drive or RAID controller is faulty or improperly<br />
installed.<br />
E0F04 POST DMA INIT Direct memory access (DMA) initialization failed. DMA<br />
page register write/read operation failed.<br />
E0F04 POST MEM RFSH The main-memory refresh verification failed.<br />
E0F04 POST SHADOW BIOS-shadowing failed.<br />
E0F04 POST SHD TEST The shutdown test failed.<br />
E0F0B POST ROM CHKSUM The expansion card is faulty or improperly installed.<br />
E0F0C VID MATCH CPU n The specified microprocessor is faulty, unsupported,<br />
improperly installed, or missing.<br />
E10F3 LOG DISABLE BIOS The BIOS disabled logging errors.<br />
E13F2 IO CHANNEL CHECK The expansion card is faulty or improperly installed. The<br />
system board is faulty.<br />
E13F4 PCI PARITY<br />
E13F5 PCI SYSTEM<br />
E13F8 CPU BUS INIT The microprocessor or system board is faulty or<br />
improperly installed.<br />
E13F8 CPU MCKERR Machine check error. The microprocessor or system<br />
board is faulty or improperly installed.<br />
E13F8 HOST TO PCI BUS<br />
E13F8 MEM CONTROLLER A memory module or the system board is faulty or<br />
improperly installed.<br />
E20F1 OS HANG The operating system watchdog timer has timed out.<br />
EFFF1 POST ERROR A BIOS error occurred.<br />
EFFF2 BP ERROR The backplane board is faulty or improperly installed.<br />
6872 5688–002 B–7
Running Replication Appliance (RA) Diagnostics<br />
B–8 6872 5688–002
Appendix C<br />
Running Installation Manager<br />
Diagnostics<br />
To determine the causes of various problems as well as perform numerous procedures,<br />
you must access the Installation Manager functions and diagnostics capabilities.<br />
Using the SSH Client<br />
Throughout the procedures in this guide you might need to use the secure shell (SSH)<br />
client. Perform the following steps whenever you are asked to use the SSH client or to<br />
open a PuTTY session:<br />
1. From Windows Explorer, double-click the PuTTY.exe file.<br />
2. When prompted, enter the applicable IP address.<br />
3. Select SSH for the protocol and keep the default port settings (port 22).<br />
4. Click Open.<br />
5. If prompted by a PuTTY security dialog box, click Yes.<br />
6. When prompted to log in, type the identified user name and then press Enter.<br />
7. When prompted for a password, type the identified password and then press Enter.<br />
Running Diagnostics<br />
When you open the PuTTY session and log in as boxmgmt/boxmgmt, the Main Menu of<br />
Installation Manager is displayed. This menu offers the following six choices: Installation,<br />
Setup, Diagnostics, Cluster Operations, Reboot/Shutdown, and Quit.<br />
For more information about these capabilities, see the Unisys <strong>SafeGuard</strong> <strong>Solutions</strong><br />
Replication Appliance Installation <strong>Guide</strong>.<br />
6872 5688–002 C–1
Running Installation Manager Diagnostics<br />
To access the various diagnostic capabilities of Installation Manager, perform the<br />
following steps:<br />
1. Open a PuTTY session using the IP address of the RA, and log in as<br />
boxmgmt/boxmgmt.<br />
The Main Menu is displayed, as follows:<br />
** Main Menu **<br />
[1] Install<br />
[2] Setup<br />
[3] Diagnostics<br />
[4] Cluster Operations<br />
[5] Reboot / Shutdown<br />
[Q] Quit<br />
2. Type 3 (Diagnostics) and press Enter.<br />
The Diagnostics menu is displayed as follows:<br />
** Diagnostics **<br />
IP Diagnostics<br />
[1] IP diagnostics<br />
[2] Fibre Channel diagnostics<br />
[3] Synchronization diagnostics<br />
[4] Collect system info<br />
[B] Back<br />
[Q] Quit<br />
The four diagnostics capabilities are explained in the following topics.<br />
Use the IP diagnostics when you need to check port connectivity, view IP addresses,<br />
test throughput, and review other related information.<br />
On the Diagnostics menu, type 1 (IP diagnostics) and press Enter to access the IP<br />
Diagnostics menu as shown:<br />
** IP Diagnostics **<br />
[1] Site connectivity tests<br />
[2] View IP details<br />
[3] View routing table<br />
[4] Test throughput<br />
[5] Port diagnostics<br />
[6] System connectivity<br />
[B] Back<br />
[Q] Quit<br />
C–2 6872 5688–002
Site Connectivity Tests<br />
Running Installation Manager Diagnostics<br />
On the IP Diagnostics menu, type 1 (Site connectivity tests) and press Enter to<br />
access the Site connectivity tests menu.<br />
Note: You must apply settings to the RA before you can test options 1 through 4 in the<br />
following list.<br />
The options to test are as follows:<br />
** Select the target to which to test connectivity: **<br />
[1] Gateway<br />
[2] Primary DNS server<br />
[3] Secondary DNS server<br />
[4] NTP Server<br />
[5] Other host<br />
[B] Back<br />
[Q] Quit<br />
Tests for options 1 through 4 return a result of success or failure.<br />
For option 5, you must specify the target IP address that you want to test. The test<br />
returns the relative success of 0 through 100 percent over both the management and<br />
WAN interfaces.<br />
View IP Details<br />
From the IP Diagnostics menu, type 2 (View IP details) and press Enter to run an<br />
ipconfig process. The displayed results of the process are similar to the following:<br />
eth0 Link encap:Ethernet HWaddr 00:0F:1F:6A:03:E7<br />
inet addr:10.10.17.61 Bcast:10.10.17.255 Mask:255.255.255.0<br />
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1<br />
RX packets:12751337 errors:0 dropped:0 overruns:0 frame:0<br />
TX packets:13628048 errors:0 dropped:0 overruns:0 carrier:0<br />
collisions:0 txqueuelen:1000<br />
RX bytes:1084700432 (1034.4 Mb) TX bytes:2661155798 (2537.8 Mb)<br />
Base address:0xecc0 Memory:fe6e0000-fe700000<br />
eth1 Link encap:Ethernet HWaddr 00:0F:1F:6A:03:E8<br />
inet addr:172.16.17.61 Bcast:172.16.255.255 Mask:255.255.0.0<br />
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1<br />
RX packets:10519453 errors:0 dropped:0 overruns:0 frame:0<br />
TX packets:10244866 errors:0 dropped:0 overruns:0 carrier:0<br />
collisions:0 txqueuelen:5000<br />
RX bytes:2846677622 (2714.8 Mb) TX bytes:2702094827 (2576.9 Mb)<br />
Base address:0xdcc0 Memory:fe4e0000-fe500000<br />
6872 5688–002 C–3
Running Installation Manager Diagnostics<br />
eth1:1 Link encap:Ethernet HWaddr 00:0F:1F:6A:03:E8<br />
inet addr:172.16.17.60 Bcast:172.16.255.255 Mask:255.255.0.0<br />
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1<br />
Base address:0xdcc0 Memory:fe4e0000-fe500000<br />
lo Link encap:Local Loopback<br />
inet addr:127.0.0.1 Mask:255.0.0.0<br />
UP LOOPBACK RUNNING MTU:16436 Metric:1<br />
RX packets:3853904 errors:0 dropped:0 overruns:0 frame:0<br />
TX packets:3853904 errors:0 dropped:0 overruns:0 carrier:0<br />
collisions:0 txqueuelen:0<br />
RX bytes:3312865098 (3159.3 Mb) TX bytes:3312865098 (3159.3 Mb)<br />
View Routing Table<br />
On the IP Diagnostics menu, type 3 (View routing table) and press Enter to display<br />
the routing table.<br />
Test Throughput<br />
On the IP Diagnostics menu, type 4 (Test throughput) and press Enter to use iperf to<br />
test throughput to another RA.<br />
Once you select this option, Installation Manager guides you through the following<br />
dialog. The bold text shows sample entries.<br />
Note: The Fibre Channel interface only appears if the Installation Manager Diagnostic<br />
capability was preconfigured to run on Fibre Channel. Then the option appears a [2} in<br />
the menu list.<br />
Enter the IP address to which to test throughput:<br />
>>192.168.1.86<br />
Select the interface from which to test throughput:<br />
** Interface **<br />
[1] Management interface<br />
[2] Fibre Channel Interface<br />
[3] WAN interface<br />
>>3<br />
Enter the desired number of concurrent streams:<br />
>>2<br />
Enter the test duration (seconds):<br />
>>10<br />
C–4 6872 5688–002
Running Installation Manager Diagnostics<br />
If the test is successful, the system responds with a standard iperf output that<br />
resembles the following:<br />
Checking connectivity to 10.10.17.51<br />
Connection to 10.10.17.51 established.<br />
Client Connecting to 10.10.17.51, TCP port 5001<br />
Binding to local address 10.10.17.61<br />
TCP window size: 4.00 MByte (WARNING: requested 2.00 MByte)<br />
[ 6] local 10.10.17.61 port 35222 connected with 10.10.17.51 port 5001<br />
[ 5] local 10.10.17.61 port 35222 connected with 10.10.17.51 port 5001<br />
[ ID] Interval Transfer Bandwidth<br />
[ 5] 0.0-10.6 sec 59.1 Mbytes 46.9 Mbits/sec<br />
[ 6] 0.0-10.6 sec 59.1 Mbytes 46.9 Mbits/sec<br />
[SUM] 0.0-10.6 sec 118 Mbytes 93.9 Mbits/sec<br />
Port Diagnostics<br />
On the IP Diagnostics menu, type 5 (Port diagnostics) and press Enter to check that<br />
none of the ports used by the RAs are blocked (for example, by a firewall). You must test<br />
each RA individually—that is, designate each RA, in turn, to be the server.<br />
Once you select the option, Installation Manager guides you through one of the following<br />
dialogs, depending on whether you designate the RA to be the server or the client. In the<br />
dialogs, sample entries are bold.<br />
For the server, the dialog is as follows:<br />
In which mode do you want to run ports diagnostics?<br />
** **<br />
[1] Server<br />
[2] Client<br />
>>1<br />
Note: Before you select the server designation for the RA, detach the RA that you<br />
intend to specify as the server.<br />
6872 5688–002 C–5
Running Installation Manager Diagnostics<br />
After you specify the RA that you want to test as the server, move to the RA from which<br />
you wish to run the port diagnostics tests. Designate that RA as a client, as noted in the<br />
following dialog:<br />
** **<br />
[1] Server<br />
[2] Client<br />
>>2<br />
Did you already designate another RA to be the server (y/n)<br />
>>y<br />
Enter the IP address to test:<br />
>>10.10.17.51<br />
If the test is successful, the system responds with output that resembles the following:<br />
Port No. TCP Connection<br />
5030 OK<br />
5040 OK<br />
4401 OK<br />
1099 OK<br />
5060 Blocked<br />
4405 OK<br />
5001 OK<br />
5010 OK<br />
5020 OK<br />
Correct the problem on any port that returns a Blocked response.<br />
System Connectivity<br />
Use the system connectivity options to test connections and generate reports on<br />
connections between RAs anywhere in the system. You can perform the tests during<br />
installation and during normal operation. The tests performed to verify connections are<br />
as follows:<br />
• Ping<br />
• TCP (to ports and IP addresses, to the specific processes of the RA, and using SSH)<br />
• UDP (general and to RA processes)<br />
• RA internal protocols<br />
C–6 6872 5688–002
Running Installation Manager Diagnostics<br />
On the IP Diagnostics menu, type 6 (System connectivity) and press Enter to access<br />
the System Connectivity menu as follows:<br />
** System Connectivity **<br />
[1] System connectivity test<br />
[2] Advanced connectivity test<br />
[3] Show all results from last connectivity check<br />
[B] Back<br />
[Q] Quit<br />
When you select System connectivity test and Full mesh network check, the<br />
test reports errors in communications from any RA to any other RA in the system.<br />
When you select System connectivity test and Check from local RA to all<br />
other boxes, the test reports errors from the local RA to any other RA in the system.<br />
When you select Advanced connectivity test, the test reports on the connection<br />
from an IP address that you specified on the local appliance to an IP address and port<br />
that you specified on an RA anywhere in the system. Use this option to diagnose a<br />
problem specific to a local IP address or port.<br />
When you select Show all results from last connectivity check, the test reports<br />
all results from the previous tests—not only the errors, but also the tests that completed<br />
successfully.<br />
6872 5688–002 C–7
Running Installation Manager Diagnostics<br />
You might receive one of the messages shown in Table C–1 from the connectivity test<br />
tool.<br />
Table C–1. Messages from the Connectivity Testing Tool<br />
Message Meaning<br />
Machine is down. There is no communication with the RA.<br />
Perform the following steps to determine<br />
the problem:<br />
• Verify that the firewall permits pinging<br />
the RA, that is, using a CMP echo.<br />
• Check that the RA is connected and<br />
operating.<br />
• Check that the required ports are<br />
open. (Refer to Section 7, “Solving<br />
Networking Problems,” for tables with<br />
the port information.)<br />
is down. The host connection exists but the RA is<br />
not responding.<br />
Perform the following steps to determine<br />
the problem:<br />
• Check that the required ports are<br />
open. (Refer to Section 7, “Solving<br />
Networking Problems” for tables with<br />
the port information.)<br />
• Verify that the RA is attached to an RA<br />
cluster.<br />
Connection to link: protocol:<br />
FAILED.<br />
Link ()<br />
FAILED.<br />
No connection is available to the host<br />
through the protocol.<br />
The connection that was checked has<br />
failed.<br />
All OK. The connection is working.<br />
To discover which port is involved in the error or failure, run the test again and select<br />
Show all results from last connectivity check. The port on which each failure<br />
occurred is shown.<br />
C–8 6872 5688–002
Fibre Channel Diagnostics<br />
Running Installation Manager Diagnostics<br />
Use the Fibre Channel diagnostics when you need to check SAN connections, review<br />
port settings, see details of the Fibre Channel, determine Fibre Channel targets and<br />
LUNs, and perform I/O operations to a LUN.<br />
On the Diagnostics menu, type 2 (Fibre Channel diagnostics) and press Enter to<br />
access the Fibre Channel Diagnostics menu as follows:<br />
** Fibre Channel Diagnostics **<br />
[1] Run SAN diagnostics<br />
[2] View Fibre Channel details<br />
[3] Detect Fibre Channel targets<br />
[4] Detect Fibre Channel LUNs<br />
[5] Detect Fibre Channel SCSI-3 reserved LUNs<br />
[6] Perform I/O to a LUN<br />
[B] Back<br />
[Q] Quit<br />
Run SAN Diagnostics<br />
On the Fibre Channel Diagnostics menu, type 1 (Run SAN diagnostics) and press<br />
Enter to run the SAN diagnostics.<br />
When you select this option, the system conducts a series of automatic tests to identify<br />
the most common problems encountered in the configuration of SAN environments,<br />
such as the following:<br />
• Storage inaccessible within a site<br />
• Delays with writes or reads to disk<br />
• Disk not accessible in the network<br />
• Configuration issues<br />
Once the tests complete, a message is displayed confirming the successful completion<br />
of SAN diagnostics, or a report is displayed that provides additional details.<br />
Results similar to the following are displayed for a successful diagnostics run of port 0:<br />
0 errors:<br />
0 warnings:<br />
Total=0<br />
6872 5688–002 C–9
Running Installation Manager Diagnostics<br />
Sample results follow for a diagnostics run that returns errors:<br />
ConfigB_Site2 Box2>>1<br />
>>Running SAN diagnostics. This may take a few moments...<br />
results of SAN diagnostics are<br />
3 errors:<br />
1. Found device with no guid : wwn=5006016b1060090d lun=0 port=0 vendor=DGC<br />
product=LUNZ<br />
2. Found device with no guid : wwn=500601631060090d lun=0 port=0 vendor=DGC<br />
product=LUNZ<br />
3. Found device with no guid : wwn=5006016b1060090d lun=0 port=1 vendor=DGC<br />
product=LUNZ<br />
9 warnings:<br />
1. device wwn=500601631060090d lun=8<br />
guid=(storage=CLARION,buffer=Vector(96,6,1,96,155,195,14,0,125,87,93,152,230,2<br />
29,218,17)) found in port 1 and not in port 0<br />
2. device wwn=500601631060090d lun=7<br />
guid=(storage=CLARION,buffer=Vector(96,6,1,96,155,195,14,0,127,87,93,152,230,2<br />
29,218,17)) found in port 1 and not in port 0<br />
3. device wwn=500601631060090d lun=6<br />
guid=(storage=CLARION,buffer=Vector(96,6,1,96,155,195,14,0,129,87,93,152,230,2<br />
29,218,17)) found in port 1 and not in port 0<br />
4. device wwn=500601631060090d lun=5<br />
guid=(storage=CLARION,buffer=Vector(96,6,1,96,155,195,14,0,131,87,93,152,230,2<br />
29,218,17)) found in port 1 and not in port 0<br />
5. device wwn=500601631060090d lun=4<br />
guid=(storage=CLARION,buffer=Vector(96,6,1,96,155,195,14,0,133,87,93,152,230,2<br />
29,218,17)) found in port 1 and not in port 0<br />
6. device wwn=500601631060090d lun=3<br />
guid=(storage=CLARION,buffer=Vector(96,6,1,96,155,195,14,0,135,87,93,152,230,2<br />
29,218,17)) found in port 1 and not in port 0<br />
7. device wwn=500601631060090d lun=2<br />
guid=(storage=CLARION,buffer=Vector(96,6,1,96,155,195,14,0,137,87,93,152,230,2<br />
29,218,17)) found in port 1 and not in port 0<br />
C–10 6872 5688–002
Running Installation Manager Diagnostics<br />
8. device wwn=500601631060090d lun=1<br />
guid=(storage=CLARION,buffer=Vector(96,6,1,96,155,195,14,0,139,87,93,152,230,2<br />
29,218,17)) found in port 1 and not in port 0<br />
9. device wwn=500601631060090d lun=0<br />
guid=(storage=CLARION,buffer=Vector(96,6,1,96,155,195,14,0,141,87,93,152,230,2<br />
29,218,17)) found in port 1 and not in port 0<br />
Total=12<br />
View the Fibre Channel Details<br />
On the Fibre Channel Diagnostics menu, type 2 (View Fibre Channel details) and<br />
press Enter to show the current Fibre Channel details.<br />
The operation mode is identified automatically according to the SAN switch<br />
configuration. Usually the RA is configured for the point-to-point mode unless the SAN<br />
switch is hard-wired to port L.<br />
Note: You can use the View Fibre Channel details capability to obtain information about<br />
WWNs that is needed for zoning.<br />
You can check the status for the following on the Fibre Channel Diagnostics menu:<br />
• Speed<br />
• Operating node<br />
• Node WWN<br />
• Changes made<br />
• Connection issues<br />
• Additions of new HBAs<br />
Sample results showing Fibre Channel details for port 0 and port 1 follow:<br />
ConfigB_Site2 Box2>>2<br />
>> Port 0<br />
-----------------------------------wwn<br />
= 5001248200875c81<br />
node_wwn = 5001248200875c80<br />
port id = 0x20100<br />
operating mode = point to point<br />
speed = 2 GB<br />
Port 1<br />
-----------------------------------wwn<br />
= 5001248201a75c81<br />
node_wwn = 5001248201a75c80<br />
port id = 0x20500<br />
operating mode = point to point<br />
speed = 2 GB<br />
6872 5688–002 C–11
Running Installation Manager Diagnostics<br />
If all cables are disconnected, the operating mode results for all ports are disconnected.<br />
If only one cable is disconnected, then the operating mode for the affected port is<br />
disconnected, as shown in the following sample results:<br />
ConfigB_Site2 Box2>>2<br />
>> Port 0<br />
------------------------------------<br />
wwn = 5001248200875c81<br />
node_wwn = 5001248200875c80<br />
port id = 0x20100<br />
operating mode = point to point<br />
speed = 2 GB<br />
Port 1<br />
------------------------------------<br />
wwn = 5001248201a75c81<br />
node_wwn = 5001248201a75c80<br />
port id = 0x0<br />
operating mode = disconnected<br />
speed = 2 GB<br />
Detect Fibre Channel Targets<br />
On the Fibre Channel Diagnostics menu, type 3 (Detect Fibre Channel targets) and<br />
press Enter to see a list of the targets that are accessible to the RA through ports A<br />
and B.<br />
Some of the reasons to use this capability are as follows:<br />
• Zoning issues<br />
• Failure to detect a host<br />
• SAN connection issues<br />
• Need for WWN or storage details of each RA<br />
The following sample results provide port WWN, node WWN, and port information:<br />
ConfigB_Site2 Box2>>3<br />
>><br />
Port 0<br />
Port WWN Node WWN Port ID<br />
----------------------------------------------------<br />
1) 0x500601631060090d 0x500601609060090d 0x20000<br />
2) 0x5006016b1060090d 0x500601609060090d 0x20400<br />
C–12 6872 5688–002
Port 1<br />
Port WWN Node WWN Port ID<br />
----------------------------------------------------<br />
1) 0x500601631060090d 0x500601609060090d 0x20000<br />
2) 0x5006016b1060090d 0x500601609060090d 0x20400<br />
Detect Fibre Channel LUNs<br />
Running Installation Manager Diagnostics<br />
On the Fibre Channel Diagnostics menu, type 64(Detect Fibre Channel LUNs) and<br />
press Enter to see a list of all volumes on the SAN that are visible to the RA.<br />
Using this capability can detect<br />
• Issues with volume access<br />
• LUN repository details<br />
• Additions of volumes<br />
In the following sample results that show the types of information returned, the<br />
information wraps around:<br />
ConfigB_Site2 Box2>>4<br />
>>This operation may take a few minutes...<br />
Size Vendor Product Serial Number Vendor Specific UID<br />
Port WWN LUN CGs Site ID<br />
================================================================================<br />
1. 4.00GB DGC RAID 5 APM00031800182 LUN ID: 127<br />
CLARION: 60,06,01,60,9b,c3,0e,00,8d,57,5d,98,e6,e5,da,11:0<br />
1 500601631060090d 0 2<br />
2. 4.00GB DGC RAID 5 APM00031800182 LUN ID: 125<br />
CLARION: 60,06,01,60,9b,c3,0e,00,8b,57,5d,98,e6,e5,da,11:0<br />
1 500601631060090d 1 2<br />
3. 4.00GB DGC RAID 5 APM00031800182 LUN ID: 123<br />
CLARION: 60,06,01,60,9b,c3,0e,00,89,57,5d,98,e6,e5,da,11:0<br />
1 500601631060090d 2 2<br />
4. 4.00GB DGC RAID 5 APM00031800182 LUN ID: 121<br />
CLARION: 60,06,01,60,9b,c3,0e,00,87,57,5d,98,e6,e5,da,11:0<br />
1 500601631060090d 3 2<br />
6872 5688–002 C–13
Running Installation Manager Diagnostics<br />
5. 4.00GB DGC RAID 5 APM00031800182 LUN ID: 119<br />
CLARION: 60,06,01,60,9b,c3,0e,00,85,57,5d,98,e6,e5,da,11:0<br />
1 500601631060090d 4 2<br />
6. 4.00GB DGC RAID 5 APM00031800182 LUN ID: 117<br />
CLARION: 60,06,01,60,9b,c3,0e,00,83,57,5d,98,e6,e5,da,11:0<br />
1 500601631060090d 5 2<br />
7. 1.00GB DGC RAID 5 APM00031800182 LUN ID: 115<br />
CLARION: 60,06,01,60,9b,c3,0e,00,81,57,5d,98,e6,e5,da,11:0<br />
1 500601631060090d 6 0<br />
8. 4.00GB DGC RAID 5 APM00031800182 LUN ID: 113<br />
CLARION: 60,06,01,60,9b,c3,0e,00,7f,57,5d,98,e6,e5,da,11:0<br />
1 500601631060090d 7 2<br />
9. 62.00GB DGC RAID 5 APM00031800182 LUN ID: 111<br />
CLARION: 60,06,01,60,9b,c3,0e,00,7d,57,5d,98,e6,e5,da,11:0<br />
1 500601631060090d 8 40<br />
10. N/A DGC LUNZ APM00031800182 -<br />
N/A<br />
0 500601631060090d 0 N/A<br />
11. N/A DGC LUNZ APM00031800182 -<br />
N/A<br />
0 5006016b1060090d 0 N/A<br />
12. N/A DGC LUNZ APM00031800182 -<br />
N/A<br />
1 5006016b1060090d 0 N/A<br />
C–14 6872 5688–002
Detect Fibre Channel Scsi3 Reserved LUNs<br />
Running Installation Manager Diagnostics<br />
On the Fibre Channel Diagnostics menu, type 5 (Detect Fibre Channel Scsi3<br />
reserved LUNs) and press Enter to list all LUNs that have SCSI-3 reservations. The<br />
information returned includes the WWN, LUN number, port number, and reservation<br />
type.<br />
Perform I/O to a LUN<br />
On the Fibre Channel Diagnostics menu, type 6 (Perform I/O to a LUN) and press<br />
Enter to initiate a dialog that guides you through performing an I/O operation to a LUN.<br />
Note: The write operation removes any data that you might have. Use the write<br />
operation only when you are installing at the site.<br />
The following example for a read operation shows sample responses in bold type.<br />
SYDNEY Box1>>6<br />
>>This operation may take a few minutes...<br />
Size Vendor Product Serial Number Vendor Specific UID<br />
Port WWN Ctrl LUN<br />
============================================================================<br />
1. 9.00GB DGC RAID 5 APM00024400378 LUN 29 Sydney<br />
JouCLARION: 60,06,01,60,db,e3,0e,00,d1,3a,e0,54,cd,b6,db,11:0<br />
0 500601601060009a SP-A 0<br />
0 500601681060009a SP-B 0<br />
1 500601601060009a SP-A 0<br />
1 500601681060009a SP-B 0<br />
.<br />
.<br />
.<br />
10. 10.00GB DGC RAID 5 APM00024400378 LUN 36 Sydney<br />
JouCLARION: 60,06,01,60,db,e3,0e,00,da,3a,e0,54,cd,b6,db,11:0<br />
0 500601601060009a SP-A 10<br />
0 500601681060009a SP-B 10<br />
1 500601601060009a SP-A 10<br />
1 500601681060009a SP-B 10<br />
Select: 6<br />
Select operation to perform:<br />
** Operation To Perform **<br />
[1] Read<br />
[2] Write<br />
6872 5688–002 C–15
Running Installation Manager Diagnostics<br />
SYDNEY Box1>>1<br />
>><br />
Enter the desired transaction size:<br />
SYDNEY Box1>>10485760<br />
Do you want to read the whole LUN? (y/n)<br />
>>y<br />
1 buffers in<br />
1 buffers out<br />
total time : 0.395567 seconds<br />
2.65082e+07 bytes/sec<br />
25.2802 MB/sec<br />
2.52802 IO/sec<br />
CRC = 4126172682534249172<br />
I/O succeeded.<br />
The following example for a write operation shows sample responses in bold type.<br />
SYDNEY Box1>>6<br />
>>This operation may take a few minutes...<br />
Size Vendor Product Serial Number Vendor Specific UID<br />
Port WWN Ctrl LUN<br />
============================================================================<br />
1. 9.00GB DGC RAID 5 APM00024400378 LUN 29 Sydney<br />
JouCLARION: 60,06,01,60,db,e3,0e,00,d1,3a,e0,54,cd,b6,db,11:0<br />
0 500601601060009a SP-A 0<br />
0 500601681060009a SP-B 0<br />
1 500601601060009a SP-A 0<br />
1 500601681060009a SP-B 0<br />
.<br />
.<br />
.<br />
10. 10.00GB DGC RAID 5 APM00024400378 LUN 36 Sydney<br />
JouCLARION: 60,06,01,60,db,e3,0e,00,da,3a,e0,54,cd,b6,db,11:0<br />
0 500601601060009a SP-A 10<br />
0 500601681060009a SP-B 10<br />
1 500601601060009a SP-A 10<br />
1 500601681060009a SP-B 10<br />
============================================================================<br />
Select: 10<br />
Select operation to perform:<br />
** Operation To Perform **<br />
[1] Read<br />
[2] Write<br />
SYDNEY Box1>>2<br />
>><br />
Enter the desired transaction size:<br />
SYDNEY Box1>>10485760<br />
C–16 6872 5688–002
Enter the number of transactions to perform:<br />
SYDNEY Box1>>100<br />
Enter the number of blocks to skip:<br />
SYDNEY Box1>>16<br />
100 buffers in<br />
100 buffers out<br />
total time : 40.7502 seconds<br />
2.57318e+07 bytes/sec<br />
24.5398 MB/sec<br />
2.45398 IO/sec<br />
CRC = 3829111553924479115<br />
I/O succeeded.<br />
Synchronization Diagnostics<br />
Running Installation Manager Diagnostics<br />
On the Diagnostics menu, type 3 (Synchronization diagnostics) and press Enter to<br />
verify that a RA is synchronized.<br />
Note: The RA must be attached to run the synchronization diagnostics. Reattaching the<br />
RA causes the RA to reboot.<br />
The results displayed are similar to the following example:<br />
remote refid st t when poll reach delay offset jitter<br />
=============================================================================<br />
*10.10.0.1 192.116.202.203 3 u 438 1024 377 0.337 12.971 6.241<br />
+11 10.10.0.1 2 u 484 1024 376 0.090 -4.530 0.023<br />
LOCAL(0) LOCAL(0) 13 1 2 64 377 0.000 0.000 0.004<br />
The columns in the previous output are defined as follows:<br />
• remote—host names or addresses of the servers and peers used for synchronization<br />
• refid—current source of synchronization<br />
• st—stratum<br />
• t—type (u=unicast, m=multicast, l=local, – =do not know)<br />
• when—time since the peer was last heard, in seconds<br />
• poll—poll interval, in seconds<br />
• reach—status of the reachability register in octal format<br />
• delay—latest delay in milliseconds<br />
• offset—latest offset in milliseconds<br />
• jitter—latest jitter in milliseconds<br />
6872 5688–002 C–17
Running Installation Manager Diagnostics<br />
The symbol at the left margin indicates the synchronization status of each peer. The<br />
currently selected peer is marked with an asterisk (*); additional peers designated as<br />
acceptable for synchronization are marked with a plus sign (+). Peers marked with * and<br />
+ are included in the weighted average computation to set the local clock. Data<br />
produced by peers marked with other symbols is discarded. The LOCAL(0) entry<br />
represents the values obtained from the internal clock on the local machine.<br />
Collect System Info<br />
On the Diagnostics menu, type 4 (Collect system info) and press Enter to collect<br />
system information for later processing and analysis. You specify where to place the<br />
information collected. In some cases, you might need to transfer it to a vendor for<br />
technical support. You are prompted to provide the following information:<br />
• The time frame for log collection<br />
• Whether to collect information from the remote site<br />
• FTP details if you choose to send the results to an FTP server<br />
• Which logs to collect<br />
• Whether you have SANTap switches from which you want to collect information<br />
Note: The dialog asks whether you want full collection. If you choose full collection,<br />
additional technical information is supplied, but the time required for the collection<br />
process is lengthened. Unless specifically instructed by a Unisys service representative,<br />
do not choose full collection.<br />
The following dialog provides sample responses in bold type for collecting system<br />
information:<br />
>>GMT right now is 11/24/2005 14:45:43<br />
Enter the start date:<br />
>>11/22/2005<br />
Enter the start time:<br />
>>12:00:00<br />
Enter the end date:<br />
>>11/24/2005<br />
Enter the end time:<br />
>>14:45:43<br />
Note: The start and end times are used only for collection of the system<br />
logs. Logs from hosts are collected in their entirety.<br />
Do you want to collect system information from the other site also? (y/n)<br />
>>y<br />
Do you want to send results to an ftp server? (y/n)<br />
>>y<br />
C–18 6872 5688–002
Running Installation Manager Diagnostics<br />
Enter the name of the ftp server to which you want to transfer the<br />
collected system information:<br />
>>ftp.ess.unisys.com<br />
Enter the port number to which to connect on the FTP server:<br />
>>21<br />
Enter the FTP user name:<br />
>>MY_USERNAME<br />
Enter the location on the FTP server in which you want to put the collected<br />
system information:<br />
>>incoming<br />
Enter the file on the FTP server in which you want to put the collected<br />
system information:<br />
>>19557111_company.tar<br />
Enter the FTP password:<br />
>>*******<br />
Select the logs you want to collect:<br />
** Collection mode **<br />
[1] Collect logs from RAs only<br />
[2] Collect logs from hosts only<br />
[3] Collect logs from RAs and hosts<br />
>>3<br />
Do you have SANTap switches from which you want to collect information?<br />
>>n<br />
Do you want to perform full collection? (y/n)<br />
>>n<br />
Do you want to limit collection time? (y/n)<br />
>>n<br />
Once you complete the information-entry dialog, Installation Manager checks<br />
connectivity and displays a list of accessible hosts for which the feature is enabled. (See<br />
the Unisys <strong>SafeGuard</strong> <strong>Solutions</strong> Replication Appliance Administrator’s <strong>Guide</strong> for more<br />
information.). You must indicate the hosts for which you want to collect logs. You can<br />
select one or more individual hosts or enter NONE or ALL.<br />
Once you specify the hosts, Installation Manager returns system information and logs for<br />
all accessible RAs, including the remote RAs, if so instructed. This software also returns<br />
a success or failure status report for each RA from which it has been instructed to collect<br />
information.<br />
6872 5688–002 C–19
Running Installation Manager Diagnostics<br />
Installation Manager also collects logs for the selected hosts and reports on the success<br />
or failure of each collection. The timeout on the collection process is 20 minutes.<br />
Once the information is collected and you requested that it be stored on an ftp server,<br />
the system reports that it is transferring the collected information to the specified FTP<br />
location. Once the transfer completes, you are prompted to press ENTER to continue.<br />
You can also open or download the stored files using your browser. Log in as<br />
webdownload/webdownload, and access the files at one of these URLs:<br />
• For nonsecured servers: http:///info/<br />
• For secured servers: https:///info/<br />
The following error conditions apply:<br />
• If the connection with an RA is lost while information collection is in progress, no<br />
information is collected.<br />
You can run the process again. If the collection from the remote site failed because<br />
of a WAN failure, run the process locally at the remote site.<br />
• If simultaneous information collection is occurring from the same RA, only the<br />
collector that established the first connection can succeed.<br />
• FTP failure results in failure of the entire process.<br />
If this process fails to collect the desired host information, you can alternatively generate<br />
host information collection directly for individual hosts. Use the Host Information<br />
Collector (HIC) utility as described in Appendix A. Also, the Unisys <strong>SafeGuard</strong> <strong>Solutions</strong><br />
Administrator’s <strong>Guide</strong> provides additional information about the HIC utility.<br />
C–20 6872 5688–002
Appendix D<br />
Replacing a Replication Appliance<br />
(RA)<br />
To replace an RA at a site, you must perform the following tasks as described in this<br />
appendix:<br />
• Save configuration settings.<br />
• Record the group properties and save the Global cluster mode settings.<br />
• Modify the Preferred RA setting.<br />
• Detach the failed RA.<br />
• Remove the Fibre Channel adapter cards.<br />
• Install and configure the replacement RA.<br />
• Verify the RA installation.<br />
• Restore group properties.<br />
• Ensure the existing RA can switch over to the new RA.<br />
Note: During this process, be sure that the direction of all consistency groups is from<br />
the site without the failed RA to the site with the RA during this process. You might<br />
need to move groups.<br />
6872 5688–002 D–1
Replacing a Replication Appliance (RA)<br />
Saving the Configuration Settings<br />
Before you replace an RA, Unisys recommends that you save the current environment<br />
settings to a file. The saved file is a script that contains CLI commands for all groups,<br />
volumes, and replication pairs needed to re-create the environment. The file is used for<br />
backup purposes only.<br />
1. From a command prompt on the management PC, enter the following command to<br />
change to the directory where the plink.exe file is located:<br />
cd putty<br />
2. Update the following command with your site management IP address and<br />
administrator (admin) password, and then enter the command:<br />
plink -ssh site management IP address -l admin -pw admin password<br />
save_settings > sitexandsitey.txt<br />
Note: If a message is displayed asking whether you want to add a cached registry<br />
key, type y and press Enter. The file is automatically saved to the management PC<br />
in the same directory from which the command was issued.<br />
If you need to restore the settings saved in the previous procedure, update the following<br />
command with your site management IP address and administrator (admin) password,<br />
and then enter the command:<br />
plink -ssh site management IP address -l admin -pw admin password -m<br />
version30.txt<br />
Recording Policy Properties and Saving Settings<br />
Before you begin the RA replacement procedure, ensure to record the policy properties<br />
and save the Global cluster mode settings.<br />
Perform the following steps for each consistency group to record policy properties and<br />
save settings:<br />
1. Select the Policy tab.<br />
2. Write down and save the current preferred RA settings and Global cluster mode<br />
parameter for each consistency group. Use this record to restore these values after<br />
you replace the RA.<br />
3. Click OK.<br />
4. Repeat steps 1 through 3 for all the other groups.<br />
.<br />
D–2 6872 5688–002
Modifying the Preferred RA Setting<br />
Replacing a Replication Appliance (RA)<br />
For each consistency group, record the Preferred RA and Global cluster mode settings<br />
so that they can be stored at the end of this procedure.<br />
Perform the following steps to change all consistency groups that were running on the<br />
failed RA to a surviving RA:<br />
1. Select the Policy tab.<br />
2. Change the Preferred RA setting to a surviving RA number for all consistency<br />
groups that had the Preferred RA value set to the failed RA. Perform steps 2a<br />
through 2e for each group.<br />
a. If the Global cluster mode parameter is set to one of the following options,<br />
skip this step, and continue with step 4d:<br />
• None<br />
• Manual (shared quorum)<br />
• Manual<br />
b. Change the Global cluster mode parameter to<br />
• Manual (if using MSCS with shared quorum)<br />
• Manual (if using MSCS with majority node set)<br />
c. Click Apply.<br />
d. Change the Preferred RA setting, and then click Apply.<br />
e. Change the Global cluster mode parameter to the original setting.<br />
f. Click Apply.<br />
2. Select the Consistency Group and click the Status tab to verify that all groups<br />
are running on the new RA number.<br />
Review the current status of the preferred RA under the components pane.<br />
3. Detach the failed RA. If you can log on to the RA, detach the RA by performing the<br />
following steps. Else continue with “Removing Fibre Channel Adapter Cards.”<br />
a. Use the Putty utility to connect to the box management IP address for the RA that<br />
is being replaced.<br />
b. Type boxmgmt when prompted to log in, and then type the appropriate<br />
password if it has changed from the default password boxmgmt.<br />
The Main Menu is displayed.<br />
c. Type 4 (Cluster operations) and press Enter.<br />
d. Type 2 (Detach from cluster) to detach the RA from the cluster, and then press<br />
Enter.<br />
e. Type y when prompted to detach and press Enter.<br />
f. Type B (Back) and press Enter to return to the Main Menu.<br />
g. Type quit and close the PuTTY window.<br />
6872 5688–002 D–3
Replacing a Replication Appliance (RA)<br />
Removing Fibre Channel Adapter Cards<br />
Perform the following to remove the RA and Fibre Channel host bus adapters (HBAs):<br />
1. Power off the failed RA.<br />
2. Physically disconnect and remove the failed RA from the rack.<br />
3. Physically remove the Fibre Channel HBAs from the failed RA and insert them into<br />
the replacement RA.<br />
Note: If you cannot use the cards from the existing RA, refer to “Failure of All SAN<br />
Fibre Channel Host Bus Adapters (HBAs)” in Section 8 for information about<br />
replacing a failed HBA.<br />
Installing and Configuring the Replacement RA<br />
To install and configure the replacement RA, you must complete several tasks, as follow:<br />
• Complete the procedure in “Cable and Apply Power to New RA.”<br />
• Complete the procedure in “Connecting and Accessing the RA.”<br />
• Complete the procedure in “Configuring the RA.”<br />
• Complete the procedures in “Verifying the RA Installation.”<br />
Cable and Apply Power to the New RA<br />
1. Insert the new RA into the rack and apply power.<br />
2. Insert the Unisys <strong>SafeGuard</strong> <strong>Solutions</strong> RA Setup Disk CD-ROM into the CD/DVD<br />
drive of the RA. Ensure that this disk is the same version that is running in the other<br />
RAs.<br />
3. Power off and then power on the RA.<br />
4. As the RA boots, check the BIOS level as displayed in the Unisys banner and note<br />
the level displayed. At the end of the replacement procedure, you can compare the<br />
existing RA BIOS level with the new RA BIOS level. The RA BIOS might need to be<br />
updated.<br />
Connecting and Accessing the RA<br />
1. Power on the appropriate RA.<br />
2. Connect an Ethernet cable between the management PC used for installation and<br />
the WAN Ethernet segment to which the RA is connected.<br />
If you connect the management PC directly to the RA, use a crossover cable.<br />
3. Assign the following IP address and subnet mask to the management PC:<br />
10.77.77.50 (IP address)<br />
255.255.255.0 (subnet mask)<br />
4. Access the RA by using the SSH client. (See Appendix C.) Use the 10.77.77.77 IP<br />
address, which has a subnet mask of 255.255.255.0.<br />
D–4 6872 5688–002
Replacing a Replication Appliance (RA)<br />
5. Log in with the boxmgmt user name and the boxmgmt password.<br />
6. Provide the following information for the layout of the RA installation:<br />
a. When prompted about the number of sites in the environment<br />
• Type 2 to install in a geographic replication environment or a geographic<br />
clustered environment.<br />
• Type 1 to install in a continuous data protection environment.<br />
b. Type the number of RAs at the site, and press Enter.<br />
The Main Menu appears.<br />
Checking Storage-to-RA Access<br />
If the LUNs are not accessible, check your switch configuration and zoning. Verify that all<br />
LUNs are accessible by using the Main Menu of the Installation Manager and<br />
performing the following steps:<br />
1. Type 3 (Diagnostics).<br />
2. Type 2 (Fibre Channel diagnostics).<br />
3. Type 4 (Detect Fibre Channel LUNs).<br />
After a few minutes, a list of detected LUNs appears.<br />
4. Press the spacebar until all expected LUNs appear.<br />
5. Type B (Back).<br />
6. Type B again.<br />
The Main Menu appears.<br />
7. If you do not see all Fibre Channel LUNs in step 4, correct the environment and<br />
repeat steps 1 through 6.<br />
Enabling PCI-X Slot Functionality<br />
If your system is configured with a gigabit (Gb) WAN, which is used for the optical WAN<br />
connection, perform the following steps on the Main Menu of the replacement RA:<br />
1. Type 2 (Setup).<br />
2. Type 8 (Advanced option).<br />
3. Type 12 (Enable/disable additional remote interface).<br />
4. Type yes when prompted on whether to enable the additional remote interface.<br />
5. Type B twice to return to the Main Menu.<br />
6872 5688–002 D–5
Replacing a Replication Appliance (RA)<br />
Configuring the RA<br />
1. On the Main Menu, type 1 (Installation).<br />
2. Type 2 (Get Setup information from an installed RA). Press Enter.<br />
The Get Settings Wizard menu appears with Get Settings from Installed<br />
RA selected.<br />
3. Press Enter.<br />
4. Type 1 (Management interface) to view the settings from the installed RA.<br />
5. Type y when prompted to configure a temporary IP address.<br />
6. Type the IP address.<br />
7. Type the IP subnet mask and then press Enter.<br />
8. Type y or n, depending on your environment, when prompted to configure a<br />
gateway.<br />
9. Type the box management IP address of Site 1 RA 1 to import the settings from that<br />
RA.<br />
10. Type y to import the settings.<br />
11. Press Enter to continue when a message states that the configuration was<br />
successfully imported.<br />
The Get Settings Wizard menu appears with Apply selected.<br />
12. Perform the following steps to apply the configuration to the RA:<br />
a. Press Enter to continue.<br />
The complete list of settings is displayed. These settings are the same as the<br />
ones for Site 1 RA 1.<br />
b. Type y to apply these settings.<br />
c. Type 1 or 2 when prompted for a site number, depending on the site on which<br />
the RA is located.<br />
d. Type the RA number when prompted.<br />
A confirmation message appears when the settings are applied successfully.<br />
e. Press Enter.<br />
The Get Settings Wizard menu appears with Proceed to the Complete<br />
Installation Wizard selected.<br />
f. Press Enter to continue.<br />
The Complete Installation Wizard menu appears with Configure<br />
repository volume selected.<br />
13. Configure the repository volume by completing the following steps:<br />
a. Press Enter.<br />
b. Type 2 (Select a previously formatted repository volume).<br />
D–6 6872 5688–002
Replacing a Replication Appliance (RA)<br />
c. Select the number of the repository volume corresponding to the group of<br />
displayed volumes, and press Enter.<br />
d. Press Enter again.<br />
The Complete Installation Wizard menu appears with Attach to cluster<br />
selected.<br />
14. Attach the RA to the RA cluster by completing the following steps:<br />
a. Press Enter.<br />
b. Type y at the prompt to attach to the cluster.<br />
The RA reboots.<br />
c. Close the PuTTY session if necessary.<br />
Verifying the RA Installation<br />
To verify that the RA is correctly installed, you must<br />
• Verify the WAN bandwidth<br />
• Verify the clock synchronization<br />
Verifying WAN Bandwidth<br />
Use the following procedure to verify the actual versus the expected WAN bandwidth.<br />
Note: Correct any problems and rerun the verification.<br />
1. Open an SSH session to the box management IP address for the replacement RA.<br />
2. Type boxmgmt when prompted to log in, and then type the appropriate password<br />
if it has changed from the default password boxmgmt.<br />
The Main Menu is displayed.<br />
3. Type 3 (Diagnostics) and press Enter.<br />
The Diagnostics menu appears.<br />
4. Type 1 (IP diagnostics) and press Enter.<br />
The IP Diagnostics menu appears.<br />
5. Type 4 (Test throughput) and press Enter.<br />
6. Type the WAN IP address of the peer RA; for example, site 2 RA 1 is the peer for<br />
site 1 RA 1.<br />
7. Type 2 (WAN interface).<br />
8. At the prompt, type 20 to change the default value for the desired number of<br />
concurrent streams.<br />
9. At the prompt for the test duration, type 60 to change the default value.<br />
A message is displayed that the connection was established.<br />
6872 5688–002 D–7
Replacing a Replication Appliance (RA)<br />
10. After 60 seconds, make sure that the following information is displayed on the<br />
screen. Ignore any TCP Windows Size warnings.<br />
• IP connection for every stream<br />
• Interval, Transfer, and Bandwidth for every stream<br />
• Expected bandwidth in the [SUM] display at the bottom of the screen<br />
11. On the IP Diagnostics menu, type Q (Quit), and then type y.<br />
Verifying Clock Synchronization<br />
The timing of all Unisys <strong>SafeGuard</strong> 30m activities across all RAs in an installation must<br />
be synchronized against a single clock (for example, on the network time protocol [NTP]<br />
server). Consequently, you need to synchronize the replacement RA.<br />
For the procedure to verify RA synchronization, see the Unisys <strong>SafeGuard</strong> <strong>Solutions</strong><br />
Replication Appliance Installation <strong>Guide</strong>.<br />
Restoring Group Properties<br />
Perform the following steps on the Management Console for each group that needs<br />
to have the Preferred RA setting restored to an RA other than RA 1.. All Preferred RA<br />
settings are set to RA 1.<br />
1. Select the Policy tab for the consistency group.<br />
2. On the General Settings section, change the Preferred RA setting to the<br />
original setting, and then click Apply.<br />
3. Change the Global cluster mode under Advanced to the original setting if it<br />
was changed earlier.<br />
4. Click Apply.<br />
Ensuring the Existing RA Can Switch Over to the<br />
New RA<br />
Once the new RA is part of the configuration, the management console does not display<br />
any errors. Shut down any other RA at the site to ensure that the newly replaced RA can<br />
successfully complete the switchover. As the existing RA reboots, check the BIOS level<br />
as displayed in the Unisys banner and note it.<br />
Compare the BIOS level noted for the exiting (rebooting) RA with the BIOS level you<br />
noted for the replacement RA. If the BIOS levels do not match, contact the Unisys<br />
<strong>Support</strong> Center to obtain the correct BIOS.<br />
D–8 6872 5688–002
Appendix E<br />
Understanding Events<br />
Event Log<br />
Event Topics<br />
Various events generate entries to the Unisys <strong>SafeGuard</strong> 30m solution system log.<br />
These events are predefined in the system according to topic, level of severity, and<br />
scope. The Unisys <strong>SafeGuard</strong> 30m solution supports proactive notification of an event—<br />
either by sending e-mail messages or by generating system log events that are logged<br />
by a management application.<br />
The system records log entries in response to a wide range of predefined events. Each<br />
event carries an event ID. For manageability, the system divides the events into general<br />
and advanced types. In most cases, you can monitor system behavior effectively by<br />
viewing the general events only. For troubleshooting a problem, technical support<br />
personnel might want to review the advanced log events.<br />
Event topics correspond to the components where the events occur, including<br />
• Management (management console and CLI)<br />
• Site<br />
• RA<br />
• Consistency group<br />
• Splitter<br />
A single event can generate multiple log entries.<br />
6872 5688–002 E–1
Understanding Events<br />
Event Levels<br />
Event Scope<br />
The levels of severity for events are defined as follows (in ascending order):<br />
• Info<br />
These messages are informative in nature, usually referring to changes in the<br />
configuration or normal system state.<br />
• Warning<br />
These messages indicate a warning, usually referring to a transient state or to an<br />
abnormal condition that does not degrade system performance.<br />
• Error<br />
These messages indicate an important event that is likely to disrupt normal system<br />
behavior, performance, or both.<br />
A single change in the system—for example, an error over a communications line—can<br />
affect a wide range of system components and cause the system to generate a large<br />
number of log events. Many of these events contain highly technical information that is<br />
intended for use by Unisys service representatives. When all of the events are displayed,<br />
you might find it difficult to identify the particular events in which you are interested.<br />
You can use the scope to manage the type and quantity of events that are displayed in<br />
the log. An event belongs to one of the following scopes:<br />
• Normal<br />
Events with a Normal scope result when the system analyzes a wide range of<br />
system data to generate a single event that explains the root cause for an entire set<br />
of Detailed and Advanced events. Usually, these events are sufficient for effective<br />
monitoring of system behavior.<br />
• Detailed<br />
Events with a Detailed scope include all events for all components that are<br />
generated for users and that are not included among the events that have a Normal<br />
scope. The display of Detailed events includes Normal events also.<br />
• Advanced<br />
Events with an Advanced scope contain technical information. In some cases, such<br />
as troubleshooting a problem, a Unisys service representative might need to retrieve<br />
information from the Advanced log events.<br />
.<br />
E–2 6872 5688–002
Displaying the Event Log<br />
Understanding Events<br />
The event log is displayed either from the Management Console or using the CLI.<br />
To display event logs, select Logs in the navigation pane; the most recent events in the<br />
event log are displayed. For more information about a particular event log, double-click<br />
the event log. The Log Event Properties dialog box displays details of the individual<br />
event.<br />
You can sort the log events according to any of the columns (that is, level, scope, time,<br />
site, ID and topic) in ascending or descending order.<br />
Perform the following steps to display advanced logs:<br />
1. Click the Filter log tool bar option in the event pane.<br />
The Filter Log dialog box appears.<br />
2. Change the scope to Advanced.<br />
3. Click OK.<br />
For more information about using the management console, see the Unisys <strong>SafeGuard</strong><br />
<strong>Solutions</strong> Replication Appliance Administrator’s <strong>Guide</strong>.<br />
To display the event log from the CLI, run the get_logs command and specify values for<br />
each of the parameters. Specify the parameters carefully to avoid displaying unnecessary<br />
log information. You can use the terse display parameter to show more or less<br />
information for the displayed events as desired.<br />
For information about the CLI, see the Unisys <strong>SafeGuard</strong> <strong>Solutions</strong> Replication Appliance<br />
Command Line Interface (CLI) Reference <strong>Guide</strong>.<br />
Using the Event Log for <strong>Troubleshooting</strong><br />
The event log provides information that can be useful in determining the cause or nature<br />
of problems that might arise during operation.<br />
The “group capabilities” events provide an important tool for understanding the behavior<br />
of a consistency group. Each group capabilities event—such as group capabilities OK,<br />
group capabilities minor problem, or group capabilities<br />
problem—provides a high-level description of a current group situation with regard to<br />
each of the RAs and identifies the RA that is currently handling the group.<br />
6872 5688–002 E–3
Understanding Events<br />
The information reported for each RA includes the following:<br />
• RA status: Indicates whether an RA is currently a member of the RA cluster (that is,<br />
alive) or not a member (that is, dead).<br />
• Marking status: yes or no.<br />
• Transfer status: yes, no, no data loss (that is, flushing), or yes unstable (that is, the<br />
RA cannot be initialized if closed or detached).<br />
• Journal capability: yes (that is, distributing, logged access, and so forth), no, or static<br />
(that is, access to an image is enabled but access to a different image is not enabled,<br />
cannot distribute, and cannot support image access)<br />
• Preferred: yes or no.<br />
In addition, the event log reports the RA on which the group is actually running and the<br />
status of the link between the sites.<br />
A group capabilities event is generated whenever there is a change in the capabilities of<br />
a group on any RA. The message reports on any limitations to the capabilities of the<br />
group and provides reasons for these limitations.<br />
Tracking logged events can explain changes in a group state (for example, the reason<br />
replication was paused, the reason the group switched to another RA, and so forth).<br />
The group capabilities events might offer reasons that particular actions are not<br />
performed. For example, if you want to know the reason the group transfer was paused,<br />
you can check the event log for the “pause replication” action. If, however, you want to<br />
know the reason a group transfer did not start, you might check the most recent group<br />
capabilities event.<br />
The level of a group capabilities event can be INFO, WARNING, or ERROR, depending<br />
on the severity of the reported situation. These levels correspond to the OK, minor<br />
problem, and problem bookmarks that follow group capabilities in the message<br />
descriptions.<br />
List of Events<br />
The list of events is presented in tabular format with the following given for each event:<br />
• Event ID<br />
• Topic (for example, Management, Site, RA, Splitter, Group)<br />
• Level (for example, Info, Warning, Error )<br />
• Description<br />
• Scope<br />
• Time<br />
• Site<br />
E–4 6872 5688–002
List of Normal Events<br />
Event<br />
ID<br />
Understanding Events<br />
Normal events include both root-cause events (a single description for an event that can<br />
generate multiple events) and other selected basic events. Some Normal events do not<br />
have a topic or trigger. Table E–1 lists Normal events with their descriptions.<br />
Topic<br />
Table E–1. Normal Events<br />
Level<br />
Description<br />
1000 Management Info User logged in. (User<br />
)<br />
1001 Management Warning Log in failed. (User<br />
)<br />
1003 Management Warning Failed to generate SNMP<br />
trap. (Trap contents)<br />
1004 Management Warning Failed to send e-mail alert<br />
to specified address.<br />
(Address , Event summary<br />
)<br />
1005 Management Warning Failed to update file. (File<br />
<br />
1006 Management Info Settings changed. (User<br />
, Settings<br />
)<br />
1007 Management Warning Settings change failed.<br />
(User , Settings<br />
, Reason<br />
)<br />
1008 Management Info User action succeeded.<br />
(User , Action<br />
)<br />
Trigger<br />
User log-in action<br />
User failed to log in<br />
The system failed to<br />
send SNMP trap.<br />
The system failed to<br />
send an e-mail alert.<br />
The system failed to<br />
update the local<br />
configuration file<br />
(passwords, SSH<br />
keys, system log<br />
configuration, and<br />
SNMP configuration).<br />
The user changed<br />
settings.<br />
The system failed to<br />
change settings.<br />
The user performed<br />
one of these actions:<br />
bookmark_image,<br />
clear_markers,<br />
set_markers,<br />
undo_logged_<br />
writes, set_num_<br />
of_streams.<br />
6872 5688–002 E–5
Understanding Events<br />
Event<br />
ID<br />
Topic<br />
Table E–1. Normal Events<br />
Level<br />
Description<br />
1009 Management Warning User action failed. (User<br />
, Action ,<br />
Reason )<br />
1011 Management Error Grace period expired. You<br />
must install an activation<br />
code to activate your<br />
license.<br />
1014 Management Info User bookmarked an<br />
image. (Group ,<br />
Snapshot )<br />
1015 Management Warning RA-to-storage multipathing<br />
problem (RA ,<br />
Volume )<br />
1016 Management Warning<br />
Off<br />
RA- multipathing fixed.<br />
problem (RA ,<br />
Volume )<br />
1017 Management Warning RA- multipathing problem.<br />
(RA ,<br />
Splitter)<br />
1018 Management Warning<br />
Off<br />
RA- multipathing problem<br />
fixed. (RA , Splitter<br />
)<br />
1019 Management Warning User action succeeded.<br />
(Markers cleared. Group<br />
,)<br />
(Replication set attached<br />
as clean. Group)<br />
3001 RA Warning RA is no longer a cluster<br />
member. (RA )<br />
3005 RA Error Settings conflict between<br />
sites. (Reason )<br />
Trigger<br />
One of these actions<br />
failed:<br />
bookmark_image,<br />
clear_markers,<br />
set_markers,<br />
undo_logged_<br />
writes, set_num_<br />
of_streams.<br />
Grace period expired<br />
The user bookmarked<br />
an image.<br />
Single path only or<br />
more paths between<br />
RA and volume are<br />
not available.<br />
All paths between the<br />
RA and volume are<br />
available.<br />
One or more paths<br />
between the RA and<br />
the splitter are not<br />
available.<br />
All paths between the<br />
RA and the splitter<br />
are available.<br />
User cleared markers<br />
or attached replication<br />
set as clean.<br />
An RA is<br />
disconnected from<br />
site control.<br />
A settings conflict<br />
between the sites<br />
was discovered.<br />
E–6 6872 5688–002
Event<br />
ID<br />
Topic<br />
Table E–1. Normal Events<br />
Level<br />
Description<br />
3006 RA Error Off Settings conflict between<br />
sites resolved by user.<br />
(Using Site <br />
settings)<br />
3030 RA Warning RA switched path to<br />
storage. (RA ,<br />
Volume )<br />
4056 Group Warning No image was found in<br />
the journal to match the<br />
query. (Group )<br />
4090 Group Warning Target-side log is 90<br />
percent full. When log is<br />
full, writing by hosts at<br />
target side is disabled.<br />
(Group )<br />
4106 Group Warning Capacity reached; cannot<br />
write additional markers<br />
for this group to<br />
.<br />
Starting full sweep. (Group<br />
)<br />
4117 Group Warning Virtual access buffer is 90<br />
percent full. When the<br />
buffer is full, writing by<br />
hosts at the target side is<br />
disabled. (Group )<br />
5008 Splitter Warning Host shut down. (Host<br />
Splitter )<br />
5010 Splitter Warning Splitter stopped;<br />
depending on policy,<br />
writing by host might be<br />
disabled for some groups,<br />
and a full sweep might be<br />
required for other groups.<br />
(Splitter )<br />
5011 Splitter Warning Splitter stopped; full<br />
sweep is required. (Splitter<br />
)<br />
5012 Splitter Warning The splitter stopped; write<br />
operations to replication<br />
volumes are disabled.<br />
(Splitter )<br />
Understanding Events<br />
Trigger<br />
A settings conflict<br />
between the sites<br />
was resolved by the<br />
user.<br />
A storage path<br />
change was initiated<br />
by the RA.<br />
No image was found<br />
in the journal to<br />
match the query.<br />
The target-side log is<br />
90 percent full.<br />
The disk space for the<br />
markers was filled for<br />
the group.<br />
The usage of the<br />
virtual access buffer<br />
has reached 90<br />
percent.<br />
The host was shut<br />
down or restarted.<br />
The user stopped the<br />
splitter after removing<br />
volumes; volumes are<br />
disconnected.<br />
The user stopped the<br />
splitter after removing<br />
volumes; volumes are<br />
disconnected.<br />
The splitter stopped;<br />
host access to all<br />
volumes is disabled.<br />
6872 5688–002 E–7
Understanding Events<br />
Event<br />
ID<br />
Topic<br />
Table E–1. Normal Events<br />
Level<br />
Description<br />
10000 — Info Changes are occurring in<br />
the system. Analysis in<br />
progress.<br />
10001 — Info System changes have<br />
occurred. The system is<br />
now stable.<br />
10002 — Info The system activity has<br />
not stabilized—issuing an<br />
intermediate report.<br />
10101 — Error The cause of the system<br />
activity is unclear. To<br />
obtain more information,<br />
filter the events log using<br />
the Detailed scope.<br />
10102 — Info Site control recorded<br />
internal changes that do<br />
not affect system<br />
operation.<br />
10202 — Info Settings have changed. —<br />
10203 — Info The RA cluster is down. —<br />
10204 — Error One or more RAs are<br />
disconnected from the RA<br />
cluster.<br />
10205 — Error A communications<br />
problem occurred in an<br />
internal process.<br />
10206 — Info An internal process was<br />
restarted.<br />
10207 — Error An internal process was<br />
restarted.<br />
10210 — Error Initialization is<br />
experiencing high-load<br />
conditions.<br />
10211 — Error A temporary problem<br />
occurred in the Fibre<br />
Channel link between the<br />
splitters and the RAs.<br />
10212 — Error Off The temporary problem<br />
that occurred in the Fibre<br />
Channel link between the<br />
splitters and the RAs is<br />
resolved.<br />
Trigger<br />
E–8 6872 5688–002<br />
—<br />
—<br />
—<br />
—<br />
—<br />
—<br />
—<br />
—<br />
—<br />
—<br />
—<br />
—
Event<br />
ID<br />
Topic<br />
Table E–1. Normal Events<br />
Level<br />
Description<br />
10501 — Info Synchronization<br />
completed.<br />
10502 — Info Access to the target-side<br />
image is enabled.<br />
10503 — Error The system is transferring<br />
the latest snapshot before<br />
pausing transfer (no data<br />
loss).<br />
10504 — Info The journal was cleared. —<br />
10505 — Info The system completed<br />
undoing writes to the<br />
target-side log.<br />
10506 — Info The roll to the physical<br />
images is complete.<br />
Logged access to the<br />
physical image is now<br />
available.<br />
10507 — Info Because of system<br />
changes, the journal was<br />
temporarily out of service.<br />
The journal is now<br />
available.<br />
10508 — Info All data were flushed from<br />
the local-side RA;<br />
automatic failover<br />
proceeds.<br />
10509 — Info The initial long<br />
resynchronization has<br />
completed.<br />
10510 — Info Following a paused<br />
transfer, the system is<br />
now cleared to restart<br />
transfer.<br />
10511 — Info The system finished<br />
recovering the replication<br />
backlog.<br />
12001 — Error The splitter is down. —<br />
12002 — Error An error occurred in all<br />
WAN links to the other<br />
site. The other site is<br />
possibly down.<br />
Understanding Events<br />
Trigger<br />
6872 5688–002 E–9<br />
—<br />
—<br />
—<br />
—<br />
—<br />
—<br />
—<br />
—<br />
—<br />
—<br />
—
Understanding Events<br />
Event<br />
ID<br />
Topic<br />
Table E–1. Normal Events<br />
Level<br />
Description<br />
12003 — Error An error occurred in the<br />
WAN link to the RA at the<br />
other site.<br />
12004 — Error An error occurred in the<br />
data link over the WAN. All<br />
RAs are unable to transfer<br />
replicated data to the<br />
other site.<br />
12005 — Error An error occurred in the<br />
data link over the WAN.<br />
The RA is unable to<br />
transfer replicated data to<br />
the other site.<br />
12006 — Error The RA is disconnected<br />
from the RA cluster.<br />
12007 — Error All RAs are disconnected<br />
from the RA cluster.<br />
12008 — Error The RA is down. —<br />
12009 — Error The group entered high<br />
load.<br />
12010 — Error A journal error occurred.<br />
Full sweep is to be<br />
performed after the error<br />
is corrected.<br />
12011 — Error The target-side log or<br />
virtual buffer is full. Writing<br />
by hosts at the target side<br />
is disabled.<br />
12012 — Error The system cannot enable<br />
virtual access to the<br />
image.<br />
12013 — Error The system cannot enable<br />
access to a specified<br />
image.<br />
12014 — Error The Fibre Channel link<br />
between all RAs and all<br />
splitters and storage is<br />
down.<br />
12016 — Error The Fibre Channel link<br />
between all RAs and all<br />
storage is down.<br />
Trigger<br />
E–10 6872 5688–002<br />
—<br />
—<br />
—<br />
—<br />
—<br />
—<br />
—<br />
—<br />
—<br />
—<br />
—<br />
—
Event<br />
ID<br />
Topic<br />
Table E–1. Normal Events<br />
Level<br />
Description<br />
12022 — Error The Fibre Channel link<br />
between the RA and<br />
splitters or storage<br />
volumes (or both) is down.<br />
12023 — Error The Fibre Channel link<br />
between the RA and all<br />
splitters and storage is<br />
down.<br />
12024 — Error The Fibre Channel link<br />
between the RA and all<br />
splitters is down.<br />
12025 — Error The Fibre Channel link<br />
between the RA and all<br />
storage is down.<br />
12026 — Error An error occurred in the<br />
WAN link to the RA at the<br />
other site.<br />
12027 — Error All replication volumes<br />
attached to the<br />
consistency group (or<br />
groups) are not accessible.<br />
12029 — Error The Fibre Channel link<br />
between all RAs and one<br />
or more volumes is down.<br />
12033 — Error The repository volume is<br />
not accessible; data might<br />
be lost.<br />
12034 — Error Writes to storage occurred<br />
without corresponding<br />
writes to the RA.<br />
12035 — Error An error occurred in the<br />
WAN link to the RA cluster<br />
at the other site.<br />
12036 — Error A renegotiation of the<br />
transfer protocol is<br />
requested.<br />
12037 — Error All volumes attached to<br />
the consistency group (or<br />
groups) are not accessible.<br />
Understanding Events<br />
Trigger<br />
6872 5688–002 E–11<br />
—<br />
—<br />
—<br />
—<br />
—<br />
—<br />
—<br />
—<br />
—<br />
—<br />
—<br />
—
Understanding Events<br />
Event<br />
ID<br />
Topic<br />
Table E–1. Normal Events<br />
Level<br />
Description<br />
12038 — Error All journal volumes<br />
attached to the<br />
consistency group (or<br />
groups) are not accessible.<br />
12039 — Error A long resynchronization<br />
started.<br />
12040 — Error The system detected bad<br />
sectors in a volume.<br />
12041 — Error The splitter is up. —<br />
12042 — Error All WAN links to the other<br />
site are restored.<br />
12043 — Error The WAN link to the RA at<br />
the other site is restored.<br />
12044 — Error Problem with IP link<br />
between RA (in at least in<br />
one direction).<br />
12045 — Error Problem with all IP links<br />
between RA<br />
12046 — Error Problem with IP links<br />
between RA<br />
12047 — Error RA network interface card<br />
(NIC) problem.<br />
14001 — Error Off The splitter is up. —<br />
14002 — Error Off All WAN links to the other<br />
site are restored.<br />
14003 — Error Off The WAN link to the RA at<br />
the other site is restored.<br />
14004 — Error Off The data link over the<br />
WAN is restored. All RAs<br />
can transfer replicated<br />
data to the other site.<br />
14005 — Error Off The data link over the<br />
WAN is restored. The RA<br />
can transfer replicated<br />
data to the other site.<br />
14006 — Error Off The connection of the RA<br />
to the RA cluster is<br />
restored.<br />
Trigger<br />
E–12 6872 5688–002<br />
—<br />
—<br />
—<br />
—<br />
—<br />
—<br />
—<br />
—<br />
—<br />
—<br />
—<br />
—<br />
—<br />
—
Event<br />
ID<br />
Topic<br />
Table E–1. Normal Events<br />
Level<br />
Description<br />
14007 — Error Off The connection of all RAs<br />
to the RA cluster is<br />
restored.<br />
14008 — Error Off The RA is up. —<br />
14009 — Error Off The group exited high<br />
load. The initialization<br />
completed.<br />
14010 — Error Off The journal error was<br />
corrected. A full sweep<br />
operation is required.<br />
14011 — Error Off The target-side log or<br />
virtual buffer is no longer<br />
full.<br />
14012 — Error Off Virtual access to an image<br />
is enabled.<br />
14013 — Error Off The system is no longer<br />
trying to access a diluted<br />
image.<br />
14014 — Error Off The Fibre Channel link<br />
between all RAs and all<br />
splitters and storage is<br />
restored.<br />
14016 — Error Off The Fibre Channel link<br />
between all RAs and all<br />
storage is restored.<br />
14022 — Error Off The Fibre Channel link that<br />
was down between the<br />
RA and splitters or storage<br />
volumes (or both) is<br />
restored.<br />
14023 — Error Off The Fibre Channel link<br />
between the RA and all<br />
splitters and storage is<br />
restored.<br />
14024 — Error Off The Fibre Channel link<br />
between the RA and all<br />
splitters is restored.<br />
14025 — Error Off The Fibre Channel link<br />
between the RA and all<br />
storage is restored.<br />
14026 — Error Off The WAN link to the RA at<br />
the other site is restored.<br />
Understanding Events<br />
Trigger<br />
6872 5688–002 E–13<br />
—<br />
—<br />
—<br />
—<br />
—<br />
—<br />
—<br />
—<br />
—<br />
—<br />
—<br />
—<br />
—
Understanding Events<br />
Event<br />
ID<br />
Topic<br />
Table E–1. Normal Events<br />
Level<br />
Description<br />
14027 — Error Off Access to all volumes<br />
attached to the<br />
consistency group (or<br />
groups) is restored.<br />
14029 — Error Off The Fibre Channel link<br />
between all RAs and one<br />
or more volumes is<br />
restored.<br />
14033 — Error Off Access to the repository<br />
volume is restored.<br />
14034 — Error Off Replication consistency in<br />
writes to storage is<br />
restored.<br />
14035 — Error Off The WAN link to the RA at<br />
the other site is restored.<br />
14036 — Error Off The renegotiation of the<br />
transfer protocol is<br />
complete.<br />
14037 — Error Off Access to all replication<br />
volumes attached to the<br />
consistency group (or<br />
groups) is restored.<br />
14038 — Error Off Access to all journal<br />
volumes attached to the<br />
consistency group (or<br />
groups) is restored.<br />
14039 — Info The long resynchronization<br />
has completed.<br />
14040 — Error Off The system detected a<br />
correction of bad sectors<br />
in the volume.<br />
14041 — Error Off The system detected that<br />
the volume is no longer<br />
read-only.<br />
14042 — Error Off A synchronization is in<br />
progress to restore any<br />
failed writes in the group.<br />
14043 — Error Off A synchronization is in<br />
progress to restore any<br />
failed writes.<br />
Trigger<br />
E–14 6872 5688–002<br />
—<br />
—<br />
—<br />
—<br />
—<br />
—<br />
—<br />
—<br />
—<br />
—<br />
—<br />
—<br />
—
Event<br />
ID<br />
Topic<br />
Table E–1. Normal Events<br />
Level<br />
Description<br />
14044 — Error Off Problem with IP link<br />
between RAs (in at least in<br />
one direction) corrected.<br />
14045 — Error Off All IP links between RAs<br />
restored.<br />
14046 — Error Off IP link between RAs<br />
restored.<br />
14047 — Error Off RA network interface card<br />
(NIC) problem corrected.<br />
16000 — Error Transient root cause. —<br />
16001 — Error The splitter was down.<br />
The problem is corrected.<br />
16002 — Error An error occurred in all<br />
WAN links to the other<br />
site. The problem is<br />
corrected.<br />
16003 — Error An error occurred in the<br />
WAN link to the RA at the<br />
other site. The problem is<br />
corrected.<br />
16004 — Error An error occurred in the<br />
data link over the WAN. All<br />
RAs were unable to<br />
transfer replicated data to<br />
the other site. The<br />
problem is corrected.<br />
16005 — Error An error occurred in the<br />
data link over the WAN.<br />
The RA was unable to<br />
transfer replicated data to<br />
the other site. The<br />
problem is corrected.<br />
16006 — Error The RA was disconnected<br />
from the RA cluster. The<br />
connection is restored.<br />
16007 — Error All RAs were<br />
disconnected from the RA<br />
cluster. The problem is<br />
corrected.<br />
16008 — Error The RA was down. The<br />
problem is corrected.<br />
Understanding Events<br />
Trigger<br />
6872 5688–002 E–15<br />
—<br />
—<br />
—<br />
—<br />
—<br />
—<br />
—<br />
—<br />
—<br />
—<br />
—<br />
—
Understanding Events<br />
Event<br />
ID<br />
Topic<br />
Table E–1. Normal Events<br />
Level<br />
Description<br />
16009 — Error The group entered high<br />
load. The problem is<br />
corrected.<br />
16010 — Error A journal error occurred.<br />
The problem is corrected.<br />
A full sweep is required.<br />
16011 — Error The target-side log or<br />
virtual buffer was full.<br />
Writing by the hosts at the<br />
target side was disabled.<br />
The problem is corrected.<br />
16012 — Error The system could not<br />
enable virtual access to<br />
the image. The problem is<br />
corrected.<br />
16013 — Error The system could not<br />
enable access to the<br />
specified image. The<br />
problem is corrected.<br />
16014 — Error The Fibre Channel link<br />
between all RAs and all<br />
splitters and storage was<br />
down. The problem is<br />
corrected.<br />
16016 — Error The Fibre Channel link<br />
between all RAs and all<br />
storage was down. The<br />
problem is corrected.<br />
16022 — Error The Fibre Channel link<br />
between the RA and<br />
splitters or storage<br />
volumes (or both) was<br />
down. The problem is<br />
corrected.<br />
16023 — Error The Fibre Channel link<br />
between the RA and all<br />
splitters and storage was<br />
down. The problem is<br />
corrected.<br />
16024 — Error The Fibre Channel link<br />
between the RA and all<br />
splitters was down. The<br />
problem is corrected.<br />
Trigger<br />
E–16 6872 5688–002<br />
—<br />
—<br />
—<br />
—<br />
—<br />
—<br />
—<br />
—<br />
—<br />
—
Event<br />
ID<br />
Topic<br />
Table E–1. Normal Events<br />
Level<br />
Description<br />
16025 — Error The Fibre Channel link<br />
between the RA and all<br />
storage was down. The<br />
problem is corrected.<br />
16026 — Error An error occurred in the<br />
WAN link to the RA at the<br />
other site. The problem is<br />
corrected.<br />
16027 — Error All volumes attached to<br />
the consistency group (or<br />
groups) were not<br />
accessible. The problem is<br />
corrected.<br />
16029 — Error The Fibre Channel link<br />
between all RAs and one<br />
or more volumes was<br />
down. The problem is<br />
corrected.<br />
16033 — Error The repository volume<br />
was not accessible. The<br />
problem is corrected.<br />
16034 — Error Off Writes to storage occurred<br />
without corresponding<br />
writes to the RA. The<br />
problem is corrected.<br />
16035 — Error An error occurred in the<br />
WAN link to the RA at the<br />
other site. The problem is<br />
corrected.<br />
16036 — Error The renegotiation of the<br />
transfer protocol was<br />
requested and has been<br />
completed.<br />
16037 — Error All replication volumes<br />
attached to the<br />
consistency group (or<br />
groups) were not<br />
accessible. The problem is<br />
corrected.<br />
Understanding Events<br />
Trigger<br />
6872 5688–002 E–17<br />
—<br />
—<br />
—<br />
—<br />
—<br />
—<br />
—<br />
—<br />
—
Understanding Events<br />
Event<br />
ID<br />
Topic<br />
Table E–1. Normal Events<br />
Level<br />
Description<br />
16038 — Error All journal volumes<br />
attached to the<br />
consistency group (or<br />
groups) were not<br />
accessible. The problem is<br />
corrected.<br />
16039 — Info The system ran a long<br />
resynchronization.<br />
16040 — Error The system detected bad<br />
sectors in the volume. The<br />
problem is corrected.<br />
16041 — Error The system detected that<br />
the volume was read-only.<br />
The problem is corrected.<br />
16042 — Error The splitter write<br />
operation might have<br />
failed while the group was<br />
transferring data.<br />
16043 — Error The splitter write<br />
operations might have<br />
failed.<br />
16044 — Error There was a problem with<br />
an IP link between RAs (in<br />
at least in one direction)<br />
16045 — Error There was a problem with<br />
all IP links between RAs.<br />
Problem has been<br />
corrected<br />
16046 — Error There was a problem with<br />
an IP link between RAs.<br />
Problem has been<br />
corrected.<br />
16047 — Error There was an RA network<br />
interface card (NIC)<br />
problem. Problem has<br />
been corrected.<br />
18001 — Error Off The splitter was<br />
temporarily up but is down<br />
again.<br />
18002 — Error Off All WAN links to the other<br />
site were temporarily<br />
restored, but the problem<br />
has returned.<br />
Trigger<br />
E–18 6872 5688–002<br />
—<br />
—<br />
—<br />
—<br />
—<br />
—<br />
—<br />
—<br />
—<br />
—<br />
—<br />
—
Event<br />
ID<br />
Topic<br />
Table E–1. Normal Events<br />
Level<br />
Description<br />
18003 — Error Off The WAN link to the RA at<br />
the other site was<br />
temporarily restored, but<br />
the problem has returned.<br />
18004 — Error Off The data link over the<br />
WAN was temporarily<br />
restored, but the problem<br />
has returned. All RAs are<br />
unable to transfer<br />
replicated data to the<br />
other site.<br />
18005 — Error Off The data link over the<br />
WAN was temporarily<br />
restored, but the problem<br />
has returned. The RA is<br />
currently unable to<br />
transfer replicated data to<br />
the other site.<br />
18006 — Error Off The connection of the RA<br />
to the RA cluster was<br />
temporarily restored, but<br />
the problem has returned.<br />
18007 — Error Off All RAs were temporarily<br />
restored to the RA cluster,<br />
but the problem has<br />
returned.<br />
18008 — Error Off The RA was temporarily<br />
up, but is down again.<br />
18009 — Error Off The group temporarily<br />
exited high load, but the<br />
problem has returned.<br />
18010 — Error Off The journal error was<br />
temporarily corrected, but<br />
the problem has returned.<br />
18011 — Error Off The target-side log or<br />
virtual buffer was<br />
temporarily no longer full,<br />
and write operations by<br />
the hosts at the target<br />
side were re-enabled.<br />
However, the problem has<br />
returned.<br />
Understanding Events<br />
Trigger<br />
6872 5688–002 E–19<br />
—<br />
—<br />
—<br />
—<br />
—<br />
—<br />
—<br />
—<br />
—
Understanding Events<br />
Event<br />
ID<br />
Topic<br />
Table E–1. Normal Events<br />
Level<br />
Description<br />
18012 — Error Off Virtual access to the<br />
image was temporarily<br />
enabled, but the problem<br />
has returned.<br />
18013 — Error Off Access to an image was<br />
temporarily enabled, but<br />
the problem has returned.<br />
18014 — Error Off The Fibre Channel link<br />
between all RAs and all<br />
splitters and storage was<br />
temporarily restored, but<br />
the problem has returned.<br />
18016 — Error Off The Fibre Channel link<br />
between all splitters and<br />
all storage was temporarily<br />
restored, but the problem<br />
has returned.<br />
18022 — Error Off The Fibre Channel link that<br />
was down between the<br />
RA and splitters or storage<br />
volumes (or both) was<br />
temporarily restored, but<br />
the problem has returned.<br />
18023 — Error Off The Fibre Channel link<br />
between the RA and all<br />
storage was temporarily<br />
restored, but the problem<br />
has returned.<br />
18024 — Error Off The Fibre Channel link<br />
between the RA and all<br />
splitters was temporarily<br />
restored, but the problem<br />
has returned.<br />
18025 — Error Off The Fibre Channel link<br />
between the RA and all<br />
storage was temporarily<br />
restored, but the problem<br />
has returned.<br />
18026 — Error The WAN link to the RA at<br />
the other site was<br />
temporarily restored, but<br />
the problem has returned.<br />
Trigger<br />
E–20 6872 5688–002<br />
—<br />
—<br />
—<br />
—<br />
—<br />
—<br />
—<br />
—<br />
—
Event<br />
ID<br />
Topic<br />
Table E–1. Normal Events<br />
Level<br />
Description<br />
18027 — Error Off Access to all journal<br />
volumes attached to the<br />
consistency group (or<br />
groups) was temporarily<br />
restored, but the problem<br />
has returned.<br />
18029 — Error Off The Fibre Channel link<br />
between all RAs and one<br />
or more volumes was<br />
temporarily restored, but<br />
the problem has returned.<br />
18033 — Error Off Access to the repository<br />
volume was temporarily<br />
restored, but the problem<br />
has returned.<br />
18034 — Error Off Replication consistency in<br />
write operations to<br />
storage and to RAs was<br />
temporarily restored, but<br />
the problem has returned.<br />
18035 — Error Off The WAN link to the RA at<br />
the other site was<br />
temporarily restored, but<br />
the problem has returned.<br />
18036 — Error Off The negotiation of the<br />
transfer protocol was<br />
completed but is again<br />
requested.<br />
18037 — Error Off Access to all volumes<br />
attached to the<br />
consistency group (or<br />
groups) was temporarily<br />
restored, but the problem<br />
has returned.<br />
18038 — Error Off Access to all replication<br />
volumes attached to the<br />
consistency group (or<br />
groups) was temporarily<br />
restored, but the problem<br />
has returned.<br />
18039 — Info The long resynchronization<br />
completed but has now<br />
restarted.<br />
Understanding Events<br />
Trigger<br />
6872 5688–002 E–21<br />
—<br />
—<br />
—<br />
—<br />
—<br />
—<br />
—<br />
—<br />
—
Understanding Events<br />
Event<br />
ID<br />
Topic<br />
Table E–1. Normal Events<br />
Level<br />
Description<br />
18040 — Error Off The user marked the<br />
volume as OK, but the<br />
bad-sectors problem<br />
persists.<br />
18041 — Error Off The user marked the<br />
volume as OK, but the<br />
read-only problem<br />
persists.<br />
18042 — Error Off The synchronization<br />
restored any failed write<br />
operations in the group,<br />
but the problem has<br />
returned.<br />
18043 — Error Off An internal problem has<br />
occurred.<br />
18044 — Error Off Problem with IP link<br />
between RAs (in at least<br />
one direction) was<br />
corrected, but problem<br />
has returned.<br />
18045 — Error Off Problem with all IP links<br />
between RAs (in at least in<br />
one direction) was<br />
corrected, but problem<br />
has returned.<br />
18046 — Error Off Problem with IP link<br />
between RAs was<br />
corrected, but problem<br />
has returned.<br />
18047 — Error Off RA network interface card<br />
(NIC) problem was<br />
corrected, but problem<br />
has returned.<br />
List of Detailed Events<br />
Trigger<br />
Detailed events are all events with respect to components generated for use by users<br />
and do not have a normal scope. Table E–2 lists these events and their descriptions.<br />
E–22 6872 5688–002<br />
—<br />
—<br />
—<br />
—<br />
—<br />
—<br />
—<br />
—
Event<br />
ID<br />
Topic<br />
Table E–2. Detailed Events<br />
Level<br />
Description<br />
Understanding Events<br />
Trigger<br />
1002 Management Info User logged out. (User ) The user logged<br />
out of the system.<br />
1010 Management Warning Grace period expires in 1 day.<br />
You must install an activation<br />
code to activate your Unisys<br />
<strong>SafeGuard</strong> solution license.<br />
1012 Management Warning License expires in 1 day. You<br />
must obtain a new Unisys<br />
<strong>SafeGuard</strong> 30m solution<br />
license.<br />
1013 Management Error License expired. You must<br />
obtain a new Unisys <strong>SafeGuard</strong><br />
30m solution license.<br />
2000 Site Info Site management running on<br />
.<br />
3000 RA Info RA as become a cluster<br />
member. (RA )<br />
3002 RA Warning Site management switched<br />
over to this RA. (RA ,<br />
Reason )<br />
3007 RA Warning<br />
Off<br />
The grace period<br />
expires in 1 day.<br />
The Unisys<br />
<strong>SafeGuard</strong> 30m<br />
solution license<br />
expires in 1 day.<br />
The Unisys<br />
<strong>SafeGuard</strong> 30m<br />
solution license<br />
expired.<br />
Site control is<br />
open; the RA has<br />
become the<br />
cluster leader.<br />
The RA is<br />
connected to site<br />
control.<br />
Leadership is<br />
transferred from<br />
an RA to another<br />
RA.<br />
RA is up. (RA ) The RA that was<br />
previously down<br />
came up.<br />
3008 RA Warning RA appears to be down. (RA<br />
)<br />
3011 RA Info RA access to a volume or<br />
volumes restored. (RA ,<br />
Volume , Volume<br />
Type )<br />
3012 RA Warning RA unable to access a volume<br />
or volumes. (RA , Volume<br />
, Volume Type<br />
)<br />
An RA suspects<br />
that the other RA<br />
is down.<br />
Volumes that were<br />
inaccessible<br />
became<br />
accessible.<br />
Volumes ceased to<br />
be accessible to<br />
the RA.<br />
6872 5688–002 E–23
Understanding Events<br />
Event<br />
ID<br />
Topic<br />
Table E–2. Detailed Events<br />
Level<br />
3013 RA Warning<br />
Off<br />
Description<br />
RA access to restored. (RA ,<br />
Volume )<br />
3014 RA Warning RA unable to access<br />
. (RA<br />
, Volume )<br />
3020 RA Warning<br />
Off<br />
WAN connection to an RA at<br />
other site is restored. (RA at<br />
other site: )<br />
3021 RA Warning Error in WAN connection to an<br />
RA at other site. (RA at other<br />
site: )<br />
3022 RA Warning<br />
Off<br />
LAN connection to RA<br />
restored. (RA )<br />
3023 RA Warning Error in LAN connection to an<br />
RA. RA )<br />
4000 Group Info Group capabilities OK. (Group<br />
)<br />
4001 Group Warning Group capabilities minor<br />
problem. (Group )<br />
Trigger<br />
The repository<br />
volume that was<br />
inaccessible<br />
became<br />
accessible.<br />
The repository<br />
volume became<br />
inaccessible to a<br />
single RA.<br />
The RA regained<br />
the WAN<br />
connection to an<br />
RA at the other<br />
site.<br />
The RA lost the<br />
WAN connection<br />
to an RA at the<br />
other site.<br />
The RA regained<br />
the LAN<br />
connection to an<br />
RA at the local<br />
site.<br />
The RA lost the<br />
LAN connection to<br />
an RA at the local<br />
site, without losing<br />
the connection<br />
through the<br />
repository volume.<br />
Capabilities are full<br />
and previous<br />
capabilities are<br />
unknown.<br />
Capabilities are<br />
either temporarily<br />
not full on the RA<br />
on which the<br />
group is currently<br />
running, or<br />
indefinitely not full<br />
on the RA on<br />
which the group is<br />
not running.<br />
E–24 6872 5688–002
Event<br />
ID<br />
Topic<br />
Table E–2. Detailed Events<br />
Level<br />
Description<br />
4003 Group Error Group capabilities problem.<br />
(Group )<br />
4007 Group Info Pausing data transfer. (Group<br />
, Reason: )<br />
4008 Group Warning Pausing data transfer. (Group<br />
, Reason: )<br />
4009 Group Error Pausing data transfer. (Group<br />
, Reason: )<br />
4010 Group Info Starting data transfer. (Group<br />
)<br />
4015 Group Info Transferring latest snapshot<br />
before pausing transfer (no<br />
data loss). (Group )<br />
4016 Group Warning Transferring latest snapshot<br />
before pausing transfer (no<br />
data loss). (Group )<br />
4017 Group Error Transferring latest snapshot<br />
before pausing transfer (no<br />
data loss). (Group )<br />
4018 Group Warning Transfer of latest snapshot<br />
from source is complete (no<br />
data loss). (Group )<br />
Understanding Events<br />
Trigger<br />
Capabilities are not<br />
full indefinitely on<br />
the RA on which<br />
the group is<br />
running.<br />
The user stopped<br />
the transfer.<br />
The system<br />
temporarily<br />
stopped the<br />
transfer.<br />
The system<br />
stopped the<br />
transfer<br />
indefinitely.<br />
The user<br />
requested a start<br />
transfer.<br />
In a total storage<br />
disaster, the<br />
system flushed<br />
the buffer before<br />
stopping<br />
replication.<br />
In a total storage<br />
disaster, the<br />
system flushed<br />
the buffer before<br />
stopping<br />
replication.<br />
In a total storage<br />
disaster, the<br />
system flushed<br />
the buffer before<br />
stopping<br />
replication.<br />
In a total storage<br />
disaster, the last<br />
snapshot from the<br />
source site is<br />
available at the<br />
target site.<br />
6872 5688–002 E–25
Understanding Events<br />
Event<br />
ID<br />
Topic<br />
Table E–2. Detailed Events<br />
Level<br />
Description<br />
4019 Group Warning Group in high load; transfer is<br />
to be paused temporarily.<br />
(Group )<br />
4020 Group Warning<br />
Off<br />
Group is no longer in high load.<br />
(Group )<br />
4021 Group Error Journal full—initialization<br />
paused. To complete<br />
initialization, enlarge the journal<br />
or allow long<br />
resynchronization. (Group<br />
)<br />
4022 Group Error Off Initialization resumed. (Group<br />
)<br />
4023 Group Error Journal full—transfer paused.<br />
To restart the transfer, first<br />
disable access to image.<br />
(Group )<br />
4024 Group Error Off Transfer restarted. (Group<br />
)<br />
4025 Group Warning Group in high load—<br />
initialization to be restarted.<br />
(Group )<br />
4026 Group Warning<br />
Off<br />
Group no longer in high load.<br />
(Group )<br />
4027 Group Error Group in high load—the journal<br />
is full. The roll to physical<br />
image is paused, and transfer<br />
is paused. (Group )<br />
4028 Group Error Off Group no longer in high load.<br />
(Group )<br />
Trigger<br />
The disk manager<br />
has a high load.<br />
The disk manager<br />
no longer has a<br />
high load.<br />
In initialization, the<br />
journal is full and<br />
a long<br />
resynchronization<br />
is not allowed.<br />
End of an<br />
initialization<br />
situation in which<br />
the journal is full<br />
and a long<br />
resynchronization<br />
was not allowed.<br />
Access to the<br />
image is enabled<br />
and the journal is<br />
full.<br />
End of a situation<br />
in which access to<br />
the image is<br />
enabled and the<br />
journal is full.<br />
The group has a<br />
high load;<br />
initialization is to<br />
be restarted.<br />
The group no<br />
longer has a high<br />
load.<br />
No space remains<br />
to which to write<br />
during roll.<br />
Journal capacity<br />
was added, or<br />
image access was<br />
disabled.<br />
E–26 6872 5688–002
Event<br />
ID<br />
Topic<br />
Table E–2. Detailed Events<br />
Level<br />
Description<br />
4040 Group Error Journal error—full sweep to be<br />
performed. (Group )<br />
4041 Group Info Group activated. (Group<br />
, RA )<br />
4042 Group Info Group deactivated. (Group<br />
, RA )<br />
4043 Group Warning Group deactivated. (Group<br />
, RA )<br />
4044 Group Error Group deactivated. (Group<br />
, RA )<br />
4051 Group Info Disabling access to image—<br />
resuming distribution. (Group<br />
)<br />
4054 Group Error Enabling access to image.<br />
(Group )<br />
4057 Group Warning Specified image was removed<br />
from the journal. Try a later<br />
image. (Group )<br />
4062 Group Info Access enabled to latest<br />
image. (Group ,<br />
Failover site )<br />
4063 Group Warning Access enabled to latest<br />
image. (Group ,<br />
Failover site )<br />
Understanding Events<br />
Trigger<br />
A journal volume<br />
error occurred.<br />
The group is<br />
replication-ready;<br />
that is, replication<br />
could take place if<br />
other factors are<br />
acceptable, such<br />
as RAs, network,<br />
and storage<br />
access.<br />
A user action<br />
deactivated the<br />
group.<br />
The system<br />
temporarily<br />
deactivated the<br />
group.<br />
The system<br />
deactivated the<br />
group indefinitely.<br />
The user disabled<br />
access to an<br />
image (that is,<br />
distribution is<br />
resumed).<br />
The system<br />
enabled access to<br />
an image<br />
indefinitely.<br />
The specified<br />
image was<br />
removed from the<br />
journal (that is,<br />
FIFO).<br />
Access was<br />
enabled to the<br />
latest image during<br />
automatic failover.<br />
Access was<br />
enabled to the<br />
latest image during<br />
automatic failover.<br />
6872 5688–002 E–27
Understanding Events<br />
Event<br />
ID<br />
Topic<br />
Table E–2. Detailed Events<br />
Level<br />
Description<br />
4064 Group Error Access enabled to latest<br />
image. (Group ,<br />
Failover site )<br />
4080 Group Warning Current lag exceeds maximum<br />
lag. (Group , Lag<br />
, Maximum lag<br />
)<br />
4081 Group Warning<br />
off<br />
Current lag within policy.<br />
(Group , Lag ,<br />
Maximum lag )<br />
4082 Group Warning Starting full sweep. (Group<br />
)<br />
4083 Group Warning Starting volume sweep. (Group<br />
, Pair )<br />
4084 Group Info Markers cleared. (Group<br />
)<br />
4085 Group Warning Unable to clear markers.<br />
(Group )<br />
4086 Group Info Initialization started. (Group<br />
)<br />
4087 Group Info Initialization completed. (Group<br />
)<br />
4091 Group Error Target-side log is full; write<br />
operations by the hosts at the<br />
target side is disabled. (Group<br />
, Site )<br />
4095 Group Info Writing target-side log to<br />
storage; writes to log cannot<br />
be undone. (Group )<br />
Trigger<br />
Access was<br />
enabled to the<br />
latest during<br />
automatic failover.<br />
The group lag<br />
exceeds the<br />
maximum lag<br />
(when not<br />
regulating an<br />
application).<br />
The group lag<br />
drops from above<br />
the maximum lag<br />
to below 90<br />
percent of the<br />
maximum.<br />
Group markers<br />
were set.<br />
Volume markers<br />
were set.<br />
Group markers<br />
were cleared.<br />
An attempt to<br />
clear the group<br />
markers failed.<br />
Initialization<br />
started.<br />
Initialization<br />
completed.<br />
The target-side log<br />
is full.<br />
Started marking to<br />
retain write<br />
operations in the<br />
target-side log.<br />
E–28 6872 5688–002
Event<br />
ID<br />
Topic<br />
Table E–2. Detailed Events<br />
Level<br />
Description<br />
4097 Group Warning Maximum journal lag<br />
exceeded. Distribution in fastforward—older<br />
images<br />
removed from journal. (Group<br />
)<br />
4098 Group Warning<br />
Off<br />
Maximum journal lag within<br />
limit. Distribution normal—<br />
rollback information retained.<br />
(Group )<br />
4099 Group Info Initializing in long<br />
resynchronization mode.<br />
(Group )<br />
4110 Group Info Enabling virtual access to<br />
image. (Group )<br />
4111 Group Info Virtual access to image<br />
enabled. (Group )<br />
4112 Group Info Rolling to physical image.<br />
(Group )<br />
4113 Group Info Roll to physical image stopped.<br />
(Group )<br />
4114 Group Info Roll to physical image<br />
complete—logged access to<br />
physical image is now enabled.<br />
(Group )<br />
Understanding Events<br />
Trigger<br />
Fast-forward<br />
action started<br />
(causing a loss of<br />
snapshots taken<br />
before as<br />
maximum journal<br />
lag was<br />
exceeded).<br />
Five minutes have<br />
passed since the<br />
fast-forward action<br />
stopped.<br />
The system<br />
started a long<br />
resynchronization.<br />
The user initiated<br />
enabling virtual<br />
access to an<br />
image.<br />
The user enabled<br />
virtual access to an<br />
image.<br />
Rolling to the<br />
image (in<br />
background) while<br />
virtual access to<br />
the image is<br />
enabled.<br />
Rolling to the<br />
image (that is, the<br />
background, while<br />
virtual access to<br />
the image is<br />
enabled) is<br />
stopped.<br />
The system<br />
completed the roll<br />
to the physical<br />
image.<br />
6872 5688–002 E–29
Understanding Events<br />
Event<br />
ID<br />
Topic<br />
Table E–2. Detailed Events<br />
Level<br />
Description<br />
4115 Group Error Unable to enable access to<br />
virtual image because of<br />
partition table error. (The<br />
partition table on at least one<br />
of the volumes in group<br />
has been modified<br />
since logged access was last<br />
enabled to a physical image. To<br />
enable access to a virtual<br />
image, first enable logged<br />
access to a physical image.)<br />
4116 Group Error Virtual access buffer is full—<br />
writing by hosts at the target<br />
side is disabled. (Group<br />
)<br />
4118 Group Error Cannot enable virtual access to<br />
an image. (Group )<br />
4119 Group Error Initiator issued an out-ofbounds<br />
I/O operation. Contact<br />
technical support. (Initiator<br />
, Group<br />
, Volume )<br />
4120 Group Warning Journal usage (with logged<br />
access enabled) now exceeds<br />
this threshold. (Group<br />
, )<br />
4121 Group Error Unable to gain permissions to<br />
write to replica.<br />
Trigger<br />
An attempt to<br />
pause on a virtual<br />
image is<br />
unsuccessful<br />
because of a<br />
change in the<br />
partition table of a<br />
volume or volumes<br />
in the group.<br />
An attempt to<br />
write to the virtual<br />
image is<br />
unsuccessful<br />
because the virtual<br />
access buffer<br />
usage is 100<br />
percent.<br />
An attempt to<br />
enable virtual<br />
access to the<br />
image is<br />
unsuccessful<br />
because of<br />
insufficient<br />
memory.<br />
A configuration<br />
problem exists.<br />
Journal usage<br />
(with logged<br />
access enabled)<br />
has passed a<br />
specified<br />
threshold.<br />
RAs unable to<br />
write to replication<br />
or journal volumes<br />
because they do<br />
not have proper<br />
permissions.<br />
E–30 6872 5688–002
Event<br />
ID<br />
Topic<br />
Table E–2. Detailed Events<br />
Level<br />
Description<br />
4122 Group Trying to regain permissions to<br />
write to replica.<br />
4123 Group Error Unable to access volumes –<br />
bad sectors encountered.<br />
4124 Group Error Off Trying to access volumes that<br />
previously had bad sectors.<br />
5000 Splitter Info Splitter or splitters are attached<br />
to a volume. (Splitter<br />
, Volume )<br />
5001 Splitter Info Splitter or splitters are<br />
detached from a volume.<br />
(Splitter , Volume<br />
)<br />
5002 Splitter Error RA is unable to access splitter.<br />
(Splitter , RA )<br />
5003 Splitter Error Off RA access to splitter is<br />
restored. (Splitter ,<br />
RA )<br />
5004 Splitter Error Splitter is unable to access a<br />
replication volume or volumes.<br />
(Splitter , Volume<br />
)<br />
5005 Splitter Error Off Splitter access to replication<br />
volume or volumes is restored.<br />
(Splitter , Volume<br />
)<br />
5006 OBSOLETE<br />
5007 OBSOLETE<br />
Understanding Events<br />
Trigger<br />
User has indicated<br />
that the<br />
permissions<br />
problem has been<br />
corrected.<br />
RAs unable to<br />
write to replication<br />
or journal volumes<br />
due to bad sectors<br />
on the storage.<br />
User has indicated<br />
that the bad<br />
sectors problem<br />
has been<br />
corrected.<br />
The user attached<br />
a splitter to a<br />
volume.<br />
The user detached<br />
a splitter from a<br />
volume.<br />
The RA is unable<br />
to access a<br />
splitter.<br />
The RA can access<br />
a splitter that was<br />
previously<br />
inaccessible.<br />
The splitter cannot<br />
access a volume.<br />
The splitter can<br />
access a volume<br />
that was<br />
previously<br />
inaccessible.<br />
6872 5688–002 E–31
Understanding Events<br />
Event<br />
ID<br />
Topic<br />
Table E–2. Detailed Events<br />
Level<br />
Description<br />
5013 Splitter Error Splitter is down. (Splitter<br />
)<br />
5015 Splitter Error Off Splitter is up. (Splitter<br />
)<br />
5016 Splitter Warning Splitter has restarted. (Splitter<br />
)<br />
5030 Splitter Error Splitter write failed. (Splitter<br />
, Group )<br />
5031 Splitter Warning Splitter is not splitting to<br />
replication volumes; volume<br />
sweeps are required. (Host<br />
, Volumes , Groups )<br />
5032 Splitter Info Splitter is splitting to replication<br />
volumes. (Host ,<br />
Volumes ,<br />
Groups (Groups)<br />
5035 Splitter Info Writes to replication volumes<br />
are disabled. (Splitter<br />
, Volumes , Groups )<br />
5036 Splitter Warning Writes to replication volumes<br />
are disabled. (Host< host>,<br />
Volumes ,<br />
Groups )<br />
5037 Splitter Error Writes to replication volumes<br />
are disabled. (Splitter<br />
, Volumes , Groups )<br />
Trigger<br />
Connection to the<br />
splitter was lost<br />
with no warning;<br />
splitter crashed or<br />
the connection is<br />
down.<br />
Connection to the<br />
splitter was<br />
regained after a<br />
splitter crash.<br />
The boot<br />
timestamp of the<br />
splitter has<br />
changed.<br />
The splitter write<br />
operation to the<br />
RA was<br />
successful; the<br />
write operation to<br />
the storage device<br />
was not<br />
successful.<br />
The splitter is not<br />
splitting to the<br />
replication<br />
volumes.<br />
The splitter started<br />
splitting to the<br />
replication<br />
volumes.<br />
Write operations<br />
to the replication<br />
volumes are<br />
disabled.<br />
Write operations<br />
to the replication<br />
volumes are<br />
disabled.<br />
Write operations<br />
to the replication<br />
volumes are<br />
disabled.<br />
E–32 6872 5688–002
Event<br />
ID<br />
Topic<br />
Table E–2. Detailed Events<br />
Level<br />
Description<br />
5038 Splitter Info Splitter delaying writes.<br />
(Splitter , Volumes<br />
, Groups<br />
)<br />
5039 Splitter Warning Splitter delaying writes.<br />
(Splitter , Volumes<br />
, Groups<br />
)<br />
5040 Splitter Error Splitter delaying writes.<br />
(Splitter , Volumes<br />
, Groups<br />
)<br />
5041 Splitter Info Splitter is not splitting to<br />
replication volumes. (Splitter<br />
, Volumes , Groups )<br />
5042 Splitter Warning Splitter is not splitting to<br />
replication volumes. (Splitter<br />
, Volumes , Groups )<br />
5043 Splitter Error Splitter not splitting to<br />
replication volumes. (Splitter<br />
, Volumes , Groups )<br />
5045 Splitter Warning Simultaneous problems<br />
reported in splitter and RA.<br />
Full-sweep resynchronization is<br />
required after restarting data<br />
transfer.<br />
5046 Splitter Warning Transient error—reissuing<br />
splitter write.<br />
Understanding Events<br />
Trigger<br />
—<br />
6872 5688–002 E–33<br />
—<br />
—<br />
The splitter is not<br />
splitting to the<br />
replication<br />
volumes because<br />
of a user decision.<br />
The splitter is not<br />
splitting to the<br />
replication<br />
volumes.<br />
The splitter is not<br />
splitting to the<br />
replication<br />
volumes because<br />
of a system action.<br />
The marking<br />
backlog on the<br />
splitter was lost as<br />
a result of<br />
concurrent<br />
disasters to the<br />
splitter and the<br />
RA.<br />
—
Understanding Events<br />
E–34 6872 5688–002
Appendix F<br />
Configuring and Using SNMP Traps<br />
The RA in the Unisys <strong>SafeGuard</strong> 30m solution is SNMP capable—that is, the solution<br />
supports monitoring and problem notification using the standard Simple Network<br />
Management Protocol (SNMP), including support for SNMPv3. The solution supports<br />
various SNMP queries to the agent and can be configured so that events generate<br />
SNMP traps, which are sent to designated servers.<br />
Software Monitoring<br />
To configure SNMP traps for monitoring, see the Unisys <strong>SafeGuard</strong> 30m Solution<br />
Planning and Installation <strong>Guide</strong>.<br />
You cannot query the RA software management information base (MIB). You can query<br />
the MIB-II. The RA SNMP agent includes MIB-II support. Also see “Hardware<br />
Monitoring.” For more information on MIB-II, see the document at<br />
http://www.faqs.org/rfcs/rfc1213.html<br />
All of the management console log events listed in Appendix E generate SNMP traps<br />
depending on the severity of the trap configuration.<br />
The Unisys MIB OID is 1.3.6.1.4.1.21658.<br />
The trap identifiers for Unisys traps are as follows:<br />
1: Info<br />
2: Warning<br />
3: Error<br />
6872 5688–002 F–1
Configuring and Using SNMP Traps<br />
The Unisys trap variables and their possible values are defined in Table F–1.<br />
Table F–1. Trap Variables and Values<br />
Variable OID Description Value<br />
dateAndTime 3.1.1.1 Date and time that the trap was<br />
sent<br />
eventID 3.1.1.2 Unique event identifier (See<br />
values in “List of Events” in<br />
Appendix E.)<br />
siteName 3.1.1.3 Name of site where event<br />
occurred<br />
eventLevel 3.1.1.4 See values 1: info<br />
2: warning<br />
3: warning off<br />
4: error<br />
5: error off<br />
eventTopic 3.1.1.5 See values 1: site<br />
2: K-Box<br />
3: group<br />
4: splitter<br />
5: management<br />
hostName 3.1.1.6 Name of host —<br />
kboxName 3.1.1.7 Name of RA —<br />
volumeName 3.1.1.8 Name of volume —<br />
groupName 3.1.1.9 Name of group —<br />
eventSummary 3.1.1.10 Short description of event —<br />
eventDescription 3.1.1.11 More detailed description of<br />
event<br />
F–2 6872 5688–002<br />
—<br />
—<br />
—<br />
—
Configuring and Using SNMP Traps<br />
SNMP Monitoring and Trap Configuration<br />
To configure SNMP traps, see the Unisys <strong>SafeGuard</strong> <strong>Solutions</strong> Planning and Installation<br />
<strong>Guide</strong>.<br />
On the management console, use the SNMP Settings menu (in the System menu) to<br />
manage the SNMP capabilities. Through that menu, you can enable and disable the<br />
agent or the SNMP traps feature, modify the configuration for SNMP traps, and add or<br />
remove SNMP users.<br />
In addition, the RA provides several CLI commands for SNMP, as follows:<br />
• The enable_snmp command to enable the SNMP agent<br />
• The disable_snmp command to disable the SNMP agent<br />
• The set_snmp_community command to define a community of users (for SNMPv1)<br />
• The add_snmp_user command to add SNMP users (for SNMPv3)<br />
• The remove_snmp_user command to remove SNMP users (for SNMPv3)<br />
• The get_snmp_settings command to display whether the agent is currently set to be<br />
enabled, the current configuration for SNMP traps, and the list of registered SNMP<br />
users<br />
• The config_snmp_traps command to configure the SNMP traps feature so that<br />
events generate traps. Before you enable the feature, you must designate the IP<br />
address or DNS name for a host at one or more sites to receive the SNMP traps.<br />
Note: You can designate a DNS name for a host only in installations for which a<br />
DNS has been configured.<br />
• The test_snmp_trap command to send a test SNMP trap<br />
When the SNMP agent is enabled, SNMP users can submit queries to retrieve various<br />
types of information about the RA.<br />
You can also designate the minimum severity for which an event should generate an<br />
SNMP trap (that is, info, warning, or error in order from less severe to more severe with<br />
error as the initial default). Once the SNMP traps feature is enabled, the system sends<br />
an SNMP trap to the designated host whenever an event of sufficient severity occurs.<br />
Installing MIB Files on an SNMP Browser<br />
Install the RA MIB file (\MIBS\mib.txt on the Unisys <strong>SafeGuard</strong> <strong>Solutions</strong> Splitter Install<br />
Disk CD-ROM) on an SNMP browser. Follow the instructions for your browser to load<br />
the MIB file.<br />
6872 5688–002 F–3
Configuring and Using SNMP Traps<br />
Resolving SNMP Issues<br />
For SNMP issues, first determine whether the issue is an SNMP trap or an SNMP<br />
monitoring issue by performing the procedure for verifying SNMP traps in the Unisys<br />
<strong>SafeGuard</strong> <strong>Solutions</strong> Planning and Installation <strong>Guide</strong>.<br />
If you do not receive traps, perform the steps in “Monitoring Issues” and then in “Trap<br />
Issues.”<br />
Monitoring Issues<br />
Trap Issues<br />
1. Ping the RA management IP address from the management server that has the<br />
SNMP browser.<br />
2. Ensure that the community name used on the RA configuration matches the<br />
management server running the SNMP browser (version 1 and 2). Use public as a<br />
community name.<br />
3. Ensure that the user and password used on the RA configuration matches the<br />
management server running the SNMP browser (version 3).<br />
1. Ensure that the trap destination is on the same network as the management<br />
network and that a firewall has not blocked SNMP traffic.<br />
2. Ensure that the same version of SNMP is configured in the management software<br />
that receives traps.<br />
F–4 6872 5688–002
Appendix G<br />
Using the Unisys <strong>SafeGuard</strong> 30m<br />
Collector<br />
The Unisys <strong>SafeGuard</strong> 30m Collector utility enables you to easily collect information<br />
about the environment so that you can solve problems. An enterprise solution requires<br />
many logs, and gathering the log information can be time intensive. Often the person<br />
who collects the information is not familiar with all the interfaces to the hardware. The<br />
Collector solves these problems. An experienced installer configures log collection one<br />
time, and then other personnel can use a “one-button” approach to log collection.<br />
You can use this utility to create custom scripts to complete tasks tailored to your<br />
environment. You choose which CLI commands to include in the custom scripts so that<br />
you build the capabilities you need. Refer to the Unisys <strong>SafeGuard</strong> <strong>Solutions</strong> Introduction<br />
to Replication Appliance Command Line Interface (CLI) for more information about CLI<br />
commands.<br />
The Collector gathers configuration information from RAs, storage subsystems, and<br />
switches. No information is collected from the servers in the environment.<br />
Installing the <strong>SafeGuard</strong> 30m Collector<br />
This utility offers two modes: Collector and View. You determine the available modes<br />
when you install the program. If you install the Collector and specify Collector mode,<br />
both modes are enabled. If you install the Collector and specify View mode, the Collector<br />
mode functions are disabled. The View mode is primarily used by support personnel at<br />
the Unisys <strong>Support</strong> Center.<br />
If you are installing the Collector at a customer installation, be sure to install the utility on<br />
PCs at both sites.<br />
The utility requires .NET Framework 2.0 and J# redistributable, which are on the Unisys<br />
<strong>SafeGuard</strong> 30m Solution Control Install Disk CD-ROM in the Redistributable folder.<br />
The directories under this folder are dotNet Framework 2.0 and JSharp.<br />
6872 5688–002 G–1
Using the Unisys <strong>SafeGuard</strong> 30m Collector<br />
Notes:<br />
• The readme file on that CD-ROM contains the same information as this appendix.<br />
• If you installed a previous version of the Collector, uninstall this utility and remove<br />
the folder and all of the files in the folder before you begin this installation.<br />
Perform the following steps to install the Collector:<br />
1. Insert the CD-ROM in the CD/DVD drive, and start the file Unisys <strong>SafeGuard</strong> 30m<br />
Collector.msi.<br />
2. On the Installation Wizard welcome screen, click Next.<br />
3. On the Customer Information screen, type the user name and organization, and<br />
click Next.<br />
4. On the Destination Folder screen, select a destination folder and click Next.<br />
Note: If you are using the Windows Vista operating system, install the Collector<br />
into a separate directory named C:\Unisys\30m\Collector.<br />
5. On the Select Options: screen, select Collector mode –install at site or<br />
select View mode –install at support center, and then click Next.<br />
6. On the Ready to Install the Program screen, click Install.<br />
The Installation wizard begins installing the files, and the Installing Unisys<br />
<strong>SafeGuard</strong> 30m Collector screen is displayed to indicate the status of the<br />
installation.<br />
After the files are installed, the Installation Wizard Completed screen is<br />
displayed.<br />
7. Click Finish.<br />
Before You Begin the Configuration<br />
Before you begin configuring the Collector, be sure you have the following information:<br />
• IP addresses<br />
− SAN switches<br />
− Network switches<br />
− RA site management<br />
• Log-in names<br />
− SAN switches<br />
− Network switches<br />
− RA (for custom scripts only)<br />
G–2 6872 5688–002
Using the Unisys <strong>SafeGuard</strong> 30m Collector<br />
• Passwords<br />
− SAN switches<br />
− Network switches<br />
− RA (for custom scripts only)<br />
• EMC Navisphere CLI<br />
− Storage<br />
• Autologon configuration<br />
− SAN switches (Consult your SAN switch documentation for the autologon<br />
configuration.)<br />
If you are using a Cisco SAN switch, enable the SSH server before you begin the<br />
configuration. See “Configuring RA, Storage, and SAN Switch Component Types Using<br />
Built-Ins” in this appendix.<br />
Handling the Security Breach Warning<br />
If you previously installed the Collector and have uninstalled the utility and all the files,<br />
when you begin configuring RAs or adding RAs, you might get this message:<br />
WARNING – POTENTIAL SECURITY BREACH!<br />
If you receive this message, complete these steps:<br />
1. Delete the IP address for the RA.<br />
2. Use the following plink command:<br />
C:\>plink -l admin -pw admin get_version<br />
Messages about the host key and a new key are displayed.<br />
3. Type Y in response to the message “Update cached key?”<br />
Once you have updated the cached key, complete the steps in “Configuring RAs” to<br />
discover the IP addresses for the RAs.<br />
6872 5688–002 G–3
Using the Unisys <strong>SafeGuard</strong> 30m Collector<br />
Using Collector Mode<br />
Installing the utility in Collector mode enables all the capabilities to gather log information<br />
using scripts and also enables View mode.<br />
Getting Started<br />
To access the Collector, follow these steps:<br />
1. On the Start menu, point to Programs, then click Unisys, then click <strong>SafeGuard</strong><br />
30m Collector; and click <strong>SafeGuard</strong> 30m Collector.<br />
2. Select the Components.ssc file on the Open Unisys <strong>SafeGuard</strong> 30m Collector<br />
File dialog box.<br />
The Unisys <strong>SafeGuard</strong> 30m Collector program window is displayed with two panes<br />
open.<br />
Configuring RAs<br />
To collect data, specify the site management IP address of either of the RA clusters for a<br />
site. The “built-in” scripts are a preconfigured set of CLI commands that facilitate easy<br />
data collection.<br />
The other site management IP address is automatically discovered when you specify<br />
either of the RA site management addresses.<br />
To configure the RA, perform these steps:<br />
1. Start the Collector.<br />
2. If needed, expand the Components tree in the left pane.<br />
3. Select BI Built-In (under RA), right-click, and click Copy Built-In (Discover RA).<br />
4. On the Script dialog box, type the RA site management IP address in the IP<br />
Address field and click Save.<br />
If you have multiple <strong>SafeGuard</strong> solutions, repeat steps 3 and 4 for each set of RA<br />
clusters.<br />
After you enter the IP address, the Collector window is updated with the folder of each<br />
site management IP address appearing below the RA folder. Each IP folder contains the<br />
built-in scripts that are enabled.<br />
The following sample window shows the IP address folders listed in the left pane. In this<br />
figure, two <strong>SafeGuard</strong> solutions are configured—the set of IP addresses (172.16.17.50<br />
and 172.16.17.60) for the two RA clusters in solution 1 and the IP address 172.16.7.50<br />
for the continuous data protection (CDP) solution, which always has only one RA cluster.<br />
G–4 6872 5688–002
Adding Customer Information<br />
Using the Unisys <strong>SafeGuard</strong> 30m Collector<br />
Add information about the Unisys service representative, customer, and architect so that<br />
the Unisys <strong>Support</strong> Center can contact the site easily. To add the information, perform<br />
the following steps on the Unisys <strong>SafeGuard</strong> 30m Collector program window.<br />
1. On the File menu, click Properties.<br />
2. On the Properties dialog box, select the appropriate tab: Customer, Architect,<br />
or CIR.<br />
3. Type in the information for each field on each tab. (For instance, type text in the<br />
Name, Office, Mobile, E-mail, and Additional Info fields for the CIR tab.)<br />
The Architect tab provides an Installed Date field. Use the Additional Info field for any<br />
other information that the Unisys <strong>Support</strong> Center might need, such as a support<br />
request number.<br />
4. Click OK.<br />
6872 5688–002 G–5
Using the Unisys <strong>SafeGuard</strong> 30m Collector<br />
Running All Scripts<br />
To collect data from all enabled scripts in a <strong>SafeGuard</strong> <strong>Solutions</strong> Components (SSC) file,<br />
perform these steps on the Unisys <strong>SafeGuard</strong> 30m Collector program window.<br />
1. Select Components.<br />
2. Right-click, and click Run, or click the Run button.<br />
Note: The status bar shows the progress of script executions and the amount of data<br />
collected.<br />
Compressing an SSC File to Send to the <strong>Support</strong> Center<br />
Once you run the utility to collect information, you can compress the SSC file to send to<br />
the Unisys <strong>Support</strong> Center.<br />
Note: A Collector components file has the .ssc suffix. Once an SSC file is compressed,<br />
the corresponding <strong>SafeGuard</strong> <strong>Solutions</strong> Data (SSD) file has the .ssd suffix.<br />
On the Unisys <strong>SafeGuard</strong> 30m Collector program window, follow these steps to<br />
compress an SSC file:<br />
1. Click Compress SSC on the File menu.<br />
Once the file is compressed, the file name and path are displayed at the top in the<br />
right pane of the window. The data is exported to the file named Components.ssd in<br />
the directory C:\Program Files\Unisys\30m\Collector\Data.<br />
Note: For the Microsoft Vista operating system, the SSD file resides in the<br />
directory where the Collector is installed. A typical location for this file is<br />
C:\Unisys\30m\Collector\Components.ssd.<br />
2. Send the SSD file to the Unisys <strong>Support</strong> Center at<br />
Safeguard30msupport@unisys.com.<br />
Duplicating the Installation on Another PC<br />
To duplicate the installation of the Collector at a different PC (for example, on the second<br />
site), perform these steps:<br />
1. Copy the SSD file from the PC with the installed Collector to the second PC, placing<br />
it in the C:\Program Files\Unisys\30m\Collector\Data directory.<br />
2. Start the Collector.<br />
3. Click Cancel on the Open Unisys <strong>SafeGuard</strong> 30m Collector File dialog box.<br />
The Unisys <strong>SafeGuard</strong> 30m Collector program window is displayed.<br />
Note: Once an SSD file is extracted, you can select the .ssc file.<br />
4. On the File menu, select Uncompress SSD.<br />
G–6 6872 5688–002
Using the Unisys <strong>SafeGuard</strong> 30m Collector<br />
5. On the Open <strong>SafeGuard</strong> 30m Data File dialog box, select from the list of<br />
available files the SSD file that you wish to uncompress.<br />
If a message appears asking about overwriting the SSC file, click Yes.<br />
6. Ensure that all scripts run from this PC by selecting each component type and<br />
running the scripts for each component.<br />
Understanding Operations in Collector Mode<br />
The Components.ssc file contains the configuration information. If you make changes to<br />
the Components.ssc file—such as adding, deleting, editing, enabling, and disabling<br />
scripts—these changes are automatically saved. You can also make these changes to a<br />
saved SSC file except that you cannot delete scripts from a saved SSC file. You must<br />
open the Components.ssc file to delete scripts.<br />
Understanding and Saving SSC Files<br />
Because you can enable and disable scripts in any SSC file, you can create saved SSC<br />
files for specific uses. If you want to run a subset of the available scripts, save the<br />
Components.ssc file as a new SSC file with a unique name. You can then enable or<br />
disable scripts in the saved SSC file. The saved SSC file is always updated from the<br />
Components.ssc file for information such as the available scripts and the details within<br />
each script. In addition, all changes that are made to any SSC file are updated in the<br />
Components.ssc file. Only scripts that were enabled in the saved SSC file are enabled<br />
when updated from a Components.ssc file.<br />
For example, you could save an SSC file with all RAs except one disabled. You might<br />
name it “radisabled.ssc”. If you have the radisabled.ssc file open and add a new script to<br />
it, the script is automatically added to the Components.ssc file.<br />
Whenever the Components.ssc file is updated with a new script, that script is<br />
automatically added to any saved SSC files.<br />
If you add a new RA to the configuration, the Components.ssc file and any existing<br />
saved SSC files are updated with the component and its scripts are disabled.<br />
If you make deletions to the Components.ssc file, the deletions are automatically<br />
removed from any saved SSC files.<br />
6872 5688–002 G–7
Using the Unisys <strong>SafeGuard</strong> 30m Collector<br />
Sample Scenario<br />
If you want to collect data at one site only or if you want to view the data from one site,<br />
you can create a new saved SSC file for each site. Follow these steps to create the<br />
saved SSC files.<br />
1. Add any desired scripts to the Components.ssc file.<br />
2. Open an SSC file.<br />
3. Click Save As on the File menu, and enter a unique name for the file.<br />
4. Enable and disable scripts as desired.<br />
For example, you might disable one site. To do so, follow these steps:<br />
a. Select the IP address of a component (perhaps Site 1 RA cluster management<br />
IP.)<br />
b. Right-click and click Disable.<br />
Repeat steps 2 through 4 to create additional customized files.<br />
Opening an SSC File<br />
On the Unisys <strong>SafeGuard</strong> 30m Collector program window, perform the following steps<br />
to open an SSC file:<br />
1. Click Open on the File menu.<br />
2. Select an SSC file and click Open.<br />
Configuring RA, Storage, and SAN Switch Component Types Using<br />
Built-In Scripts<br />
The built-in scripts are preconfigured; they contain CLI commands for RAs, navicli<br />
commands for Clariion storage, and CLI commands for switches that facilitate easy data<br />
collection. It takes about 4 minutes for the built-in scripts for one RA to run and about 2<br />
minutes for the built-in scripts for a SAN switch to run.<br />
After you configure built-in scripts, the left pane is updated with the IP addresses below<br />
the component type. Each IP folder contains the built-in scripts that are enabled.<br />
See the previous sample window with the IP address folders listed in the left pane. In<br />
that figure, two <strong>SafeGuard</strong> solutions are configured—the set of IP addresses<br />
(172.16.17.50 and 172.16.17.60) for the two RA clusters and the IP address 172.16.7.50<br />
for the continuous data protection (CDP) setup, which always has only one RA cluster.<br />
G–8 6872 5688–002
Using the Unisys <strong>SafeGuard</strong> 30m Collector<br />
On the Unisys <strong>SafeGuard</strong> 30m Collector program window, follow these steps to use<br />
built-in scripts to configure RA, Storage, and SAN Switch component types:<br />
1. Expand a component type—RA, Storage, or SAN Switch—and select BI Built-<br />
In.<br />
2. Right-click and click Copy Built-In.<br />
3. On the Script dialog box, complete the available fields and click Save.<br />
Note: You can select one script instead of all scripts by selecting a script name instead<br />
of selecting BI-Built-In.<br />
For the RA Component Type<br />
To collect data, specify the site management IP address of either of the RA clusters for a<br />
site. The other site management IP address is automatically discovered when you<br />
specify either of the RA site management addresses.<br />
If you have multiple <strong>SafeGuard</strong> solutions, repeat the three previous steps for each set of<br />
RA clusters.<br />
For the Storage Component Type<br />
Clariion is the only storage component with built-in scripts available.<br />
For the SAN Switch Component Type<br />
Before configuring a Cisco SAN switch, enter config mode on the switch and type #ssh<br />
server enable. To determine the state of the SSH server, type show ssh server<br />
when not in config mode. Refer to the Cisco MDS 9020 Fabric Switch Configuration<br />
<strong>Guide</strong> and Command Reference for more information about switch commands.<br />
If you run the tech-support command under SAN Switch from the Collector, the data<br />
capture might take a long time. You can follow the progress in the status bar of the<br />
window.<br />
If you run commands for a Brocade switch and receive the following message, the<br />
Brocade switch is downlevel and does not support the SSH protocol:<br />
rbash: switchShow: command not found<br />
Upgrade the switch software to a later version that supports the SSH protocol.<br />
6872 5688–002 G–9
Using the Unisys <strong>SafeGuard</strong> 30m Collector<br />
Enabling Scripts<br />
You can interactively enable all the scripts in any SSC file, the scripts for one component<br />
in the SSC file, or a single script. To enable a disabled script, you must open the SSC file<br />
containing the script. Perform the following steps on the Unisys <strong>SafeGuard</strong> 30m<br />
Collector program window.<br />
Enable All Scripts<br />
1. Select Components.<br />
2. Right-click and click Enable.<br />
Enabled scripts are shown in green.<br />
Enable Scripts for One Component<br />
1. Select the IP address of the component.<br />
2. Right-click and click Enable.<br />
Enabled scripts are shown in green.<br />
Enable a Single Script<br />
1. Select the script name.<br />
2. Right-click and click Enable.<br />
The enabled script is shown in green.<br />
Disabling Scripts<br />
You can interactively disable all the scripts in any SSC file, the scripts for one component<br />
in the SSC file, or a single script. Perform the following steps on the Open Unisys<br />
<strong>SafeGuard</strong> 30m Collector program window.<br />
Disable All Scripts<br />
1. Select Components.<br />
2. Right-click and click Disable.<br />
Disabled scripts are shown in red.<br />
Disable Scripts for One Component<br />
1. Select the IP address of the component.<br />
2. Right-click and click Disable.<br />
Disabled scripts are shown in red.<br />
Disable a Single Script<br />
1. Select the script name.<br />
2. Right-click and click Disable.<br />
The disabled script is shown in red.<br />
G–10 6872 5688–002
Running Scripts<br />
Using the Unisys <strong>SafeGuard</strong> 30m Collector<br />
You can interactively run all the scripts in any SSC file; the scripts for one component<br />
type such as RA, Storage, SAN Switch, or Other; the scripts for one component in the<br />
SSC file; or a single script.<br />
Note: You can use the Run button on the Collector toolbar or the Run command in the<br />
following procedures.<br />
Perform the following steps on the Unisys <strong>SafeGuard</strong> 30m Collector program window.<br />
Run All Scripts<br />
1. Select Components.<br />
2. Right-click and click Run.<br />
Run Scripts for One Component Type<br />
1. Select a component type—RA, Storage, SAN Switch, or Other.<br />
2. Right-click and click Run.<br />
The status of the executing scripts is displayed in the right pane. The status bar<br />
shows the component type that is running, the IP address, the script name, and<br />
instructions for halting script execution. A progress bar indicates that the Collector is<br />
running the script and shows the amount of data being captured by the script. Once<br />
script execution completes, the status bar shows the last script run.<br />
Run Scripts for One Component<br />
1. Select either the IP address or custom-named component.<br />
2. Right-click and click Run.<br />
The status of the executing scripts is displayed in the right pane. The status bar<br />
shows the component type that is running, the IP address, the script name, and<br />
instructions for halting script execution. A progress bar indicates that the Collector is<br />
running the script and shows the amount of data being captured by the script. Once<br />
script execution completes, the status bar shows the last script run.<br />
Run a Single Script<br />
1. Select a script name.<br />
2. Right-click and click Run.<br />
The status of the executing scripts is displayed in the right pane. The status bar<br />
shows the component type that is running, the IP address, the script name, and<br />
instructions for halting script execution. A progress bar indicates that the Collector is<br />
running the script and shows the amount of data being captured by the script. Once<br />
script execution completes, the status bar shows the last script run.<br />
Stopping Script Execution<br />
To stop a script while it is executing, click Stop on the Collector toolbar. All scripts that<br />
have been stopped are marked with a green X. The status of the stopped script is<br />
displayed in the right pane.<br />
6872 5688–002 G–11
Using the Unisys <strong>SafeGuard</strong> 30m Collector<br />
Deleting Scripts<br />
You can interactively delete scripts only in the Components.ssc file. Perform the<br />
following steps on the Unisys <strong>SafeGuard</strong> 30m Collector program window.<br />
Delete Scripts for One Component<br />
1. Select the IP address or custom-named component.<br />
2. Right-click and click Delete.<br />
Delete a Single Script<br />
1. Expand an IP address or a custom-named component; then select a script name.<br />
2. Right-click and click Delete.<br />
Adding Scripts for RA, Storage, and SAN Switch Component Types<br />
You can interactively add custom scripts to any SSC file by copying an existing script or<br />
by specifying a new script. Perform the following steps on the Unisys <strong>SafeGuard</strong> 30m<br />
Collector program window.<br />
Add New Script for a Component Type<br />
1. Select a component type—RA, Storage, or SAN Switch.<br />
2. Right-click and click New.<br />
3. Complete the script form.<br />
4. Click Save.<br />
Add a New Script Based on an Existing Custom Script<br />
1. Select a script name.<br />
2. Right-click and click New.<br />
3. Complete the form. Change the script name and the command.<br />
4. Click Save.<br />
Adding Scripts for the Other Component Type<br />
Perform the following steps on the Unisys <strong>SafeGuard</strong> 30m Collector program window.<br />
1. Select the component type Other.<br />
2. Right-click and click New.<br />
3. On the Select Program dialog box, navigate to the appropriate directory and<br />
choose the file to run. Then click Open.<br />
4. On the Script dialog box, type a component name in the Component field.<br />
5. Type a unique name for the script in the Script Name field.<br />
6. Review the selected file name that is displayed in the Command field. Modify the<br />
file name as necessary.<br />
G–12 6872 5688–002
Using the Unisys <strong>SafeGuard</strong> 30m Collector<br />
The following example illustrates using a custom component (adding a new script as<br />
shown in the previous procedure) to mount and unmount drives.<br />
Note: In this example, the Collector must be installed on the server with the kutils<br />
utility installed or with the stand-alone kutils utility installed.<br />
C:\batch_File\mount_r.bat<br />
%This command, when run, mounts the specified drive<br />
Echo ON<br />
cd c:\program files\kdriver\kutils<br />
kutils.exe umount r:<br />
kutils.exe mount r:<br />
echo "Finished"<br />
C:\batch_File\unmount_r.bat<br />
%This command, when run, unmounts the specified drive<br />
cd c:\program files\kdriver\kutils<br />
kutils.exe flushedFS r:<br />
kutils.exe unmount r:<br />
Scheduling an SSC File<br />
Perform the following steps on the Unisys <strong>SafeGuard</strong> 30m Collector program window.<br />
1. Click Schedule on the menu bar.<br />
2. On the Schedule Unisys <strong>SafeGuard</strong> 30m Collector File dialog box, enter the<br />
information required for each field as follows:<br />
a. Type the password.<br />
b. Type the date and start time.<br />
c. Select a Perform task option, which determines how often the schedule runs.<br />
d. Enter the end date if shown. (You do not need an end date for a Perform task of<br />
Once.)<br />
3. Click Select.<br />
4. On the Select Unisys <strong>SafeGuard</strong> 30m Collector dialog box, select the<br />
appropriate SSC file for which you wish to run the schedule, and then click Open.<br />
The Schedule Unisys <strong>SafeGuard</strong> 30m Collector File dialog box is again<br />
displayed. The Collector opens the selected SSC file as the current SSC file.<br />
5. Click Add.<br />
6. Click Exit.<br />
Note: You can create one schedule for an SSC file. To create additional schedules,<br />
create additional SSC files with the desired scripts enabled. The resultant scheduled data<br />
is appended to any current data (if available). For example, if you run the Collector using<br />
Windows Scheduler three times, three outputs are displayed in the right pane one after<br />
another with the timestamps for each.<br />
6872 5688–002 G–13
Using the Unisys <strong>SafeGuard</strong> 30m Collector<br />
Querying a Scheduled SSC File<br />
Perform the following steps on the Unisys <strong>SafeGuard</strong> 30m Collector program window.<br />
1. Click Schedule from the menu bar.<br />
2. On the Schedule Unisys <strong>SafeGuard</strong> 30m Collector File dialog box, click<br />
Query.<br />
3. On the Tasks window, select the task name that is the same as the scheduled SSC<br />
file.<br />
4. Right-click and click Properties.<br />
5. View the details of the scheduled task in the window; then click OK to close the<br />
task Properties window.<br />
6. Close the Tasks window and then select the Schedule Unisys <strong>SafeGuard</strong> 30m<br />
Collector window.<br />
7. Click Exit.<br />
Note: For the Microsoft Vista operating system, if you want to see the scheduled task<br />
after scheduling a task, click Query on the Schedule Unisys <strong>SafeGuard</strong> 30m<br />
Collector File dialog box. The Vista Microsoft Management Control (mmc) window is<br />
displayed. Press F5 to see the scheduled task.<br />
Deleting a Scheduled SSC File<br />
Perform the following steps on the Unisys <strong>SafeGuard</strong> 30m Collector program window.<br />
1. Select Schedule from the menu bar.<br />
2. On the Schedule Unisys <strong>SafeGuard</strong> 30m Collector File dialog box, click<br />
Query.<br />
3. On the Tasks window, select the task name that is the same as the scheduled SSC<br />
file.<br />
4. Right-click and click Delete.<br />
5. Close the Tasks window and then select the Schedule Unisys <strong>SafeGuard</strong> 30m<br />
Collector window.<br />
6. Click Exit.<br />
G–14 6872 5688–002
Using View Modde<br />
6872 5688–002<br />
If you installed the Collector in View mode, the support personnel at the<br />
Unisys <strong>Support</strong><br />
Center can use Vieww<br />
Mode to view the information. To access the Collector,<br />
follow<br />
these steps:<br />
1. Start the Collecctor.<br />
2. On the Open UUnisys<br />
<strong>SafeGuard</strong> 30m Collector File dialog box, b click Cancel.<br />
The Unisys <strong>SafeGuard</strong><br />
30m Collector program window is displayed d.<br />
Note: Once aan<br />
SSD file is extracted, you can select the . ssc file.<br />
3. On the File meenu,<br />
click Uncompress SSD.<br />
4. On the Open S<strong>SafeGuard</strong><br />
30m Data File dialog box, select from m the list of<br />
available files thhe<br />
SSD file that you wish to uncompress.<br />
5. In View mode, expand the components tree and then expand a com mponent type:<br />
RA, Storage, SAN Switch, or Other.<br />
6. Click a script naame<br />
from those displayed to view the data collected d from that script.<br />
The data is dispplayed<br />
in the right pane.<br />
The following figure<br />
displays a sample of View mode with data disp played in the right<br />
pane.<br />
7. On the File meenu,<br />
click Exit.<br />
Using the Unisys <strong>SafeGuard</strong> d 30m Collector<br />
G–15
Using the Unisys <strong>SafeGuard</strong> 30m Collector<br />
G–16 6872 5688–002
Appendix H<br />
Using kutils<br />
Usage<br />
The server-based kutils utility enables you to manage host splitters across all platforms.<br />
This utility is installed automatically when you install the Unisys <strong>SafeGuard</strong> 30m splitter<br />
on a host machine. When the splitting function is performed by an intelligent fabric<br />
switch, you can install a stand-alone version of the kutils utility separately on host<br />
machines.<br />
For details on the syntax and use of the ktuils commands, see the Unisys <strong>SafeGuard</strong><br />
<strong>Solutions</strong> Replication Appliance Administrator’s <strong>Guide</strong>.<br />
A kutils command is always introduced with the kutils string. If you enter the string<br />
independently—that is, without any parameters—the ktuils utility returns usages notes,<br />
as follows:<br />
C:\program files\kdriver\kutils>kutils<br />
Usage: kutils <br />
Path Designations<br />
You can designate the path to a device in the following ways:<br />
• Device path example<br />
“SCSI\DISK&VEN_KASHYA&PROD_MASTER_RIGHT&REV_0001\5&133EF78A&0&000”<br />
• Storage path example<br />
“SCSI#DISK&VEN_KASHYA&PROD_MASTER_RIGHT&REV_0001#5&133EF78A&0&000#{53<br />
f56307-b6bf-11d0-94f2-00a0c91efb8b}”<br />
• Volume path example<br />
“\\?\Volume{33b4a391-26af-11d9-b57b-505054503030}”<br />
Each command notes the particular designation to use. In addition, some commands,<br />
such as showDevices and showFS, return the symbolic link for a device. The symbolic<br />
link generally provides additional information about the characteristics of the specific<br />
devices.<br />
6872 5688–002 H–1
Using kutils<br />
The following are examples of symbolic links:<br />
“\Device\0000005c”<br />
“\Device\EmcPower\Power2”<br />
“\Device\Scsi\q123001Port2Path0Target0Lun2”<br />
Command Summary<br />
The kutils utility offers the following commands:<br />
• disable: Removes host access to the specified device or volume (Windows only).<br />
• enable: Restores host access to a specified device or volume (Windows only).<br />
• flushFS: Initiates an operating system flush of the file system (Windows only).<br />
• manage_auto_host_info_collection: Indicates whether the automatic host<br />
information collection is enabled or disabled, or enables or disables automatic host<br />
information collection.<br />
• mount: Mounts a file system (Windows only).<br />
• rescan: Scans storage for all existing disks (Windows only).<br />
• showDevices: Presents a list of physical devices to which the host has access,<br />
providing (as available) the device path, storage path, and symbolic link for each<br />
device (Windows only).<br />
• showFS: Presents the drive designation and, as available, the device path, storage<br />
path, and symbolic link for each mounted physical device (Windows only).<br />
• show_vol_info: Presents information on the specified volume, including: the Unisys<br />
<strong>SafeGuard</strong> 30m solution name (if “created” in Unisys <strong>SafeGuard</strong> <strong>Solutions</strong>), size, and<br />
storage path.<br />
• show_vols: Presents information on all volumes to which the host has access<br />
including: the Unisys <strong>SafeGuard</strong> 30m solution name (if “created” in Unisys<br />
<strong>SafeGuard</strong> <strong>Solutions</strong>), size, and storage path<br />
• sqlRestore: Restores an image previously created by the sqlSnap command<br />
(Windows only)<br />
• sqlSnap: Performs a VDI-based SQL Server image (Windows only).<br />
• start: Resumes the splitting of write operations.<br />
• stop: Discontinues the splitting of write operations to an RA (that is, places the host<br />
splitter in pass-through mode in which data is written to storage only).<br />
• umount: Unmounts the file system (Windows only).<br />
H–2 6872 5688–002
Appendix I<br />
Analyzing Cluster Logs<br />
Samples of cluster log messages for problems and situations are listed throughout this<br />
guide. You can search on text strings from cluster log messages to find specific<br />
references.<br />
The information gathered in cluster logs is critical in determining the cause of a given<br />
cluster problem. Without the diagnostic information from the cluster logs, you might find<br />
it difficult to determine the root cause of a cluster problem.<br />
This appendix provides information to help you use the cluster log as a diagnostic tool.<br />
Introduction to Cluster Logs<br />
The cluster log is a text log file updated by the Microsoft Cluster Service (MSCS) and its<br />
associated cluster resource. The cluster log contains diagnostic messages about cluster<br />
events that occur on an individual cluster member or node. This file provides more<br />
detailed information than the cluster events written in the system event log.<br />
A cluster log reports activity for one node. All member nodes in a cluster perform as a<br />
single unit. Therefore, when a problem occurs, it is important to gather log information<br />
from all member nodes in the cluster. This information gathering is typically done using<br />
the Microsoft MPS Report Utility. Gather the information immediately after a problem<br />
occurs to ensure cluster log data is not overwritten.<br />
By default, the cluster log name and location are as follows:<br />
• C:\Winnt\Cluster\cluster.log<br />
Note: For windows 2003 cluster.log file is located in the following path:<br />
C:\WINDOWS\Cluster<br />
• Captured with MPS Report Utility: _Cluster.log<br />
6872 5688–002 I–1
Analyzing Cluster Logs<br />
Creating the Cluster Log<br />
In Windows 2000 Advanced Server and Windows 2000 Datacenter Server, by default,<br />
cluster logging is enabled on all nodes. You can define the characteristics and behavior of<br />
the cluster log with system environment variables.<br />
To access the system environment variables, perform the following actions:<br />
1. In Control Panel, double-click System.<br />
2. Select the Advanced tab.<br />
3. Click Environment Variables.<br />
You can get additional information regarding the system environment variables in<br />
Microsoft Knowledge Base article 16880, “How to Turn On Cluster Logging in Microsoft<br />
Cluster Server” at this URL:<br />
http://support.microsoft.com/default.aspx?scid=kb;en-us;168801<br />
The default cluster settings are listed in Table I–1. Some parameters might not be listed<br />
when viewing the system environment variables. If a variable is not listed, its default<br />
value is still in effect.<br />
Table I–1. System Environment Variables Related to Clustering<br />
Variable Name Default Setting Comment<br />
ClusterLog %SystemRoot%<br />
\Cluster\Cluster.log<br />
Determines the location and name<br />
of cluster log file.<br />
ClusterLogSize 8 MB Determines the size of the cluster<br />
log. The default size is usually not<br />
large enough to retain history on<br />
enterprise systems. The<br />
recommended setting is 64 MB.<br />
ClusterLogLevel 2 Sets the level of detail for log<br />
entries, as follows:<br />
0 = No logging<br />
1 = Errors only<br />
2 = Errors and Warnings<br />
3 = Everything that occurs<br />
Used only with the /debug<br />
parameter on MSCS startup.<br />
Review Microsoft Knowledge Base<br />
article 258078 for more information<br />
about using the /debug parameter.<br />
I–2 6872 5688–002
Analyzing Cluster Logs<br />
Table I–1. System Environment Variables Related to Clustering<br />
Variable Name Default Setting Comment<br />
ClusterLogOverwrite<br />
Note: By default, the<br />
ClusterLogOverwrite setting is<br />
disabled. Unisys recommends that<br />
this setting remain disabled. When<br />
this setting is enabled, all cluster<br />
log history is lost if MSCS is<br />
restarted twice in succession.<br />
Understanding the Cluster Log Layout<br />
Process ID<br />
Thread ID<br />
Date<br />
GMT<br />
0 Determines whether a new cluster<br />
log is to be created when MSCS<br />
starts.<br />
0 = Disabled<br />
1 = Enabled<br />
Figure I–1 illustrates the layout of the cluster log. The paragraphs following the figure<br />
explain the various parts of the layout.<br />
Figure I–1. Layout of the Cluster Log<br />
The process ID is the process number assigned by the operating system to a service or<br />
application.<br />
The thread ID is a thread of a particular process. A process typically has multiple threads<br />
listed. Within a large cluster log, it is particularly useful to search by thread ID to find the<br />
messages related to the same thread.<br />
The date listed is the date of the entry. You can use this date to match the date of the<br />
problem in the system event log.<br />
The time entered in the Windows 2000 cluster log is always in Greenwich Mean Time<br />
(GMT). The format of the entry is HH:MM:SS.SSS. The SS.SSS entry represents<br />
seconds carried out to the thousandths of a second. There can be multiple .SSS entries<br />
for the same thousandth of a second. Therefore, there can be more than 999 cluster log<br />
entries vsn exist for any given second.<br />
6872 5688–002 I–3
Analyzing Cluster Logs<br />
Cluster Module<br />
Table I–2 lists the various modules of MSCS. These module names are logged within<br />
square brackets in the cluster log.<br />
Table I–2. Modules of MSCS<br />
API API <strong>Support</strong><br />
ClMsg Cluster messaging<br />
ClNet Cluster network engine<br />
CP Checkpoint Manager<br />
CS Cluster service<br />
DM Database Manager<br />
EP Event Processor<br />
FM Failover Manager<br />
GUM Global Update Manager<br />
INIT Initialization<br />
JOIN Join<br />
LM Log Manager<br />
MM Membership Manager<br />
NM Node Manager<br />
OM Object Manager<br />
RGP Regroup<br />
RM Resource Monitor<br />
For additional descriptions of the cluster components, refer to the Windows 2000 Server<br />
Resource Kit at this URL:<br />
http://www.microsoft.com/technet/prodtechnol/windows2000serv/reskit/default.msp<br />
x?mfr=true<br />
Click the following link for Windows 2003 to refer to the Windows 2003 Server Resource<br />
Kit:<br />
http://www.microsoft.com/windowsserver2003/techinfo/reskit/tools/default.mspx<br />
Click the following link to interpret the cluster logs:<br />
http://technet2.microsoft.com/windowsserver/en/library/16eb134d-584e-46d9-9bf4-<br />
6836698cd26a1033.mspx?mfr=true<br />
I–4 6872 5688–002
Sample Cluster Log<br />
Analyzing Cluster Logs<br />
The sample cluster log that follows illustrates the component names in brackets.<br />
Cluster Operation<br />
00000848.00000ba0::2008/05/05-16:11:31.000 [RGP] Node 1: REGROUP INFO:<br />
regroup engine requested immediate shutdown.<br />
00000848.00000ba0::2008/05/05-16:11:31.000 [NM] Prompt shutdown is requested<br />
by a membership engine<br />
00000adc.00000acc::2008/05/05-16:11:31.234 [RM] Going away, Status = 1,<br />
Shutdown = 0.<br />
The cluster operation is the task currently being performed by the cluster. Each cluster<br />
module (listed in Table I–2) can perform hundreds of operations, such as forming a<br />
cluster, joining a cluster, checkpointing, moving a group manually, and moving a group<br />
because of a failure.<br />
Posting Information to the Cluster Log<br />
The cluster log file is organized by date and time. Process threads of MSCS and<br />
resources post entries in an intermixed fashion. As the threads are performing various<br />
cluster functions, they constantly post entries to the cluster log in an interspersed<br />
manner.<br />
The following sample cluster log shows various disks in the process of coming online.<br />
The entries are not logically grouped by disk; rather, the entries are logged as each<br />
thread posts its unique information.<br />
In the left navigation pane, click on Windows 2000 Server Resource Kit and click<br />
on Distributed Systems <strong>Guide</strong>, then Enterprise Technologies, and then<br />
Interpreting the Cluster Log.<br />
Sample Cluster Log<br />
Thread ID<br />
↓<br />
00000444.00000600::2008/11/18-18:23:48.307 Physical Disk :<br />
[DiskArb] Issuing GetSectorSize on signature 9a042144.<br />
00000444.000005e0::2008/11/18-18:23:48.307 Physical Disk :<br />
[DiskArb]Successful read (sector 12) [:0] (0,00000000:00000000).<br />
00000444.00000608::2008/11/18-18:23:48.307 Physical Disk :<br />
[DiskArb]DisksOpenResourceFileHandle: CreateFile successful.<br />
6872 5688–002 I–5
Analyzing Cluster Logs<br />
00000444.00000600::2008/11/18-18:23:48.307 Physical Disk :<br />
[DiskArb] GetSectorSize completed, status 0.<br />
00000444.00000608::2008/11/18-18:23:48.307 Physical Disk :<br />
DiskArbitration must be called before DisksOnline.<br />
00000444.00000600::2008/11/18-18:23:48.307 Physical Disk :<br />
[DiskArb] ArbitrationInfo.SectorSize is 512<br />
00000444.00000608::2008/11/18-18:23:48.307 Physical Disk :<br />
[DiskArb] Arbitration Parameters (1 9999).<br />
00000444.00000600::2008/11/18-18:23:48.307 Physical Disk :<br />
[DiskArb] Issuing GetPartInfo on signature 9a042144.<br />
Because the cluster performs many operations simultaneously, the log entries pertaining<br />
to a particular thread are interwoven along with the threads of the other cluster<br />
operations. Depending on the number of cluster groups and resources, reading a cluster<br />
log can become difficult.<br />
Tip: To follow a particular operation, search by the thread ID. For instance, to follow<br />
online events for Physical Disk V, perform these steps using the preceding sample<br />
cluster log:<br />
1. Anchor the cursor in the desired area.<br />
2. Search up or down for thread 00000600.<br />
Diagnosing a Problem Using Cluster Logs<br />
The following topics provide you with useful information for diagnosing problems using<br />
cluster logs:<br />
• Gathering Materials<br />
• Opening the Cluster Log<br />
• Converting GMT to Local Time<br />
• Converting Cluster Log GUIDs to Text Resource Names<br />
• Understanding State Codes<br />
• Understanding Persistent State<br />
• Understanding Error and Status Codes<br />
I–6 6872 5688–002
Gathering Materials<br />
Analyzing Cluster Logs<br />
You need to gather the following pieces of information, tools, and files to use with the<br />
cluster logs to diagnose problems:<br />
• Information<br />
− Date and time of problem occurrence<br />
− Server time zone<br />
• Tools<br />
− Notepad or Wordpad text viewer<br />
− This command-line tool is embedded in Windows. The command syntax is Net<br />
Helpmsg ).<br />
• Output from the MPS Report Utility from all cluster nodes<br />
• Files from the MPS Report Utility run<br />
− Cluster log (Mandatory)<br />
The file name is _Cluster.log.<br />
− System event log (Mandatory)<br />
The file name is _Event_Log_System.txt.<br />
− .nfo system information file for installed adapters and driver versions (Reference)<br />
The file name is _Msinfo.nfo.<br />
− Cluster registry hive for cross-referencing information used in the cluster log<br />
(Reference)<br />
The file name is _Cluster_Registry.hiv.<br />
− Cluster configuration file for a basic listing of cluster nodes, groups, resources,<br />
and dependencies (available in MPS Report Utility version 7.2 or later)<br />
The file name is _Cluster_mps_Information.txt.<br />
Opening the Cluster Log<br />
Use a text editor to view the cluster log file in the MPS Report Utility. Notepad or<br />
Wordpad works well. Notepad allows text searches up or down the document. Wordpad<br />
allows text searches only down the document.<br />
Note: Do not open the cluster.log file on a production cluster. Logging stops while the<br />
file is open. Instead, copy the cluster.log file first and then open the copy to read the file.<br />
The cluster log is on the local system in the directory Winnt/Cluster/Cluster.log.<br />
6872 5688–002 I–7
Analyzing Cluster Logs<br />
Converting GMT/UCT to Local Time<br />
The time posted in the cluster log is given as GMT/UCT. You must convert GMT/UCT to<br />
the local time to cross-reference cluster log entries with system and application event<br />
log entries.<br />
You can find the local time zone in the .nfo file in MPS Reports under system summary.<br />
You can also use the Web site www.worltimeserver.com to find accurate local time for a<br />
given city, GMT/UCT, and the difference between the two in hours.<br />
Converting Cluster Log GUIDs to Text Resource Names<br />
A globally unique identifier (GUID) is a 32-character hexadecimal string used to identify a<br />
unique entity in the cluster. A unique entry can be a node name, group name, resource<br />
name, or cluster name.<br />
The GUID format is nnnnnnnn-nnnn-nnnn-nnnn-nnnnnnnnnnnn.<br />
The following are examples of GUIDs in the cluster log:<br />
000007d0.00000808::2008/04/23-21:48:23.105 [FM] FmpHandleResourceTransition: resource<br />
Name = ae775058-af20-4ba2-a911-af138b1f65bd old state=130 new state=3<br />
000007d0.00000808::2008/04/23-21:48:23.448 [FM] FmpRmOfflineResource: RMOffline() for<br />
6060dc33-5737-4277-b2f2-9cc45629ef0 returned error 997<br />
000007d0.00001970::2008/05/02-21:41:58.846 [FM] OnlineResource: e65bc275-66d1-41ff-<br />
8a4e-89ad6643838b depends on 758bb9bb-7d1f-4148-a994-684dd4f8c969. Bring online<br />
first.<br />
000007d0.0000081::2008/05/04-17:21:06.888 [FM] New owner of Group b072608c-b7f3-48b0-<br />
83f8-7c922c14e709 is 2, state 0, curstate 1.<br />
Mapping a Text Name to a GUID<br />
The two methods for mapping a text name to a GUID are<br />
• Automatic mapping<br />
• Reviewing the cluster registry hive<br />
I–8 6872 5688–002
Automatic Mapping<br />
Analyzing Cluster Logs<br />
The simplest method of mapping a text name to a GUID is the automatic mapping<br />
performed by some versions of the MPS Report tool. However, most versions of the<br />
MPS Report tool do not perform this automatic function.<br />
For those versions with the automatic mapping feature, you can find the information in<br />
the cluster configuration file (_Cluster_Mps_Information.txt). The<br />
following listing shows this mapping:<br />
f9f0b528-b674-40fb-9770-c65e17a2a387 = SQL Network Name<br />
f0dd1852-acc8-4921-b33a-a77dd5cdcfee = SQL Server Fulltext (SQL1)<br />
f0aca2c4-049f-4255-9332-92a69cc07326 = MSDTC<br />
eff360f3-d987-4a020-8f3c-4118056a50b2 = MSDTC IP Address<br />
e74769f8-67e1-43b2-9bec-93171c31d182 = SQL IP Address 1<br />
e09f61cf-8ebf-4cd1-9ae3-58ed4d2b0fbc = Disk K:<br />
Reviewing the Cluster Registry Hive<br />
The second method of mapping a text name to a GUID is more complex and involves<br />
opening the cluster registry hive from the MPS Report tool and then reviewing the<br />
contents.<br />
Follow these steps to open and review the cluster registry hive:<br />
1. Start the Registry Editor (Regedt32.exe).<br />
2. Click the HKEY_LOCL_MACHINE hive.<br />
3. Click the HKEY_LOCAL_MACHINE root folder.<br />
4. Click Load Hive on the Registry menu.<br />
5. Select the _Cluster_Registry.hiv file; then press Ctrl-C.<br />
6. Select Open.<br />
7. Press Ctrl-V to obtain the key name.<br />
8. Expand the cluster hive and review the GUIDS, which are located in the subkeys<br />
Groups, Resources, Networks, and NetworkInterfaces, as shown in Figure I–2.<br />
6872 5688–002 I–9
Analyzing Cluster Logs<br />
I–10<br />
Figure I–2. Expandded<br />
Cluster Hive (in Windows 2000 Server)<br />
Scroll through the GUUIDs<br />
until you find the one that matches the GUID from<br />
the<br />
cluster log. You can aalso<br />
open each key until you find the matching GUID D.<br />
Tip: Under each GUID iss<br />
a TYPE field. This field identifies a resource type such<br />
as<br />
physical disk, IP addresss,<br />
network name, generic application, generic service e, and so<br />
forth. You can use this fiield<br />
to find a specific resource type and then map it to the GUID.<br />
Understanding State CCodes<br />
MSCS uses state codes to determine the status of a cluster component. The e state varies<br />
depending on the type off<br />
cluster components, which are nodes, groups, resources,<br />
networks, and network innterfaces.<br />
Some state codes are posted in the cluster<br />
log using<br />
the numeric code and others<br />
using the actual value for the code.<br />
68 872 5688–002
Examples of State Codes in the Cluster Log<br />
Analyzing Cluster Logs<br />
The following example entries show state codes for the resource, group, network<br />
interface, node, and network types of cluster component:<br />
• Resource<br />
In this example, the resource is changing states from online pending (129) to online<br />
(2).<br />
00000850.00000888::2008/05/05-17:37:29.125 [FM] FmpHandleResource<br />
Transition: Resource Name = 87e55402-87cb-4354-95e7-6dd864b79039 old state =<br />
129 new state=2<br />
• Group<br />
In this example, the group state is set to offline (1).<br />
00000898.000008a0::2008/05/05-06:25:55:062 [FM] Setting group 1951e272-6271-<br />
4ea3-b0f9-cd767537f245 owner to node 2, state 1<br />
• Network interface<br />
This example provides the actual value of the state code, not the numeric code.<br />
00000898.00000598:2008/05/05-06:28:40;921 [ClMsg] Received interface<br />
unreachable event for node 2 network 1<br />
• Node<br />
This example provides the actual value of the state code, not the numeric code.<br />
00000898.0000060c::2008/05/05-06:28:45:953 [EP] Node down event received<br />
00000898.000008a8:2008/05/05-06:28:45:953 [Gum] Nodes down: 0002. Locker=1,<br />
Locking=1<br />
• Network<br />
This example provides the actual value of the state code, not the numeric code.<br />
00000898.000008a4::2008/05/05-06:25:53:703 [NM] Processing local interface<br />
up event for network 0433c4e2-a577-4325-9ebd-a9d3d2b9b81f.<br />
6872 5688–002 I–11
Analyzing Cluster Logs<br />
State Codes<br />
Table I–3 lists the state codes from the Windows 2000 Resource Kit for nodes.<br />
Table I–3. Node State Codes<br />
State Code State<br />
–1 ClusterNodeStateUnknown<br />
0 ClusterNodeUp<br />
1 ClusterNodeDown<br />
2 ClusterNodePaused<br />
3 ClusterNodeJoining<br />
Table I–4 lists the state codes from the Windows 2000 Resource Kit for groups.<br />
Table I–4. Group State Codes<br />
State Code State<br />
–1 ClusterGroupStateUnknown<br />
0 ClusterGroupOnline<br />
1 ClusterGroupOffline<br />
2 ClusterGroupFailed<br />
3 ClusterGroupPartialOnline<br />
Table I–5 lists the state codes from the Windows 2000 Resource Kit for resources.<br />
Table I–5. Resource State Codes<br />
State Code State<br />
–1 ClusterResourceStateUnknown<br />
0 ClusterResourceInherited<br />
1 ClusterResourceInitializing<br />
2 ClusterResourceOnline<br />
3 ClusterResourceOffline<br />
4 ClusterResourceFailed<br />
128 ClusterResourcePending<br />
I–12 6872 5688–002
Table I–5. Resource State Codes<br />
State Code State<br />
129 ClusterResourceOnlinePending<br />
130 ClusterResourceOfflinePending<br />
Analyzing Cluster Logs<br />
Table I–6 lists the state codes from the Windows 2000 Resource Kit for network<br />
interfaces.<br />
Table I–6. Network Interface State Codes<br />
State Code State<br />
–1 ClusterNetInterfaceStateUnknown<br />
0 ClusterNetInterfaceUnavailable<br />
1 ClusterNetInterfaceFailed<br />
2 ClusterNetInterfaceUnreachable<br />
3 ClusterNetInterfaceUp<br />
Table I–7 lists the state codes from the Windows 2000 Resource Kit for networks<br />
Table I–7. Network State Codes<br />
State Code State<br />
–1 ClusterNetworkStateUnknown<br />
0 ClusterNetworkUnavailable<br />
1 ClusterNetworkDown<br />
2 ClusterNetworkPartitioned<br />
3 ClusterNetworkUp<br />
6872 5688–002 I–13
Analyzing Cluster Logs<br />
Understanding Persistent State<br />
Persistent state is not a state code, but rather a key in the cluster registry hive for groups<br />
and resources. The persistent state key reflects the current state of a resource or group.<br />
This key is not a permanent value; it changes value when a group or resource changes<br />
states.<br />
You can change the value of the persistent state key, which can be useful for<br />
troubleshooting or managing the cluster. For example, you can change the value before a<br />
manual failover or shutdown to prevent a particular group or resource from starting<br />
automatically.<br />
The value for the persistent state can be 0 (disabled or offline) or 1 (enabled or online).<br />
The default value is 1.<br />
If the value for persistent state is 0, the group or resource remains in an offline state<br />
until it is manually brought online.<br />
The following is an example cluster log reference to persistent state:<br />
000008bc.00000908::2008/05/12-23:45:36/687 [FM] FmpPropagateGroupState:<br />
Group 1951e272-6271-4ea3-b0f9-cd767537f245 state = 3, persistent state = 1<br />
For more information about persistent state, view Microsoft Knowledge Base article<br />
259243, “How to Set the Startup Value for a Resource on a Clustered Server” at this<br />
URL:<br />
http://support.microsoft.com/default.aspx?scid=kb;en-us;259243<br />
I–14 6872 5688–002
Understanding Error and Status Codes<br />
Analyzing Cluster Logs<br />
You can easily interpret error and status codes that occur in cluster log entries by issuing<br />
the following command from the command line:<br />
Net Helpmsg <br />
This command returns a line of explanatory text that corresponds to the number.<br />
Examples<br />
• For the error code value of 5 as shown in the following example, the Net Helpmsg<br />
command returns “Access is denied.”<br />
00000898.000008f0:2008/30-16:03:31.979 [DM] DmpCheckpointTimerCb -Failed to<br />
reset log, error=5<br />
• For the status code value of 997 as shown in the following example, the Net<br />
Helpmsg command returns “Overlapped I/O operation is in progress.” This status<br />
code is also known as “I/O pending.”<br />
00000898.00000a8c::2008/05/05-06:38:14.187 [FM] FmpOnlineResource: Returning<br />
Resource 87e55402-87cb-4354-95e7-6dd864b79039, state 129, statue 997<br />
• For the status code value of 170 as shown in the following example, the Net<br />
Helpmsg command returns “The requested resource is in use.”<br />
000009a4.000009c4::2008/05/15-07:28:42.303 Physicsl Disk :[DiskArb]<br />
CompletionRoutine, status 170<br />
6872 5688–002 I–15
Analyzing Cluster Logs<br />
I–16 6872 5688–002
Index<br />
A<br />
accessing an image, 3-2<br />
analyzing<br />
intelligent fabric switch logs, A-16<br />
RA log collection files, A-8<br />
server (host) logs, A-16<br />
B<br />
bandwidth, verifying, D-7<br />
bin directory, A-14<br />
C<br />
changes for this release, 1-2<br />
clearing the system event log (SEL), B-1<br />
ClearPath MCP<br />
bringing data consistency group online, 3-5<br />
manual failover, 3-5<br />
recovery tasks, 3-5<br />
CLI file, A-10<br />
clock synchronization, verifying, D-8<br />
cluster failure, recovering, 4-19<br />
cluster log<br />
cluster registry hive, I-9<br />
definition, I-1<br />
error and status codes, I-15<br />
GUID format, I-8<br />
GUIDs, I-8<br />
layout, I-3<br />
mapping GUID to text name, I-8<br />
name and location, I-1<br />
opening, I-7<br />
overview, 2-9<br />
persistent state, I-14<br />
state codes, I-10, I-12<br />
cluster registry hive, I-9<br />
cluster service modules, I-4<br />
cluster settings<br />
system environment variables, I-2<br />
cluster setup, checking, 4-1<br />
collecting host logs<br />
using host information collector (HIC)<br />
utility, A-7<br />
using MPS utility, A-6<br />
collecting RA logs, A-1, A-3<br />
Collector (See Unisys <strong>SafeGuard</strong> 30m<br />
Collector)<br />
collector directory, A-11<br />
configuration settings, saving, D-2<br />
configuring additional RAs, D-4<br />
configuring the replacement RA, D-6<br />
connecting, accessing the replacement<br />
RA, D-4<br />
connectivity testing tool messages, C-8<br />
converting local time to GMT or UTC, A-3<br />
6872 5688–002 Index–1<br />
D<br />
data consistency group<br />
bringing online, 3-3, 4-9<br />
bringing online for ClearPath MCP, 3-5<br />
manual failover, 3-2, 4-8<br />
manual failover for ClearPath MCP, 3-5<br />
recovery tasks, 3-2, 3-5, 4-7<br />
recovery tasks for ClearPath MCP, 3-5<br />
taking offline, 4-7, 5-9<br />
data flow, overview, 2-3<br />
detaching the failed RA, D-3<br />
determining when the failure occurred, A-2<br />
diagnostics<br />
Installation Manager, C-1<br />
RA hardware, B-2<br />
directory<br />
bin, A-14<br />
collector, A-11<br />
etc, A-11<br />
files, A-11<br />
home, A-11, A-14<br />
host log extraction, A-15<br />
InfoCollect, A-12<br />
processes, A-12<br />
rreasons, A-11
Index<br />
E<br />
sbin, A-12<br />
tmp, A-14<br />
usr, A-13<br />
e-mail notifications<br />
configuring a diagnostic e-mail<br />
notification, 2-8<br />
overview, 2-8<br />
enabling PCI-X slot functionality, D-5<br />
environment settings, restoring, D-2<br />
etc directory, A-11<br />
event log, E-1<br />
displaying, E-3<br />
event levels, E-2<br />
event scope, E-2<br />
event topics, E-1<br />
list of Detailed events, E-22<br />
list of Normal events, E-5<br />
overview, 2-7<br />
using for troubleshooting, E-3<br />
events<br />
event log, E-1<br />
understanding, E-1<br />
events that cause journal distribution, 2-10<br />
F<br />
Fabric Splitter, 2-4<br />
Fibre Channel diagnostics<br />
detecting Fibre Channel LUNs, C-13<br />
detecting Fibre Channel Scsi3 Reserved<br />
LUNs, C-15<br />
detecting Fibre Channel targets, C-12<br />
performing I/O to LUN, C-15<br />
running SAN diagnostics, C-9<br />
viewing Fibre Channel details, C-11<br />
Fibre Channel HBA LEDs<br />
location, 8-12<br />
files directory, A-11<br />
full-sweep initialization, 4-4<br />
G<br />
geographic clustered environment<br />
basic configuration diagram, 2-2<br />
definition, 2-1<br />
overview, 2-2<br />
recovery from total failure of one site, 4-19<br />
geographic replication environment, 2-1<br />
definition, 2-1<br />
server failure, 9-20<br />
total storage loss, 5-13<br />
GMT<br />
converting local time to, A-3<br />
example of local time conversion, A-3<br />
group initialization effects on move-group<br />
operation, 4-3<br />
Index–2 6872 5688–002<br />
H<br />
HIC (See host information collector (HIC)<br />
utility)<br />
high load<br />
disk manager reports, 10-4<br />
general description, 10-3<br />
home directory, A-11, A-14<br />
host information collector (HIC) utility<br />
overview, 2-9<br />
using, A-7<br />
host logs collection<br />
using host information collector (HIC)<br />
utility, A-7<br />
using MPS utility, A-6<br />
I<br />
InfoCollect directory, A-12<br />
initialization<br />
from marking mode, 4-5<br />
full sweep, 4-4<br />
long resynchronization, 4-4<br />
initiate_failover command, 4-6<br />
Installation Manager<br />
diagnostics, 2-9<br />
Diagnostics menu, 8-17, 8-21, C-2<br />
steps to run, C-2<br />
Installation Manager diagnostics<br />
collect system info, C-18<br />
Fibre Channel diagnostics, C-9<br />
IP diagnostics, C-2<br />
synchronization diagnostics, C-17<br />
installing and configuring the replacement<br />
RA, D-4<br />
IP diagnostics<br />
port diagnostics, C-5<br />
site connectivity tests, C-3<br />
system connectivity, C-6, C-7
K<br />
test throughput, C-4<br />
view IP details, C-3<br />
view routing table, C-4<br />
kutils<br />
command summary, H-2<br />
overview, 2-10<br />
path designations, H-1<br />
string, H-1<br />
using, H-1<br />
L<br />
Local Replication by CDP, 2-5<br />
log extraction directory<br />
host, A-15<br />
RA, A-9<br />
log file, A-10<br />
long resynchronization, 4-4<br />
M<br />
management console<br />
locked user, 8-4<br />
RA attached to cluster, 8-4<br />
understanding access, 8-4<br />
manual failover<br />
data consistency group, 3-2, 4-8<br />
performing, 4-7<br />
performing with data consistency group<br />
(older image), 4-8<br />
quorum consistency groups, 4-14, 4-23<br />
manual failover for ClearPath MCP<br />
data consistency group, 3-5<br />
manual failover of volumes and data<br />
consistency groups<br />
accessing an image, 3-2<br />
marking mode, initializing from, 4-5<br />
MIB<br />
OID Unisys, F-1<br />
RA file, F-3<br />
MIB II, F-1<br />
Microsoft Cluster Service, 2-1<br />
modifying the Preferred RA setting, D-3<br />
move group operation, initialization<br />
effects, 4-3<br />
MPS utility, A-6<br />
MSCS (See Microsoft Cluster Service)<br />
MSCS properties, checking, 4-1<br />
Index<br />
6872 5688–002 Index–3<br />
N<br />
network bindings<br />
checking, 4-2<br />
cluster specific, 4-3<br />
host network specific, 4-2<br />
network LEDs<br />
location, 8-11<br />
networking problem<br />
cluster node public NIC failure (geographic<br />
clustered environment), 7-3<br />
management network failure (geographic<br />
clustered environment), 7-11<br />
port information, 7-32<br />
private cluster network failure (geographic<br />
clustered environment), 7-22<br />
public or client WAN failure (geographic<br />
clustered environment), 7-6<br />
replication network failure (geographic<br />
clustered environment), 7-15<br />
temporary WAN failures, 7-21<br />
total communication failure (geographic<br />
clustered environment), 7-26<br />
new for this release, 1-2<br />
P<br />
parameters file, A-9<br />
performance problem<br />
failover time lengthens, 10-5<br />
high load<br />
disk manager, 10-4<br />
distributer, 10-5<br />
slow initialization, 10-2<br />
persistent state key, I-14<br />
port information, 7-32<br />
processes directory, A-12<br />
Q<br />
quorum consistency group<br />
manual failover, 4-14, 4-23
Index<br />
R<br />
RA problem<br />
all RAs at one site fail, 8-25<br />
all RAs not attached, 8-27<br />
all SAN Fibre Channel HBAs fail, 8-14<br />
onboard management network adapter<br />
fails, 8-23<br />
onboard WAN network adapter fails, 8-19<br />
optional Gigabit Fibre Channel WAN<br />
network adapter fails, 8-19<br />
reboot regulation failover, 8-12<br />
single hard disk fails, 8-24<br />
single RA failure, 8-4<br />
single RA failures with switchover, 8-5<br />
single RA failures without switchover, 8-21<br />
single SAN Fibre Channel HBA on one RA<br />
fails, 8-21<br />
rear panel indicators, 8-11<br />
recording group properties and saving<br />
settings, D-2<br />
recovery<br />
all RAs fail on site, 4-11<br />
from site failure, 4-19<br />
from total failure of one site in geographic<br />
clustered environment, 4-19<br />
site 1 failure with quorum owner located<br />
on site 2, 4-25<br />
site 1 failure with quorum resource owned<br />
by site 1, 4-19<br />
using older image, 4-7<br />
recovery tasks<br />
data consistency group, 3-2, 4-7<br />
data consistency group for ClearPath<br />
MCP, 3-5<br />
reformatting the repository volume, 5-8<br />
removing Fibre Channel host bus<br />
adapters, D-4<br />
replacing an RA, D-1<br />
replication appliance (RA)<br />
connecting, accessing, D-4<br />
diagnostics, B-2<br />
LCD status messages, B-4<br />
replacing, D-1<br />
replication appliance (RA)<br />
analyzing logs from, A-8<br />
collecting logs from, A-1<br />
replication, reversing direction, 4-10, 4-15<br />
repository volume<br />
not accessible, 5-6<br />
reformatting, 5-8<br />
restoring environment settings, D-2<br />
restoring failover settings, 4-24<br />
restoring group properties, D-8<br />
resynchronization, long, 4-4<br />
rreasons directory, A-11<br />
runCLI file, A-14<br />
Index–4 6872 5688–002<br />
S<br />
<strong>SafeGuard</strong> 30m Control<br />
behavior during move group, 4-5<br />
SAN connectivity problem<br />
RAs not accessible to splitter, 6-12<br />
total SAN switch failure (geographic<br />
clustered environment), 6-17<br />
volume not accessible to RAs, 6-3<br />
volume not accessible to splitter, 6-7<br />
saving configuration settings, D-2<br />
sbin directory, A-12<br />
server problem<br />
cluster node failure (georgraphic clustered<br />
environment), 9-2<br />
infrastructure (NTP) server fails, 9-18<br />
server crash or restart, 9-12<br />
server failure (georgraphic replication<br />
environment), 9-20<br />
server HBA fails, 9-17<br />
server unable to connect with SAN, 9-14<br />
unexpected server shutdown because of a<br />
bug check, 9-8<br />
Windows server reboot, 9-3<br />
SNMP traps<br />
configuring and using, F-1<br />
MIB, F-1<br />
resolving issues, F-4<br />
variables and values, F-2<br />
SSH client, using, C-1<br />
state codes, I-10, I-12<br />
storage problem<br />
journal volume not accessible, 5-11<br />
repository volume not accessible, 5-6<br />
storage failure on one site (geographic<br />
clustered environment), 5-16<br />
total storage loss (geographic replicated<br />
environment), 5-13<br />
user or replication volume not<br />
accessible, 5-4<br />
storage-to-RA access, checking, D-5<br />
summary file, A-11<br />
system event log (SEL), clearing, B-1<br />
system status<br />
using CLI commands, 2-8
T<br />
using the management console, 2-7<br />
tar file, A-15<br />
testing FTP connectivity, A-2<br />
tmp directory, A-14<br />
troubleshooting<br />
general procedures, 2-11<br />
recovering from site failure, 4-19<br />
U<br />
Unisys <strong>SafeGuard</strong> 30m Collector, G-1<br />
Collector mode, G-4<br />
adding customer information, G-5<br />
adding scripts, G-12<br />
automatic discovery of RAs, G-4<br />
compressing an SSC file, G-6<br />
configuring component types using<br />
built-ins scripts, G-8<br />
configuring RAs, G-4<br />
configuring SAN switches, G-9<br />
deleting a scheduled SSC file, G-14<br />
deleting scripts, G-12<br />
disabling scripts, G-10<br />
duplicating installation on another<br />
PC, G-6<br />
enabling scripts, G-10<br />
opening an SSC file, G-8<br />
querying a scheduled SSC file, G-14<br />
running all scripts, G-6<br />
running scripts, G-11<br />
scheduling an SSC file, G-13<br />
stopping script execution, G-11<br />
installing, G-1<br />
prior to configuring, G-2<br />
security breach warning, G-3<br />
View mode, G-15<br />
Unisys <strong>SafeGuard</strong> 30m solution<br />
definition, 2-1<br />
unmounting volumes<br />
at production site, 3-4<br />
at remote site, 3-3<br />
unmounting volumes at source site, 3-4<br />
user types, preconfigured for RAs, 2-8<br />
using the SSH client, C-1<br />
using this guide, 1-3<br />
usr directory, A-13<br />
UTC<br />
converting local time to, A-3<br />
example of local time conversion, A-3<br />
Index<br />
6872 5688–002 Index–5<br />
V<br />
verify_failover command, 4-6<br />
verifying clock synchronization, D-8<br />
verifying the replacement RA installation, D-7<br />
volumes<br />
unmounting at source site, 3-4<br />
W<br />
WAN bandwidth, verifying, D-7<br />
webdownload/webdownload, 2-8, C-20
Index<br />
Index–6 6872 5688–002
© 2008 Unisys Corporation.<br />
All rights reserved.<br />
*68725688-002*<br />
6872 5688–002