18.08.2013 Views

SafeGuard Solutions Troubleshooting Guide - Public Support Login ...

SafeGuard Solutions Troubleshooting Guide - Public Support Login ...

SafeGuard Solutions Troubleshooting Guide - Public Support Login ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Unisys <strong>SafeGuard</strong> <strong>Solutions</strong><br />

<strong>Troubleshooting</strong> <strong>Guide</strong><br />

Unisys <strong>SafeGuard</strong> <strong>Solutions</strong> Release 6.0<br />

June 2008


Unisys <strong>SafeGuard</strong> <strong>Solutions</strong><br />

<strong>Troubleshooting</strong> <strong>Guide</strong><br />

Unisys <strong>SafeGuard</strong> <strong>Solutions</strong> Release 6.0<br />

June 2008 6872 5688–002<br />

unisys<br />

imagine it. done.


NO WARRANTIES OF ANY NATURE ARE EXTENDED BY THIS DOCUMENT. Any product or related information<br />

described herein is only furnished pursuant and subject to the terms and conditions of a duly executed agreement to<br />

purchase or lease equipment or to license software. The only warranties made by Unisys, if any, with respect to the<br />

products described in this document are set forth in such agreement. Unisys cannot accept any financial or other<br />

responsibility that may be the result of your use of the information in this document or software material, including<br />

direct, special, or consequential damages.<br />

You should be very careful to ensure that the use of this information and/or software material complies with the laws,<br />

rules, and regulations of the jurisdictions with respect to which it is used.<br />

The information contained herein is subject to change without notice. Revisions may be issued to advise of such<br />

changes and/or additions.<br />

Notice to U.S. Government End Users: This is commercial computer software or hardware documentation developed at<br />

private expense. Use, reproduction, or disclosure by the Government is subject to the terms of Unisys standard<br />

commercial license for the products, and where applicable, the restricted/limited rights provisions of the contract data<br />

rights clauses.<br />

Unisys is a registered trademark of Unisys Corporation in the United States and other countries.<br />

All other brands and products referenced in this document are acknowledged to be the trademarks or registered<br />

trademarks of their respective holders.


Unisys <strong>SafeGuard</strong><br />

<strong>Solutions</strong><br />

<strong>Troubleshooting</strong> <strong>Guide</strong><br />

Unisys <strong>SafeGuard</strong><br />

<strong>Solutions</strong> Release 6.0<br />

Unisys <strong>SafeGuard</strong><br />

<strong>Solutions</strong><br />

<strong>Troubleshooting</strong><br />

<strong>Guide</strong><br />

Unisys<br />

<strong>SafeGuard</strong><br />

<strong>Solutions</strong><br />

Release 6.0<br />

6872 5688–002 6872 5688–002<br />

Bend here, peel upwards and apply to spine.


Contents<br />

Section 1. About This <strong>Guide</strong><br />

Section 2. Overview<br />

Purpose and Audience .......................................................................... 1–1<br />

Related Product Information ................................................................. 1–1<br />

Documentation Updates ....................................................................... 1–1<br />

What’s New in This Release ................................................................. 1–2<br />

Using This <strong>Guide</strong> ................................................................................... 1–3<br />

Geographic Replication Environment .................................................... 2–1<br />

Geographic Clustered Environment ...................................................... 2–2<br />

Data Flow .............................................................................................. 2–3<br />

Diagnostic Tools and Capabilities.......................................................... 2–7<br />

Event Log ............................................................................. 2–7<br />

System Status ..................................................................... 2–7<br />

E-mail Notifications .............................................................. 2–8<br />

Installation Diagnostics ........................................................ 2–9<br />

Host Information Collector (HIC) ......................................... 2–9<br />

Cluster Logs......................................................................... 2–9<br />

Unisys <strong>SafeGuard</strong> 30m Collector......................................... 2–9<br />

RA Diagnostics .................................................................... 2–9<br />

Hardware Indicators ............................................................ 2–9<br />

SNMP <strong>Support</strong> ................................................................... 2–10<br />

kutils Utility ........................................................................ 2–10<br />

Discovering Problems ......................................................................... 2–10<br />

Events That Cause Journal Distribution ............................ 2–10<br />

<strong>Troubleshooting</strong> Procedures ............................................................... 2–11<br />

Identifying the Main Components and Connectivity<br />

of the Configuration....................................................... 2–11<br />

Understanding the Current State of the System ............... 2–12<br />

Verifying the System Connectivity .................................... 2–12<br />

Analyzing the Configuration Settings ................................ 2–13<br />

Section 3. Recovering in a Geographic Replication<br />

Environment<br />

Manual Failover of Volumes and Data Consistency Groups ................. 3–2<br />

Accessing an Image ............................................................ 3–2<br />

Testing the Selected Image at Remote Site ....................... 3–3<br />

Manual Failover of Volumes and Data Consistency Groups for<br />

ClearPath MCP Hosts ....................................................................... 3–5<br />

6872 5688–002 iii


Contents<br />

Accessing an Image ............................................................. 3–5<br />

Testing the Selected Image at Remote Site ........................ 3–5<br />

Section 4. Recovering in a Geographic Clustered Environment<br />

Checking the Cluster Setup ................................................................... 4–1<br />

MSCS Properties .................................................................. 4–1<br />

Network Bindings ................................................................. 4–2<br />

Group Initialization Effects on a Cluster Move-Group<br />

Operation ........................................................................................... 4–3<br />

Full-Sweep Initialization ........................................................ 4–4<br />

Long Resynchronization ....................................................... 4–4<br />

Initialization from Marking Mode .......................................... 4–5<br />

Behavior of <strong>SafeGuard</strong> 30m Control During a Move-Group<br />

Operation ........................................................................................... 4–5<br />

Recovering by Manually Moving an Auto-Data (Shared<br />

Quorum) Consistency Group ............................................................. 4–7<br />

Taking a Cluster Data Group Offline ..................................... 4–7<br />

Performing a Manual Failover of an Auto-Data<br />

(Shared Quorum) Consistency Group to a<br />

Selected Image ................................................................ 4–8<br />

Bringing a Cluster Data Group Online and Checking<br />

the Validity of the Image .................................................. 4–9<br />

Reversing the Replication Direction of the<br />

Consistency Group ......................................................... 4–10<br />

Recovery When All RAs Fail on Site 1 (Site 1 Quorum Owner) .......... 4–11<br />

Recovery When All RAs Fail on Site 1 (Site 2 Quorum Owner) .......... 4–17<br />

Recovery When All RAs and All Servers Fail on One Site ................... 4–19<br />

Site 1 Failure (Site 1 Quorum Owner) ................................ 4–19<br />

Site 1 Failure (Site 2 Quorum Owner) ................................ 4–25<br />

Section 5. Solving Storage Problems<br />

User or Replication Volume Not Accessible .......................................... 5–4<br />

Repository Volume Not Accessible ....................................................... 5–6<br />

Reformatting the Repository Volume ................................... 5–8<br />

Journal Not Accessible ........................................................................ 5–11<br />

Journal Volume Lost Scenarios ........................................................... 5–13<br />

Total Storage Loss in a Geographic Replicated Environment ............. 5–13<br />

Storage Failure on One Site in a Geographic Clustered<br />

Environment .................................................................................... 5–16<br />

Storage Failure on One Site with Quorum Owner<br />

on Failed Site ................................................................. 5–17<br />

Storage Failure on One Site with Quorum Owner<br />

on Surviving Site ............................................................ 5–20<br />

Section 6. Solving SAN Connectivity Problems<br />

Volume Not Accessible to RAs .............................................................. 6–3<br />

Volume Not Accessible to <strong>SafeGuard</strong> 30m Splitter ............................... 6–7<br />

iv 6872 5688–002


Contents<br />

RAs Not Accessible to <strong>SafeGuard</strong> 30m Splitter .................................. 6–12<br />

Total SAN Switch Failure on One Site in a Geographic<br />

Clustered Environment ................................................................... 6–17<br />

Cluster Quorum Owner Located on Site with Failed<br />

SAN Switch ................................................................... 6–18<br />

Cluster Quorum Owner Not on Site with Failed<br />

SAN Switch ................................................................... 6–22<br />

Section 7. Solving Network Problems<br />

<strong>Public</strong> NIC Failure on a Cluster Node in a Geographic<br />

Clustered Environment ..................................................................... 7–3<br />

<strong>Public</strong> or Client WAN Failure in a Geographic Clustered<br />

Environment ..................................................................................... 7–6<br />

Management Network Failure in a Geographic Clustered<br />

Environment ................................................................................... 7–11<br />

Replication Network Failure in a Geographic Clustered<br />

Environment ................................................................................... 7–15<br />

Temporary WAN Failures .................................................................... 7–21<br />

Private Cluster Network Failure in a Geographic Clustered<br />

Environment ................................................................................... 7–22<br />

Total Communication Failure in a Geographic Clustered<br />

Environment ................................................................................... 7–26<br />

Port Information .................................................................................. 7–32<br />

Section 8. Solving Replication Appliance (RA) Problems<br />

Single RA Failures ................................................................................. 8–4<br />

Single RA Failure with Switchover ...................................... 8–5<br />

Reboot Regulation ............................................................. 8–12<br />

Failure of All SAN Fibre Channel Host Bus Adapters<br />

(HBAs ............................................................................ 8–14<br />

Failure of Onboard WAN Adapter or Failure of<br />

Optional Gigabit Fibre Channel WAN Adapter .............. 8–19<br />

Single RA Failures Without a Switchover ........................................... 8–21<br />

Port Failure on a Single SAN Fibre Channel HBA on<br />

One RA .......................................................................... 8–21<br />

Onboard Management Network Adapter Failure .............. 8–23<br />

Single Hard Disk Failure ..................................................... 8–24<br />

Failure of All RAs at One Site .............................................................. 8–25<br />

All RAs Are Not Attached .................................................................... 8–27<br />

Section 9. Solving Server Problems<br />

Cluster Node Failure (Hardware or Software) in a Geographic<br />

Clustered Environment ..................................................................... 9–2<br />

Possible Subset Scenarios .................................................. 9–3<br />

Windows Server Reboot ..................................................... 9–3<br />

Unexpected Server Shutdown Because of a Bug<br />

Check .............................................................................. 9–8<br />

6872 5688–002 v


Contents<br />

Server Crash or Restart ...................................................... 9–12<br />

Server Unable to Connect with SAN .................................. 9–14<br />

Server HBA Failure ............................................................. 9–17<br />

Infrastructure (NTP) Server Failure ...................................................... 9–18<br />

Server Failure (Hardware or Software) in a Geographic<br />

Replication Environment ................................................................. 9–20<br />

Section 10. Solving Performance Problems<br />

Slow Initialization ................................................................................. 10–2<br />

General Description of High-Load Event ............................................. 10–3<br />

High-Load (Disk Manager) Condition ................................................... 10–4<br />

High-Load (Distributor) Condition ........................................................ 10–5<br />

Failover Time Lengthens ..................................................................... 10–5<br />

Appendix A. Collecting and Using Logs<br />

Collecting RA Logs ............................................................................... A–1<br />

Setting the Automatic Host Info Collection Option ............. A–2<br />

Testing FTP Connectivity .................................................... A–2<br />

Determining When the Failure Occurred ............................ A–2<br />

Converting Local Time to GMT or UTC ............................... A–3<br />

Collecting RA Logs .............................................................. A–3<br />

Collecting Server (Host) Logs ............................................................... A–6<br />

Using the MPS Report Utility .............................................. A–6<br />

Using the Host Information Collector (HIC) Utility .............. A–7<br />

Analyzing RA Log Collection Files ........................................................ A–8<br />

RA Log Extraction Directory ................................................ A–9<br />

tmp Directory .................................................................... A–14<br />

Host Log Extraction Directory ........................................... A–15<br />

Analyzing Server (Host) Logs .............................................................. A–16<br />

Analyzing Intelligent Fabric Switch Logs ............................................ A–16<br />

Appendix B. Running Replication Appliance (RA) Diagnostics<br />

Clearing the System Event Log (SEL)................................................... B–1<br />

Running Hardware Diagnostics ............................................................ B–2<br />

Custom Test ........................................................................ B–3<br />

Express Test ........................................................................ B–4<br />

LCD Status Messages .......................................................................... B–4<br />

Appendix C. Running Installation Manager Diagnostics<br />

Using the SSH Client ............................................................................ C–1<br />

Running Diagnostics ............................................................................. C–1<br />

IP Diagnostics ...................................................................... C–2<br />

Fibre Channel Diagnostics ................................................... C–9<br />

Synchronization Diagnostics ............................................. C–17<br />

Collect System Info ........................................................... C–18<br />

vi 6872 5688–002


Appendix D. Replacing a Replication Appliance (RA)<br />

Contents<br />

Saving the Configuration Settings ........................................................ D–2<br />

Recording Policy Properties and Saving Settings ................................. D–2<br />

Modifying the Preferred RA Setting ..................................................... D–3<br />

Removing Fibre Channel Adapter Cards ............................................... D–4<br />

Installing and Configuring the Replacement RA ................................... D–4<br />

Cable and Apply Power to the New RA .............................. D–4<br />

Connecting and Accessing the RA ...................................... D–4<br />

Checking Storage-to-RA Access .......................................... D–5<br />

Enabling PCI-X Slot Functionality ......................................... D–5<br />

Configuring the RA .............................................................. D–6<br />

Verifying the RA Installation .................................................................. D–7<br />

Restoring Group Properties .................................................................. D–8<br />

Ensuring the Existing RA Can Switch Over to the New RA ................. D–8<br />

Appendix E. Understanding Events<br />

Event Log .............................................................................................. E–1<br />

Event Topics ........................................................................ E–1<br />

Event Levels ........................................................................ E–2<br />

Event Scope......................................................................... E–2<br />

Displaying the Event Log ..................................................... E–3<br />

Using the Event Log for <strong>Troubleshooting</strong> ............................ E–3<br />

List of Events ........................................................................................ E–4<br />

List of Normal Events .......................................................... E–5<br />

List of Detailed Events ...................................................... E–22<br />

Appendix F. Configuring and Using SNMP Traps<br />

Software Monitoring ............................................................................. F–1<br />

SNMP Monitoring and Trap Configuration ............................................ F–3<br />

Installing MIB Files on an SNMP Browser ............................................ F–3<br />

Resolving SNMP Issues ........................................................................ F–4<br />

Appendix G. Using the Unisys <strong>SafeGuard</strong> 30m Collector<br />

Appendix H. Using kutils<br />

Installing the <strong>SafeGuard</strong> 30m Collector ................................................ G–1<br />

Before You Begin the Configuration ..................................................... G–2<br />

Handling the Security Breach Warning ................................ G–3<br />

Using Collector Mode ........................................................................... G–4<br />

Getting Started .................................................................... G–4<br />

Understanding Operations in Collector Mode ..................... G–7<br />

Using View Mode ............................................................................... G–15<br />

Usage .................................................................................................... H–1<br />

Path Designations ................................................................................. H–1<br />

Command Summary ............................................................................. H–2<br />

6872 5688–002 vii


Contents<br />

Appendix I. Analyzing Cluster Logs<br />

Introduction to Cluster Logs ................................................................... I–1<br />

Creating the Cluster Log ....................................................... I–2<br />

Understanding the Cluster Log Layout ................................. I–3<br />

Sample Cluster Log ................................................................................ I–5<br />

Posting Information to the Cluster Log ................................. I–5<br />

Diagnosing a Problem Using Cluster Logs ............................................. I–6<br />

Gathering Materials ............................................................... I–7<br />

Opening the Cluster Log ....................................................... I–7<br />

Converting GMT/UCT to Local Time ..................................... I–8<br />

Converting Cluster Log GUIDs to Text Resource<br />

Names ............................................................................... I–8<br />

Understanding State Codes ................................................ I–10<br />

Understanding Persistent State .......................................... I–14<br />

Understanding Error and Status Codes ............................... I–15<br />

Index ............................................................................................. 1<br />

viii 6872 5688–002


Figures<br />

2–1. Basic Geographic Clustered Environment ......................................................... 2–2<br />

2–2. Data Flow ........................................................................................................... 2–3<br />

2–3. Data Flow with Fabric Splitter ............................................................................ 2–5<br />

2–4. Data flow in CDP ................................................................................................ 2–6<br />

4–1. All RAs Fail on Site 1 (Site 1 Quorum Owner) ................................................. 4–11<br />

4–2. All RAs Fail on Site 1 (Site 2 Quorum Owner) ................................................. 4–17<br />

4–3. All RAs and Servers Fail on Site 1 (Site 1 Quorum Owner) ............................. 4–20<br />

4–4. All RAs and Servers Fail on Site 1 (Site 2 Quorum Owner) ............................. 4–25<br />

5–1. Volumes Tab Showing Volume Connection Errors ............................................ 5–4<br />

5–2. Management Console Messages for the User Volume Not Accessible<br />

Problem ......................................................................................................... 5–5<br />

5–3. Groups Tab Shows “Paused by System” .......................................................... 5–5<br />

5–4. Management Console Display: Storage Error and RAs Tab Shows<br />

Volume Errors ................................................................................................ 5–7<br />

5–5. Volumes Tab Shows Error for Repository Volume ............................................ 5–7<br />

5–6. Groups Tab Shows All Groups Paused by System ............................................ 5–7<br />

5–7. Management Console Messages for the Repository Volume not<br />

Accessible Problem ....................................................................................... 5–8<br />

5–8. Volumes Tab Shows Journal Volume Error ..................................................... 5–11<br />

5–9. RAs Tab Shows Connection Errors .................................................................. 5–11<br />

5–10. Groups Tab Shows Group Paused by System ................................................. 5–12<br />

5–11. Management Console Messages for the Journal Not Accessible<br />

Problem ....................................................................................................... 5–12<br />

5–12. Management Console Volumes Tab Shows Errors for All Volumes ............... 5–14<br />

5–13. RAs Tab Shows Volumes That Are Not Accessible ......................................... 5–14<br />

5–14. Multipathing Software Reports Failed Paths to Storage Device ..................... 5–15<br />

5–15. Storage on Site 1 Fails ..................................................................................... 5–16<br />

5–16. Cluster “Regroup” Process ............................................................................. 5–17<br />

5–17. Cluster Administrator Displays ......................................................................... 5–19<br />

5–18. Multipathing Software Shows Server Errors for Failed Storage<br />

Subsystem ................................................................................................... 5–19<br />

6–1. Management Console Showing “Inaccessible Volume” Errors ........................ 6–3<br />

6–2. Management Console Messages for Inaccessible Volumes ............................. 6–3<br />

6–3. Management Console Error Display Screen ...................................................... 6–7<br />

6–4. Management Console Messages for Volumes Inaccessible to Splitter ............ 6–8<br />

6–5. EMC PowerPath Shows Disk Error .................................................................. 6–10<br />

6–6. Management Console Display Shows a Splitter Down ................................... 6–12<br />

6–7. Management Console Messages for Splitter Inaccessible to RA ................... 6–13<br />

6–8. SAN Switch Failure on One Site ...................................................................... 6–17<br />

6–9. Management Console Display with Errors for Failed SAN Switch .................. 6–18<br />

6872 5688–002 ix


Figures<br />

6–10. Management Console Messages for Failed SAN Switch ................................ 6–19<br />

6–11. Management Console Messages for Failed SAN Switch with Quorum<br />

Owner on Surviving Site ............................................................................... 6–23<br />

7–1. <strong>Public</strong> NIC Failure of a Cluster Node .................................................................. 7–3<br />

7–2. <strong>Public</strong> NIC Error Shown in the Cluster Administrator ......................................... 7–5<br />

7–3. <strong>Public</strong> or Client WAN Failure............................................................................... 7–7<br />

7–4. Cluster Administrator Showing <strong>Public</strong> LAN Network Error ................................ 7–8<br />

7–5. Management Network Failure .......................................................................... 7–11<br />

7–6. Management Console Display: “Not Connected” ........................................... 7–13<br />

7–7. Management Console Message for Event 3023 .............................................. 7–13<br />

7–8. Replication Network Failure .............................................................................. 7–15<br />

7–9. Management Console Display: WAN Down .................................................... 7–17<br />

7–10. Management Console Log Messages: WAN Down ........................................ 7–17<br />

7–11. Management Console RAs Tab: All RAs Data Link Down ............................... 7–18<br />

7–12. Private Cluster Network Failure ........................................................................ 7–22<br />

7–13. Cluster Administrator Display with Failures ...................................................... 7–23<br />

7–14. Total Communication Failure ............................................................................ 7–26<br />

7–15. Management Console Display Showing WAN Error ........................................ 7–27<br />

7–16. RAs Tab for Total Communication Failure ........................................................ 7–28<br />

7–17. Management Console Messages for Total Communication Failure ................ 7–28<br />

7–18. Cluster Administrator Showing Private Network Down ................................... 7–31<br />

7–19. Cluster Administrator Showing <strong>Public</strong> Network Down .................................... 7–31<br />

8–1. Single RA Failure ................................................................................................. 8–5<br />

8–2. Sample BIOS Display .......................................................................................... 8–6<br />

8–3. Management Console Display Showing RA Error and RAs Tab......................... 8–7<br />

8–4. Management Console Messages for Single RA Failure with<br />

Switchover...................................................................................................... 8–8<br />

8–5. LCD Display on Front Panel of RA .................................................................... 8–10<br />

8–6. Rear Panel of RA Showing Indicators ............................................................... 8–11<br />

8–7. Location of Network LEDs................................................................................ 8–11<br />

8–8. Location of SAN Fibre Channel HBA LEDs ....................................................... 8–12<br />

8–9. Management Console Display: Host Connection with RA Is Down ................ 8–15<br />

8–10. Management Console Messages for Failed RA (All SAN HBAs Fail) ............... 8–16<br />

8–11. Management Console Showing WAN Data Link Failure .................................. 8–20<br />

8–12. Location of Hard Drive LEDs ............................................................................ 8–25<br />

8–13. Management Console Showing All RAs Down ................................................ 8–26<br />

9–1. Cluster Node Failure ........................................................................................... 9–2<br />

9–2. Management Console Display with Server Error ............................................... 9–4<br />

9–3. Management Console Messages for Server Down ........................................... 9–5<br />

9–4. Management Console Messages for Server Down for Bug Check ................... 9–9<br />

9–5. Management Console Display Showing LA Site Server Down ........................ 9–14<br />

9–6. Management Console Images Showing Messages for Server Unable<br />

to Connect to SAN ....................................................................................... 9–15<br />

9–7. PowerPath Administrator Console Showing Failures ....................................... 9–16<br />

9–8. PowerPath Administrator Console Showing Adapter Failure ........................... 9–17<br />

9–9. Event 1009 Display ........................................................................................... 9–19<br />

I–1. Layout of the Cluster Log .................................................................................... I–3<br />

I–2. Expanded Cluster Hive (in Windows 2000 Server) ............................................ I–10<br />

x 6872 5688–002


Figures<br />

6872 5688–002 xi


Figures<br />

xii 6872 5688–002


Tables<br />

2–1. User Types ......................................................................................................... 2–8<br />

2–2. Events That Cause Journal Distribution ........................................................... 2–11<br />

5–1. Possible Storage Problems with Symptoms ..................................................... 5–1<br />

5–2. Indicators and Management Console Errors to Distinguish Different<br />

Storage Volume Failures ................................................................................ 5–3<br />

6–1. Possible SAN Connectivity Problems ................................................................ 6–1<br />

7–1. Possible Networking Problems with Symptoms ............................................... 7–1<br />

7–2. Ports for Internet Communication ................................................................... 7–33<br />

7–3. Ports for Management LAN Communication and Notification ........................ 7–33<br />

7–4. Ports for RA-to-RA Internal Communication .................................................... 7–34<br />

8–1. Possible Problems for Single RA Failure with a Switchover .............................. 8–2<br />

8–2. Possible Problems for Single RA Failure Wthout a Switchover ......................... 8–3<br />

8–3. Possible Problems for Multiple RA Failures with Symptoms ............................ 8–3<br />

8–4. Management Console Messages Pertaining to Reboots ................................ 8–13<br />

9–1. Possible Server Problems with Symptoms ....................................................... 9–1<br />

10–1. Possible Performance Problems with Symptoms ........................................... 10–1<br />

B–1. LCD Status Messages ....................................................................................... B–5<br />

C–1. Messages from the Connectivity Testing Tool .................................................. C–8<br />

E–1. Normal Events .................................................................................................... E–5<br />

E–2. Detailed Events ................................................................................................ E–23<br />

F–1. Trap Variables and Values .................................................................................. F–2<br />

I–1. System Environment Variables Related to Clustering ........................................ I–2<br />

I–2. Modules of MSCS ............................................................................................... I–4<br />

I–3. Node State Codes ............................................................................................. I–12<br />

I–4. Group State Codes ............................................................................................ I–12<br />

I–5. Resource State Codes ...................................................................................... I–12<br />

I–6. Network Interface State Codes ........................................................................ I–13<br />

I–7. Network State Codes ........................................................................................ I–13<br />

6872 5688–002 xiii


Tables<br />

xiv 6872 5688–002


Section 1<br />

About This <strong>Guide</strong><br />

Purpose and Audience<br />

This document presents procedures for problem analysis and troubleshooting of the<br />

Unisys <strong>SafeGuard</strong> 30m solution. It is intended for Unisys service representatives and<br />

other technical personnel who are responsible for maintaining the Unisys <strong>SafeGuard</strong><br />

30m solution installation.<br />

Related Product Information<br />

The methods described in this document are based on support and diagnostic tools that<br />

are provided as standard components of the Unisys <strong>SafeGuard</strong> 30m solution. You can<br />

find additional information about these tools in the following documents:<br />

• Unisys <strong>SafeGuard</strong> <strong>Solutions</strong> Planning and Installation <strong>Guide</strong><br />

• Unisys <strong>SafeGuard</strong> <strong>Solutions</strong> Replication Appliance Administrator’s <strong>Guide</strong><br />

• Unisys <strong>SafeGuard</strong> <strong>Solutions</strong> Introduction to Replication Appliance Command Line<br />

Interface (CLI)<br />

• Unisys <strong>SafeGuard</strong> <strong>Solutions</strong> Replication Appliance Installation <strong>Guide</strong><br />

Note: Review the information in the Unisys <strong>SafeGuard</strong> <strong>Solutions</strong> Planning and<br />

Installation <strong>Guide</strong> about making configuration changes before you begin troubleshooting<br />

a problem.<br />

Documentation Updates<br />

This document contains all the information that was available at the time of<br />

publication. Changes identified after release of this document are included in problem list<br />

entry (PLE) 18609274. To obtain a copy of the PLE, contact your Unisys service<br />

representative or access the current PLE from the Unisys Product <strong>Support</strong> Web site:<br />

http://www.support.unisys.com/all/ple/18609274<br />

Note: If you are not logged into the Product <strong>Support</strong> site, you will be asked to do so.<br />

6872 5688–002 1–1


About This <strong>Guide</strong><br />

What’s New in This Release<br />

Some of the important changes in the 6.0 release are summarized in this table.<br />

Unisys <strong>SafeGuard</strong><br />

Continuous Data<br />

Protection (CDP)<br />

Change Notes<br />

<strong>Support</strong> for Concurrent<br />

Local and Remote (CLR)<br />

<strong>Support</strong> for CLARiiON<br />

splitter.<br />

<strong>Support</strong> for Brocade<br />

intelligent fabric splitting<br />

(multi-VI mode only), using<br />

the Brocade 7500 SAN<br />

Router.<br />

<strong>Support</strong> for configurations<br />

using a mix of splitters<br />

within the same RA<br />

cluster and across RA<br />

clusters at different sites.<br />

Redesign of the<br />

Management Console GUI<br />

for greater ease-of-use.<br />

SNMP trap viewer, Log<br />

Collection and Analysis,<br />

Auto-discovery of<br />

<strong>SafeGuard</strong> components in<br />

Safeguard Command<br />

Center.<br />

A Unisys <strong>SafeGuard</strong> Duplex solution that uses one<br />

Replication Appliance (RA) cluster to replicate data<br />

across the Storage Area Network (SAN).<br />

Concurrent Local (CDP) and Concurrent Remote<br />

Replication (CRR) of the same production volumes.<br />

Unisys <strong>SafeGuard</strong> solutions work with the<br />

CLARiiON CX3 Series CLARiiON Splitter service to<br />

deliver a fully heterogeneous array-based data<br />

replication solution that is achieved without the<br />

need for host-based agents.<br />

To support the heterogeneous environment at<br />

switch level, Safeguard Solution supports<br />

Intelligent fabric splitting with Brocade switch.<br />

<strong>SafeGuard</strong> solutions can support mixed splitters in<br />

a given solution configuration.<br />

New RA GUI interface is easy to navigate and<br />

more clear to use.<br />

Command Center now has the log collection and<br />

automatic discovery of the devices.<br />

1–2 6872 5688–002


Using This <strong>Guide</strong><br />

About This <strong>Guide</strong><br />

This guide offers general information in the first four sections. Read Section 2 to<br />

understand the overall approach to troubleshooting and to gain an understanding of the<br />

Unisys <strong>SafeGuard</strong> 30m solution architecture.<br />

Section 3 describes recovery in a geographic replication environment, and Section 4<br />

offers information and recovery procedures for geographic clustered environments.<br />

Sections 5 through 10 group potential problems into categories and describe the<br />

problems. You must recognize symptoms, identify the problem or failed component, and<br />

then decide what to do to correct the problem. Sections 5 through 10 include a table at<br />

the beginning of each section that lists symptoms and potential problems.<br />

Each problem is then presented in the following format:<br />

• Problem Description: Description of the problem<br />

• Symptoms: List of symptoms that are typical for this problem<br />

• Actions to Resolve the Problem: Steps recommended to solve the problem<br />

The appendixes provide information about using tools and offer reference information<br />

that you might find useful in different situations.<br />

6872 5688–002 1–3


About This <strong>Guide</strong><br />

1–4 6872 5688–002


Section 2<br />

Overview<br />

The Unisys <strong>SafeGuard</strong> <strong>Solutions</strong> are flexible, integrated business continuance solutions<br />

especially suitable for protecting business-critical application environments. The Unisys<br />

<strong>SafeGuard</strong> 30m solution provides two distinct functions that act in concert: replication of<br />

data and automated application recovery through clustering over great distances.<br />

Typically, the Unisys <strong>SafeGuard</strong> 30m solution is implemented in one of these<br />

environments:<br />

• Geographic replication environment: In this replication environment, data from<br />

servers at one site are replicated to a remote site.<br />

• Geographic clustered environment: In this replication environment, Microsoft Cluster<br />

Service (MSCS) is installed on servers that span sites and that participate in one<br />

cluster. The use of a Unisys <strong>SafeGuard</strong> 30m Control resource allows automated<br />

failover and recovery by controlling the replication direction with a MSCS resource.<br />

The resource is used in this environment only.<br />

Geographic Replication Environment<br />

Unisys <strong>SafeGuard</strong> <strong>Solutions</strong> supports replication of data over Fibre Channel to local SANattached<br />

storage and over WAN to remote sites. It also allows failover to a secondary<br />

site and continues operations in the event of a disaster at the primary site.<br />

Unisys <strong>SafeGuard</strong> <strong>Solutions</strong> replicates data over any distance:<br />

• within the same site (CDP), or<br />

• to another site halfway around the globe (CRR), or<br />

• both (CLR.)<br />

6872 5688–002 2–1


Overview<br />

Geographic Clusteered<br />

Environment<br />

2–2<br />

In the geographic clusterred<br />

environment, MSCS and cluster nodes are part of o the<br />

environment. Figure 2–1 illustrates a basic geographic clustered environmen nt that<br />

consists of two sites. In addition to server clusters, the typical configuration is made up<br />

of an RA cluster (RA 1 annd<br />

RA 2) at each of the two sites. However, multiple e RA cluster<br />

configurations are also poossible.<br />

Note: The dashed liness<br />

in Figure 2–1 represent the server WAN connections.<br />

To<br />

simplify the view, redunddant<br />

and physical connections are not shown.<br />

Figure 2–11.<br />

Basic Geographic Clustered Environment t<br />

68 872 5688–002


Data Flow<br />

Write<br />

Overview<br />

Figure 2–2 shows the data flow in the basic system configuration for data written by the<br />

server. The system replicates the data in snapshot replication mode to a remote site.<br />

The data flow is divided into the following segments: write, transfer, and distribute.<br />

Figure 2–2. Data Flow<br />

The flow of data for a write transaction is as follows:<br />

1. The host writes data to the splitter (either on the host or the fabric) that immediately<br />

sends it to the RA and to the production site replication volume (storage system).<br />

2. After receiving the data, the RA returns an acknowledgement (ACK) to the splitter.<br />

The storage system returns an ACK after successfully writing the data to storage.<br />

3. The splitter sends an ACK to the host that the write operation has been completed<br />

successfully.<br />

In snapshot replication mode, this sequence of events (steps 1 to 3) can be repeated<br />

multiple times before the snapshot is closed.<br />

6872 5688–002 2–3


Overview<br />

Transfer<br />

Distribute<br />

The flow of data for transfer is as follows:<br />

1. After processing the snapshot data (that is, applying the various compression<br />

techniques), the RA sends the snapshot over the WAN to its peer RA at the remote<br />

site.<br />

2. The RA at the remote site writes the snapshot to the journal. At the same time, the<br />

remote RA returns an ACK to its peer at the production site.<br />

Note: Alternatively, you can set an advanced policy parameter so that lag is<br />

measured to the journal. In that case, the RA at the target site returns an ACK to its<br />

peer at the source site only after it receives an ACK from the journal (step 3).<br />

3. After the complete snapshot is written to the journal, the journal returns an ACK to<br />

the RA.<br />

When possible, and unless instructed otherwise, the Unisys <strong>SafeGuard</strong> 30m solution<br />

proceeds at first opportunity to “distribute” the image to the appropriate location on the<br />

storage system at the remote site. The logical flow of data for distribution is as follows:<br />

1. The remote RA reads the image from the journal.<br />

2. The RA reads existing information from the relevant remote replication volume.<br />

3. The RA writes “undo” information (that is, information that can support a rollback, if<br />

necessary) to the journal.<br />

Note: Steps 2 and 3 are skipped when the maximum journal lag policy parameter<br />

causes distribution to operate in fast-forward mode.<br />

(See the Unisys <strong>SafeGuard</strong> <strong>Solutions</strong> Replication Appliance Administrator’s <strong>Guide</strong> for<br />

more information.)<br />

4. The RA writes the image to the appropriate remote replication volume.<br />

Alternatives to the basic system architecture<br />

The following are derivatives of the basic system architecture:<br />

Fabric Splitter<br />

An intelligent fabric switch can perform the splitting function instead of a Unisys<br />

<strong>SafeGuard</strong> <strong>Solutions</strong> host-based Splitter installed on the host. In this case, the host<br />

sends a single write transaction to the switch on its way to storage. At the switch,<br />

however, the message is split, with a copy sent also to RA (as shown in Figure 2–3). The<br />

system behaves the same way as it does when using a Unisys <strong>SafeGuard</strong> <strong>Solutions</strong><br />

host-based splitter on the host to perform the splitting function.<br />

2–4 6872 5688–002


Figure 2–3. Data Flow with Fabric Splitter<br />

Local Replication by CDP<br />

Overview<br />

You can use CDP to perform replication over short distances—that is, to replicate<br />

storage at the same site as CRR does over long distances. Operation of the system is<br />

similar to CRR including the ability to use the journal to recover from a corrupted data<br />

image, and the ability, if necessary, to fail over to the remote side or storage pool. In<br />

Figure 2–4, there is no WAN, the storage pools are part of the storage at the same site,<br />

and the same RA appears in each of the segments.<br />

6872 5688–002 2–5


Overview<br />

Figure 2–4. Data flow in CDP<br />

Note: The repository volume must belong to remote-side storage pool. Unisys<br />

<strong>SafeGuard</strong> <strong>Solutions</strong> support a simultaneous mix of groups for remote and local<br />

replication. Individual volumes and groups, however, must be designated for either<br />

remote or local replication, but not for both. Certain policy parameters do not apply for<br />

local replication by CDP.<br />

Single RA<br />

Note: Unisys <strong>SafeGuard</strong> <strong>Solutions</strong> does not support single RA configuration (at both<br />

sites or at a single site).<br />

2–6 6872 5688–002


Diagnostic Tools and Capabilities<br />

Event Log<br />

Overview<br />

The Unisys <strong>SafeGuard</strong> 30m solution offers the following tools and capabilities to help<br />

you diagnose and solve problems.<br />

The replication capability of the Unisys <strong>SafeGuard</strong> 30m solution records log entries in<br />

response to a wide range of predefined events. The event log records all significant<br />

events that have recently occurred in the system. Appendix E lists and explains the<br />

events.<br />

Each event is classified by an event ID. The event ID can be used to help analyze or<br />

diagnose system behavior, including identifying the trigger for a rolling problem,<br />

understanding a sequence of events, and examining whether the system performed the<br />

correct set of actions in response to a component failure.<br />

You can monitor system behavior by viewing the event log through the management<br />

console, by issuing CLI commands, or by reading RA logs. The exact period of time<br />

covered by the log varies according to the operational state of the environment during<br />

that period or, in the case of RA logs, the time period that was specified. The capacity of<br />

the event log is 5000 events.<br />

For problems that are not readily apparent and for situations that you are monitoring for<br />

failure, you can configure an e-mail notification to send all logs to you in a daily summary.<br />

Once you resolve the problem, you can remove the event notifications. See “Configuring<br />

a Diagnostic E-mail Notification” in this section to configure a daily summary of events.<br />

System Status<br />

The management console displays an immediate indication of any problem that<br />

interferes with normal operation of the Unisys <strong>SafeGuard</strong> 30m environment. If a<br />

component fails, the indication is accompanied by an error message that provides<br />

detailed information about the failure.<br />

6872 5688–002 2–7


Overview<br />

You must log in to the management console to monitor the environment and to view<br />

events. The RAs are preconfigured with the users defined in Table 2–1.<br />

Table 2–1. User Types<br />

User Initial Password Permissions<br />

boxmgmt boxmgmt Install<br />

admin admin All except install and<br />

webdownload<br />

monitor monitor Read only<br />

webdownload webdownload webdownload<br />

SE Unisys(CSC) All except install and<br />

webdownload<br />

Note: The password boxmgmt is not used to log in to the management console; it is<br />

only used for SSH sessions.<br />

The CLI provides all users with status commands for the complete set of Unisys<br />

<strong>SafeGuard</strong> 30m components. You can use the information and statistics provided by<br />

these commands to identify bottlenecks in the system.<br />

E-mail Notifications<br />

The e-mail notification mechanism sends specified event notifications (or alerts) to<br />

designated individuals. Also, you can set up an e-mail notification for once a day that<br />

contains a daily summary of events.<br />

Configuring a Diagnostic E-mail Notification<br />

1. From the management console, click Alert Settings on the System menu.<br />

2. Under Rules, click Add.<br />

3. Using the diagnostic rule, select the appropriate topic, level, and type options.<br />

Diagnostic Rule<br />

This rule sends all messages on a daily basis to personnel of your choice.<br />

Topics: All Topics<br />

Level: Information<br />

Scope: Detailed<br />

Type Daily<br />

4. Under Addresses, click Add.<br />

2–8 6872 5688–002


Overview<br />

5. In the New Address box, type the e-mail address to which you would like event<br />

notifications sent. You can specify more than one e-mail address.<br />

6. Click OK.<br />

7. Repeat steps 4 through 6 for each additional e-mail recipient.<br />

8. Click OK.<br />

9. Click OK.<br />

Installation Diagnostics<br />

The Diagnostics menu of the Installation Manager provides a suite of diagnostic tools for<br />

testing the functionality and connectivity of the installed RAs and Unisys <strong>SafeGuard</strong> 30m<br />

components. Appendix C explains how to use the Installation Manager diagnostics.<br />

Installation Manager is also used to collect RA logs and host splitter logs from one<br />

centralized location. See Appendix A for more information about collecting logs.<br />

Host Information Collector (HIC)<br />

Cluster Logs<br />

The HIC collects extensive information about the environment, operation, and<br />

performance of any server on which a splitter has been installed. You can use the<br />

Installation Manager to collect logs across the entire environment including RAs and all<br />

servers on which the HIC feature is enabled. The HIC can also be used at the server. See<br />

Appendix A for more information about collecting logs.<br />

In a geographic clustered environment, MSCS maintains logs of events for the clustered<br />

environment. Analyzing these logs is helpful in diagnosing certain problems. Appendix I<br />

explains how to analyze these logs.<br />

Unisys <strong>SafeGuard</strong> 30m Collector<br />

The Unisys <strong>SafeGuard</strong> 30m Collector utility enables you to easily collect various pieces of<br />

information about the environment that can help in solving problems. Appendix G<br />

describes this utility.<br />

RA Diagnostics<br />

Diagnostics specific to the RAs are available to aid in identifying problems. Appendix B<br />

explains how to use the RA diagnostics.<br />

Hardware Indicators<br />

Hardware problems—for example, RA disk failures or RA power problems—are<br />

identified by status LEDs located on the RAs themselves. Several indicators are<br />

explained in Section 8, “Solving Replication Appliance (RA) Problems.”<br />

6872 5688–002 2–9


Overview<br />

SNMP <strong>Support</strong><br />

kutils Utility<br />

The RAs support monitoring and problem notification using standard SNMP, including<br />

support for SNMPv3. You can use SNMP queries to the agent on the RA. Also, you can<br />

configure the environment such that events generate SNMP traps that are then sent to<br />

designated hosts. Appendix F explains how to configure and use SNMP traps.<br />

The kutils utility is a proprietary server-based program that enables you to manage server<br />

splitters across all platforms. The command-line utility is installed automatically when the<br />

Unisys <strong>SafeGuard</strong> 30m splitter is installed on the application server. If the splitting<br />

function is not on a host but rather is on an intelligent switch, the kutils utility is copied<br />

from the Splitter CD-ROM. (See the Unisys <strong>SafeGuard</strong> <strong>Solutions</strong> Planning and Installation<br />

<strong>Guide</strong> for more information.)<br />

Appendix H explains some kutils commands that are helpful in troubleshooting<br />

problems. See the Unisys <strong>SafeGuard</strong> <strong>Solutions</strong> Replication Appliance Administrator’s<br />

<strong>Guide</strong> for complete reference information on the kutils utility.<br />

Discovering Problems<br />

Symptoms of problems and notifications occur in various ways with the Unisys<br />

<strong>SafeGuard</strong> 30m solution. The tools and capabilities described previously provide<br />

notifications for some conditions and events. Other problems are recognized from<br />

failures. Problems might be noted in the following ways:<br />

• Problems with data because of a rolling disaster, which means that the site needs to<br />

use a previous snapshot to recover<br />

• Problems with applications failing<br />

• Inability to switch processing to the remote or secondary site<br />

• Problems with the MSCS cluster (such as a failover to another cluster or site)<br />

• Problems reported in an e-mail notification from an RA<br />

• Problem reported in an SNMP trap notification<br />

• Problems listed on the management console as reported in the overall system status<br />

or in group state or properties<br />

• Problems reported in the daily summary of events<br />

In this guide, symptoms and notifications are often listed with potential problems.<br />

However, the messages and notifications vary based on the problem, and multiple<br />

events and notifications are possible at any given time.<br />

Events That Cause Journal Distribution<br />

Certain conditions might occur that can prevent access to the expected journal image.<br />

For instance, images might be flushed or distributed so that they are not available. Table<br />

2–2 lists events that might cause the images to be unavailable. For tables listing all<br />

events, see Appendix E.<br />

2–10 6872 5688–002


Table 2–2. Events That Cause Journal Distribution<br />

Event ID Level Scope Description Trigger<br />

4042 Info Detailed Group deactivated.<br />

(Group , RA<br />

)<br />

4062 Info Detailed Access enabled to<br />

latest image. (Group<br />

, Failover site<br />

)<br />

4097 Warning Detailed Maximum journal lag<br />

exceeded.<br />

Distribution in fastforward—older<br />

images removed from<br />

journal. (Group<br />

)<br />

4099 Info Detailed Initializing in long<br />

resynchronization<br />

mode. (Group<br />

)<br />

<strong>Troubleshooting</strong> Procedures<br />

Overview<br />

A user action deactivated<br />

the group.<br />

Access was enabled to<br />

the latest image during<br />

automatic failover.<br />

Fast-forward action<br />

started and caused the<br />

snapshots taken before<br />

the fast-forward action to<br />

be lost and the maximum<br />

journal lag to be<br />

exceeded.<br />

The system started a long<br />

resynchronization<br />

For troubleshooting, you must differentiate between problems that arise from<br />

environmental changes, network changes (cabling, routing and port blocking), or those<br />

changes related to zoning, logical unit number (LUN) masking, other devices in the SAN,<br />

and storage failures and problems that arise from misconfiguration or internal errors in<br />

the environmental setup.<br />

Refer to the preceding diagrams as you consider the general troubleshooting procedures<br />

that follow. Use the following four general tasks to help you identify symptoms and<br />

causes whenever you encounter a problem.<br />

Identifying the Main Components and Connectivity of the<br />

Configuration<br />

Knowledge of the main system components and the connectivity between these<br />

components is a key to understanding how the entire environment operates. This<br />

knowledge helps you understand where the problem exists in the overall system context<br />

and can help you correctly identify which components are affected.<br />

6872 5688–002 2–11


Overview<br />

Identify the following components:<br />

• Storage device, controller, and the configuration of connections to the Fibre Channel<br />

(FC) switch<br />

• Switch and port types, and their connectivity<br />

• Network configuration (WAN and LAN): IP addresses, routing schemes, subnet<br />

masks, and gateways<br />

• Participating servers: operating system, host bus adapters (HBAs), connectivity to<br />

the FC switch<br />

• Participating volumes: repository volumes, journal volumes, and replication volumes<br />

Understanding the Current State of the System<br />

Use the management console and the CLI get commands to understand the current<br />

state of the system:<br />

• Is there any component which is shown to be in an error state?<br />

If so, what is the error? Is it down, disconnected from any other components?<br />

• What is the state of the groups, splitters, volumes, transfer, and distribution?<br />

• Is the current state stable or changing within intervals of time?<br />

Verifying the System Connectivity<br />

To verify the system connectivity, use physical and tool-based verification methods to<br />

answer the following questions:<br />

• Are all the components physically connected? Are the activity or link lights active?<br />

• Are the components connected to the correct switch or switches? Are they<br />

connected to the correct ports?<br />

• Is there connectivity over the WAN between all appliances? Is there connectivity<br />

between the appliances on the same site over the management network?<br />

2–12 6872 5688–002


Analyzing the Configuration Settings<br />

Many problems occur because of improper configuration settings such as improper<br />

zoning. Analyze the configuration settings to ensure they are not the cause of the<br />

problem.<br />

Overview<br />

• Are the zones properly configured?<br />

− Splitter-to-storage?<br />

− Splitter-to-RA?<br />

− RA-to-storage?<br />

− RA-to-RA?<br />

• Are the zones in the switch config?<br />

• Has the proper switch config been applied?<br />

• Are the LUNs properly masked?<br />

− Is the splitter masked to see only the relevant replication volume or volumes?<br />

− Are the RAs masked to see the relevant replication volume or volumes,<br />

repository volume, and journal volume or volumes?<br />

• Are the network settings (such as gateway) for the RAs correct?<br />

• Are there any possible IP conflicts on the network?<br />

6872 5688–002 2–13


Overview<br />

2–14 6872 5688–002


Section 3<br />

Recovering in a Geographic Replication<br />

Environment<br />

This section provides recovery procedures so that user applications can be online as<br />

quickly as possible in a geographic replication environment.<br />

An older image might be required to recover from a rolling disaster, human error, a virus,<br />

or any other failure that corrupts the latest snapshot image. Ensure that the image is<br />

tested prior to reversing direction.<br />

Complete the procedures for each group that needs to be moved based on the type of<br />

hosts in the environment:<br />

• Manual Failover of Volumes and Data Consistency Groups<br />

• Manual Failover of Volumes and Data Consistency Groups for ClearPath MCP Hosts<br />

Refer to the Unisys <strong>SafeGuard</strong> <strong>Solutions</strong> Replication Appliance Administrator’s <strong>Guide</strong> for<br />

more information on logged and virtual (with roll or without roll) access modes. For<br />

specific environments, refer to the best practices documents listed under <strong>SafeGuard</strong><br />

<strong>Solutions</strong> documentation on the Unisys Product <strong>Support</strong> Web site,<br />

www.support.unisys.com<br />

6872 5688–002 3–1


Recovering in a Geographic Replication Environment<br />

Manual Failover of Volumes and Data Consistency<br />

Groups<br />

When you need to perform a manual failover of volumes and data consistency groups,<br />

complete the following tasks:<br />

1. Accessing an image<br />

2. Testing the selected image<br />

Accessing an Image<br />

1. From the Management Console, select any one of the data consistency groups<br />

on the navigation pane.<br />

2. Select the Status tab, (if it is not opened.)<br />

3. Perform the following steps to allow access to the target image:<br />

a. Right-click the Consistency Group and select Pause Transfer. Click Yes<br />

when the system prompts that the group activity will be paused.<br />

b. Right-click the Consistency Group and scroll down.<br />

c. Select the Remote Copy name and click Enable Image Access.<br />

The Enable Image Access dialog box appears.<br />

d. Choose Select an image from the list and click Next.<br />

The Select Explicit Image dialog box appears and displays the available<br />

images.<br />

e. Select the desired image from the list and click Next.<br />

The Image Access Mode dialog box appears.<br />

f. Select the option Logged access (physical) and click Next.<br />

The Summary screen displays the Image name and the Image Access mode.<br />

g. Click Finish.<br />

Note: This process might take a long time to complete depending on the value<br />

of the journal lag setting in the group policy of the consistency group. The<br />

following message appears during the process:<br />

Enabling log access<br />

h. Verify the target image name displayed below the bitmap in the components<br />

pane under the Status tab.<br />

Transfer:Paused displays at the bottom in the Status tab under the<br />

components pane.<br />

3–2 6872 5688–002


Testing the Selected Image at Remote Site<br />

Recovering in a Geographic Replication Environment<br />

Perform the following steps to test the selected image at the remote site:<br />

1. Run the following batch file to mount a volume at the remote site. If necessary,<br />

modify the program files\kdriver path to fit your environment.<br />

@echo off<br />

cd "c:\program files\kdriver\kutils"<br />

"c:\program files\kdriver\kutils\kutils.exe" umount e:<br />

"c:\program files\kdriver\kutils\kutils.exe" mount e:<br />

2. Repeat step 1 for all volumes in the group.<br />

3. Ensure that the selected image is valid:<br />

• all applications start successfully using the selected image<br />

• the data in the image is consistent and valid<br />

For example, you might want to test whether you can start a database application on<br />

this image. You might also want to run proprietary test procedures to validate the<br />

data.<br />

4. Skip to “Unmounting Volumes at Production site and Reversing Replication<br />

Direction” if you have tested the validity of the image and the test is successful. If<br />

the test is unsuccessful, continue with step 5.<br />

5. To test a different image, perform the procedure “Unmounting the Volumes and<br />

Disabling the Image Access at Remote site.”<br />

Unmounting the Volumes and Disabling the Image Access at Remote<br />

Site<br />

1. Before choosing another image, unmount the volume using the following batch file.<br />

If necessary, modify the program files/kdriver path to fit your environment.<br />

@echo off<br />

cd "c:\program files\kdriver\kutils"<br />

"c:\program files\kdriver\kutils\kutils.exe" flushFS e:<br />

"c:\program files\kdriver\kutils.exe" umount e:<br />

2. Repeat step 1 for all volumes in the group.<br />

3. Select one of the Consistency Groups in the navigation pane on the<br />

Management Console.<br />

4. Right-click the Consistency Group and scroll down.<br />

5. Select the Remote Copy name and click Disable Image Access.<br />

6. Click Yes when the system prompts you to ensure that all group volumes are<br />

unmounted.<br />

7. Repeat the procedures “Accessing an Image” and “Testing the Selected Image at<br />

the Remote Site”.<br />

6872 5688–002 3–3


Recovering in a Geographic Replication Environment<br />

Unmounting the Volumes at Production Site and Reversing<br />

Replication Direction<br />

Perform these steps at the host:<br />

1. To unmount a volume at the production site, run the following batch file. If<br />

necessary, modify the program files\kdriver path to fit your environment.<br />

@echo off<br />

cd "c:\program files\kdriver\kutils"<br />

"c:\program files\kdriver\kutils\kutils.exe" flushFS e:<br />

"c:\program files\kdriver\kutils\kutils.exe" umount e:<br />

2. Repeat step 1 for all volumes in the group.<br />

Perform these steps on the Management Console:<br />

1. Select a Consistency Group from the navigation pane.<br />

2. Right-click the Group and select Pause Transfer. Click Yes when the system<br />

prompts that the group activity will be paused.<br />

3. Click the Status tab. The status of the transfer must display Paused.<br />

4. Right-click the Consistency group and select Failover to .<br />

5. Click Yes when the system prompts you to confirm failover.<br />

6. Ensure that the Start data transfer immediately check box is selected.<br />

The following warning message appears:<br />

Warning: Journal will be erased. Do you wish to continue?<br />

7. Click Yes to continue.<br />

3–4 6872 5688–002


Recovering in a Geographic Replication Environment<br />

Manual Failover of Volumes and Data Consistency<br />

Groups for ClearPath MCP Hosts<br />

When you need to perform a manual failover of volumes and data consistency groups,<br />

complete the following tasks:<br />

1. Accessing an image<br />

2. Testing the selected image<br />

Note: For ClearPath MCP hosts, close and free units at the remote site before<br />

completing the following procedures. This action prevents SCSI Reserved errors being<br />

logged to units that are no longer accessible.<br />

Accessing an Image<br />

Quiescence any databases before accessing an image. Once the pack has failed over<br />

and has been acquired, resume the databases.<br />

If the volumes to be failed over are not in use by a database, issue the CLOSE PK<br />

command from the operator display terminal (ODT) to close the<br />

volumes.<br />

For more information on how to access an image, refer to the procedures,<br />

“Accessing an Image under Manual Failover of Volumes” and “Data Consistency<br />

Groups”.<br />

Testing the Selected Image at Remote Site<br />

1. Mount a volume at the remote site by issuing the ACQUIRE PK <br />

command from the remote site ODT to acquire the unit. Also acquire any controls<br />

necessary to access the unit if these controls are not automatically acquired.<br />

Verify that the MCP can access the volume using commands such as SC– and P PK<br />

to display the status of the peripherals.<br />

2. Repeat step 1 for all volumes in the group.<br />

3. Ensure that the selected image is valid; that is, verify that<br />

• All applications start successfully using the selected image.<br />

• The data in the image is consistent and valid.<br />

For example, you might want to test whether you can start a database application on<br />

this image. You might also want to run proprietary test procedures to validate the<br />

data.<br />

4. If you tested the validity of the image and the test completed successfully, skip to<br />

“Unmounting Volumes and Reversing Replication Direction at Production site.” If the<br />

testing is not successful, continue with step 5.<br />

5. To test a different image, perform the procedure “Unmounting the Volumes and<br />

Disabling the Image Access at Remote Site.”<br />

6872 5688–002 3–5


Recovering in a Geographic Replication Environment<br />

Unmounting the Volumes and Disabling the Image Access at Remote<br />

Site<br />

1. Before choosing another image, unmount the volume by issuing the CLOSE PK<br />

command followed by the FREE PK command from<br />

the ODT. Verify that the units are closed and freed using peripheral status<br />

commands.<br />

2. Repeat step 1 for all volumes in the group.<br />

3. Click the Status tab. The status of the transfer must display Paused.<br />

4. Right-click the Consistency group and select Failover to .<br />

5. Click Yes when the system prompts you to confirm failover.<br />

6. Ensure that the Start data transfer immediately check box is selected.<br />

The following warning message appears:<br />

Warning: Journal will be erased. Do you wish to continue?<br />

7. Click Yes to continue.<br />

Unmounting the Volumes at Source Site and Reversing Replication<br />

Direction<br />

Perform these steps at the source site host:<br />

1. Unmount a volume at the source site by issuing the CLOSE PK <br />

command followed by the FREE PK command from the ODT to close<br />

and free the volume.<br />

If the site is down when the host is recovered, use the FREE PK <br />

command to free the original source units. In response to inquiry commands, the<br />

status of the original source units is “closed.” Free the units to prevent access by<br />

the original source site host.<br />

2. Repeat step 1 for all volumes in the group.<br />

3. Select a Consistency Group from the navigation pane.<br />

4. Right-click the Group and select Pause Transfer. Click Yes when the system<br />

prompts that the group activity will be paused.<br />

5. Click the Status tab. The status of the transfer must display Paused.<br />

6. Right-click the Consistency group and select Failover to .<br />

7. Click Yes when the system prompts you to confirm failover.<br />

8. Ensure that the Start data transfer immediately check box is selected.<br />

The following warning message appears:<br />

Warning: Journal will be erased. Do you wish to continue?<br />

9. Click Yes to continue.<br />

3–6 6872 5688–002


Section 4<br />

Recovering in a Geographic Clustered<br />

Environment<br />

This section provides information and procedures that relate to geographic clustered<br />

environments running Microsoft Cluster Service (MSCS).<br />

Checking the Cluster Setup<br />

To ensure that the cluster configuration is correct, check the MSCS properties and the<br />

network bindings. For more detailed information, refer to “<strong>Guide</strong> to Creating and<br />

Configuring a Server Cluster under Windows Server 2003”, which you can download at<br />

MSCS Properties<br />

http://www.microsoft.com/downloads/details.aspx?familyid=96F76ED7-9634-4300-<br />

9159-89638F4B4EF7&displaylang=en<br />

To check the MSCS properties, enter the following command from the command<br />

prompt:<br />

Cluster /prop<br />

Output similar to the following is displayed:<br />

T Cluster Name Value<br />

-- -------------------- ------------------------------ -----------------------<br />

M AdminExtensions {4EC90FB0-D0BB-11CF-B5EF-0A0C90AB505}<br />

D DefaultNetworkRole 2 (0x2)<br />

S Description<br />

B Security 01 00 14 80 ... (148 bytes)<br />

B Security Descriptor 01 00 14 80 ... (148 bytes)<br />

M Groups\AdminExtensions<br />

M Networks\AdminExtensions<br />

M NetworkInterfaces\AdminExtensions<br />

M Nodes\AdminExtensions<br />

M Resources\AdminExtensions<br />

M ResourceTypes\AdminExtensions<br />

D EnableEventLogReplication 0 (0x0)<br />

D QuorumArbitrationTimeMax 300 (0x12c)<br />

D QuorumArbitrationTimeMin 15 (0xf)<br />

D DisableGroupPreferredOwnerRandomization 0 (0x0)<br />

D EnableEventDeltaGeneration 1 (0x1)<br />

D EnableResourceDllDeadlockDetection 0 (0x0)<br />

D ResourceDllDeadlockTimeout 240 (0xf0)<br />

D ResourceDllDeadlockThreshold 3 (0x3)<br />

D ResourceDllDeadlockPeriod 1800 (0x708)<br />

D ClusSvcHeartbeatTimeout 60 (0x3c)<br />

D HangRecoveryAction 3 (0x3)<br />

6872 5688–002 4–1


Recovering in a Geographic Clustered Environment<br />

If the properties are not set correctly, use one of the following commands to correct the<br />

settings.<br />

Majority Node Set Quorum<br />

Cluster /prop HangRecoveryAction=3<br />

Cluster /prop EnableEventLogReplication=0<br />

Shared Quorum<br />

Cluster /prop QuorumArbitrationTimeMax=300 (not for majority node set)<br />

Network Bindings<br />

Cluster /prop QuorumArbitrationTimeMin=15<br />

Cluster /prop HangRecoveryAction=3<br />

Cluster /prop EnableEventLogReplication=0<br />

The following binding priority order and settings are suggested as best practices for<br />

clustered configurations. These procedures assume that you can identify the public and<br />

private networks by the connection names that are referenced in the steps.<br />

Host-Specific Network Bindings and Settings<br />

1. Open the Network Connections window.<br />

2. On the Advanced menu, click Advanced Settings.<br />

3. Select the Networks and Bindings tab.<br />

This tab shows the binding order in the upper pane and specific connection<br />

properties in the lower pane.<br />

4. Verify that the public network connection is above the private network in the binding<br />

list in the upper pane.<br />

If it is not, follow these steps to change the order:<br />

a. Select a network connection in the binding list in the upper pane.<br />

b. Use the arrows to the right to move the network connection up or down in the<br />

list as appropriate.<br />

5. Select the private network in the binding list. In the lower pane, verify that the File<br />

and Print Sharing for Microsoft Networks and the Client for Microsoft<br />

Networks check boxes are cleared for the private network.<br />

6. Click OK.<br />

7. Highlight the public connections, then right-click and click Properties.<br />

8. Select Internet (TCP.IP) in the list, and click Properties.<br />

9. Click Advanced.<br />

10. Select the WINS tab.<br />

4–2 6872 5688–002


Recovering in a Geographic Clustered Environment<br />

11. Ensure that Enable LM/Hosts lookup is selected.<br />

12. Ensure that Disable NetBIOS over TCP/IP is selected.<br />

13. Repeat steps 7 through 12 for the private network connection.<br />

Cluster-Specific Network Bindings and Settings<br />

1. Open the Cluster Administrator.<br />

2. Right-click the cluster (the top node in the tree structure in the left pane and click<br />

Properties.<br />

3. Select the Networks Priority tab.<br />

4. Ensure that the private network is at the top of the list and that the public network is<br />

below the private network.<br />

If it is not, follow these steps to change the order:<br />

a. Select the private network.<br />

b. Use the command button at the right to move up the private network up in the<br />

list as appropriate.<br />

5. Select the private network, and click Properties.<br />

6. Verify that the Enable this network for cluster use check box is selected and<br />

that Internal cluster communications only (private network) is selected.<br />

7. Click OK.<br />

8. Select the public network, and click Properties.<br />

9. Verify that the Enable this network for cluster use check box is selected and<br />

that All communications (mixed network) is selected.<br />

10. Click OK.<br />

Group Initialization Effects on a Cluster<br />

Move-Group Operation<br />

The following conditions affect failover times for a cluster move-group operation. A<br />

cluster move-group operation cannot complete if a lengthy consistency group<br />

initialization, such as a full-sweep initialization, long resynchronization, or initialization<br />

from marking mode, is executing in the background. Review these conditions and plan<br />

accordingly.<br />

6872 5688–002 4–3


Recovering in a Geographic Clustered Environment<br />

Full-Sweep Initialization<br />

A full-sweep initialization occurs when the disks on both sites are scanned or read in<br />

their entirety and a comparison is made, using checksums, to check for differences. Any<br />

differences are then replicated from the Production site disk to the remote site disk. A<br />

full-sweep initialization generates an entry in the management console log.<br />

A full-sweep initialization occurs in the following circumstances:<br />

• Disabling or enabling a group<br />

Disabling a group causes all disk replication in the group to stop. A full-sweep<br />

initialization is performed once the group is enabled. The full-sweep initialization<br />

guarantees that the disks are consistent between the sites.<br />

• Adding a new splitter server or host that has access to the disks in the group<br />

When adding a new splitter to the replication, there is a time before the splitter is<br />

added to the configuration when activity from this splitter to the disks is not being<br />

monitored or replicated. To guarantee that no write operations were performed by<br />

the new splitter before the splitter was configured in the replication, a full-sweep<br />

initialization is required for all groups that contain disks accessed by this splitter. This<br />

initialization is done automatically by the system.<br />

• Double failure of a main component<br />

When a double failure of a main component occurs, a full-sweep initialization is<br />

required to guarantee that consistency was maintained. The main components<br />

include the host, the replication appliance (RA), and the storage subsystem.<br />

Long Resynchronization<br />

A long resynchronization occurs when the data difference that needs to be replicated to<br />

the other site cannot fit on the journal volume. The data is split into multiple snapshots<br />

for distribution to the other site, and all the previous snapshots are lost. Long<br />

resynchronization can be caused by long WAN outages, a group being disabled for a long<br />

time period, and other instances when replication has not been functional for a long time<br />

period.<br />

Long resynchronization is not connected with full-sweep initialization and can also<br />

happen during initialization from marking (see “Initialization from Marking Mode”). It is<br />

dependant only on the journal volume size and the amount of data to be replicated.<br />

A long resynchronization is identified in the Status Tab in Components Pane under<br />

the remote journal bitmap in the management console. The status Performing Long<br />

Resync is visible for the group that is currently performing a long resynchronization.<br />

4–4 6872 5688–002


Initialization from Marking Mode<br />

Recovering in a Geographic Clustered Environment<br />

All other instances of initialization in the replication are caused by marking. The marking<br />

mode refers to a replication mode in which the location of “dirty,” or changed, data is<br />

marked in a bitmap on the repository volume. This bitmap is a standard size—no matter<br />

how much data changes or what size disks are being monitored—so the repository<br />

volume cannot fill up during marking.<br />

The replication moves to marking mode when replication cannot be performed normally,<br />

such as during WAN outages. This marking mode guarantees that all data changes are<br />

still being recorded until replication is functioning normally. When replication can perform<br />

normally again, the RAs read the dirty, or changed, data from the source disk based on<br />

data recorded in the bitmap and replicates it to the disk on the remote site. The length of<br />

time for this process to complete depends on the amount of dirty, or changed, data as<br />

well as the performance of other components in the configuration, such as bandwidth<br />

and the storage subsystem.<br />

A high-load state can also cause the replication to move to marking mode. A high-load<br />

state occurs when write activity to the source disks exceeds the limits that the<br />

replication, bandwidth, or remote disks can handle. Replication moves into marking<br />

mode at this time until the replication determines the activity has reached a level at<br />

which it can continue normal replication. The replication then exits the high-load state<br />

and an initialization from marking occurs.<br />

See Section 10, “Solving Performance Problems,” for more information on high-load<br />

conditions and problems.<br />

Behavior of <strong>SafeGuard</strong> 30m Control During a<br />

Move-Group Operation<br />

During a move-group operation, the Unisys <strong>SafeGuard</strong> 30m Control resource in a<br />

clustered environment behaves as follow. Be aware of this information when dealing<br />

with various failure scenarios.<br />

1. MSCS issues an offline request because of a failure with a group resource—for<br />

example, a physical disk—or an MSCS move group. The request is sent to the<br />

Unisys <strong>SafeGuard</strong> 30m Control resource on the node that owns the group.<br />

The MSCS resources that are dependent on the Unisys <strong>SafeGuard</strong> 30m Control<br />

resource, such as physical disk resources, are taken offline first. Taking the<br />

resources offline does not issue any commands to the RA.<br />

2. MSCS issues an online request to the Unisys <strong>SafeGuard</strong> 30m Control resource on<br />

the node to which a group was moved, or in the case of failure, to the next node in<br />

the preferred owners list.<br />

3. When the resource receives an online request from MSCS, the Unisys <strong>SafeGuard</strong><br />

30m Control resource issues two commands to control the access to disks:<br />

initiate_failover and verify_failover.<br />

6872 5688–002 4–5


Recovering in a Geographic Clustered Environment<br />

Initiate_Failover Command<br />

This command changes the replication direction from one site to another.<br />

• If a same-site failover is requested, the command completes successfully with<br />

no action performed by the RA.<br />

• The resource issues the verify_failover command to see if the RA performed<br />

the operations successfully.<br />

• If a different-site failover is requested, the RA starts changing direction between<br />

sites and returns successfully. In certain circumstances, the RA returns a failure<br />

when the WAN is down or a long resynchronization occurs.<br />

• If the RA returns a failure to the Unisys <strong>SafeGuard</strong> 30m Control resource, the<br />

resource logs the failure in the Windows application event log and retries the<br />

command continuously until the cluster pending timeout is reached. When a<br />

move-group operation fails to view events posted by the resource, check the<br />

application event log. The event source of the event entry is the 30m Control.<br />

Verify_Failover Command<br />

This command enables the Unisys <strong>SafeGuard</strong> 30m Control resource to determine<br />

the time at which the change of the replication direction completes.<br />

• If a same-site failover is requested, the command completes successfully with<br />

no action performed by the RA.<br />

• If a different-site failover is requested, the verify_failover command returns a<br />

pending status until the replication direction changes. The change of direction<br />

takes from 2 to 30 minutes.<br />

• When the verify_failover command completes, write access to the physical disk<br />

is enabled to the host from the RA and the splitter.<br />

• If the time to complete the verify_failover command is within the pending<br />

timeout, the Unisys <strong>SafeGuard</strong> 30m Control resource comes online followed by<br />

all the resources dependent on this resource.<br />

All dependent disks come online using the default physical disk timeout of an<br />

MSCS cluster. The physical disk is available to the physical disk resource<br />

immediately; there is no delay. Physical disk access is available when the Unisys<br />

<strong>SafeGuard</strong> 30m Control resource comes online. You do not need to change the<br />

default resource settings for the physical disk. However, the physical disk must<br />

be dependent on the Unisys <strong>SafeGuard</strong> 30m Control resource.<br />

• If the time to complete the verify_failover command is longer than the pending<br />

timeout of the Unisys <strong>SafeGuard</strong> 30m Control resource, MSCS fails this<br />

resource.<br />

The default pending timeout for a Unisys <strong>SafeGuard</strong> 30m Control resource is<br />

15 minutes or 900 seconds. This timeout occurs before the cluster disk timeout.<br />

4–6 6872 5688–002


Recovering in a Geographic Clustered Environment<br />

If you use the default retry value of 1, this resource issues the following<br />

commands:<br />

• Initiate_failover<br />

• Verify_failover<br />

• Initiate_failover<br />

• Verify_failover<br />

Using the default pending timeout, the Unisys <strong>SafeGuard</strong> 30m Control resource<br />

waits a total of 30 minutes to come online; this timeout period equals the<br />

timeout plus one retry. If the resource does not come online, MSCS attempts to<br />

move the group to the next node in the preferred owners list and then repeats<br />

this process.<br />

Recovering by Manually Moving an Auto-Data<br />

(Shared Quorum) Consistency Group<br />

An older image might be required to recover from a rolling disaster, human error, a virus,<br />

or any other failure that corrupts the latest snapshot image. It is impossible to recover<br />

automatically to an older image using MSCS because automatic cluster failover is<br />

designed to minimize data loss. The Unisys <strong>SafeGuard</strong> 30m solution always attempts to<br />

fail over to the latest image.<br />

Note: Manual image recovery is only for data consistency groups, not for the quorum<br />

group.<br />

To recover a data consistency group using an older image, you must complete the<br />

following tasks:<br />

• Take the cluster data group offline.<br />

• Perform a manual failover of an auto-data (shared quorum) consistency group to a<br />

selected image.<br />

• Bring the cluster group online and check the validity of the image.<br />

• Reverse the replication direction of the consistency group.<br />

Taking a Cluster Data Group Offline<br />

To take a group offline in the cluster for which you are performing a manual recovery,<br />

complete the following steps:<br />

1. Open Cluster Administrator on one of the nodes in the MSCS cluster.<br />

2. Right-click the group that you want to recover and click Take Offline.<br />

3. Wait until all resources in the group show the status as Offline.<br />

6872 5688–002 4–7


Recovering in a Geographic Clustered Environment<br />

Performing a Manual Failover of an Auto-Data (Shared Quorum)<br />

Consistency Group to a Selected Image<br />

1. Open the Management Console.<br />

2. Select a Consistency Group from the navigation pane.<br />

Note: Do not select the quorum group. The data consistency group you select<br />

should be the cluster data group that you took offline.<br />

4. Click the Policy tab on the selected Consistency Group.<br />

5. Scroll down and select Advanced in the Policy tab.<br />

6. In Global Cluster mode, select Manual (shared quorum) in the Global cluster<br />

mode list.<br />

7. Click Apply.<br />

8. Perform the following steps to access the image:<br />

a. Right-click the Consistency Group and select Pause Transfer. Click Yes<br />

when the system prompts that the group activity will be paused.<br />

b. Right-click the Consistency Group and scroll down.<br />

c. Select the Remote Copy name and click Enable Image Access.<br />

The Enable Image Access dialog box appears.<br />

d. Choose Select an image from the list and click Next.<br />

The Select Explicit Image dialog box appears and displays the available<br />

images.<br />

e. Select the desired image from the list and click Next.<br />

The Image Access Mode dialog box appears.<br />

f. Select the option Logged access (physical) and click Next.<br />

The Summary screen displays the Image name and the Image Access mode.<br />

g. Click Finish.<br />

Note: This process might take a long time to complete depending on the value<br />

of the journal lag setting in the group policy of the consistency group. The<br />

following message appears during the process:<br />

Enabling log access<br />

h. Verify the target image name displayed below the bitmap in the components<br />

pane under the Status tab.<br />

Transfer:Paused status appears at the bottom in the Status tab under the<br />

components pane.<br />

4–8 6872 5688–002


Recovering in a Geographic Clustered Environment<br />

Bringing a Cluster Data Group Online and Checking the Validity<br />

of the Image<br />

1. Open the Cluster Administrator window on the Management Console.<br />

2. Move the group to the node on the recovered site by right-clicking the group that<br />

you previously took offline and then clicking Move Group.<br />

• If the cluster has more than two nodes, a list of possible owner target nodes<br />

appears. Select the node to which you want to move the group.<br />

• If the cluster has only two nodes, the move starts immediately. Go to step 3.<br />

3. Bring the group online by right-clicking the group name and then clicking Bring<br />

Online.<br />

4. Ensure that the selected image is valid; that is, verify that<br />

• All applications start successfully using the selected image.<br />

• The data in the image is consistent and valid.<br />

For example, you might want to test whether you can start a database application on<br />

this image. You might also want to run proprietary test procedures to validate the<br />

data.<br />

5. If you tested the validity of the image and the test completed successfully, skip to<br />

“Reversing the Replication Direction of the Consistency Group.”<br />

6. If the validity of the image fails and you choose to test a different image, perform the<br />

following steps:<br />

a. To take the group offline, right-click the group name and then click Take<br />

Offline on the Cluster Administrator.<br />

b. Select one of the Consistency Groups in the navigation pane on the<br />

Management Console.<br />

c. Right-click the Consistency Group and scroll down.<br />

d. Select the Remote Copy name and click Disable Image Access.<br />

e. Click Yes when the system prompts you to ensure that all group volumes are<br />

unmounted.<br />

7. Perform the following steps if you want to choose a different image:<br />

a. Right-click the Consistency Group and select Pause Transfer. Click Yes<br />

when the system prompts that the group activity will be paused.<br />

b. Right-click the Consistency Group and scroll down.<br />

c. Select the Remote Copy name and click Enable Image Access.<br />

The Enable Image Access dialog box appears.<br />

d. Choose Select an image from the list and click Next.<br />

The Select Explicit Image dialog box appears and displays the available<br />

images.<br />

6872 5688–002 4–9


Recovering in a Geographic Clustered Environment<br />

e. Select the desired image from the list and click Next.<br />

The Image Access Mode dialog box appears.<br />

f. Select the option Logged access (physical) and click Next.<br />

The Summary screen displays the Image name and the Image Access mode.<br />

g. Click Finish.<br />

Note: This process might take a long time to complete depending on the value<br />

of the journal lag setting in the group policy of the consistency group. The<br />

following message appears during the process:<br />

Enabling log access<br />

h. Verify the target image name displayed below the bitmap in the components<br />

pane under Status tab.<br />

Transfer:Paused status appears at the bottom in the Status tab under the<br />

components pane.<br />

8. To bring the cluster group online, using the Cluster Administrator, right-click the<br />

group name and then click Online to.<br />

9. Ensure that the selected image is valid. Verify that<br />

• All applications start successfully using the selected image.<br />

• The data in the image is consistent and valid.<br />

For example, you might want to test whether you can start a database application on<br />

this image. You might also want to run proprietary test procedures to validate the<br />

data.<br />

10. If you tested the validity of the image and the test completed successfully, skip to<br />

“Reversing the Replication Direction of the Consistency Group.”<br />

11. If the image is not valid, repeat steps 6 through 9 as necessary.<br />

Reversing the Replication Direction of the Consistency Group<br />

1. Select the Consistency Group from the navigation pane.<br />

2. Right-click the Group and select Pause Transfer. Click Yes when the system<br />

prompts that the group activity will be paused.<br />

3. Click the Status tab. The status transfer must display Paused.<br />

4. Click the Policy tab and expand the Advanced Settings (if it is not expanded).<br />

5. Select Auto data (shared quorum) from the Global Cluster mode list.<br />

6. Right-click the Consistency Group and select Failover to .<br />

7. Click Yes when the system prompts you to confirm failover.<br />

4–10 6872 5688–002


6872 5688–002<br />

8. Ensure that thee<br />

Start data transfer immediately check box is s selected.<br />

The following wwarning<br />

message appears:<br />

Warning: JJournal<br />

will be erased. Do you wish to continue e?<br />

9. Click Yes to coontinue.<br />

Problem Description<br />

The following pointts<br />

describe the behavior of the components in this event:<br />

• When the quorum<br />

group is running on the site where the RAs faile ed (site 1), the<br />

cluster nodes oon<br />

site 1 fail because of quorum lost reservations, an nd cluster nodes<br />

on site 2 attempt<br />

to arbitrate for the quorum resource.<br />

• To prevent a “ssplit<br />

brain” scenario, the RAs assume that the other site is active<br />

when a WAN faailure<br />

occurs. (A WAN failure occurs if the RAs cannot<br />

communicate<br />

to at least one RA at the other site.)<br />

• When the MSCCS<br />

Reservation Manager on the surviving site (site 2)<br />

attempts the<br />

quorum arbitrattion<br />

request, the RA prevents access. Eventually, all cluster services<br />

stop and manuaal<br />

intervention is required to bring up the cluster service.<br />

Figure 4–1 illustratees<br />

this failure.<br />

Recovering in a Geographic Clustere ed Environment<br />

Recovery When All RAs Fail on Site 1 (Site 1<br />

Quorum Owner) )<br />

Figure 44–1.<br />

All RAs Fail on Site 1 (Site 1 Quorum Owner)<br />

O<br />

4–11


Recovering in a Geographic Clustered Environment<br />

Symptoms<br />

The following symptoms might help you identify this failure:<br />

• The management console display shows errors and messages similar to those for<br />

“Total Communication Failure in a Geographic Clustered Environment” in Section 7.<br />

• If you review the system event log, you find messages similar to the following<br />

examples:<br />

System Event Log for Usmv-East2 Host (Surviving Host)<br />

8/2/2008 1:35:48 PM ClusSvc Error Physical Disk Resource 1038 N/A USMV-WEST2 Reservation of<br />

cluster disk 'Disk Q:' has been lost. Please check your system and disk configuration.<br />

8/2/2008 1:35:48 PM Ftdisk Warning Disk 57 N/A USMV-WEST2 The system failed to flush data to<br />

the transaction log. Corruption may occur.<br />

System Event Log for Usmv-West2 (Failure Host)<br />

8/2/2008 1:35:48 PM ClusSvc Error Physical Disk Resource 1038 N/A USMV-WEST2 Reservation of<br />

cluster disk 'Disk Q:' has been lost. Please check your system and disk configuration.<br />

8/2/2008 1:35:48 PM Ftdisk Warning Disk 57 N/A USMV-WEST2 The system failed to flush data to<br />

the transaction log. Corruption may occur.<br />

• If you review the cluster log, you find messages similar to the following examples:<br />

Cluster Log for Usmv-East2 (Surviving Host)<br />

Attempted to try five times before the cluster timed-out. The entries recorded five times in the log:<br />

00000f90.00000620::2008/02/02-20:36:06.195 ERR Physical Disk : [DiskArb] GetPartInfo<br />

completed, status 170 (The requested resource is in use).<br />

00000f90.00000620::2008/02/02-20:36:06.195 ERR Physical Disk : [DiskArb] Failed to read<br />

(sector 12), error 170.<br />

00000f90.00000620::2008/02/02-20:36:08.210 ERR Physical Disk : [DiskArb] GetPartInfo<br />

completed, status 170.<br />

00000f90.00000620::2008/02/02-20:36:08.210 ERR Physical Disk : [DiskArb] Failed to write<br />

(sector 12), error 170.<br />

00000638.00000b10::2008/02/02-20:36:18.273 ERR [FM] Failed to arbitrate quorum resource c336021a-<br />

083e-4fa0-9d37-7077a590c206, error 170.<br />

00000638.00000b10::2008/02/02-20:36:18.273 ERR [RGP] Node 2: REGROUP ERROR: arbitration failed.<br />

00000638.00000b10::2008/02/02-20:36:18.273 ERR [CS] Halting this node to prevent an inconsistency<br />

within the cluster. Error status = 5892 (The membership engine requested shutdown of the cluster<br />

service on this node).<br />

00000684.000005a8::2008/02/02-20:37:53.473 ERR [JOIN] Unable to connect to any sponsor node.<br />

00000684.000005a8::2008/02/02-20:38:06.020 ERR [FM] FmGetQuorumResource failed, error 170.<br />

00000684.000005a8::2008/02/02-20:38:06.020 ERR [INIT] ClusterForm: Could not get quorum resource.<br />

No fixup attempted. Status = 5086 (The quorum disk could not be located by the cluster service).<br />

00000684.000005a8::2008/02/02-20:38:06.020 ERR [INIT] Failed to form cluster, status 5086 (The<br />

quorum disk could not be located by the cluster service).<br />

4–12 6872 5688–002


Cluster Log for Usmv-West2 (Failure Host)<br />

Recovering in a Geographic Clustered Environment<br />

00000d80.00000bbc::2008/02/02-20:31:21.257 ERR [FM] FmpSetGroupEnumOwner:: MM returned<br />

MM_INVALID_NODE, chose the default target<br />

00000da0.00000130::2008/02/02-20:35:48.395 ERR Physical Disk : [DiskArb]<br />

CompletionRoutine: reservation lost! Status 170 (The requested resource is in use)<br />

00000da0.00000130::2008/02/02-20:35:48.395 ERR [RM] LostQuorumResource, cluster service<br />

terminated...<br />

00000da0.00000b80::2008/02/02-20:35:49.145 ERR Network Name : Unable to open<br />

handle to cluster, status 1753 (There are no more endpoints available from the endpoint mapper).<br />

00000da0.00000c20::2008/02/02-20:35:49.145 ERR IP Address : WorkerThread:<br />

GetClusterNotify failed with status 6 (The handle is invalid).<br />

00000a04.00000a14::2008/02/02-20:37:23.456 ERR [JOIN] Unable to connect to any sponsor node.<br />

Attempted to try five times before the cluster timed-out, The entries recorded five times in the log:<br />

000001e4.00000598::2008/02/02-20:37:23.799 ERR Physical Disk : [DiskArb] GetPartInfo<br />

completed, status 170 (The resource is in use).<br />

000001e4.00000598::2008/02/02-20:37:23.831 ERR Physical Disk : [DiskArb] Failed to read<br />

(sector 12), error 170.<br />

000001e4.00000598::2008/02/02-20:37:23.831 ERR Physical Disk : [DiskArb] BusReset<br />

completed, status 31 (A device attached to the system is not functioning).<br />

000001e4.00000598::2008/02/02-20:37:23.831 ERR Physical Disk : [DiskArb] Failed to break<br />

reservation, error 31.<br />

00000a04.00000a14::2008/02/02-20:37:25.830 ERR [FM] FmGetQuorumResource failed, error 31.<br />

00000a04.00000a14::2008/08/02-20:37:25.830 ERR [INIT] ClusterForm: Could not get quorum resource.<br />

No fixup attempted. Status = 5086 (The quorum disk could not be located by the cluster service).<br />

00000a04.00000a14::2008/02/02-20:37:25.830 ERR [INIT] Failed to form cluster, status 5086 (The<br />

quorum disk could not be located by the cluster service).<br />

00000a04.00000a14::2008/02/02-20:37:25.830 ERR [CS] ClusterInitialize failed 5086<br />

00000a04.00000a14::2008/02/02-20:37:25.846 ERR [CS] Service Stopped. exit code = 5086<br />

Actions to Resolve the Problem<br />

If all RAs on site 1 fail and site 1 owns the quorum resource, perform the following tasks<br />

to recover:<br />

1. Disable MSCS on all nodes at the site with the failed RAs.<br />

2. Perform a manual failover of the quorum consistency group.<br />

3. Reverse replication direction.<br />

4. Start MSCS on a node on the surviving site.<br />

5. Complete the recovery process.<br />

6872 5688–002 4–13


Recovering in a Geographic Clustered Environment<br />

Caution<br />

Manual recovery is required only if the quorum device is lost because of a<br />

failure of an RA cluster.<br />

Before you bring the remote site online and before you perform the manual<br />

recovery procedure, ensure that MSCS is stopped and disabled on the cluster<br />

nodes at the production site (site 1 in this case). You must verify the server<br />

status with a network test.<br />

Improper use of the manual recovery procedure can lead to an inconsistent<br />

quorum disk and unpredictable results that might require a long recovery<br />

process.<br />

Disabling MSCS<br />

Stop MSCS on each node at the site where the RAs failed by completing the following<br />

steps:<br />

1. In the Control Panel, point to Administrative Tools, and then click Services.<br />

2. Right-click Cluster Service and click Stop.<br />

3. Change the startup type to Disabled.<br />

4. Repeat steps 1 through 3 for each node on the site.<br />

Performing a Manual Failover of the Quorum Consistency Group<br />

1. Connect to the Management Console by opening a browser to the management IP<br />

address of the surviving site. The management console can be accessed only by the<br />

site with a functional RA cluster because the WAN is down.<br />

2. Click the Quorum Consistency Group (that is, the consistency group that holds<br />

the quorum drive) in the navigation pane.<br />

3. Click the Policy tab.<br />

4. Under Advanced, select Manual (shared quorum) in the Global cluster<br />

mode list, and click Apply.<br />

5. Right-click the Quorum Consistency Group and then select Pause Transfer.<br />

Click Yes when the system prompts that the group activity will be stopped.<br />

6. Perform the following steps to allow access to the target image:<br />

a. Right-click the Consistency Group and scroll down.<br />

b. Select the Remote Copy name and click Enable Image Access.<br />

The Enable Image Access dialog box appears.<br />

c. Choose Select an image from the list and click Next.<br />

The Select Explicit Image dialog box displays the available images.<br />

d. Select the desired image from the list and then click Next.<br />

The Image Access Mode dialog box appears.<br />

4–14 6872 5688–002


Recovering in a Geographic Clustered Environment<br />

e. Select Logged access (physical) and click Next.<br />

The Summary screen shows the Image name and the Image Access mode.<br />

f. Click Finish.<br />

Note: This process might take a long time to complete depending on the value<br />

of the journal lag setting in the group policy of the consistency group.<br />

g. Verify the target image name displayed below the bitmap in the components<br />

pane under the Status tab.<br />

Transfer:Paused status displays under the bitmap in the Status tab under the<br />

components pane.<br />

Reversing Replication Direction<br />

1. Select the Quorum Consistency Group in the navigation pane.<br />

2. Right-click the Group and select Pause Transfer. Click Yes when the system<br />

prompts that the group activity will be paused.<br />

3. Click the Status tab. The status of the transfer must show Paused.<br />

4. Right-click the Consistency Group and select Failover to .<br />

5. Click Yes when the system prompts to confirm failover.<br />

6. Ensure that the Start data transfer immediately check box is selected.<br />

The following warning message appears:<br />

Warning: Journal will be erased. Do you wish to continue?<br />

7. Click Yes to continue.<br />

Starting MSCS<br />

MSCS should start within 1 minute on the surviving nodes when the MSCS recovery<br />

setting is enabled. You can manually start MSCS on each node of the surviving site by<br />

completing the following steps:<br />

1. In the Control Panel, point to Administrative Tools, and then click Services.<br />

2. Right-click Cluster Service, and click Start.<br />

MSCS starts the cluster group and automatically moves all groups to the first-started<br />

cluster node.<br />

3. Repeat steps 1 through 2 for each node on the site.<br />

6872 5688–002 4–15


Recovering in a Geographic Clustered Environment<br />

Completing the Recovery Process<br />

To complete the recovery process, you must restore the global cluster mode property<br />

and start MSCS.<br />

• Restoring the Global Cluster Mode Property for the Quorum Group<br />

Once the primary site is operational and you have verified that all nodes at both sites<br />

are online in the cluster, restore the failover settings by performing the following<br />

steps:<br />

1. Click the Quorum Consistency Group (that is, the consistency group that<br />

holds the quorum device) from the navigation pane.<br />

2. Click the Policy tab.<br />

3. Under Advanced, select Auto-quorum (shared quorum) in the Global<br />

cluster mode list.<br />

4. Click Apply.<br />

5. Click Yes when the system prompts that the group activity will be stopped.<br />

• Enabling MSCS<br />

Enable and start MSCS on each node at the site where the RAs failed by completing<br />

the following steps:<br />

1. In the Control Panel, point to Administrative Tools, and then click<br />

Services.<br />

2. Right-click Cluster Services and click Properties.<br />

3. Change the startup type to Automatic.<br />

4. Click Start<br />

5. Repeat steps 1 through 4 for each node on the site.<br />

6. Open the Cluster Administrator and move the groups to the preferred node.<br />

4–16 6872 5688–002


Problem Description<br />

Symptoms<br />

6872 5688–002<br />

If the quorum groupp<br />

is running on site 2 and the RAs fail on site 1, all cluster<br />

nodes<br />

remain in a running state. All consistency groups remain at the respective<br />

sites because<br />

all disk accesses arre<br />

successful. In this case, because data is stored on n the replication<br />

volumes—but the ccorresponding<br />

marking information is not written to the repository<br />

volume—a full-sweeep<br />

resynchronization is required following recovery.<br />

An exception is if thhe<br />

consistency group option “Allow application to ru un even when<br />

Unisys <strong>SafeGuard</strong> S<strong>Solutions</strong><br />

cannot mark data” was selected. The split tter prevents<br />

access to disks when<br />

the RAs are not available to write marking data to o the repository<br />

volume, and I/Os faail.<br />

Figure 4–2 illustratees<br />

this failure.<br />

Recovering in a Geographic Clustere ed Environment<br />

Recovery When All RAs Fail on Site 1 (Site 2<br />

Quorum Owner) )<br />

Figure 44–2.<br />

All RAs Fail on Site 1 (Site 2 Quorum Owner) O<br />

The following sympptoms<br />

might help you identify this failure:<br />

• The managemeent<br />

console display shows errors and messages sim milar to those for<br />

“Total Communnication<br />

Failure in a Geographic Clustered Environme ent” in Section 7.<br />

• If you review thhe<br />

system event log, you find messages similar to th he following<br />

examples:<br />

4–17


Recovering in a Geographic Clustered Environment<br />

System Event Log for Usmv-East2 Host (Surviving Site—Site 2)<br />

8/2/2008 3:09:27 PM ClusSvc Information Failover Mgr 1204 N/A USMV-EAST2 "The Cluster<br />

Service brought the Resource Group ""Group 0"" offline."<br />

8/2/2008 3:09:47 PM ClusSvc Error Failover Mgr 1069 N/A USMV-WEST2 Cluster resource 'Data1' in<br />

Resource Group 'Group 0' failed.<br />

8/2/2008 3:10:29 PM ClusSvc Information Failover Mgr 1153 N/A USMV-WEST2 Cluster service is<br />

attempting to failover the Cluster Resource Group 'Group 0' from node USMV-WEST2 to node USMV-<br />

EAST2.<br />

8/2/2008 3:10:53 PM ClusSvc Information Failover Mgr 1201 N/A USMV-EAST2 "The Cluster<br />

Service brought the Resource Group ""Group 0"" online."<br />

System Event Log for Usmv-West2 Host (Failure Site—Site 1)<br />

8/2/2008 3:09:27 PM ClusSvc Information Failover Mgr 1204 N/A USMV-EAST2 "The Cluster<br />

Service brought the Resource Group ""Group 0"" offline."<br />

8/2/2008 3:09:47 PM ClusSvc Error Failover Mgr 1069 N/A USMV-WEST2 Cluster resource 'Data1' in<br />

Resource Group 'Group 0' failed.<br />

8/2/2008 3:10:29 PM ClusSvc Information Failover Mgr 1153 N/A USMV-WEST2 Cluster service is<br />

attempting to failover the Cluster Resource Group 'Group 0' from node USMV-WEST2 to node USMV-<br />

EAST2.<br />

8/2/2008 3:10:53 PM ClusSvc Information Failover Mgr 1201 N/A USMV-EAST2 "The Cluster<br />

Service brought the Resource Group ""Group 0"" online."<br />

• If you review the cluster log, you find messages similar to the following examples:<br />

Cluster Log for Surviving Site (Site 2)<br />

000005a0.00000fdc::2008/02/02-21:57:33.543 ERR [FM] FmpSetGroupEnumOwner:: MM returned<br />

MM_INVALID_NODE, chose the default target<br />

00000ec8.000008b4::2008/02/02-22:09:03.139 ERR Unisys <strong>SafeGuard</strong> 30m Control : KfLogit:<br />

Online resource failed. Cannot complete transfer for auto failover. Action: Verify through the<br />

Management Console that the WAN connection is operational.<br />

00000ec8.00000f48::2008/02/02-22:10:39.715 ERR Unisys <strong>SafeGuard</strong> 30m Control : KfLogit:<br />

Online resource failed. Cannot complete transfer for auto failover. Action: Verify through the<br />

Management Console that the WAN connection is operational.<br />

Cluster Log for Failure Site (Site 1)<br />

0000033c.000008e4::2008/02/02-22:09:47.159 ERR Unisys <strong>SafeGuard</strong> 30m Control :<br />

KfGetKboxData: get_system_settings command failed. Error: (2685470674).<br />

0000033c.000008e4::2008/02/02-22:09:47.159 ERR Unisys <strong>SafeGuard</strong> 30m Control :<br />

UrcfKConGroupOnlineThread: Error 1117 bringing resource online. (Error 1117: the request could not be<br />

performed because of an I/O device error).<br />

0000033c.00000b8c::2008/02/02-22:10:08.168 ERR Unisys <strong>SafeGuard</strong> 30m Control :<br />

KfGetKboxData: get_version command failed. Error: (2685470674).<br />

0000033c.00000b8c::2008/02/02-22:10:29.146 ERR Unisys <strong>SafeGuard</strong> 30m Control :<br />

KfGetKboxData: get_system_settings command failed. Error: (2685470674).<br />

0000033c.00000b8c::2008/02/02-22:10:29.146 ERR Unisys <strong>SafeGuard</strong> 30m Control :<br />

UrcfKConGroupOnlineThread: Error 1117 bringing resource online. (Error 1117: the request could not be<br />

performed because of an I/O device error).<br />

4–18 6872 5688–002


Actions to Resolve the Problem<br />

Recovering in a Geographic Clustered Environment<br />

If all RAs on site 1 fail and site 2 owns the quorum resource, you do not need to perform<br />

manual recovery. Because the surviving site owns the quorum consistency group, MSCS<br />

automatically restarts, and the data consistency group fails over on the surviving site.<br />

Recovery When All RAs and All Servers Fail on One<br />

Site<br />

The following two cases describe an event in which a complete site fails (for example,<br />

site 1) and all data I/O, cluster node communication, disk reservations, and so forth, stop<br />

responding. MSCS nodes on site 2 detect a network heartbeat loss and loss of disk<br />

reservations, and try to take over the cluster groups that had been running on the nodes<br />

that failed.<br />

There are two cases for recovering from this failure based on which site owns the<br />

quorum group:<br />

• The RAs and servers fail on site 1 and that site owns the quorum group.<br />

• The RAs and servers fail on site 1 and site 2 owns the quorum group.<br />

Manual recovery of MSCS is required as described in the following topic, “Site 1 Failure<br />

(Site 1 Quorum Owner).”<br />

If the site can recover in an acceptable amount of time and the quorum owner does not<br />

reside on the failed site, manual recovery should not be performed.<br />

The two cases that follow respond differently and are solved differently based on where<br />

the quorum owner resides.<br />

Site 1 Failure (Site 1 Quorum Owner)<br />

Problem Description<br />

In the first failure case, all nodes at site 1 fail as well as the RAs. Thus, the RAs must fail<br />

quorum arbitration attempts initiated by nodes on the surviving site. Because the RAs on<br />

the surviving site (site 2) are not able to communicate over the communication<br />

networks, the RAs assume that it is a WAN network failure and do not allow automatic<br />

failover of cluster resources.<br />

MSCS attempts to fail over to a node at site 2. Because the quorum resource was<br />

owned by site 1, site 2 must be brought up using the manual quorum recovery<br />

procedure.<br />

Figure 4–3 illustrates this case.<br />

6872 5688–002 4–19


Recovering in a Geographic CClustered<br />

Environment<br />

4–20<br />

Figure 4–3. All RAs annd<br />

Servers Fail on Site 1 (Site 1 Quorum Ow wner)<br />

68 872 5688–002


Symptoms<br />

Recovering in a Geographic Clustered Environment<br />

The following symptoms might help you identify this failure:<br />

• The management console display shows errors and messages similar to those for<br />

“Total Communication Failure in a Geographic Clustered Environment” in Section 7.<br />

• If you review the system event log, you find messages similar to the following<br />

examples:<br />

System Event Log for Usmv-East2 Host (Failure Site)<br />

8/3/2008 10:46:01 AM ClusSvc Error Startup/Shutdown 1073 N/A USMV-EAST2 Cluster service<br />

was halted to prevent an inconsistency within the server cluster. The error code was 5892 (The<br />

membership engine requested shutdown of the cluster service on this node).<br />

8/3/2008 10:46:00 AM ClusSvc Error Membership Mgr 1177 N/A USMV-EAST2 Cluster service is<br />

shutting down because the membership engine failed to arbitrate for the quorum device. This could be<br />

due to the loss of network connectivity with the current quorum owner. Check your physical network<br />

infrastructure to ensure that communication between this node and all other nodes in the server cluster is<br />

intact.<br />

8/3/2008 10:47:40 AM ClusSvc Error Startup/Shutdown 1009 N/A USMV-EAST2 Cluster service<br />

could not join an existing server cluster and could not form a new server cluster. Cluster service has<br />

terminated.<br />

8/3/2008 10:50:16 AM ClusDisk Error None 1209 N/A USMV-EAST2 Cluster service is requesting<br />

a bus reset for device \Device\ClusDisk0.<br />

• If you review the cluster log, you find messages similar to the following examples:<br />

Cluster Log for Surviving Site (Site 2)<br />

00000c54.000008f4::2008/02/02-17:13:31.901 ERR [NMJOIN] Unable to begin join, status 1717 (the NIC<br />

interface is unknown).<br />

00000c54.000008f4::2008/02/02-17:13:31.901 ERR [CS] ClusterInitialize failed 1717<br />

00000c54.000008f4::2008/02/02-17:13:31.917 ERR [CS] Service Stopped. exit code = 1717<br />

00000be0.000008e0::2008/02/02-17:14:53.686 ERR [JOIN] Unable to connect to any sponsor node.<br />

00000be0.000008e0::2008/02/02-17:14:56.374 ERR [FM] FmpSetGroupEnumOwner:: MM returned<br />

MM_INVALID_NODE, chose the default target<br />

000001e0.00000bac::2008/02/02-17:16:37.563 ERR IP Address : WorkerThread:<br />

GetClusterNotify failed with status 6.<br />

00000e8c.00000ea8::2008/02/02-17:30:20.275 ERR Physical Disk : [DiskArb] Signature of disk<br />

has changed or failed to find disk with id, old signature 0xe1e7208e new signature 0xe1e7208e, status 2<br />

(the system cannot find the file specified).<br />

00000e8c.00000ea8::2008/02/02-17:30:20.289 ERR Physical Disk : SCSI: Attach, error<br />

attaching to signature e1e7208e, error 2.<br />

000008e8.000008fc::2008/02/02-17:30:20.289 ERR [FM] FmGetQuorumResource failed, error 2.<br />

000008e8.000008fc::2008/02/02-17:30:20.289 ERR [INIT] ClusterForm: Could not get quorum resource.<br />

No fixup attempted. Status = 5086 (The quorum disk could not be located by the cluster service).<br />

000008e8.000008fc::2008/02/0-17:30:20.289 ERR [INIT] Failed to form cluster, status 5086.<br />

000008e8.000008fc::2008/02/02-17:30:20.289 ERR [CS] ClusterInitialize failed 5086<br />

000008e8.000008fc::2008/02/02-17:30:20.360 ERR [CS] Service Stopped. exit code = 5086<br />

00000710.00000e80::2008/02/02-17:55:02.092 ERR [FM] FmpSetGroupEnumOwner:: MM returned<br />

MM_INVALID_NODE, chose the default target<br />

000009cc.00000884::2008/02/02-17:55:12.413 ERR Unisys <strong>SafeGuard</strong> 30m Control : KfLogit:<br />

6872 5688–002 4–21


Recovering in a Geographic Clustered Environment<br />

Online resource failed. Cannot complete transfer for auto failover. Action: Verify through the<br />

Management Console that the WAN connection is operational.<br />

Cluster Log for Failure Site (Site 1)<br />

00000dc8.00000c48::2008/02/02-17:12:53.942 ERR Physical Disk : [DiskArb] GetPartInfo<br />

completed, status 2.<br />

00000dc8.00000c48::2008/02/02-17:12:53.942 ERR Physical Disk : [DiskArb] Failed to read<br />

(sector 12), error 2.<br />

00000dc8.00000c48::2008/02/02-17:12:55.942 ERR Physical Disk : [DiskArb] GetPartInfo<br />

completed, status 2.<br />

00000dc8.00000c48::2008/02/02-17:12:55.942 ERR Physical Disk : [DiskArb] Failed to write<br />

(sector 12), error 2.<br />

00000fe4.00000810::2008/02/02-17:13:20.030 ERR [FM] Failed to arbitrate quorum resource c336021a-<br />

083e-4fa0-9d37-7077a590c206, error 2.<br />

00000fe4.00000810::2008/02/02-17:13:20.030 ERR [RGP] Node 1: REGROUP ERROR: arbitration failed.<br />

00000fe4.00000810::2008/02/02-17:13:20.030 ERR [NM] Halting this node due to membership or<br />

communications error. Halt code = 1000<br />

00000fe4.00000810::2008/02/02-17:13:20.030 ERR [CS] Halting this node to prevent an inconsistency<br />

within the cluster. Error status = 5892 (The membership engine requested shutdown of the cluster<br />

service on this node).<br />

00000dc8.00000f34::2008/02/02-17:13:20.670 ERR Unisys <strong>SafeGuard</strong> 30m Control : KfLogit:<br />

Online resource failed. Pending processing terminated by resource monitor.<br />

00000dc8.00000f34::2008/02/02-17:13:20.670 ERR Unisys <strong>SafeGuard</strong> 30m Control :<br />

UrcfKConGroupOnlineThread: Error 1117 bringing resource online.<br />

000009e4::2008/02/02-17:29:20.587 ERR [FM] FmGetQuorumResource failed, error 2.<br />

000008e4.000009e4::2008/02/02-17:29:20.587 ERR [INIT] ClusterForm: Could not get quorum resource.<br />

No fixup attempted. Status = 5086<br />

000008e4.000009e4::2008/02/02-17:29:20.587 ERR [INIT] Failed to form cluster, status 5086.<br />

000008e4.000009e4::2008/02/02-17:29:20.587 ERR [CS] ClusterInitialize failed 5086<br />

000008e4.000009e4::2008/02/02-17:29:20.602 ERR [CS] Service Stopped. exit code = 5086<br />

000005b4.000008cc::2008/02/02-17:31:11.075 ERR [FM] FmpSetGroupEnumOwner:: MM returned<br />

MM_INVALID_NODE, chose the default target<br />

00000ff4.000008d8::2008/02/02-17:31:19.901 ERR Unisys <strong>SafeGuard</strong> 30m Control : KfLogit:<br />

Online resource failed. Cannot complete transfer for auto failover. Action: Verify through the<br />

Management Console that the WAN connection is operational.<br />

Actions to Resolve the Problem<br />

If all RAs and servers on site 1 fail and site 1 owns the quorum resource, perform the<br />

following tasks to recover:<br />

1. Perform a manual failover of the quorum consistency group.<br />

2. Reverse replication direction.<br />

3. Start MSCS.<br />

4. Power on the site if a power failure occurred.<br />

5. Restore the failover settings.<br />

Note: Do not bring up any nodes until the manual recovery process is complete.<br />

4–22 6872 5688–002


Recovering in a Geographic Clustered Environment<br />

Caution<br />

Manual recovery is required only if the quorum device is lost because of a<br />

failure of an RA cluster.<br />

If the cluster nodes at the production site are operational, you must disable<br />

MSCS. You must verify the server status with a network test or attempt to<br />

log in to the server. Use the procedure in ”Recovery When All RAs Fail on<br />

Site 1 (Site 1 Quorum Owner).”<br />

Improper use of the manual recovery procedure can lead to an inconsistent<br />

quorum disk and unpredictable results that might require a long recovery<br />

process.<br />

Performing a Manual Failover of the Quorum Consistency Group<br />

To perform a manual failover of the quorum consistency group, follow the procedure<br />

given in the “Actions to Resolve the Problem” for “Recovery When All RAs Fail on Site 1<br />

(Site 1 Quorum Owner)” earlier in this section.<br />

Reversing Replication Direction<br />

1. Select the Consistency Group from the navigation pane.<br />

2. Right-click the Group and select Pause Transfer. Click Yes when the system<br />

prompts that the group activity will be paused.<br />

3. Click the Status tab. The status of the transfer must display Paused.<br />

4. Right-click the Consistency Group and select Failover to <br />

5. Click Yes when the system prompts to confirm failover.<br />

6. Ensure that the Start data transfer immediately check box is selected.<br />

The following warning message appears:<br />

Warning: Journal will be erased. Do you wish to continue?<br />

7. Click Yes to continue.<br />

Starting MSCS<br />

MSCS should start within 1 minute on the surviving nodes when the MSCS recovery<br />

setting is enabled. You can manually start MSCS on each node of the surviving site by<br />

completing the following steps:<br />

1. In the Control Panel, point to Administrative Tools, and then click<br />

Services.<br />

2. Right-click Cluster Service, and click Start.<br />

MSCS starts the cluster group and automatically moves all groups to the<br />

first-started cluster node.<br />

3. Repeat steps 1 through 2 for each node on the site.<br />

6872 5688–002 4–23


Recovering in a Geographic Clustered Environment<br />

Powering-on a Site<br />

If a site experienced a power failure, power on the site in the following order:<br />

• Switches<br />

• Storage<br />

Note: Wait until all switches and storage units are initialized before continuing to<br />

power on the site.<br />

• RAs<br />

Note: Wait 10 minutes after you power on the RAs before you power on the hosts.<br />

• Hosts<br />

Restoring the Global Cluster Mode Property for the Quorum Group<br />

Once the primary site is again operational and you have verified that all nodes at both<br />

sites are online in the cluster, restore the failover settings by completing the following<br />

steps:<br />

1. Click the Quorum Consistency Group (that is, the consistency group that holds<br />

the quorum drive) from the navigation pane.<br />

2. Click the Policy tab.<br />

3. Under Advanced, select Auto-quorum (shared quorum) in the Global<br />

cluster mode list.<br />

4. Ensure that the Allow Regulation box check box is selected.<br />

5. Click Apply.<br />

4–24 6872 5688–002


Site 1 Failure (Site 2 Quorum Owner)<br />

Problem Description<br />

6872 5688–002<br />

If the quorum groupp<br />

is running on site 2 and a complete site failure occ curs on site 1, a<br />

quorum failover is nnot<br />

required. Only data groups on the failed site will require failover.<br />

All data that is not mmirrored<br />

and was in the failed RA cache is lost; the latest<br />

image on<br />

the remote site is uused<br />

to recover. Cluster services will be up on all nod des on site 2, and<br />

cluster nodes will faail<br />

on site 1. You cannot move a group to nodes on a site where the<br />

RAs are down (site 1).<br />

MSCS attempts to fail over to a node at site 2. An e-mail alert is sent st tating that a site<br />

or RA cluster has faailed.<br />

Figure 4–4 illustratees<br />

this case.<br />

Recovering in a Geographic Clustere ed Environment<br />

Figure 4–4. All RAAs<br />

and Servers Fail on Site 1 (Site 2 Quorum m Owner)<br />

4–25


Recovering in a Geographic Clustered Environment<br />

Symptoms<br />

The following symptoms might help you identify this failure:<br />

• The management console display shows errors and messages similar to those for<br />

“Total Communication Failure in a Geographic Clustered Environment” in Section 7.<br />

• If you review the system event log, you find messages similar to the following<br />

examples:<br />

System Event Log for Usmv-West2 (Failure Site)<br />

8/3/2006 1:49:26 PM ClusSvc Information Failover Mgr 1205 N/A USMV-WEST2 "The Cluster<br />

Service failed to bring the Resource Group ""Cluster Group"" completely online or offline."<br />

8/3/2008 1:49:26 PM ClusSvc Information Failover Mgr 1203 N/A USMV-WEST2 "The Cluster<br />

Service is attempting to offline the Resource Group ""Cluster Group""."<br />

8/3/2008 1:50:46 PM ClusDisk Error None 1209 N/A USMV-WEST2 Cluster service is requesting a<br />

bus reset for device \Device\ClusDisk0.<br />

• If you review the cluster log, you find messages similar to the following examples:<br />

Cluster Log for Failure Site (Site 1)<br />

00000e50.00000c10::2008/02/02-20:50:46.165 ERR Physical Disk : [DiskArb] GetPartInfo<br />

completed, status 170 (the requested resource is in use).<br />

00000e50.00000c10::2008/02/02-20:50:46.165 ERR Physical Disk : [DiskArb] Failed to read<br />

(sector 12), error 170.<br />

00000e50.00000fb4::2008/02/02-20:52:05.133 ERR IP Address : WorkerThread:<br />

GetClusterNotify failed with status 6 (the handle is invalid).<br />

Cluster Log for Surviving Site (Site 2)<br />

00000178.00000dd8::2008/02/02-20:49:30.976 ERR Physical Disk : [DiskArb] GetPartInfo<br />

completed, status 170.<br />

00000178.00000dd8::2008/02/02-20:49:30.992 ERR Physical Disk : [DiskArb] Failed to read<br />

(sector 12), error 170.<br />

00000d80.00000e68::2008/02/02-20:49:57.679 ERR [GUM] GumSendUpdate: GumQueueLocking update<br />

to node 1 failed with 1818 (The remote procedure call was cancelled).<br />

00000d80.00000e68::2008/02/02-20:49:57.679 ERR [GUM] GumpCommFailure 1818 communicating<br />

with node 1<br />

00000178.00000810::2008/02/02-20:50:45.492 ERR IP Address : WorkerThread:<br />

GetClusterNotify failed with status 6 (The handle is invalid).<br />

Actions to Resolve the Problem<br />

If all RAs and all servers on site 1 fail and site 2 owns the quorum resource, you do not<br />

need to perform manual recovery. Because the surviving site owns the quorum<br />

consistency group, MSCS automatically restarts, and the data consistency group fails<br />

over on the surviving site.<br />

4–26 6872 5688–002


Section 5<br />

Solving Storage Problems<br />

This section lists symptoms that usually indicate problems with storage. Table 5–1 lists<br />

symptoms and possible problems indicated by the symptom. The problems and their<br />

solutions are described in this section. The graphics, behaviors, and examples in this<br />

section are similar to what you observe with your system but might differ in some<br />

details.<br />

In addition to the symptoms listed, you might receive e-mail messages or SNMP traps<br />

for possible problems. Also, messages similar to e-mail notifications might be displayed<br />

on the management console. If you do not see the messages, they might have already<br />

dropped off the display. Review the management console logs for messages that have<br />

dropped off the display.<br />

Table 5–1. Possible Storage Problems with Symptoms<br />

Symptom Possible Problem<br />

The system pauses the transfer for the<br />

relevant consistency group.<br />

The server cannot access this volume;<br />

writes to this volume fail; the file system<br />

cannot be mounted; and so forth.<br />

The management console shows an error<br />

for all connections to this volume—that is,<br />

all RAs on the relevant site and all splitters<br />

attached to this volume.<br />

The system pauses the transfer for all<br />

consistency groups.<br />

The management console shows an error<br />

for all connections to this volume—that is,<br />

all RAs on the relevant site and all splitters<br />

attached to this volume.<br />

The event log reports that the repository<br />

volume is inaccessible.<br />

The event log indicates that the repository<br />

volume is corrupted.<br />

User or replication volume not accessible<br />

Repository volume not accessible<br />

6872 5688–002 5–1


Solving Storage Problems<br />

Table 5–1. Possible Storage Problems with Symptoms<br />

Symptom Possible Problem<br />

The management console shows an error<br />

for the connections between this volume<br />

and all RAs on the relevant site.<br />

The system pauses the transfer for the<br />

relevant consistency group.<br />

The event log indicates that the journal<br />

was lost or corrupted.<br />

No volumes from the relevant target and<br />

worldwide name (WWN) are accessible to<br />

any initiator on the SAN.<br />

The cluster regroup process begins and<br />

the quorum device fails over to a site<br />

without failed storage.<br />

The management console shows a storage<br />

error and replication has stopped.<br />

Servers report multipath software errors.<br />

Applications that depend on physical disk<br />

resources go offline and fail when<br />

attempting to come online.<br />

Once resource retry threshold parameters<br />

are reached, site 1 fails over to site 2. With<br />

the default settings, this timing is about 30<br />

minutes.<br />

Journal not accessible<br />

Total storage loss in a geographic<br />

replicated environment<br />

Storage failure on one site with quorum<br />

owner on failed site in a geographic<br />

clustered environment<br />

Storage failure on one site with quorum<br />

owner on surviving site in a geographic<br />

clustered environment<br />

5–2 6872 5688–002


Solving Storage Problems<br />

Table 5–2 lists specific storage volume failures and the types of errors and indicators on<br />

the management console that distinguish each failure.<br />

Table 5–2. Indicators and Management Console Errors to<br />

Distinguish Different Storage Volume Failures<br />

Failure<br />

Data volume<br />

lost or failed<br />

Journal<br />

volume lost,<br />

failed, or<br />

corrupt<br />

Repository<br />

volume lost,<br />

failed, or<br />

corrupt<br />

Groups<br />

Paused<br />

Status<br />

Relevant<br />

Data<br />

Group<br />

Relevant<br />

Data<br />

Group<br />

System<br />

Status<br />

All Storage and<br />

RA error<br />

failure<br />

Volumes<br />

Tab<br />

Storage error Replication<br />

volume with<br />

error status<br />

Storage error Journal<br />

volume with<br />

error status<br />

Repository<br />

volume with<br />

error status<br />

6872 5688–002 5–3<br />

Logs<br />

Tab<br />

Error<br />

3012<br />

Error<br />

3012<br />

Error<br />

3014


Solving Storage Problems<br />

User or Replication Volume Not Accessible<br />

Problem Description<br />

Symptoms<br />

The replication volume is not accessible to any host or splitter.<br />

The following symptoms might help you identify this failure:<br />

• The management console shows an error for storage and the Volumes tab (status<br />

column) shows additional errors (See Figure 5–1).<br />

Figure 5–5–1. Volumes Tab Showing Volume Connection Errors<br />

• Warnings and informational messages similar to those shown in Figure 5–2 appear<br />

on the management console. See the table after the figure for an explanation of the<br />

numbered console messages.<br />

5–4 6872 5688–002


Solving Storage Problems<br />

Figure 5–2. Management Console Messages for the User Volume Not Accessible<br />

Problem<br />

Reference<br />

No.<br />

The following table explains the numbered messages in Figure 5–2.<br />

Event<br />

ID<br />

Description E-mail<br />

Immediate<br />

1 4003 Group capabilities problem with the details<br />

showing that the RA is unable to access .<br />

E-mail<br />

Daily<br />

Summary<br />

2 3012 The RA is unable to access the volume. X<br />

• The Groups tab on the management console shows that the system paused the<br />

transfer for the relevant consistency group. (See Figure 5–3.)<br />

Figure 5–3. Groups Tab Shows “Paused by System”<br />

• The server cannot access this volume; writes to this volume fail; the file system<br />

cannot be mounted; and so forth.<br />

6872 5688–002 5–5<br />

X


Solving Storage Problems<br />

Actions to Resolve<br />

Perform the following actions to isolate and resolve the problem:<br />

• Determine whether other volumes from the same storage device are accessible to<br />

the same RAs to rule out a total storage loss. If no volumes are seen by an RA, refer<br />

to “Total Storage Loss in a Geographic Replicated Environment.”<br />

• Verify that this LUN still exists and has not failed or been removed from the storage<br />

device.<br />

• Verify that the LUN is masked to the proper splitter or splitters and RAs.<br />

• Verify that other servers in the SAN do not use this volume. For example, if an<br />

MSCS cluster in the SAN acquired ownership of this volume, it might reserve the<br />

volume and block other initiators from seeing the volume.<br />

• Verify that the volume has read and write permissions on the storage system.<br />

• Verify that the volume, as configured in the management console, has the expected<br />

WWN and LUN.<br />

Repository Volume Not Accessible<br />

Problem Description<br />

Symptoms<br />

The repository volume is not accessible to any SAN-attached initiator, including the<br />

splitter and RAs.<br />

Or, the repository volume is corrupted---either by another initiator because of storage<br />

changes or as a result of storage failure. You must reformat the repository volume<br />

before replication can proceed normally.<br />

The following symptoms might help you identify this failure:<br />

• The management console shows an error for all connections to this volume—that is,<br />

all RAs on the relevant site and all splitters attached to this volume. The RAs tab on<br />

the management console shows errors for the volume. (See Figure 5–4.)<br />

The following error messages appear for the RAs error condition when you click<br />

Details:<br />

Error: RA 1 in Sydney can't access repository volume<br />

Error: RA 2 in Sydney can't access repository volume<br />

The following error message appears for the storage error condition, when you click<br />

Details:<br />

Error: Repository volume can't be accessed by any RAs<br />

5–6 6872 5688–002


Solving Storage Problems<br />

Figure 5–4. Management Console Display: Storage Error and RAs Tab Shows<br />

Volume Errors<br />

• The Volumes tab on the management console shows an error for the repository<br />

volume, as shown in Figure 5–5.<br />

Figure 5–5. Volumes Tab Shows Error for Repository Volume<br />

• The Groups tab on the management console shows that the system paused the<br />

transfer for all consistency groups, as shown in Figure 5–6.<br />

Figure 5–6. Groups Tab Shows All Groups Paused by System<br />

• The Logs tab on the management console lists a message for event ID 3014. This<br />

message indicates that the RA is unable to access the repository volume or the<br />

repository volume is corrupted. (See Figure 5–7.)<br />

6872 5688–002 5–7


Solving Storage Problems<br />

Figure 5–7. Management Console Messages for the Repository Volume not<br />

Accessible Problem<br />

Actions to Resolve<br />

Perform the following actions to isolate and resolve the problem:<br />

• Determine whether other volumes from the same storage device are accessible to<br />

the same RAs to rule out a total storage loss. If no volumes are seen by an RA, refer<br />

to “Total Storage Loss in a Geographic Replicated Environment.”<br />

• Verify that this LUN still exists and has not failed or been removed from the storage<br />

device.<br />

• Verify that the LUN is masked to the proper splitter or splitters and RAs.<br />

• Verify that other servers in the SAN do not use this volume. For example, if an<br />

MSCS cluster in the SAN acquired ownership of this volume, it might reserve the<br />

volume and block other initiators from seeing the volume.<br />

• Verify that the volume has read and write permissions on the storage system.<br />

• Verify that the volume, as configured in the management console, has the expected<br />

WWN and LUN.<br />

• If the volume is corrupted or you determine that it must be reformatted, perform the<br />

steps in “Reformatting the Repository Volume.”<br />

Reformatting the Repository Volume<br />

Before you begin the reformatting process in a geographic clustered environment, be<br />

sure that all groups are located at the site for which the repository volume is not to be<br />

formatted.<br />

On RA 1 at the site for which the repository volume is to be formatted, determine from<br />

the Site Planning <strong>Guide</strong> which LUN is used for the repository volume. If the LUN is not<br />

recorded for the repository volume, a list is presented during the volume formatting<br />

process that shows LUNs and the previously used repository volume is identified.<br />

5–8 6872 5688–002


Solving Storage Problems<br />

Perform the following steps to reformat a repository volume for a particular site:<br />

1. Click the Data Group in the Management Console, and perform the following<br />

steps:<br />

a. Click Policy in the right pane and change the Global Cluster mode<br />

selection to Manual.<br />

b. Click Apply.<br />

c. Right-click the Data Group and select Disable Group.<br />

d. Click Yes when the system prompts that the copy activities will be stopped.<br />

2. Skip to step 6 for geographic replication environments.<br />

3. Perform the following steps for geographic clustered environments:<br />

a. Open the Group Policy window for the quorum group.<br />

b. Change the Global Cluster mode selection to Manual.<br />

c. Click Apply.<br />

4. Right-click the Consistency Group and select Disable Group.<br />

5. Click Yes when the system prompts that the copy activities will be stopped.<br />

6. Select the Splitters tab.<br />

a. Open the Splitter Properties window for the splitter.<br />

b. Select all the attached volumes.<br />

c. Click Detach and then click Apply.<br />

d. Click OK to close the window.<br />

e. Delete the splitter at the site for which the repository volume is to be<br />

reformatted.<br />

7. Open the PuTTY session on RA1 for the site.<br />

a. Log on with boxmgmt as the User ID and boxmgmt as the password.<br />

The Main menu is displayed.<br />

b. At the prompt, type 2 (Setup) and press Enter.<br />

c. On the Setup menu, type 2 (Configure repository volume) and press Enter.<br />

d. Type 1 (Format repository volume) and press Enter.<br />

e. Enter the appropriate number from the list to select the LUN. Ensure that<br />

the WWN and LUN are for the volume that you want to format. The LUN<br />

and identifier displays.<br />

f. Confirm the volume to format.<br />

All data is removed from the volume.<br />

g. Verify that the operation succeeds and press Enter.<br />

h. On the Main Menu, type Q (quit) and press Enter.<br />

8. Open a PuTTY session on each additional RA at the site for which the repository<br />

volume is to be formatted.<br />

6872 5688–002 5–9


Solving Storage Problems<br />

9. Log on with the boxmgmt as the user ID and boxmgmt as the password.<br />

The Main menu displays.<br />

a. At the prompt, type 2 (Setup) and press Enter.<br />

b. On the Setup menu, type 2 (Configure repository volume) and press Enter.<br />

c. Type 2 (Select a previously formatted repository volume) and press Enter.<br />

d. Enter the appropriate number from the list to select the LUN. Ensure that<br />

the WWN and LUN are for the volume that you want to format. The LUN<br />

and identifier displays.<br />

e. Confirm the volume to format. All data is removed from the volume.<br />

f. Verify that the operation succeeds and press Enter.<br />

g. On the Main menu, type Q (quit) and press Enter.<br />

Note: Complete step 9 for each additional RA at the site.<br />

10. On the Management Console, select the Splitters tab.<br />

a. Click the Add New Splitter icon to open the Add splitter window.<br />

b. Click Rescan and select the splitter.<br />

11. Open the Group Properties window and click the Policy tab and perform the<br />

following steps for each data group:<br />

a. Change the Global cluster mode selection to auto-data (shared<br />

quorum).<br />

b. Right-click the Data Group and click Enable Group.<br />

12. Skip to step 16 for geographic replication environments.<br />

13. Perform the following steps for geographic clustered environments.<br />

a. Right-click the Quorum Group and click Enable Group.<br />

b. Click the Quorum Group and select Policy in the right pane.<br />

c. Change the Global Cluster mode selection to Auto-quorum (shared<br />

quorum).<br />

14. Verify that initialization completes for all the groups.<br />

15. Review the Management Console event log.<br />

16. Ensure that no storage error or other component error appears.<br />

5–10 6872 5688–002


Journal Not Accessible<br />

Problem Description<br />

Symptoms<br />

The journal is not accessible to either RA.<br />

Solving Storage Problems<br />

A journal for one of the consistency groups is corrupted. The corruption results from<br />

another initiator because of storage changes or as a result of storage failure. Because<br />

the snapshot history is corrupted, replication for the relevant consistency group cannot<br />

proceed.<br />

The following symptoms might help you identify this failure:<br />

• The Volumes tab on the management console shows an error for the journal volume.<br />

(See Figure 5–8.)<br />

Figure 5–8. Volumes Tab Shows Journal Volume Error<br />

• The RAs tab on the management console shows errors for connections between<br />

this volume and the RAs. (See Figure 5–9.)<br />

Figure 5–9. RAs Tab Shows Connection Errors<br />

6872 5688–002 5–11


Solving Storage Problems<br />

• The Groups tab on the management console shows that the system paused the<br />

transfer for the relevant consistency group, as shown in Figure 5–10.<br />

Figure 5–10. Groups Tab Shows Group Paused by System<br />

• The Logs tab on the management console lists a message for event ID 3012. This<br />

message indicates that the RA is unable to access the volume. (See Figure 5–11.)<br />

Figure 5–11. Management Console Messages for the Journal Not Accessible<br />

Problem<br />

Actions to Resolve<br />

Perform the following actions to isolate and resolve the problem:<br />

• Determine whether other volumes from the same storage device are accessible to<br />

the same RAs to rule out a total storage loss. If no volumes are seen by an RA, refer<br />

to “Total Storage Loss in a Geographic Replicated Environment.”<br />

• Verify that this LUN still exists on the storage device and that it is only masked to<br />

the RAs.<br />

• Verify that the volume has read and write permissions on the storage system.<br />

• Verify that the volume, as configured in the management console, has the expected<br />

WWN and LUN.<br />

• For a corrupted journal, check that the system recovers automatically by re-creating<br />

the data structures for the corrupted journal and that the system then initiates a fullsweep<br />

resynchronization. No manual intervention is needed.<br />

5–12 6872 5688–002


Journal Volume Lost Scenarios<br />

Problem Description<br />

Scenarios<br />

Solving Storage Problems<br />

The journal volume is lost and will not be available in some scenarios as described<br />

below.<br />

• Attempt to write data to the Journal volume with the speed higher than the journal<br />

data is distributed to the replication volume will result in Journal data loss. In this<br />

case the Journal volume may be full and attempt to perform write operation on it<br />

creates a problem.<br />

• The user performs the following operations:<br />

− Failover<br />

− Recover production<br />

Actions to Resolve<br />

You can minimize the occurrence of this problem in scenario 1 by carefully configuring<br />

the Journal Lag. It is unavoidable in scenario 2.<br />

Total Storage Loss in a Geographic Replicated<br />

Environment<br />

Problem Description<br />

Symptoms<br />

All volumes belonging to a certain storage target and WWN (or controller, device) have<br />

been lost.<br />

The following symptoms might help you identify this failure:<br />

• The symptoms can be the same as those from any of the volume failure problems<br />

listed previously (or a subset of those symptoms), if the symptoms are relevant to<br />

the volumes that were used on this target. All volumes common to a particular<br />

storage array have failed.<br />

The Volumes tab on the management console shows errors for all volumes. (See<br />

Figure 5–12.)<br />

6872 5688–002 5–13


Solving Storage Problems<br />

Figure 5–12. Management Console Volumes Tab Shows Errors for All Volumes<br />

• No volumes from the relevant target and WWN are accessible to any initiator on the<br />

SAN, as shown on the RAs tab on the management console. (See Figure 5–13.)<br />

Figure 5–13. RAs Tab Shows Volumes That Are Not Accessible<br />

• Multipathing software (such as EMC PowerPath Administrator) reports failed paths<br />

to the storage device, as shown in Figure 5–14.<br />

5–14 6872 5688–002


Figure 5–14. Multipatthing<br />

Software Reports Failed Paths to Storage<br />

Device<br />

Actions to Resolve<br />

6872 5688–002<br />

Perform the followiing<br />

actions to isolate and resolve the problem:<br />

Solving Sto orage Problems<br />

• Verify that the sstorage<br />

device has not experienced a power outage.<br />

Instead, the<br />

device is functioning<br />

normally according to all external indicators.<br />

• Verify that the FFibre<br />

Channel switch and the storage device indicate e an operating<br />

Fibre Channel cconnection<br />

(that is, the relevant LEDs show OK). If the<br />

indicators are<br />

not OK, the prooblem<br />

might be a faulty Fibre Channel port (storage, switch, or patch<br />

panel) or a faultty<br />

Fibre Channel cable.<br />

• Verify that the iinitiator<br />

can be seen from the switch name server. If f not, the problem<br />

could be a Fibree<br />

Channel port or cable problem (as in the preceding g item). Otherwise,<br />

the problem coould<br />

be a misconfiguration of the port on the switch (for ( example, type<br />

or speed could be wrong).<br />

• Verify that the ttarget<br />

WWN is included in the relevant zones (that is s, hosts and RA).<br />

Verify also that the current zoning configuration is the active config guration. If you use<br />

the default zonee,<br />

verify that it is set to permit by default.<br />

• Verify that the rrelevant<br />

LUNs still exist on the storage device and are<br />

masked to the<br />

proper splitters and RAs.<br />

• Verify that volumes<br />

have read and write permissions on the storage<br />

system.<br />

• Verify that thesse<br />

volumes are exposed and managed by the proper r hosts and that<br />

there are no othher<br />

hosts on the SAN that use this volume.<br />

5–15


Solving Storage Problems<br />

Storage Failure on One Site in a Geographic<br />

Clustered Environmment<br />

5–16<br />

In a geographic clusteredd<br />

environment where MSCS is running, if the storage<br />

subsystem<br />

on one site fails, the symmptoms<br />

and resulting actions depend on whether the e quorum<br />

owner resided on the failed<br />

storage subsystem.<br />

To understand the two scenarios<br />

and to follow the actions for both possibilit ties, review<br />

Figure 5–15.<br />

Fiigure<br />

5–15. Storage on Site 1 Fails<br />

68 872 5688–002


Storage Failure on OOne<br />

Site with Quorum Owner on Failed Site<br />

Problem Description<br />

Symptoms<br />

6872 5688–002<br />

In this case, the cluuster<br />

quorum owner as well as the quorum resource e resides on the<br />

failed storage subsyystem.<br />

The quorum and resource<br />

automatically fail over to the node that gains control through<br />

MSCS arbitration. TThis<br />

node resides on the site without the storage failure.<br />

The RAs use the lasst<br />

available image. This action results in a loss of dat ta that has yet to<br />

be replicated. The rresources<br />

cannot fail back to the failed site until the storage<br />

subsystem is restored.<br />

The following sympptoms<br />

might help you identify this failure.<br />

• A node on whicch<br />

the cluster was running might report a delayed write w failure or<br />

similar error.<br />

• The quorum resservation<br />

is lost, and MSCS stops on the cluster nod de that owned the<br />

quorum resourcce.<br />

This action triggers a cluster “regroup” process, which allows<br />

other cluster noodes<br />

to arbitrate for the quorum device. Figure 5–16 6 shows typical<br />

listings for the ccluster<br />

regroup process.<br />

Figuure<br />

5–16. Cluster “Regroup” Process<br />

Solving Sto orage Problems<br />

5–17


Solving Storage Problems<br />

• Cluster nodes located on the failed storage subsystem fail quorum arbitration<br />

because the service cannot provide a reservation on the quorum volume. The<br />

resources fail over to the site without a storage failure. The first cluster node on the<br />

site without the storage failure that successfully completes arbitration of the quorum<br />

device assumes ownership of the cluster.<br />

The following messages illustrate this process.<br />

Cluster Log Entries<br />

INFO Physical Disk : [DiskArb]------- DisksArbitrate -------.<br />

INFO Physical Disk : [DiskArb] DisksOpenResourceFileHandle: Attaching to disk with<br />

signature f6fb216<br />

INFO Physical Disk : [DiskArb] DisksOpenResourceFileHandle: Disk unique id present<br />

trying new attach<br />

INFO Physical Disk : [DiskArb] DisksOpenResourceFileHandle: Retrieving disk number<br />

from ClusDisk registry key<br />

INFO Physical Disk : [DiskArb] DisksOpenResourceFileHandle: Retrieving handle to<br />

PhysicalDrive9<br />

INFO Physical Disk : [DiskArb] DisksOpenResourceFileHandle: Returns success.<br />

INFO Physical Disk : [DiskArb] Arbitration Parameters: ArbAttempts 5,<br />

SleepBeforeRetry 500 ms.<br />

INFO Physical Disk : [DiskArb] Read the partition info to insure the disk is<br />

accessible.<br />

INFO Physical Disk : [DiskArb] Issuing GetPartInfo on signature f6fb216.<br />

INFO Physical Disk : [DiskArb] GetPartInfo completed, status 0.<br />

INFO Physical Disk : [DiskArb] Arbitrate for ownership of the disk by<br />

reading/writing various disk sectors.<br />

INFO Physical Disk : [DiskArb] Successful read (sector 12) [:0]<br />

(0,00000000:00000000).<br />

INFO Physical Disk : [DiskArb] Successful write (sector 11) [USMV-DL580:0]<br />

(0,6ddd5cac:01c6d778).<br />

INFO Physical Disk : [DiskArb] Successful read (sector 12) [:0]<br />

(0,00000000:00000000).<br />

INFO Physical Disk : [DiskArb] Successful write (sector 12) [USMV-DL580:0]<br />

(0,6ddd5cac:01c6d778).<br />

INFO Physical Disk : [DiskArb] Successful read (sector 11) [USMV-DL580:0]<br />

(0,6ddd5cac:01c6d778).<br />

INFO Physical Disk : [DiskArb] Issuing Reserve on signature f6fb216.<br />

INFO Physical Disk : [DiskArb] Reserve completed, status 0.<br />

INFO Physical Disk : [DiskArb] CompletionRoutine starts.<br />

INFO Physical Disk : [DiskArb] Posting request to check reserve progress.<br />

INFO Physical Disk : [DiskArb] ********* IO_PENDING ********** - Request to insure<br />

reserves working is now posted.<br />

WARN Physical Disk : [DiskArb] Assume ownership of the device.<br />

INFO Physical Disk : [DiskArb] Arbitrate returned status 0.<br />

5–18 6872 5688–002


6872 5688–002<br />

• In Cluster Administrator,<br />

the groups that were online on one node change to the<br />

node that wins arbitration, as shown in Figure 5–17.<br />

Figuree<br />

5–17. Cluster Administrator Displays<br />

Solving Sto orage Problems<br />

• Multipathing sooftware,<br />

if present, reports errors on the host server rs of the site for<br />

which the storaage<br />

subsystem failed. Figure 5–18 shows errors for failed f storage<br />

devices.<br />

Figure 5–18. Multipatthing<br />

Software Shows Server Errors for Fai iled Storage<br />

Subsystem<br />

5–19


Solving Storage Problems<br />

Actions to Resolve<br />

Perform the following actions to isolate and resolve the problem:<br />

• Verify that all cluster resources failed over to a node on the site for which the<br />

storage subsystem did not fail and that these resources are online. If the cluster is<br />

running and no additional errors are reported, the problem has probably been isolated<br />

to a total site storage failure.<br />

• Log in to the storage subsystem, and verify that all LUNs are present and configured<br />

properly.<br />

• If the storage subsystem appears to be operating, the problem is most likely<br />

because of a failed SAN switch. See “Total SAN Switch Failure on One Site in a<br />

Geographic Clustered Environment” in Section 6.<br />

• Resolve the failure of the storage subsystem before attempting failback. Once the<br />

storage subsystem is working and the RAs and host can access it, a full initialization<br />

is initiated.<br />

Storage Failure on One Site with Quorum Owner on Surviving<br />

Site<br />

Problem Description<br />

Symptoms<br />

In this case, the cluster quorum owner does not reside on the failed storage subsystem,<br />

but other resources do reside on the failed storage subsystem.<br />

The cluster resources fail over to a site without a failed storage subsystem. The RAs use<br />

the last available image. This action results in a loss of data that has yet to be replicated<br />

(if not synchronous). The resources cannot fail back to the failed site until the storage<br />

subsystem is restored.<br />

The following symptoms might help you identify this failure:<br />

• The cluster marks the data groups containing the physical disk resources as failed.<br />

• Applications dependent on the physical disk resource go offline. Failed resources<br />

attempt to come online on the failed site, but fail. Then the resources fail over to the<br />

site with a valid storage subsystem.<br />

Actions to Resolve<br />

Perform the following actions to isolate and resolve the problem:<br />

• Verify that multipathing software, if present, reports errors on the host servers at the<br />

site with the suspected failed storage subsystem. (See Figure 5–19.)<br />

• Verify that all cluster resources failed over to site 2 in Cluster Administrator. Entries<br />

similar to the following occur in the cluster log for a host at the site with a failed<br />

storage subsystem (thread ID and timestamp removed).<br />

5–20 6872 5688–002


Cluster Log<br />

Solving Storage Problems<br />

Disk reservation lost ..<br />

ERR Physical Disk : [DiskArb] CompletionRoutine: reservation lost! Status 2<br />

Arbitrate for disk ....<br />

INFO Physical Disk : [DiskArb] Arbitration Parameters: ArbAttempts 5,<br />

SleepBeforeRetry 500 ms.<br />

INFO Physical Disk : [DiskArb] Read the partition info to insure the disk is<br />

accessible.<br />

INFO Physical Disk : [DiskArb] Issuing GetPartInfo on signature f6fb211.<br />

ERR Physical Disk : [DiskArb] GetPartInfo completed, status 2.<br />

INFO Physical Disk : [DiskArb] Arbitrate for ownership of the disk by<br />

reading/writing various disk sectors.<br />

ERR Physical Disk : [DiskArb] Failed to read (sector 12), error 2.<br />

INFO Physical Disk : [DiskArb] We are about to break reserve.<br />

INFO Physical Disk : [DiskArb] Issuing BusReset on signature f6fb211.<br />

Give up after 5 re-tries ...<br />

INFO Physical Disk : [DiskArb] We are about to break reserve.<br />

INFO Physical Disk : [DiskArb] Issuing BusReset on signature f6fb211.<br />

INFO Physical Disk : [DiskArb] BusReset completed, status 0.<br />

INFO Physical Disk : [DiskArb] Read the partition info from the disk to insure<br />

disk is accessible.<br />

INFO Physical Disk : [DiskArb] Issuing GetPartInfo on signature f6fb211.<br />

ERR Physical Disk : [DiskArb] GetPartInfo completed, status 2.<br />

ERR Physical Disk : [DiskArb] Failed to write (sector 12), error 2.<br />

ERR Physical Disk : Online, arbitration failed. Error: 2.<br />

INFO Physical Disk : Online, setting ResourceState 4 .<br />

Control goes offline at failed site...<br />

INFO [FM] FmpDoMoveGroup: Entry<br />

INFO [FM] FmpMoveGroup: Entry<br />

INFO [FM] FmpMoveGroup: Moving group 97ac3c3b-6985-44dd-bacd-a26e14966572 to node 4 (4)<br />

INFO [FM] FmpOfflineResource: Disk R: depends on Data1. Shut down first.<br />

INFO Unisys <strong>SafeGuard</strong> 30m Control : KfResourceOffline: Resource 'Data1' going<br />

offline.<br />

After trying other nodes at site move to remote site ...<br />

INFO [FM] FmpMoveGroup: Take group 97ac3c3b-6985-44dd-bacd-a26e14966572 request to remote<br />

node 4<br />

Move succeeds ...<br />

INFO [FM] FmpMoveGroup: Exit group , status = 0<br />

INFO [FM] FmpDoMoveGroup: Exit, status = 0<br />

INFO [FM] FmpDoMoveGroupOnFailure: FmpDoMoveGroup returns 0<br />

INFO [FM] FmpDoMoveGroupOnFailure Exit.<br />

INFO [GUM] s_GumUpdateNode: dispatching seq 5720 type 0 context 9<br />

INFO [FM] GUM update group 97ac3c3b-6985-44dd-bacd-a26e14966572, state 0<br />

INFO [FM] New owner of Group 97ac3c3b-6985-44dd-bacd-a26e14966572 is 2, state 0, curstate<br />

0.<br />

• Log in to the failed storage subsystem and determine whether the storage reports<br />

failed or missing disks. If the storage subsystem appears to be fine, the problem is<br />

most likely because of a SAN switch failure. See “Total SAN Switch Failure on One<br />

Site in a Geographic Clustered Environment” in Section 6.<br />

• Once the storage for the site that failed is back online, a full sweep is initiated.<br />

Check that the messages “Starting volume sweep“ and “Starting full sweep “ are<br />

displayed as an Events Notice.<br />

6872 5688–002 5–21


Solving Storage Problems<br />

5–22 6872 5688–002


Section 6<br />

Solving SAN Connectivity Problems<br />

This section lists symptoms that usually indicate problems with connections to the<br />

storage subsystem. Table 6–1 lists symptoms and possible problems indicated by the<br />

symptom. The problems and their solutions are described in this section. The graphics,<br />

behaviors, and examples in this section are similar to what you observe with your<br />

system but might differ in some details.<br />

In addition to the symptoms listed, you might receive e-mail messages or SNMP traps<br />

for possible problems. Also, messages similar to e-mail notifications might be displayed<br />

on the management console. If you do not see the messages, they might have already<br />

dropped off the display. Review the management console logs for messages that have<br />

dropped off the display.<br />

Table 6–1. Possible SAN Connectivity Problems<br />

Symptoms Possible Problem<br />

The system pauses the transfer. If the<br />

volume is accessible to another RA, a<br />

switchover occurs, and the relevant groups<br />

start running on the new RA.<br />

The relevant message appears in the event<br />

log.<br />

The link to the volume from the<br />

disconnected RA or RAs shows an error.<br />

The volume is accessible to the splitters<br />

that are attached to it.<br />

The system pauses the transfer for the<br />

relevant groups.<br />

If the volume is not accessible, the<br />

management console shows an error for<br />

the splitter. If a replication volume is not<br />

accessible, the splitter connection to that<br />

volume shows an error.<br />

Volume not accessible to RAs<br />

Volume not accessible to <strong>SafeGuard</strong> 30m<br />

splitter<br />

6872 5688–002 6–1


Solving SAN Connectivity Problems<br />

Table 6–1. Possible SAN Connectivity Problems<br />

Symptoms Possible Problem<br />

The system pauses the transfer for the<br />

relevant group or groups. If the connection<br />

with only one of the RAs is lost, the group<br />

or groups can restart the transfer by<br />

means of another RA, beginning with a<br />

short initialization.<br />

The splitter connection to the relevant RAs<br />

shows an error.<br />

The relevant message describes the lost<br />

connection in the event log.<br />

The management console shows a server<br />

down.<br />

Messages on the management console<br />

show that the splitter is down and that the<br />

node fails over.<br />

Multipathing software (such as EMC<br />

PowerPath Administrator) messages report<br />

an error.<br />

Cluster nodes fail and the cluster regroup<br />

process begins.<br />

Applications fail and attempt to restart.<br />

Messages regarding failed physical disks<br />

are displayed on the management console.<br />

The cluster resources fail over to the<br />

remote site.<br />

RAs not accessible to <strong>SafeGuard</strong> 30m<br />

splitter<br />

Server unable to connect with SAN<br />

(See “Server Unable to Connect with<br />

SAN” in Section 9. This problem is not<br />

described in this section.)<br />

Total SAN switch failure on one site in a<br />

geographic clustered environment<br />

6–2 6872 5688–002


Volume Not Accessible to RAs<br />

Problem Description<br />

Symptoms<br />

Solving SAN Connectivity Problems<br />

A volume (repository volume, replication volume, or journal) is not accessible to one or<br />

more RAs, but it is accessible to all other relevant initiators—that is, the splitter.<br />

The following symptoms might help you identify this failure:<br />

• The system pauses the transfer. If the volume is accessible to another RA, a<br />

switchover occurs, and the relevant group or groups start running on the new RA.<br />

• The management console displays failures similar to those in Figure 6–1.<br />

Figure 6–1. Management Console Showing “Inaccessible Volume” Errors<br />

• Warnings and informational messages similar to those shown in Figure 6–2 appear<br />

on the management console. See the table after the figure for an explanation of the<br />

numbered console messages.<br />

Figure 6–2. Management Console Messages for Inaccessible Volumes<br />

6872 5688–002 6–3


Solving SAN Connectivity Problems<br />

Referenc<br />

e No.<br />

The following table explains the numbered messages shown in Figure 6–2.<br />

Event<br />

ID<br />

Description<br />

1 3012 The RA is unable to access the<br />

volume (RA 2, quorum).<br />

2 5049 Splitter writer to RA failed. X<br />

3 4003 For each consistency group, the<br />

surviving site reports a group<br />

consistency problem. The details<br />

show a WAN problem.<br />

4 4044 The group is deactivated indefinitely<br />

by the system.<br />

5 4003 For each consistency group, a minor<br />

problem is reported. The details<br />

show that sides are not linked and<br />

also cannot transfer data.<br />

6 4001 For each consistency group, a minor<br />

problem is reported. The details<br />

show that sides are not linked and<br />

also cannot transfer data.<br />

7 5032 The splitter is splitting to replication<br />

volumes.<br />

E-mail<br />

Immediate<br />

E-mail<br />

Daily<br />

Summary<br />

To see the details of the messages listed on the management console display, you<br />

must collect the logs and then review the messages for the time of the failure.<br />

Appendix A explains how to collect the management console logs, and Appendix E<br />

lists the event IDs with explanations.<br />

• If you review the Windows system event log, you can find messages similar to the<br />

following examples that are based on the testing cases used to generate the<br />

previous management console images:<br />

System Event Log for USMV-SYDNEY Host (Host on Failure Site)<br />

5/28/2008 9:31:53 PM ClusSvc Error Physical Disk Resource 1038 N/A USMV-SYDNEY<br />

Reservation of cluster disk 'Disk Q:' has been lost. Please check your system and disk configuration<br />

5/28/2008 9:31:53 PM Srv Warning None 2012 N/A USMV-SYDNEY While transmitting or receiving<br />

data, the server encountered a network error. Occasional errors are expected, but large amounts of these<br />

indicate a possible error in your network configuration. The error status code is contained within the<br />

returned data (formatted as Words) and may point you towards the problem.<br />

5/28/2008 9:31:54 PM Ftdisk Warning Disk 57 N/A USMV CAS100P2 the system failed to<br />

flush data to the transaction log. Corruption may occur.<br />

5/28/2008 9:32:54 PM Service Control Manager Information None 7035 CLUSTERNET\clusadminUSMV-<br />

SYDNEY The Windows Internet Name Service (WINS) service was successfully sent a stop control.<br />

6–4 6872 5688–002<br />

X<br />

X<br />

X<br />

X<br />

X<br />

X


Solving SAN Connectivity Problems<br />

System Event Log for Usmv-x455 Host (Host on Surviving Site)<br />

5/28/2008 9:32:56 PM ClusSvc Warning Node Mgr 1123 N/A USMV-X455<br />

The node lost communication with cluster node 'USMV-SYDNEY' on network '<strong>Public</strong>'.<br />

5/28/2008 9:32:56 PM ClusSvc Warning Node Mgr 1123 N/A USMV-X455<br />

The node lost communication with cluster node 'USMV-SYDNEY' on network 'Private LAN'.<br />

5/28/2008 9:33:10 PM ClusDisk Error None 1209 N/A USMV-X455<br />

Cluster service is requesting a bus reset for device \Device\ClusDisk0.<br />

5/28/2008 9:33:30 PM ClusSvc Warning Node Mgr 1135 N/A USMV-X455 Cluster node USMV-<br />

SYDNEY was removed from the active server cluster membership. Cluster service may have been<br />

stopped on the node, the node may have failed, or the node may have lost communication with the other<br />

active server cluster nodes.<br />

5/28/2008 9:33:30 PM ClusSvc Information Failover Mgr 1200 N/A USMV-X455<br />

"The Cluster Service is attempting to bring online the Resource Group ""Cluster Group""."<br />

5/28/2008 9:33:34 PM ClusSvc Information Failover Mgr 1201 N/A USMV-X455<br />

"The Cluster Service brought the Resource Group ""Cluster Group"" online."<br />

5/28/2008 9:34:08 PM Service Control Manager Information None 7036 N/A USMV-X455<br />

The Windows Internet Name Service (WINS) service entered the running state.<br />

• If you review the cluster log, you can find messages similar to the following<br />

examples that are based on the testing cases used to generate the previous<br />

management console images:<br />

Cluster Log for USMV-SYDNEY Host (Host on Failure Site)<br />

00000e44.00000380::2008/05/28-21:31:53.841 ERR Physical Disk : [DiskArb]<br />

CompletionRoutine: reservation lost! Status 170 (Error 170: the requested resource is in use)<br />

00000e44.00000380::2008/05/28-21:31:53.841 ERR [RM] LostQuorumResource, cluster service<br />

terminated...<br />

00000e44.00000f0c::2008/05/28-21:31:55.011 ERR Network Name : Unable to<br />

open handle to cluster, status 1753. (Error 1753: there are no more endpoints available from the endpoint<br />

mapper)<br />

00000e44.00000f08::2008/05/28-21:31:55.341 ERR IP Address : WorkerThread:<br />

GetClusterNotify failed with status 6. (Error 6: the handle is invalid)<br />

00000e44.00000f0c::2008/05/28-1:35:56.125 ERR Physical Disk : [DiskArb] Failed to read<br />

(sector 12), error 170.<br />

00000e44.00000f0c::2008/05/28-1:35:56.125 ERR Physical Disk : [DiskArb] Error cleaning<br />

arbitration sector, error 170.<br />

Cluster Log for Usmv-x455 Host (Host on Surviving Site)<br />

0000015c.00000234::2008/05/28-21:31:56.299 INFO [ClMsg] Received interface unreachable event for<br />

node 1 network 2<br />

0000015c.00000234::2008/05/28-21:31:56.299 INFO [ClMsg] Received interface unreachable event for<br />

node 1 network 1<br />

00000688.00000e10::2008/05/28-1:35:10.712 ERR Physical Disk : [DiskArb] Signature of disk<br />

has changed or failed to find disk with id, old signature 0x98f3f0b new signature 0x98f3f0b, status 2.<br />

(Error 2: The system cannot find the file specified)<br />

0000015c.000007c8::2008/05/28-1:35:31.136 WARN [NM] Interface f409cf69-9c30-48f0-8519ad5dd14c3300<br />

is unavailable (node: USMV-SYDNEY, network: Private LAN).<br />

0000015c.000004fc::2008/05/28-1:35:31.136 WARN [NM] Interface 5019923b-d7a1-4886-825f-<br />

207b5938d11e is unavailable (node: USMV-SYDNEY, network: <strong>Public</strong>).<br />

6872 5688–002 6–5


Solving SAN Connectivity Problems<br />

Actions to Resolve the Problem<br />

Perform the following actions to isolate and resolve the problem:<br />

• Verify that the physical connection between the inaccessible RAs and the Fibre<br />

Channel switch is healthy.<br />

• Verify that any disconnected RA appears in the name server of the Fibre Channel<br />

switch. If not, the problem could be because of a bad port on the switch, a bad host<br />

bus adaptor (HBA), or a bad cable.<br />

• Verify that any disconnected RA is present in the proper zone and that the current<br />

zoning configuration is enabled.<br />

• Verify that the correct volume is configured (WWN and LUN). To double-check, enter<br />

the Create Volume command in the management console, and verify that the same<br />

volume does not appear on the list of volumes that are available to be “created.”<br />

• If the volume is not accessible to the RAs but is accessible to a splitter, and the<br />

server on which that splitter is installed is clustered using MSCS, Oracle RAC, or any<br />

other software that uses a reservation method, the problem probably occurs<br />

because the server has reserved the volume.<br />

For more information about the clustered environment installation process, see the<br />

Unisys <strong>SafeGuard</strong> <strong>Solutions</strong> Planning and Installation <strong>Guide</strong> and the Unisys<br />

<strong>SafeGuard</strong> <strong>Solutions</strong> Replication Appliance Administrator's <strong>Guide</strong>.<br />

6–6 6872 5688–002


Solving SAN Connectivity Problems<br />

Volume Not Accessible to <strong>SafeGuard</strong> 30m Splitter<br />

Problem Description<br />

Symptoms<br />

A volume (repository volume, replication volume, or journal) is not accessible to one or<br />

more splitters but is accessible to all other relevant initiators (for example, the RAs).<br />

The following symptoms might help you identify this failure:<br />

• The system pauses the transfer for the relevant groups.<br />

• If the repository volume is not accessible, the management console shows an error<br />

for the splitter. If a replication volume is not accessible, the splitter connection to<br />

that volume shows an error.<br />

• The management console System Status screen and the Splitter Settings screen<br />

show error indications similar to those in Figure 6–3.<br />

Figure 6–3. Management Console Error Display Screen<br />

• Warnings and informational messages similar to those shown in Figure 6–4 appear<br />

on the management console. See the table after the figure for an explanation of the<br />

numbered console messages.<br />

6872 5688–002 6–7


Solving SAN Connectivity Problems<br />

Figure 6–4. Management Console Messages for Volumes Inaccessible to Splitter<br />

6–8 6872 5688–002


Solving SAN Connectivity Problems<br />

The following table explains the numbered messages shown in Figure 6–4.<br />

Reference<br />

No. Event ID Description<br />

1 4008 For each consistency group at the failed site, the<br />

transfer is paused to allow a failover to the<br />

surviving site.<br />

E-mail<br />

Immediate<br />

2 5030 The splitter write operation failed. X<br />

3 4001 For each consistency group, a minor problem is<br />

reported. The details show sides are not linked<br />

and cannot transfer data.<br />

E-mail Daily<br />

Summary<br />

4 4005 Negotiating Transfer Protocol X<br />

5 4016 Transferring the latest snapshot before pausing<br />

the transfer (no data is lost).<br />

6 4007 Pausing Data Transfer X<br />

7 4087 For each consistency group at the failed site,<br />

initialization completes.<br />

8 5032 The splitter is splitting to replication volumes at<br />

the surviving site.<br />

9 5049 Splitter write to RA failed X<br />

10<br />

4086<br />

For each consistency group at the failed site, the<br />

data transfer starts and then the initialization<br />

starts.<br />

11 4104 Group Started Accepting Writes X<br />

12 5015 Splitter is Up X<br />

To see the details of the messages listed on the management console display, you<br />

must collect the logs and then review the messages for the time of the failure.<br />

Appendix A explains how to collect the management console logs, and Appendix E<br />

lists the event IDs with explanations.<br />

6872 5688–002 6–9<br />

X<br />

X<br />

X<br />

X<br />

X


Solving SAN Connectivity Prooblems<br />

6–10<br />

• The multipathing sofftware<br />

(such as EMC PowerPath) on the server at the<br />

failed site<br />

reports disk error as shown in Figure 6–5.<br />

Figure 6–5.<br />

EMC PowerPath Shows Disk Error<br />

• If you review the Windows<br />

system event log, you can find messages sim milar to the<br />

following examples tthat<br />

are based on the testing cases used to generate e the<br />

previous management<br />

console images:<br />

System Event Log foor<br />

USMV-SYDNEY Host (Host on Failure Site e)<br />

5/29/2008 1:35:20 AM EmccpBase<br />

Error None 108 N/A USMV-SYDNEY Volume<br />

6006016011321100158233EDE0B23DB11<br />

is unbound.<br />

5/29/2008 1:35:20 AM EmccpBase<br />

to APM00042302162 is dead.<br />

Error None 100 N/A USMV-SYDNEY Path Bu us 5 Tgt 3 Lun 2<br />

5/29/2008 1:35:20 AM EmccpBase<br />

to APM00042302162 is dead.<br />

Error None 100 N/A USMV-SYDNEY Path Bu us 5 Tgt 0 Lun 2<br />

5/29/2008 1:35:20 AM EmccpBase<br />

to APM00042302162 is dead.<br />

Error None 100 N/A USMV-SYDNEY Path Bu us 3 Tgt 3 Lun 2<br />

5/29/2008 1:35:20 AM EmccpBase<br />

to APM00042302162 is dead.<br />

Error None 100 N/A USMV-SYDNEY Path Bu us 3 Tgt 0 Lun 2<br />

5/29/2008 1:35:20 AM EmccpBase<br />

Error None 104 N/A USMV-SYDNEY All path hs to<br />

6006016011321100158233EDE0B23DB11<br />

are dead.<br />

5/29/2008 1:35:20 AM Srv Error None 2000 N/A USMV-SYDNEY The server's call to t a system<br />

service failed unexpectedly.<br />

5/29/2008 1:36:18 AM Ftdiisk<br />

Warning Disk 57 N/A USMV-SYDNEY The system failed to flush<br />

data to the transaction logg.<br />

Corruption may occur.<br />

5/29/2008 1:36:18 AM Srv Error None 2000 N/A USMV-SYDNEY The server's call to t a system<br />

service failed unexpectedly.<br />

5/29/2008 1:36:18 AM Ntfss<br />

Warning None 50 N/A USMV-SYDNEY {Delayed Write Failed} F Windows<br />

was unable to save all thee<br />

data for the file. The data has been lost. This error may be cause ed by a failure of<br />

your computer hardware oor<br />

network connection. Please try to save this file elsewhere.<br />

68 872 5688–002


Solving SAN Connectivity Problems<br />

5/29/2008 1:36:18 AM Application Popup Information None 26 N/A USMV-SYDNEY<br />

Application popup: Windows - Delayed Write Failed : Windows was unable to save all the data for the file<br />

S:\$BitMap. The data has been lost. This error may be caused by a failure of your computer hardware or<br />

network connection. Please try to save this file elsewhere.<br />

5/29/2008 1:36:19 AM Service Control Manager Information None 7035 CLUSTERNET\clusadmin<br />

USMV-SYDNEY The Windows Internet Name Service (WINS) service was successfully sent a stop<br />

control.<br />

System Event Log for Usmv-x455 Host (Host on Surviving Site)<br />

5/29/2008 1:35:26 AM ClusSvc Warning Node Mgr 1123 N/A USMV-X455 The node lost<br />

communication with cluster node 'USMV-SYDNEY' on network '<strong>Public</strong>'.<br />

5/29/2008 1:35:26 AM ClusSvc Warning Node Mgr 1123 N/A USMV-X455 The node lost<br />

communication with cluster node 'USMV-SYDNEY' on network 'Private LAN'.<br />

5/29/2008 1:35:40 AM ClusDisk Error None 1209 N/A USMV-X455 Cluster service is requesting<br />

a bus reset for device \Device\ClusDisk0.<br />

5/29/2008 1:36:06 AM ClusSvc Warning Node Mgr 1135 N/A USMV-X455 Cluster node USMV-<br />

SYDNEY was removed from the active server cluster membership. Cluster service may have been<br />

stopped on the node, the node may have failed, or the node may have lost communication with the other<br />

active server cluster nodes.<br />

5/29/2008 1:36:06 AM ClusSvc Information Failover Mgr 1200 N/A USMV-X455 "The Cluster<br />

Service is attempting to bring online the Resource Group ""Cluster Group""."<br />

5/29/2008 1:36:10 AM ClusSvc Information Failover Mgr 1201 N/A USMV-X455 "The Cluster<br />

Service brought the Resource Group ""Cluster Group"" online."<br />

5/29/2008 1:36:36 AM Service Control Manager Information None 7035<br />

CLUSTERNET\clusadmin USMV-X455 The Windows Internet Name Service (WINS) service was<br />

successfully sent a start control.<br />

• If you review the cluster log, you can find messages similar to the following<br />

examples that are based on the testing cases used to generate the previous<br />

management console images:<br />

Cluster Log for USMV-SYDNEY Host (Host on Failure Site)<br />

00000d68.00000284::2008/05/29-1:35:21.703 ERR Physical Disk : [DiskArb]<br />

CompletionRoutine: reservation lost! Status 21 (Error 21: the device is not ready)<br />

00000d68.00000284::2008/05/29-1:35:22.713 ERR Physical Disk : [DiskArb]<br />

CompletionRoutine: reservation lost! Status 2 (Error 2: the system cannot find the file specified)<br />

00000d68.00000284::2008/05/29-1:35:22.713 ERR [RM] LostQuorumResource, cluster service<br />

terminated...<br />

00000d68.00000e68::2008/05/29-1:35:23.133 ERR Physical Disk : LooksAlive, error checking<br />

device, error 2.<br />

00000d68.00000e68::2008/05/29-1:35:23.133 ERR Physical Disk : IsAlive, error checking<br />

device, error 2.<br />

00000d68.00000e68::2008/05/29-1:35:23.143 ERR Network Name : Name query request<br />

failed, status 3221225860.<br />

00000d68.00000e68::2008/05/29-1:35:23.143 INFO Network Name : Name SYDNEY-<br />

AUCKLAND failed IsAlive/LooksAlive check, error 22. (Error 22: the device does not recognize the<br />

command)<br />

00000d68.00000cd0::2008/05/29-1:35:23.303 ERR Network Name : Unable to<br />

open handle to cluster, status 1753.<br />

00000d68.00000cd0::2008/05/29-21:33:19.245 ERR Physical Disk : [DiskArb] Failed to read<br />

(sector 12), error 1117. (Error 1117: the request could not be performed because of an I/O device error)<br />

00000d68.00000cd0::2008/05/29-21:33:19.245 ERR Physical Disk : [DiskArb] Error cleaning<br />

arbitration sector, error 1117.<br />

Cluster Log for Usmv-x455 Host (Host on Surviving Site)<br />

0000015c.00000234::2008/05/29-1:35:26.121 INFO [ClMsg] Received interface unreachable event for<br />

node 1 network 2<br />

0000015c.00000234::2008/05/29-1:35:26.121 INFO [ClMsg] Received interface unreachable event for<br />

node 1 network 1<br />

00000688.00000d08::2008/05/29-1:35:40.523 ERR Physical Disk : [DiskArb] GetPartInfo<br />

completed, status 170.<br />

00000688.00000d08::2008/05/29-1:35:40.653 ERR Physical Disk : [DiskArb] Failed to read<br />

(sector 12), error 170.<br />

6872 5688–002 6–11


Solving SAN Connectivity Problems<br />

Actions to Resolve the Problem<br />

Perform the following actions to isolate and resolve the problem:<br />

• Verify that the physical connection between the disconnected splitter or splitters and<br />

the Fibre Channel switch is healthy.<br />

• Verify that any host on which a disconnected splitter resides appears in the name<br />

server of the Fibre Channel switch. If not, the problem could be because of a bad<br />

port on the switch, a bad HBA, or a bad cable.<br />

• Verify that any host on which a disconnected splitter resides is present in the proper<br />

zone and that the current zoning configuration is enabled.<br />

• If a replication volume is not accessible to the splitter at the source site, but appears<br />

as OK in the management console for that splitter, verify that the splitter is not<br />

functioning at the target site (TSP not enabled). During normal replication, the<br />

system prevents target-site splitters from accessing the replication volumes.<br />

RAs Not Accessible to <strong>SafeGuard</strong> 30m Splitter<br />

Problem Description<br />

Symptoms<br />

One or more RAs on a site are not accessible to the splitter through the Fibre Channel.<br />

The following symptoms might help you identify this failure:<br />

• The system pauses the transfer for the relevant groups. If the connection with only<br />

one of the RAs is lost, the groups can restart the transfer by means of another RA,<br />

beginning with a short initialization.<br />

• The splitter connection to the relevant RAs shows an error.<br />

• The management console displays error indicators similar to those in Figure 6–6.<br />

Figure 6–6. Management Console Display Shows a Splitter Down<br />

• Warnings and informational messages similar to those shown in Figure 6–7 appear<br />

on the management console. See the table after the figure for an explanation of the<br />

numbered console messages.<br />

6–12 6872 5688–002


Solving SAN Connectivity Problems<br />

Figure 6–7. Management Console Messages for Splitter Inaccessible to RA<br />

6872 5688–002 6–13


Solving SAN Connectivity Problems<br />

Reference<br />

No.<br />

The following table explains the numbered messages shown in Figure 6–7.<br />

Event<br />

ID<br />

Description E-mail<br />

Immediate<br />

1 4005 The surviving site Negotiating transfer<br />

protocol<br />

2 4008 For each consistency group at the<br />

failed site, the transfer is paused to<br />

allow a failover to the surviving site.<br />

3 5002 The splitter for server USMV-SYDNEY<br />

is unable to access the RA.<br />

4 4105 The failed site stop accepting writes to<br />

the consistency group<br />

5 4008 For each consistency group at the<br />

failed site, the transfer is paused to<br />

allow a failover to the surviving site.<br />

6 5013 Splitter down problem X<br />

7 4087 The synchronization completed<br />

message after the splitter is restored<br />

and replication completes<br />

8 5032 The splitter starts splitting the<br />

replication volumes<br />

9 4001 Group capabilities reporting problem. X<br />

10 5032 The splitter is splitting to replication<br />

volumes<br />

13 5049 The splitter unable to write to the RAs X<br />

14 4086 The original site starts the<br />

synchronization<br />

15 4104 Consistency Group start replicating X<br />

E-mail<br />

Daily<br />

Summary<br />

To see the details of the messages listed on the management console display, you<br />

must collect the logs and then review the messages for the time of the failure.<br />

Appendix A explains how to collect the management console logs, and Appendix E<br />

lists the event IDs with explanations.<br />

• If you review the Windows system event log, you can find messages similar to the<br />

following examples that are based on the testing cases used to generate the<br />

previous management console images:<br />

6–14 6872 5688–002<br />

X<br />

X<br />

X<br />

X<br />

X<br />

X<br />

X<br />

X<br />

X


Solving SAN Connectivity Problems<br />

System Event Log for USMV-SYDNEY Host (Host on Failure Site)<br />

5/29/2008 2:25:20 AM ClusSvc Error Physical Disk Resource 1038 N/A USMV-SYDNEYReservation<br />

of cluster disk 'Disk Q:' has been lost. Please check your system and disk configuration.<br />

5/29/2008 2:25:20 AM Service Control Manager Error None 7034 N/A USMV-SYDNEYThe Cluster<br />

service terminated unexpectedly.<br />

5/29/2008 2:25:50 AM Srv Warning None 2012 N/A USMV-SYDNEY While transmitting or<br />

receiving data, the server encountered a network error. Occasional errors are expected, but large amounts<br />

of these indicate a possible error in your network configuration. The error status code is contained within<br />

the returned data (formatted as Words) and may point you towards the problem.<br />

5/29/2008 2:25:20 AM Ftdisk Warning Disk 57 N/A USMV-SYDNEY<br />

The system failed to flush data to the transaction log. Corruption may occur.<br />

5/29/2008 2:25:21 AM Ntfs Warning None 50 N/A USMV-SYDNEY {Delayed Write Failed}<br />

Windows was unable to save all the data for the file. The data has been lost. This error may be caused by<br />

a failure of your computer hardware or network connection. Please try to save this file elsewhere.<br />

5/29/2008 2:25:32 AM Ftdisk Warning Disk 57 N/A USMV-SYDNEY<br />

The system failed to flush data to the transaction log. Corruption may occur.<br />

5/29/2008 2:25:32 AM Srv Error None 2000 N/A USMV-SYDNEY<br />

The server's call to a system service failed unexpectedly.<br />

5/29/2008 2:25:32 AM ClusSvc Error IP Address Resource 1077 N/A USMV-SYDNEY<br />

The TCP/IP interface for Cluster IP Address '' has failed.<br />

5/29/2008 2:25:32 AM ClusSvc Error Physical Disk Resource 1036 N/A USMV-SYDNEY<br />

Cluster disk resource '' did not respond to a SCSI maintenance command.<br />

5/29/2008 2:25:32 AM ClusSvc Error Network Name Resource 1215 N/A USMV-SYDNEYCluster<br />

Network Name SYDNEY-AUCKLAND is no longer registered with its hosting system. The associated<br />

resource name is ''.<br />

System Event Log for Usmv-x455 Host (Host on Surviving Site)<br />

5/29/2008 2:25:23 AM ClusSvc Warning Node Mgr 1123 N/A USMV-X455<br />

The node lost communication with cluster node 'USMV-SYDNEY' on network '<strong>Public</strong>'.<br />

5/29/2008 2:25:23 AM ClusSvc Warning Node Mgr 1123 N/A USMV-X455<br />

The node lost communication with cluster node 'USMV-SYDNEY' on network 'Private LAN'.<br />

5/29/2008 2:25:37 AM ClusDisk Error None 1209 N/A USMV-X455<br />

Cluster service is requesting a bus reset for device \Device\ClusDisk0.<br />

5/29/2008 2:25:53 AM ClusSvc Warning Node Mgr 1135 N/A USMV-X455 Cluster node USMV-<br />

SYDNEY was removed from the active server cluster membership. Cluster service may have been<br />

stopped on the node, the node may have failed, or the node may have lost communication with the other<br />

active server cluster nodes.<br />

5/29/2008 2:25:53 AM ClusSvc Information Failover Mgr 1200 N/A USMV-X455<br />

"The Cluster Service is attempting to bring online the Resource Group ""Cluster Group""."<br />

5/29/2008 2:25:58 AM ClusSvc Information Failover Mgr 1201 N/A USMV-X455<br />

"The Cluster Service brought the Resource Group ""Cluster Group"" online."<br />

5/28/2008 2:25:35 AM Service Control Manager Information None 7035<br />

CLUSTERNET\clusadmin USMV-X455<br />

The Windows Internet Name Service (WINS) service was successfully sent a start control.<br />

5/29/2008 2:25:37 AM Service Control Manager Information None 7035 NT<br />

AUTHORITY\SYSTEM USMV-X455<br />

The Windows Internet Name Service (WINS) service was successfully sent a continue control.<br />

• If you review the cluster log, you can find messages similar to the following<br />

examples that are based on the testing cases used to generate the previous<br />

management console images:<br />

6872 5688–002 6–15


Solving SAN Connectivity Problems<br />

Cluster Log for USMV-SYDNEY Host (Host on Failure Site)<br />

00000f70.00000d10::2008/05/29-2:25:20.426 ERR Physical Disk : [DiskArb]<br />

CompletionRoutine: reservation lost! Status 31. (Error 31: a device attached to the system is not<br />

functioning)<br />

00000f70.00000d10::2008/05/29-2:25:20.426 ERR [RM] LostQuorumResource, cluster service<br />

terminated...<br />

00000f70.00000e78::2008/05/29-2:25:32.778 ERR Physical Disk : IsAlive, error checking device,<br />

error 995. (Error 995: The I/O operation has been aborted because of either a thread exit or an application<br />

request)<br />

00000f70.00000e78::2008/05/29-2:25:32.778 ERR Physical Disk : LooksAlive, error checking<br />

device, error 31.<br />

00000f70.00000e78::2008/05/29-2:25:32.778 ERR Physical Disk : IsAlive, error checking<br />

device, error 31.<br />

00000f70.00000e78::2008/05/29-2:25:32.778 ERR Network Name : Name query request<br />

failed, status 3221225860.<br />

00000f70.00000b54::2008/05/29-2:25:32.868 ERR Network Name : Unable to open<br />

handle to cluster, status 1753. (Error 1753: there are no more endpoints available from the endpoint<br />

mapper)<br />

00000f70.00000b54::2008/05/29-2:25:33.258 ERR Physical Disk : Terminate, error opening<br />

\Device\Harddisk10\Partition1, error C0000022.<br />

00000f70.00000b54::2008/05/29-2:25:33.528 ERR Physical Disk : [DiskArb] Failed to read<br />

(sector 12), error 170. (Error 170: the requested resource is in use)<br />

00000f70.00000b54::2008/05/29-2:25:33.528 ERR Physical Disk : [DiskArb] Error cleaning<br />

arbitration sector, error 170.<br />

Cluster Log for Usmv-x455 Host (Host on Surviving Site)<br />

0000015c.00000234::2008/05/29-2:25:23.496 INFO [ClMsg] Received interface unreachable event for<br />

node 1 network 2<br />

0000015c.00000234::2008/05/29-2:25:23.496 INFO [ClMsg] Received interface unreachable event for<br />

node 1 network 1<br />

00000688.00000d08::2008/05/29-2:25:37.898 ERR Physical Disk : [DiskArb] GetPartInfo<br />

completed, status 170.<br />

00000688.00000d08::2008/05/29-2:25:37.898 ERR Physical Disk : [DiskArb] Failed to read<br />

(sector 12), error 170.<br />

Actions to Resolve the Problem<br />

Perform the following actions to isolate and resolve the problem:<br />

• Identify which of the components is the problematic one. A problematic component<br />

is likely to have additional errors or problems:<br />

− A problematic RA might not be accessible to other splitters or might not<br />

recognize certain volumes.<br />

− A problematic splitter might not recognize any RAs or the storage subsystem.<br />

• Connect to the storage switch to verify the status of each connection. Ensure that<br />

each connection is configured correctly.<br />

• If you cannot find any additional problems, there is a good chance that the problem is<br />

with the zoning; that is, somehow, the splitters are not exposed to the RAs.<br />

• Verify the physical connectivity of the RAs and the servers (those on which the<br />

potentially problematic splitters reside) to the Fibre Channel switch. For each<br />

connection, verify that it is healthy and appears correctly in the name server, zoning,<br />

and so forth.<br />

• Verify that this is not a temporary situation---for instance, if the RAs were rebooting<br />

or recovering from another failure, the splitter might not yet identify them.<br />

6–16 6872 5688–002


Total SAN Switcch<br />

Failure on One Site in a<br />

Geographic Clusstered<br />

Environment<br />

6872 5688–002<br />

Solving SAN Connec ctivity Problems<br />

A total SAN switch failure implies that cluster nodes and RAs have lost t access to the<br />

storage device thatt<br />

was connected to the SAN on one site. This failure causes the<br />

cluster nodes to losse<br />

their reservation of the physical disks and triggers s an MSCS failover<br />

to the remote site. In a geographic clustered environment where MSCS S is running, if the<br />

connection to a storage<br />

device on one site fails, the symptoms and res sulting actions<br />

depend on whetherr<br />

or not the quorum owner resided on the failed stor rage device.<br />

To understand the ttwo<br />

scenarios and to follow the actions for both pos ssibilities, review<br />

Figure 6–8.<br />

FFigure<br />

6–8. SAN Switch Failure on One Site e<br />

6–17


Solving SAN Connectivity Problems<br />

Cluster Quorum Owner Located on Site with Failed SAN Switch<br />

Problem Description<br />

Symptoms<br />

The following point explains the expected behavior of the MSCS Reservation Manager<br />

when an event of this nature occurs:<br />

• If the cluster quorum owner is located on the site with the failed SAN, the quorum<br />

reservation is lost. This loss causes the cluster nodes to fail and triggers a cluster<br />

“regroup” process. This regroup process allows other cluster nodes participating in<br />

the cluster to arbitrate for the quorum device.<br />

Cluster nodes located on the failed SAN fail quorum arbitration because the failed<br />

SAN is not able to provide a reservation on the quorum volume. The cluster nodes in<br />

the remote location attempt to reserve the quorum device and succeed arbitration of<br />

the quorum. The node that owns the quorum device assumes ownership of the<br />

cluster. The cluster owner brings online the data groups that were owned by the<br />

failed site.<br />

The following symptoms might help you identify this failure:<br />

• All resources fail over to the surviving site (site 2 in this case) and come online<br />

successfully. Cluster nodes fail at the source site. If the consistency groups are<br />

configured asynchronously, this failover results in loss of data. The failover is fully<br />

automated and does not require additional downtime. The RAs cannot replication<br />

data until the SAN is operational.<br />

• Failures are reported on the server and the management console. Replication<br />

stopped on all consistency groups.<br />

• The management console displays error indications similar to those in Figure 6–9.<br />

Figure 6–9. Management Console Display with Errors for Failed SAN Switch<br />

• Warnings and informational messages similar to those shown in Figure 6–10 appear<br />

on the management console. See the table after the figure for an explanation of the<br />

numbered console messages.<br />

6–18 6872 5688–002


Solving SAN Connectivity Problems<br />

Figure 6–10. Management Console Messages for Failed SAN Switch<br />

6872 5688–002 6–19


Solving SAN Connectivity Problems<br />

Reference<br />

No.<br />

The following table explains the numbered messages shown in Figure 6–10.<br />

Event<br />

ID<br />

Description<br />

1 3012 The RA is unable to access the<br />

volume.<br />

E-mail<br />

Immediate<br />

2 5002 RA unable to access splitter X<br />

3 4001 The surviving site reports of the<br />

Group Capabilities problem<br />

4 4008 The Surviving site pauses the data<br />

transfer<br />

5 5013 The original site reporting the<br />

splitter down status<br />

6 4003 For each consistency group, the<br />

surviving site reports a group<br />

consistency problem. The details<br />

show a WAN problem.<br />

7 3014 The RA is unable to access the<br />

repository volume.<br />

8 4044 The group is deactivated indefinitely<br />

by the system.<br />

9 4007 The system is pausing data transfer<br />

on the surviving site (Quorum ---<br />

South).<br />

E-mail<br />

Daily<br />

Summary<br />

10 4086 Synchronization started message X<br />

11 4000 Group capabilities OK message X<br />

12 5032 The splitter starts splitting X<br />

To see the details of the messages listed on the management console display, you<br />

must collect the logs and then review the messages for the time of the failure.<br />

Appendix A explains how to collect the management console logs, and Appendix E<br />

lists the event IDs with explanations.<br />

• If you review the Windows system event log, you can find messages similar to the<br />

following examples that are based on the testing cases used to generate the<br />

previous management console images:<br />

6–20 6872 5688–002<br />

X<br />

X<br />

X<br />

X<br />

X<br />

X<br />

X<br />

X


Solving SAN Connectivity Problems<br />

System Event Log for USMV-SYDNEY Host (Host on Failure Site)<br />

5/29/2008 05:15:31 PM Application Popup Information None 26 N/A USMV-SYDNEY<br />

Application popup: Windows - Delayed Write Failed : Windows was unable to save all the data for the file<br />

Q:\. The data has been lost. This error may be caused by a failure of your computer hardware or network<br />

connection. Please try to save this file elsewhere.<br />

5/29/2008 05:15:31 PM Ftdisk Warning Disk 57 N/A USMV-SYDNEY The system failed to<br />

flush data to the transaction log. Corruption may occur.<br />

5/29/2008 05:15:31 PM Ftdisk Warning Disk 57 N/A USMV-SYDNEY The system failed to<br />

flush data to the transaction log. Corruption may occur.<br />

System Event Log for USMV-AUCKLAND Host (Host on Surviving Site)<br />

5/29/2008 05:13:33 PM ClusSvc Error Physical Disk Resource 1038 N/A USMV-SYDNEY<br />

Reservation of cluser disk 'Disk Q:' has been lost. Please check your system and disk configuration.<br />

5/29/2008 05:13:33 PM Service Control Manager Error None 7031 N/A USMV-SYDNEY<br />

The Cluster Service terminated unexpectedly. It has done this 2 time(s). The following corrective action<br />

will be taken in 120000 milliseconds: Restart the service.<br />

5/29/2008 05:15:31 PM Application Popup Information None 26 N/A USMV-SYDNEY<br />

Application popup: Windows - Delayed Write Failed : Windows was unable to save all the data for the file<br />

Q:\$Mft. The data has been lost. This error may be caused by a failure of your computer hardware or<br />

network connection. Please try to save this file elsewhere.<br />

5/29/2008 05:15:31 PM Ftdisk Warning Disk 57 N/A USMV-SYDNEY The system failed to<br />

flush data to the transaction log. Corruption may occur.<br />

• If you review the cluster log, you can find messages similar to the following<br />

examples that are based on the testing cases used to generate the previous<br />

management console images:<br />

Cluster Log for USMV-SYDNEY Host (Host on Failure Site)<br />

00001130.00001354::2008/5/29-17:14:33.712 ERR Physical Disk : [DiskArb]<br />

CompletionRoutine: reservation lost! Status 170 (Error 170: the requested resource is in use)<br />

00001130.00001354::2008/5/29-17:14:33.712 ERR [RM] LostQuorumResource, cluster service<br />

terminated...<br />

00001130.00001744::2008/5/29-17:15:31.733 ERR Physical Disk : [DiskArb] Failed to read<br />

(sector 12), error 170.<br />

00001130.00001744::2008/5/29-17:15:31.733 ERR Physical Disk : [DiskArb] Error cleaning<br />

arbitration sector, error 170.<br />

00001130.00001744::2008/5/29-17:15:31.733 ERR Network Name : Unable to open<br />

handle to cluster, status 1753. (Error 1753: there are no more endpoints available from the endpoint<br />

mapper)<br />

00001130.00000d3c::2008/5/29-17:15:31.733 ERR IP Address : WorkerThread:<br />

GetClusterNotify failed with status 6. (Error 6: the handle is invalid)<br />

6872 5688–002 6–21


Solving SAN Connectivity Problems<br />

Cluster Log for USMV-AUCKLAND Host (Host on Surviving Site)<br />

00000668.00000d90::2008/5/29-17:14:35.120 INFO [ClMsg] Received interface unreachable event for<br />

node 2 network 1<br />

00000668.00000d90::2008/5/29-17:14:35.120 INFO [ClMsg] Received interface unreachable event for<br />

node 2 network 2<br />

00000bb8.00000d0c::2008/5/29-17:14:49.706 ERR Physical Disk : [DiskArb] GetPartInfo<br />

completed, status 170.<br />

00000bb8.00000d0c::2008/5/29-17:14:49.706 ERR Physical Disk : [DiskArb] Failed to read<br />

(sector 12), error 170.<br />

Actions to Resolve the Problem<br />

To resolve this situation, diagnose the SAN switch failure.<br />

Cluster Quorum Owner Not on Site with Failed SAN Switch<br />

Problem Description<br />

Symptoms<br />

The following points explain the expected behavior of the MSCS Reservation Manager<br />

when an event of this nature occurs:<br />

• If a SAN failure occurs and the cluster nodes do not own the quorum resource, the<br />

state of the cluster services on these nodes is not affected.<br />

• The cluster nodes remain as active cluster members; however, the data groups<br />

containing the <strong>SafeGuard</strong> 30m Control instance and the physical disk resources on<br />

these nodes are marked as failed, and any applications dependent on them are taken<br />

offline. These resources first try to restart, and then eventually fail over to the<br />

surviving site.<br />

The following symptoms might help you identify this failure:<br />

• Applications fail and attempt to restart.<br />

• The data groups containing the <strong>SafeGuard</strong> 30m Control instance and the physical<br />

disk resources on these nodes are marked as failed, and any applications dependent<br />

on them are taken offline. These resources first try to restart, and then eventually fail<br />

over to the surviving site. The cluster nodes remain as active cluster members.<br />

• The management console displays error indications similar to those in Figure 6–9.<br />

• Warnings and informational messages similar to those shown in Figure 6–11 appear<br />

on the management console. See the table after the figure for an explanation of the<br />

numbered console messages.<br />

6–22 6872 5688–002


Solving SAN Connectivity Problems<br />

Figure 6–11. Management Console Messages for Failed SAN Switch with Quorum<br />

Owner on Surviving Site<br />

6872 5688–002 6–23


Solving SAN Connectivity Problems<br />

Reference<br />

No.<br />

The following table explains the numbered messages shown in Figure 6–11.<br />

Event ID<br />

Description<br />

1 5002 The RA is unable to access<br />

the splitter.<br />

2 3012 The RA is unable to access<br />

the volume (RA 2, Quorum).<br />

3 4003 For each consistency group,<br />

the surviving site reports a<br />

group consistency problem.<br />

The details show a WAN<br />

problem.<br />

4 3014 The RA is unable to access<br />

the repository volume<br />

(RA2).<br />

5 4009 The system is pausing data<br />

transfer on the failure site<br />

6 4044 The group is deactivated<br />

indefinitely by the system.<br />

E-mail<br />

Immediate<br />

E-mail<br />

Daily<br />

Summary<br />

To see the details of the messages listed on the management console display, you<br />

must collect the logs and then review the messages for the time of the failure.<br />

Appendix A explains how to collect the management console logs, and Appendix E<br />

lists the event IDs with explanations.<br />

• If you review the Windows system event log, you can find messages similar to the<br />

following examples that are based on the testing cases used to generate the<br />

previous management console images:<br />

System Event Log for USMV-SYDNEY Host (Host on Failure Site)<br />

5/29/2008 5:14:24 PM ClusDisk Error None<br />

a bus reset for device \Device\ClusDisk0.<br />

1209 N/A USMV-AUCKLAND Cluster service is requesting<br />

5/29/2008 5:14:38 PM ClusSvc Warning Node Mgr 1123 N/A USMV- AUCKLAND The node lost<br />

communication with cluster node 'USMV-SYDNEY' on network 'Private LAN'.<br />

5/29/2008 5:14:39 PM ClusSvc Information Node Mgr 1122 N/A USMV- AUCKLAND The node<br />

(re)established communication with cluster node 'USMV-SYDNEY' on network 'Private LAN'.<br />

6–24 6872 5688–002<br />

X<br />

X<br />

X<br />

X<br />

X<br />

X


Solving SAN Connectivity Problems<br />

System Event Log for Usmv-Auckland Host (Host on Surviving Site)<br />

5/29/2008 5:14:38 PM ClusSvc Warning Node Mgr 1123 N/A USMV- AUCKLAND The node lost<br />

communication with cluster node 'USMV-SYDNEY' on network 'Private LAN'.<br />

5/29/2008 5:14:39 PM ClusSvc Information Node Mgr 1122 N/A USMV- AUCKLAND The node<br />

(re)established communication with cluster node 'USMV-SYDNEY' on network 'Private LAN'.<br />

• If you review the cluster log, you can find messages similar to the following<br />

examples that are based on the testing cases used to generate the previous<br />

management console images:<br />

Cluster Log for Usmv USMV-SYDNEY Host (Host on Failure Site)<br />

00001524.00000360::2008/5/29-17:14:56.750 ERR Physical Disk : [DiskArb] GetPartInfo<br />

completed, status 170.<br />

00001524.00000360::2008/5/29-17:14:56.750 ERR Physical Disk : [DiskArb] Failed to read<br />

(sector 12), error 170.<br />

00001524.000017e4::2008/5/29-17-15:22.899 ERR IP Address : WorkerThread:<br />

GetClusterNotify failed with status 6.<br />

Cluster Log for USMV-AUCKLAND Host (Host on Surviving Site)<br />

00000bb8.00000c5c::2008/5/29-17:14:14.596 ERR IP Address : WorkerThread:<br />

GetClusterNotify failed with status 6.<br />

00000bb8.0000026c::2008/5/29-17:14:24.188 ERR Physical Disk : [DiskArb] GetPartInfo<br />

completed, status 170.<br />

00000bb8.0000026c::2008/5/29-17:14:24.188 ERR Physical Disk : [DiskArb] Failed to read<br />

(sector 12), error 170.<br />

Actions to Resolve the Problem<br />

To resolve this situation, diagnose the SAN switch failure.<br />

6872 5688–002 6–25


Solving SAN Connectivity Problems<br />

6–26 6872 5688–002


Section 7<br />

Solving Network Problems<br />

This section lists symptoms that usually indicate networking problems. Table 7–1 lists<br />

symptoms and possible problems indicated by the symptom. The problems and their<br />

solutions are described in this section. The graphics, behaviors, and examples in this<br />

section are similar to what you observe with your system but might differ in some<br />

details.<br />

In addition to the symptoms listed, you might receive e-mail messages or SNMP traps<br />

for possible problems. Also, messages are displayed on the management console similar<br />

to the e-mail messages. If you do not see the messages, they might have already<br />

dropped off the display. Review the management console logs for messages that have<br />

dropped off the display.<br />

Table 7–1. Possible Networking Problems with Symptoms<br />

Symptom Possible Problem<br />

The cluster groups with the failed network<br />

connection fail over to the next preferred<br />

node. If only one node is configured at the<br />

site with the failure, replication direction<br />

changes and applications run on the<br />

backup site.<br />

If the NIC is teamed, no failover occurs and<br />

no symptoms are obvious.<br />

The networks on the Cluster Administrator<br />

screen show an error.<br />

Host system and application event log<br />

messages contain error or warning<br />

messages.<br />

Clients on site 2 are not able to access<br />

resources associated with the IP resource<br />

located on site 1.<br />

<strong>Public</strong> communication between the two<br />

sites fails, only allowing local cluster public<br />

communication between cluster nodes and<br />

local clients.<br />

The networks on the Cluster Administrator<br />

screen show an error.<br />

<strong>Public</strong> NIC failure on a cluster node in a<br />

geographic clustered environment<br />

<strong>Public</strong> or client WAN failure in a geographic<br />

clustered environment<br />

6872 5688–002 7–1


Solving Network Problems<br />

Table 7–1. Possible Networking Problems with Symptoms<br />

Symptom Possible Problem<br />

You cannot access the management<br />

console or initiate an SSH session through<br />

PuTTY using the management IP address<br />

of the remote site.<br />

Management console log indicates that the<br />

WAN data links to the RAs are down.<br />

All consistency groups show the transfer<br />

status as “Paused by system.”<br />

On the management console, all<br />

consistency groups show the transfer<br />

status switching between “Paused by<br />

system” and “initializing/active.” All<br />

groups appear unstable over the WAN<br />

connection.<br />

The networks on the Cluster Administrator<br />

screen show an error.<br />

You cannot access the management<br />

console using the management IP address<br />

of the remote site.<br />

The cluster is no longer accessible from<br />

nodes except from one surviving node.<br />

Unable to reach DNS server.<br />

Unable to communicate to NTP server.<br />

Unable to reach mail server.<br />

The management console shows errors for<br />

the WAN or for RA data links.<br />

The management console logs show RA<br />

communication errors.<br />

Management network failure in a<br />

geographic clustered environment<br />

Replication network failure in a geographic<br />

clustered environment<br />

Temporary WAN failures<br />

Private cluster network failure in a<br />

geographic clustered environment<br />

Total communication failure in a<br />

geographic clustered environment<br />

Port information<br />

7–2 6872 5688–002


<strong>Public</strong> NIC Failuure<br />

on a Cluster Node in a<br />

Geographic Clusstered<br />

Environment<br />

Problem Description<br />

6872 5688–002<br />

If a public network interface card (NIC) of a cluster node failed, the clus ster node of the<br />

failed public NIC cannot<br />

access clients. The cluster node of the failed NIC N can participate<br />

in the cluster as a mmember<br />

because it can communicate over the privat te cluster<br />

network. Other clusster<br />

nodes are not affected by this error.<br />

The MSCS software<br />

detects a failed network and the cluster resources s fail over to the<br />

next preferred nodee.<br />

All cluster groups used for replication that contain a virtual IP<br />

address for the faileed<br />

network connection succeed to fail over to the ne ext preferred<br />

node. However, thee<br />

Unisys <strong>SafeGuard</strong> 30m Control resources cannot fail f back to the<br />

node with a failed ppublic<br />

network because they cannot communicate with w the site<br />

management IP adddress<br />

of the RAs.<br />

Note: A teamed ppublic<br />

network interface does not experience this pr roblem and<br />

therefore is the reccommended<br />

configuration.<br />

Figure 7–1 illustratees<br />

this failure.<br />

Solving Net twork Problems<br />

Figgure<br />

7–1. <strong>Public</strong> NIC Failure of a Cluster Node<br />

7–3


Solving Network Problems<br />

Symptoms<br />

The following symptoms might help you identify this failure:<br />

• All cluster groups used for replication that contain a virtual IP address for the failed<br />

network connection fail over to the next preferred node.<br />

• If no other node exists at the same site, replication direction changes and the<br />

application run at the backup site.<br />

• If you review the host system event log, you can find messages similar to the<br />

following examples:<br />

Windows System Event Log Messages on Host Server<br />

Type: error<br />

Source: ClusSvc<br />

EventID: 1077, 1069<br />

Description: The TCP/IP interface for Cluster IP Address “xxx” has failed.<br />

Type: error<br />

Source: ClusSvc<br />

EventID: 1069<br />

Description: Cluster resource ‘xxx’ in Resource Group ‘xxx’ failed.<br />

Type: error<br />

Source: ClusSvc<br />

EventID: 1127<br />

Description: The interface for cluster node ‘xxx’ on network ‘xxx’ failed. If the condition persists, check<br />

the cabling connecting the node to the network. Next, check for hardware or software errors in nodes’s<br />

network Adapter.<br />

• If you attempt to move a cluster group to the node with the failing public NIC, the<br />

event 2002 message is displayed in the host application event log.<br />

Application Event Log Message on Host Server<br />

Type: warning<br />

Source: 30mControl<br />

Event Category: None<br />

EventID: 2002<br />

Date : 05/30/2008<br />

Time: 11:12:02 AM<br />

User : N/A\<br />

Computer: USMV-DL580<br />

Description: Online resource failed. RA CLI command failed because of a network communication error or<br />

invalid IP address.<br />

Action: Verify the network connection between the system and the site management IP Address<br />

specified for the resource. Ping each site management IP Address specified for the specified resource.<br />

Note: The preceding information can also be viewed in the cluster log.<br />

7–4 6872 5688–002


6872 5688–002<br />

• The managemeent<br />

console display and management console logs do d not show any<br />

errors.<br />

• When the publiic<br />

NIC fails on a node that does not use teaming, the e Cluster<br />

Administrator ddisplays<br />

an error indicator similar to Figure 7–2. If the e public NIC<br />

interface is teammed,<br />

you do not see error messages in the Cluster Administrator.<br />

Figure 7–2. Pubblic<br />

NIC Error Shown in the Cluster Adminis strator<br />

Actions to Resolve thhe<br />

Problem<br />

Perform the followiing<br />

actions to isolate and resolve the problem:<br />

Solving Net twork Problems<br />

1. In the Cluster AAdministrator,<br />

verify that the public interface for all nodes<br />

is in an<br />

“Up” state. If mmultiple<br />

nodes at a site show public connections failed<br />

in the Cluster<br />

Administrator, pphysically<br />

check the network switch for connection errors.<br />

If the private neetwork<br />

also shows errors, physically check the netw work switch for<br />

connection erroors.<br />

2. Inspect the NICC<br />

link indicators on the host and, from a client, use th he Ping command<br />

to verify the physical<br />

IP address of the adapter (not the virtual IP ad ddress).<br />

3. Isolate a NIC orr<br />

cabling issue by moving cables at the network swit tch and at the NIC.<br />

4. Replace the NICC<br />

in the host if necessary. No configuration of the re eplaced NIC is<br />

necessary.<br />

5. Move the cluster<br />

resources back to the original node after the reso olution of the<br />

failure.<br />

7–5


Solving Network Problems<br />

<strong>Public</strong> or Client WAN Failure in a Geographic<br />

Clustered Environment<br />

Problem Description<br />

When the public or client WAN fails, some clients cannot access virtual IP networks that<br />

are associated with the cluster. The WAN components that comprise this failure might<br />

be two switches that are possibly on different subnets using gateways. This failure<br />

results from connectivity issues. The MSCS cluster would detect and fail the associated<br />

node if the failure resulted from an adapter failure or media failure to the adapter.<br />

Instead, cluster groups do not fail and the public LAN shows an “unreachable for this<br />

failure” mode.<br />

<strong>Public</strong> communication between the two sites failed, only allowing local cluster public<br />

communication between cluster nodes and local clients. The cluster node state does not<br />

change on either site because all cluster nodes are able to communicate with the private<br />

cluster network.<br />

All resources remain online and no cluster group errors are reported in the Cluster<br />

Administrator. Clients on the remote site cannot access resources associated with the IP<br />

resource located on the local site until the public or client network is again operational.<br />

Depending on the cause of the failure and the network configuration, the <strong>SafeGuard</strong> 30m<br />

Control might fail to move a cluster group because the management network might be<br />

the same physical network as the public network. Whether this failure to move the<br />

group occurs or not depends on how the RAs are physically wired to the network.<br />

7–6 6872 5688–002


Symptoms<br />

6872 5688–002<br />

Figure 7–3 illustratees<br />

this scenario.<br />

Figure 7–3. <strong>Public</strong> or Client WAN Failure<br />

The following symmptoms<br />

might help you identify this failure:<br />

Solving Net twork Problems<br />

• Clients on site 2 are not able to access resources associated with the t IP resource<br />

located on site 1.<br />

• <strong>Public</strong> communnication<br />

between the two sites displays as “unreach hable” allowing<br />

local cluster public<br />

communication between cluster nodes and loca al clients.<br />

• When the publiic<br />

cluster network fails, the Cluster Administrator dis splays an error<br />

indicator similar<br />

to Figure 7–4.<br />

All private netwwork<br />

connections show as “unreachable” when the problem is a WAN<br />

issue.<br />

If only two of thhe<br />

connections show as failed (and the nodes are ph hysically located at<br />

the same site), the issue is probably local to the site.<br />

If only one connnection<br />

failed, the issue is probably a host network adapter.<br />

a<br />

7–7


Solving Network Problems<br />

7–8<br />

Figure 7–4. Cluster Administrator<br />

Showing <strong>Public</strong> LAN Network Error E<br />

• If you review the sysstem<br />

event log, messages similar to the following ex xamples are<br />

displayed:<br />

Event Type : Warning<br />

Event Source : ClusSvc<br />

Event Category: Node Mggr<br />

Event ID: 1123<br />

Date : 05/30/2008<br />

Time: 9:49:34 AM<br />

User : N/A<br />

Computer: USMV-WEST22<br />

Description:<br />

The node lost communicaation<br />

with cluster node 'USMV-EAST2' on network '<strong>Public</strong> LAN'.<br />

Event Type: Warning<br />

Event Source: ClusSvc<br />

Event Category: Node Mggr<br />

Event ID: 1126<br />

Date : 05/30/2008<br />

Time: 9:49:36 AM<br />

User : N/A<br />

Computer: USMV-WEST22<br />

Description:<br />

The interface for cluster nnode<br />

'USMV-WEST2' on network '<strong>Public</strong> LAN' is unreachable by at a least one<br />

other cluster node attacheed<br />

to the network. the server cluster was not able to determine the t location of<br />

the failure. Look for additional<br />

entries in the system event log indicating which other nodes s have lost<br />

communication with nodee<br />

USMV-WEST2. If the condition persists, check the cable connec cting the node<br />

to the network. Next, cheeck<br />

for hardware or software errors in the node's network adapter.<br />

Finally, check<br />

for failures in any other neetwork<br />

components to which the node is connected such as hubs,<br />

switches, or<br />

bridges.<br />

68 872 5688–002


Solving Network Problems<br />

Event Type: Warning<br />

Event Source: ClusSvc<br />

Event Category: Node Mgr<br />

Event ID: 1130<br />

Date : 05/30/2008<br />

Time: 9:49:36 AM<br />

User : N/A<br />

Computer: USMV-WEST2<br />

Description:<br />

Cluster network '<strong>Public</strong> network is down. None of the available nodes can communicate using this<br />

network. If the condition persists, check for failures in any network components to which the nodes are<br />

connected such as hubs, switches, or bridges. Next, check the cables connecting the nodes to the<br />

network. Finally, check for hardware or software errors in the adapters that attach the nodes to the<br />

network.<br />

• A cluster group containing a <strong>SafeGuard</strong> 30m Control resource might fail to move to<br />

another node when the management network has network components common to<br />

the public network. (Refer to “Management Network Failure in a Geographic<br />

Clustered Environment.”)<br />

• Symptoms might include those in “Management Network Failure in a Geographic<br />

Clustered Environment” when these networks are physically the same network.<br />

Refer to this topic if the clients at one site are not able to access the IP resources at<br />

another site.<br />

• The management console logs might display the messages in the following table<br />

when this connection fails and is then restored.<br />

Event<br />

ID<br />

Description<br />

3023 For each RA at the site, this console log<br />

message is displayed:<br />

Error in LAN link to RA. (RA )<br />

3022<br />

When the LAN link is restored, a<br />

management console log displays:<br />

LAN link to RA restored. (RA)<br />

E-mail<br />

Immediate<br />

E-mail<br />

Daily<br />

Summary<br />

6872 5688–002 7–9<br />

X<br />

X


Solving Network Problems<br />

Actions to Resolve the Problem<br />

Note: Typically, a network administrator for the site is required to diagnose which<br />

network switch, gateway, or connection is the cause of this failure.<br />

Perform the following actions to isolate and resolve the problem:<br />

1. In the Cluster Administrator, view the network properties of the public and private<br />

network.<br />

The private network should be operational with no failure indications.<br />

The public network should display errors. Refer to the previous symptoms to identify<br />

that this is a WAN issue. If the error is limited to one host, the problem might be a<br />

host network adapter. See “Cluster Node <strong>Public</strong> NIC Failure in a Geographic<br />

Clustered Environment.”<br />

2. Check for network problems using a method such as isolating the failure to the<br />

network switch or gateway by pinging from the cluster node to the gateway at each<br />

site.<br />

3. Use the Installation Manager site connectivity IP diagnostic from the RAs to the<br />

gateway at each site by performing the following steps. (For more information, see<br />

Appendix C.)<br />

a. Log on to an RA with user ID as boxmgmt and password as boxmgmt.<br />

b. On the Main Menu, type 3 (Diagnostics) and press Enter.<br />

c. On the Diagnostics menu, type 1 (IP diagnostics) and press Enter.<br />

d. On the IP Diagnostics menu, type 1 (Site connectivity tests) and press Enter.<br />

e. When asked to select a target for the tests, type 5 (Other host) and press<br />

Enter.<br />

f. Enter the IP address for the gateway that you want to test.<br />

g. Repeat steps a through f for each RA.<br />

4. Isolate the site by determining which gateway or network switch failed. Use<br />

standard network methods such as pinging to make the determination.<br />

7–10 6872 5688–002


Management Neetwork<br />

Failure in a Geograp phic<br />

Clustered Enviroonment<br />

Problem Description<br />

Symptoms<br />

6872 5688–002<br />

When the managemment<br />

network fails in a geographic clustered environ nment, you cannot<br />

access the manageement<br />

console for the affected site. The replication environment e<br />

is not<br />

affected. If you try tto<br />

move a cluster group to the site with the failed management<br />

m<br />

network, the move fails.<br />

Figure 7–5 illustratees<br />

this scenario.<br />

Figure 7–5. Management Network Failure<br />

The following sympptoms<br />

might help you identify this failure:<br />

Solving Net twork Problems<br />

• The indicators ffor<br />

the onboard management network adapter of the e RA are not<br />

illuminated.<br />

• Network switchh<br />

port lights show that no link exists with the host adapter.<br />

7–11


Solving Network Problems<br />

• You cannot access the management console or initiate a SSH session through<br />

PuTTY using the management IP address of the failed site from remote site. You can<br />

access the management console from a client local to the site. If you cannot access<br />

the management IP address from either site, see Section 8, “Solving Replication<br />

Appliance (RA) Problems.”<br />

• A cluster move operation to the site with the failed management network might fail.<br />

The event ID 2002 message is displayed in the host application event log.<br />

Application Event Log Message on Host Server<br />

Type : warning<br />

Source : 30mControl<br />

Event Category: None<br />

EventID : 2002<br />

Date : 05/30/2008<br />

Time : 2:46:29 PM<br />

User : N/A<br />

Computer : USMV-SYDNEY<br />

Description : Online resource failed. RA CLI command failed because of a network communication<br />

error or invalid IP address.<br />

Action : Verify the network connection between the system and the site management IP Address<br />

specified for the resource. Ping each site management IP Address mentioned for the specified resource.<br />

Note: The preceding information can also be viewed in the cluster log.<br />

• If the management console was open with the IP address of the failed site, the<br />

message “Connection with RA was lost, please check RA and network settings” is<br />

displayed. The management console display shows “not connected,” and the<br />

components have a question mark “Unknown” status as illustrated in Figure 7–6.<br />

7–12 6872 5688–002


Solving Network Problems<br />

Figure 7–6. Management Console Display: “Not Connected”<br />

• The management console log displays a message for event 3023 as shown in<br />

Figure 7–7.<br />

Figure 7–7. Management Console Message for Event 3023<br />

6872 5688–002 7–13


Solving Network Problems<br />

• The management console log messages might appear as in the following table.<br />

Event<br />

ID<br />

Description<br />

3023 For each RA at the site, this console log<br />

message is displayed:<br />

Error in LAN link to RA. (RA )<br />

3022<br />

When the LAN link is restored, a<br />

management console log displays:<br />

LAN link to RA restored. (RA )<br />

Actions to Resolve the Problem<br />

E-mail<br />

Immediate<br />

E-mail<br />

Daily<br />

Summary<br />

Note: Typically, a network administrator for the site is required to diagnose which<br />

network switch, gateway, or connection is the cause of this failure.<br />

Perform the following actions to isolate and resolve the problem:<br />

1. Ping from the cluster node to the RA box management IP address at the same site.<br />

Repeat this action for the other site. If the local connections are working at both<br />

sites, the problem is with the WAN connection such as a network switch or gateway<br />

connection.<br />

2. If one site from step 1 fails, ping from the cluster node to the gateway of that site. If<br />

the ping completes, then proceed to step 3.<br />

3. Use the Installation Manager site connectivity IP diagnostic from the RAs to the<br />

gateway at each site by performing the following steps. (For more information, see<br />

Appendix C.)<br />

a. Log in to an RA as user boxmgmt with the password boxmgmt.<br />

b. On the Main Menu, type 3 (Diagnostics) and press Enter.<br />

c. On the Diagnostics menu, type 1 (IP diagnostics) and press Enter.<br />

d. On the IP Diagnostics menu, type 1 (Site connectivity tests) and press Enter.<br />

e. When asked to select a target for the tests, type 5 (Other host) and press<br />

Enter.<br />

f. Enter the IP address for the gateway that you want to test.<br />

g. Repeat steps a through f for each RA.<br />

4. Isolate the site by determining which gateway failed. Use standard network methods<br />

such as pinging to make the determination.<br />

7–14 6872 5688–002<br />

X<br />

X


Replication Netwwork<br />

Failure in a Geograph hic<br />

Clustered Enviroonment<br />

Problem Description<br />

6872 5688–002<br />

This type of event ooccurs<br />

when the RA cannot replicate data to the rem mote site because<br />

of a replication netwwork<br />

(WAN) failure. Because this error is transparen nt to MSCS and<br />

the cluster nodes, ccluster<br />

resources and nodes are not affected. Each cluster c node<br />

continues to run, annd<br />

data transactions sent to their local cluster disk are<br />

completed.<br />

Figure 7–8 illustratees<br />

this failure.<br />

Figure 7–8. Replication Network Failure<br />

Solving Net twork Problems<br />

The RA cannot replicate<br />

data while the WAN is down. During this failur re, the RA keeps a<br />

record of data writtten<br />

to local storage. Once the WAN is restored, the RA updates the<br />

replication volumess<br />

on the remote site.<br />

During the replicatioon<br />

network failure, the RAs prevent the quorum and d data resources<br />

from failing over to the remote site. This behavior differs from a total co ommunication<br />

failure or a total sitee<br />

failure in which the data groups are allowed to fail over. The quorum<br />

group is never allowwed<br />

to fail over automatically when the RAs cannot communicate c<br />

over<br />

the WAN.<br />

7–15


Solving Network Problems<br />

Symptoms<br />

Notes:<br />

• If the management network has also failed, see “Total Communication Failure in a<br />

Geographic Clustered Environment” later in this section.<br />

• If all RAs at a site have failed, see “Failure of All RAs at One Site” in Section 8.<br />

If the administrator issues a move-group operation from the Cluster Administrator for a<br />

data or quorum group, the cluster accepts failover only to another node within the same<br />

site. Group failover to the remote site is not allowed, and the resource group fails back<br />

to a node on the source site.<br />

Although automatic failover is not allowed, the administrator can perform a manual<br />

failover to the remote site. Performing a manual failover results in a loss of data. The<br />

administrator chooses an available image for the failover.<br />

Important considerations for this type of failure are as follow:<br />

• This type of failure does not have an immediate effect on the cluster service or the<br />

cluster nodes. The quorum group cannot fail over to the remote site and goes back<br />

online at the source site.<br />

• Only local failovers are permitted. Remote failovers require that the administrator<br />

perform the manual failover process.<br />

• The <strong>SafeGuard</strong> 30m Control resource and the data consistency groups cannot fail<br />

over to the remote site while the WAN is down; they go back online at the source<br />

site.<br />

• Only one site has up-to-date data. Replication does not occur until the WAN is<br />

restored.<br />

• If the administrator manually chooses to use remote data instead of the source data,<br />

data loss occurs.<br />

• Once the WAN is restored, normal operation continues; however, the groups might<br />

initiate a long resynchronization.<br />

The following symptoms might help you identify this failure:<br />

• The management console display shows errors similar to the image in Figure 7–9.<br />

This image shows the dialog box displayed after clicking the red Errors in the right<br />

column. The More Info message box is displayed with messages similar to those in<br />

the figure but appropriate for your site. If only one RA is down, see Section 8 for<br />

resolution actions. Notice in the figure that all RA data links at the site are down.<br />

7–16 6872 5688–002


Figure 7–9. Management Console Display: WAN Down<br />

Solving Network Problems<br />

This figure also shows the Groups tab and the messages that the data consistency<br />

groups and the quorum group are “Paused by system.” If the groups are not paused<br />

by the system, a switchover might have occurred. See Section 8 for more<br />

information. If all groups are not paused, see Section 5, “Solving Storage Problems.”<br />

• Warnings and informational messages similar to those shown in Figure 7–10 appear<br />

on the management console when the WAN is down. See the table after the figure<br />

for an explanation of the numbered console messages.<br />

Figure 7–10. Management Console Log Messages: WAN Down<br />

The following table explains the numbers in Figure 7–10. You might also see the<br />

events in the table denoted by an asterisk (*) in the management console log.<br />

6872 5688–002 7–17


Solving Network Problems<br />

Reference<br />

No./Legend<br />

Event<br />

ID<br />

Description<br />

* 3001 The RA is currently experiencing a problem<br />

communicating with its cluster. The details<br />

explain that an event 3000 means that the RA<br />

functionality will be restored.<br />

* 3000 The RA is successfully communicating with its<br />

cluster. In this case, the RA communicates by<br />

means of the management link.<br />

1 4001 For each consistency group on the Auckland<br />

and the Sydney sites, the transfer is paused.<br />

2 4008 For each quorum group on the Auckland and<br />

the Sydney sites, the transfer is paused.<br />

* 4043 For each group on the Auckland and Sydney<br />

sites, the “group site is deactivated” message<br />

might appear with the detail showing the<br />

reason for the switchover. The RA attempts to<br />

switch over to resolve the problem.<br />

3 4001 The event is repeated after the switchover<br />

attempt.<br />

E-mail<br />

Immediate<br />

E-mail<br />

Daily<br />

Summary<br />

• If you review the management console RAs tab, the data link column lists errors for<br />

all RAs, as shown in Figure 7–11. The data link is the replication link between peer<br />

RAs. Notice that the WAN link shows OK because the RAs can still communicate<br />

over the management link. There is no column for the management link.<br />

Figure 7–11. Management Console RAs Tab: All RAs Data Link Down<br />

• If you review the host application event log, no messages appear for this failure<br />

unless a data resource move-group operation is attempted. If this move-group<br />

operation is attempted, then messages similar to the following are listed:<br />

Application event log<br />

Event Type : Warning<br />

Event Source : 30mControl<br />

Event Category: None<br />

Event ID : 1119<br />

Date : 5/30/2008<br />

Time : 3:27:49 PM<br />

User : N/A<br />

Computer : USMV-SYDNEY<br />

7–18 6872 5688–002<br />

X<br />

X<br />

X<br />

X<br />

X<br />

X


Description : Online resource failed.<br />

Cannot complete transfer for auto failover (7).<br />

The following could cause this error:<br />

1. Wan is down.<br />

2. Long resynchronization might be in progress.<br />

The resource might have to be brought online manually.<br />

Solving Network Problems<br />

RA Version: 3.0(g.60)<br />

Resource name: Data1<br />

RA CLI command: UCLI -ssh -l plugin -pw **** -superbatch 172.16.25.50 initiate_failover group=Data1<br />

active_site=Sydney cluster_owner=USMV-SYDNEY<br />

• If you review the system event log, a message similar to the following example is<br />

displayed:<br />

System Event Log<br />

Event Type : Error<br />

Event Source : ClusSvc<br />

Event Category: Failover Mgr<br />

Event ID : 1069<br />

Date : 5/30/2008<br />

Time : 3:27:50 PM<br />

User : N/A<br />

Computer : USMV-SYDNEY<br />

Description : Cluster resource 'Data1' in Resource Group 'Group 0' failed.<br />

Note: Data1 would change to the Quorum drive if the quorum was moved.<br />

• If you review the cluster log, you can see an error if a data or a quorum move-group<br />

operation is attempted. Messages similar to the following are listed:<br />

Cluster Log for the Node to which the Move Was Attempted<br />

Key messages<br />

00000d4c.00000910::2008/05/30-15:27:22.077 INFO Physical Disk : [DiskArb]-------<br />

DisksArbitrate -------.<br />

………………..<br />

00000d4c.00000910::2008/05/30-15:27:35.608 ERR Physical Disk : [DiskArb] Failed to write<br />

(sector 12), error 170.<br />

00000d4c.00000910::2008/05/30-15:27:35.608 INFO Physical Disk : [DiskArb] Arbitrate returned<br />

status 170.<br />

Cluster Log for the Node to which the Data Group Move Was Attempted<br />

00000e60.00000940::2008/05/30-15:53:38.470 INFO Unisys <strong>SafeGuard</strong> 30m Control :<br />

KfResourceTerminate: Resource 'Data1' terminated. AbortOnline=1 CancelConnect=0<br />

terminateProcess=0.<br />

0000099c.00000dd4::2008/05/30-15:53:38.470 INFO [CP] CppResourceNotify for resource Data1<br />

0000099c.00000dd4::2008/05/30-15:53:38.470 INFO [FM] RmTerminateResource: a16fc059-e4d3-4bc8a15a-6440e9b2f976<br />

is now offline<br />

0000099c.00000dd4::2008/05/30-15:53:38.470 WARN [FM] Group failure for group . Create thread to take offline and move<br />

6872 5688–002 7–19


Solving Network Problems<br />

Actions to Resolve the Problem<br />

Note: Typically, a network administrator for the site is required to diagnose which<br />

network switch, gateway, or connection is the cause of this failure.<br />

Perform the following actions to isolate and resolve the problem:<br />

1. On the management console, observe that a WAN error occurred for all RAs and that<br />

the data link is in error for all RAs. If that is not the case, see Section 8 for resolution<br />

actions.<br />

2. Use the Installation Manager site connectivity IP diagnostic from the RAs to the<br />

gateway at each site by performing the following steps. (For more information, see<br />

Appendix C.)<br />

a. Log in to an RA as user boxmgmt with the password boxmgmt.<br />

b. On the Main Menu, type 3 (Diagnostics) and press Enter.<br />

c. On the Diagnostics menu, type 1 (IP diagnostics) and press Enter.<br />

d. On the IP Diagnostics menu, type 1 (Site connectivity tests) and press Enter.<br />

e. When asked to select a target for the tests, type 5 (Other host) and press<br />

Enter.<br />

f. Enter the IP address for the gateway that you want to test.<br />

g. Repeat steps a through f for each RA.<br />

3. Isolate the site by determining which network switch or gateway failed. Use<br />

standard network methods such as pinging to make the determination.<br />

4. In some cases, the WAN connection might appear to be down because a firewall is<br />

blocking ports. See “Port Information” later in this section.<br />

5. If all RAs at both sites can connect to the gateway, the problem is related to the link.<br />

In this case, check the connectivity between subnets by pinging between machines<br />

on the same subnet (not RAs) and between a non-RA machine at one site and an RA<br />

at the other site.<br />

6. Verify that no routing problems exist between the sites.<br />

7. Optionally, follow the recovery actions to manually move cluster and data resource<br />

groups to the other site if necessary. This action results in a loss of data. Do not<br />

attempt this manual recovery unless the WAN failure has affected applications.<br />

If you choose to manually move groups, refer to Section 4 for the procedures.<br />

Once you observe on the management console that the WAN error is gone, verify<br />

that the consistency groups are resynchronizing.<br />

If a move-group operation is issued to the other site while the group is<br />

resynchronizing, the command fails with a return code 7 (long resync in progress)<br />

and move back to the original node.<br />

7–20 6872 5688–002


Temporary WAN Failures<br />

Problem Description<br />

Symptoms<br />

All applications are unaffected. The target image is not up-to-date.<br />

Solving Network Problems<br />

On the management console, messages showing the transfer between sites switch<br />

between the “paused by system” and “initializing/active.” All groups appear<br />

unstable over the WAN connection.<br />

Actions to Resolve the Problem<br />

Perform the following actions to isolate and resolve this problem:<br />

1. If the connection problem is temporary but reoccurs, check for a problematic<br />

network such as a high percentage of packet loss because of bad network<br />

connections, insufficient bandwidth that is causing an overloaded network, and so<br />

on.<br />

2. Verify that the bandwidth allocated to this link is reasonable and that no<br />

unreasonable external or internal (consistency group bandwidth policy) limits are<br />

causing an overloaded network.<br />

6872 5688–002 7–21


Solving Network Problems<br />

Private Cluster Nettwork<br />

Failure in a Geograph hic<br />

Clustered Environmment<br />

Problem Description<br />

7–22<br />

When the private clusterr<br />

network fails, the cluster nodes are able to commu unicate with<br />

the public cluster networrk<br />

if the cluster public address is set for all communication.<br />

No<br />

cluster resources fail oveer,<br />

and current processing on the cluster nodes cont tinues.<br />

Clients do not experiencee<br />

any impact by this failure.<br />

Figure 7–12 illustrates thhis<br />

scenario.<br />

Figuree<br />

7–12. Private Cluster Network Failure<br />

Unisys recommends thatt<br />

the public cluster network be set for “All communications”<br />

and<br />

the private cluster LAN bbe<br />

set for “internal cluster communications only…” You Y can<br />

verify these settings in thhe<br />

“Networks” properties section within Cluster Administrator.<br />

See “Checking the Clustter<br />

Setup” in Section 4.<br />

If the public cluster netwwork<br />

was not set for “All communications” but instead<br />

was set<br />

for “Client access only,” the following symptoms occur:<br />

• All nodes except the node that owned the quorum stop MSCS. This action<br />

is<br />

completed to prevennt<br />

a “split brain” situation.<br />

• All resources move tto<br />

the surviving node.<br />

68 872 5688–002


Symptoms<br />

The following symptoms might help you identify this failure:<br />

Solving Network Problems<br />

• When the private cluster network fails, the Cluster Administrator displays an error<br />

indicator similar to Figure 7–13.<br />

All private network connections show a status of “Unknown” when the problem is a<br />

WAN issue.<br />

If only two of the connections failed (and the nodes are physically located at the<br />

same site), the issue is probably local to the site.<br />

If only one connection failed, the issue is probably a host network adapter.<br />

Figure 7–13. Cluster Administrator Display with Failures<br />

• On the cluster nodes at both sides, the system event log contains entries from the<br />

cluster service similar to the following:<br />

Event Type : Warning<br />

Event Source : ClusSvc<br />

Event Category: Node Mgr<br />

Event ID : 1123<br />

Date : 5/30/2008<br />

Time : 4:03:10 PM<br />

User : N/A<br />

6872 5688–002 7–23


Solving Network Problems<br />

Computer<br />

Description:<br />

: USMV-SYDNEY<br />

The node lost communication with cluster node 'USMV-AUCKLAND' on network 'Private'.<br />

Event Type : Warning<br />

Event Source : ClusSvc<br />

Event Category: Node Mgr<br />

Event ID : 1126<br />

Date : 5/30/2008<br />

Time : 4:03:12 AMP<br />

User : N/A<br />

Computer<br />

Description:<br />

: USMV-SYDNEY<br />

The interface for cluster node 'USMV-AUCKLAND' on network 'Private' is unreachable by at least one<br />

other cluster node attached to the network. The server cluster was not able to determine the location of<br />

the failure. Look for additional entries in the system event log indicating which other nodes have lost<br />

communication with node USMV-AUCKLAND. If the condition persists, check the cable connecting the<br />

node to the network. Then, check for hardware or software errors in the node's network adapter. Finally,<br />

check for failures in any other network components to which the node is connected such as hubs,<br />

switches, or bridges.<br />

Event Type : Warning<br />

Event Source : ClusSvc<br />

Event Category: Node Mgr<br />

Event ID : 1130<br />

Date : 5/30/2008<br />

Time : 4:03:12 PM<br />

User : N/A<br />

Computer<br />

Description:<br />

: USMV-SYDNEY<br />

Cluster network 'Private’ is down. None of the available nodes can communicate using this network. If<br />

the condition persists, check for failures in any network components to which the nodes are connected<br />

such as hubs, switches, or bridges. Next, check the cables connecting the nodes to the network. Finally,<br />

check for hardware or software errors in the adapters that attach the nodes to the network.<br />

7–24 6872 5688–002


Actions to Resolve the Problem<br />

Solving Network Problems<br />

Note: Typically, a network administrator for the site is required to diagnose which<br />

network switch, gateway, or connection is the cause of this failure.<br />

Perform the following actions to isolate and resolve the problem:<br />

1. In the Cluster Administrator, view the network properties of the public and private<br />

network.<br />

The public network should be operational with no failure indications.<br />

The private network should display errors. Refer to the previous symptoms to<br />

identify that this is a WAN issue. If the error is limited to one host, the problem<br />

might be a host network adapter. See “<strong>Public</strong> NIC Failure on a Cluster Node in a<br />

Geographic Clustered Environment” for action to resolve a host network problem.<br />

2. Check for network problems using methods such as isolating the failure to the<br />

network switch or gateway with the problem.<br />

6872 5688–002 7–25


Solving Network Problems<br />

Total Communicattion<br />

Failure in a Geographic c<br />

Clustered Environmment<br />

Problem Description<br />

7–26<br />

A total communication faailure<br />

implies that the cluster nodes and RAs are no longer able<br />

to communicate with eacch<br />

other over the public and private network interfac ces.<br />

Figure 7–14 illustrates this<br />

failure.<br />

Figurre<br />

7–14. Total Communication Failure<br />

When this failure occurs, , the cluster nodes on both sites detect that the clus ster<br />

heartbeat has been brokeen.<br />

After six missed heartbeats, the cluster nodes go<br />

into a<br />

“regroup” process to determine<br />

which node takes ownership of all cluster re esources.<br />

This process consists of checking network interface states and then arbitrati ing for the<br />

quorum device.<br />

During the network interrface<br />

detection phase, all nodes perform a network interface<br />

check to determine that the node is communicating through at least one net twork<br />

interface dedicated for cllient<br />

access, assuming the network interface is set for f “All<br />

communications” or “Cliient<br />

access only.” If this process determines that the<br />

node is not<br />

communicating through aany<br />

viable network, the cluster node voluntarily stop ps cluster<br />

service and drops out of the quorum arbitration process. The remaining node es then<br />

attempt to arbitrate for thhe<br />

quorum device.<br />

68 872 5688–002


Symptoms<br />

Solving Network Problems<br />

Quorum arbitration succeeds on the site that originally owned the quorum consistency<br />

group and fails on the nodes that did not own the quorum consistency group. Cluster<br />

service then shuts itself down on the nodes where quorum arbitration fails.<br />

In Microsoft Windows 2000 environments, MSCS does not check for network interface<br />

availability during the regroup process and starts the quorum arbitration process<br />

immediately after a regroup process is initiated—that is, after six missed heartbeats.<br />

Once the cluster has determined which nodes are allowed to remain active in the<br />

cluster, the cluster node attempts to bring online all data groups previously owned by the<br />

other cluster nodes. The <strong>SafeGuard</strong> 30m Control resource and its associated dependent<br />

resources will come online.<br />

During this total communication failure, replication is “Paused by system.” An extended<br />

outage requires a full volume sweep. Refer to Section 4 for more information.<br />

The following symptoms might help you identify this failure:<br />

• The management console shows a WAN error; all groups are paused. The other site<br />

shows a status of “Unknown.” Figure 7–15 illustrates one site.<br />

Figure 7–15. Management Console Display Showing WAN Error<br />

6872 5688–002 7–27


Solving Network Problems<br />

• The RAs tab on the management console lists errors as shown in Figure 7–16.<br />

Figure 7–16. RAs Tab for Total Communication Failure<br />

• Warnings and informational messages similar to those shown in Figure 7–17 appear<br />

on the management console. See the table after the figure for an explanation of the<br />

numbered console messages.<br />

Figure 7–17. Management Console Messages for Total Communication Failure<br />

7–28 6872 5688–002


Reference<br />

No.<br />

The following table explains the numbered messages in Figure 7–17.<br />

Event ID<br />

Description<br />

1 4001 For each consistency group, a group<br />

capabilities minor problem is reported. The<br />

details indicate that a WAN problem is<br />

suspected on both RAs.<br />

2 4008 For each consistency group on the West and<br />

the East sites, the transfer is paused. The<br />

details indicate a WAN problem is<br />

suspected.<br />

3 3021 For each RA at each site, the following error<br />

message is reported:<br />

Error in WAN link to RA at other site<br />

(RA x)<br />

4 1008 The following message is displayed:<br />

User action succeeded. The details indicate<br />

that a failover was initiated. This message<br />

appears when the groups are moved by the<br />

<strong>SafeGuard</strong> Control resource to the surviving<br />

cluster node.<br />

Solving Network Problems<br />

E-mail<br />

Immediate<br />

E-mail<br />

Daily<br />

Summary<br />

• All cluster resources appear online after successfully failing over to the surviving<br />

node.<br />

• The cluster service stops on all nodes except the surviving node.<br />

• From the surviving node, the host system event log has entries similar to the<br />

following:<br />

Event Type : Warning<br />

Event Source : ClusSvc<br />

Event Category: Node Mgr<br />

Event ID : 1123<br />

Date : 6/1/2008<br />

Time : 12:58:55 PM<br />

User : N/A<br />

Computer : USMV-WEST2<br />

Description:<br />

The node lost communication with cluster node 'USMV-EAST2' on <strong>Public</strong> network.<br />

Event Type : Warning<br />

Event Source : ClusSvc<br />

Event Category: Node Mgr<br />

Event ID : 1123<br />

Date : 6/1/2008<br />

6872 5688–002 7–29<br />

X<br />

X<br />

X<br />

X


Solving Network Problems<br />

Time : 12:58:55 PM<br />

User : N/A<br />

Computer : USMV-WEST2<br />

Description:<br />

The node lost communication with cluster node 'USMV-EAST2' on Private network.<br />

Event Type : Warning<br />

Event Source : ClusSvc<br />

Event Category: Node Mgr<br />

Event ID : 1135<br />

Date : 6/1/2008<br />

Time : 12:58:16 PM<br />

User : N/A<br />

Computer : USMV-WEST2<br />

Description:<br />

Cluster node USMV-EAST2 was removed from the active server cluster membership. Cluster service may<br />

have been stopped on the node, the node may have failed, or the node may have lost communication<br />

with the other active server cluster nodes.<br />

Event Type : Information<br />

Event Source : ClusSvc<br />

Event Category: Failover Mgr<br />

Event ID : 1200<br />

Date : 6/1/2008<br />

Time : 12:58:21 PM<br />

User : N/A<br />

Computer :<br />

Description:<br />

USMV-WEST2<br />

The Cluster Service is attempting to bring online the Resource Group "Group 1".<br />

Event Type : Information<br />

Event Source : ClusSvc<br />

Event Category: Failover Mgr<br />

Event ID : 1201<br />

Date : 6/1/2008<br />

Time : 1:02:54 PM<br />

User : N/A<br />

Computer :<br />

Description:<br />

USMV-WEST2<br />

The Cluster Service brought the Resource Group "Group 1" online.<br />

7–30 6872 5688–002


Solving Network Problems<br />

• From the surviving node, the private and public network connections show an<br />

exclamation mark “Unknown” status as shown in Figures 7–18 and 7–19.<br />

Figure 7–18. Cluster Administrator Showing Private Network Down<br />

Figure 7–19. Cluster Administrator Showing <strong>Public</strong> Network Down<br />

6872 5688–002 7–31


Solving Network Problems<br />

Actions to Resolve the Problem<br />

Note: Typically, a network administrator for the site is required to diagnose which<br />

network switch, gateway, or connection is the cause of this failure.<br />

Perform the following actions to isolate and resolve the problem:<br />

1. When you observe on the management console that a WAN error occurred on site 1<br />

and on site 2, call the other site to verify that each management console is available<br />

and shows a WAN down because of the failure. If only one site can access the<br />

management console, the problem is probably not a total WAN failure but rather a<br />

management network failure. In that case, see “Management Network Failure in a<br />

Geographic Clustered Environment.”<br />

2. In the Cluster Administrator, verify that only one node is active in the cluster.<br />

3. View the network properties of the public and private network.<br />

The display should show an “Unknown” status for the private and public network.<br />

4. Check for network problems using methods such as isolating the failure to the<br />

network switch or gateway by pinging from the cluster node to the gateway at each<br />

site.<br />

Port Information<br />

Problem Description<br />

Symptoms<br />

Communications problems might occur because of firewall settings that prevent all<br />

necessary communication.<br />

The following symptoms might help you identify this problem:<br />

• Unable to reach the DNS server.<br />

• Unable to communicate to the NTP server.<br />

• Unable to reach the mail server.<br />

• The RAs tab shows RA data link errors.<br />

• The management console shows errors for the WAN.<br />

• The management console logs show RA communications errors.<br />

7–32 6872 5688–002


Actions to Resolve<br />

Solving Network Problems<br />

Perform the port diagnostics from each of the RAs by following the steps given in<br />

Appendix C.<br />

The following tables provide port information that you can use in troubleshooting the<br />

status of connections.<br />

Port Numbers<br />

Table 7–2. Ports for Internet Communication<br />

Protocol or Protocols<br />

21 FTP 192.61.61.78<br />

443 Used for remote maintenance<br />

(TCP)<br />

Unisys Product <strong>Support</strong><br />

IP Address<br />

129.225.216.130<br />

The following tables list ports used for communication other than Internet<br />

communication.<br />

Table 7–3. Ports for Management LAN<br />

Communication and Notification<br />

Port Numbers Protocol or Protocols<br />

21 Default FTP port (needed for collecting system<br />

information)<br />

22 Default SSH and communications between RAs<br />

25 Default outgoing mail (SMTP) e-mail alerts from<br />

the RA are configured.<br />

80 Web server for management (TCP)<br />

123 Default NTP port<br />

161 Default SNMP port<br />

443 Secure Web server for management (TCP)<br />

514 Syslog (UDP)<br />

1097 RMI (TCP)<br />

1099 RMI (TCP)<br />

4401 RMI (TCP)<br />

4405 Host-to-RA kutils communications (SQL<br />

commands) and KVSS (TCP)<br />

7777 Automatic host information collection<br />

6872 5688–002 7–33


Solving Network Problems<br />

The ports listed in Table 7–4 are used for both the management LAN and WAN.<br />

Table 7–4. Ports for RA-to-RA Internal<br />

Communication<br />

Port Numbers Protocol or Protocols<br />

23 telnet<br />

123 NTP (UDP)<br />

1097 RMI (TCP)<br />

1099 RMI (TCP)<br />

4444 TCP<br />

5001 TCP (default iperf port for performance<br />

measuring between RAs)<br />

5010 Management server (UDP, TCP)<br />

5020 Control (UDP, TCP)<br />

5030 RMI (TCP)<br />

5040 Replication (UDP, TCP)<br />

5060 Mpi_perf (TCP)<br />

5080 Connectivity diagnostics tool<br />

7–34 6872 5688–002


Section 8<br />

Solving Replication Appliance (RA)<br />

Problems<br />

This section lists symptoms that usually indicate problems with one or more Unisys<br />

<strong>SafeGuard</strong> 30m replication appliances (RAs). The problems include hardware failures.<br />

The graphics, behaviors, and examples in this section are similar to what you observe<br />

with your system but might differ in some details.<br />

For problems relating to RAs, gather the RA logs and ask the following questions:<br />

• Are any errors displayed on the management console?<br />

• Is the issue constant? Is the issue a one-time occurrence? Does the issue occur at<br />

intervals?<br />

• What are the states of the consistency groups?<br />

• What is the timeframe in which the problem occurred?<br />

• When was the first occurrence of the problem?<br />

• What actions were taken as a result of the problem or issue?<br />

• Were any recent changes made in the replication environment? If so, what?<br />

Table 8–1 lists symptoms and possible causes for the failure of a single RA on one site<br />

with a switchover as a symptom. Table 8–2 lists symptoms and possible causes for the<br />

failure of a single RA on one site without switchover symptoms. Table 8–3 lists<br />

symptoms and other possible problems regarding multiple RA failures. Each problem and<br />

the actions to resolve it are described in this section.<br />

In addition to the symptoms listed, you might receive e-mail messages or SNMP traps<br />

for possible problems. Also, messages similar to e-mail notifications might be displayed<br />

on the management console. If you do not see the messages, they might have already<br />

dropped off the display. Review the management console logs for messages that have<br />

dropped off the display.<br />

6872 5688–002 8–1


Solving Replication Appliance (RA) Problems<br />

Table 8–1. Possible Problems for Single RA Failure with a<br />

Switchover<br />

Symptoms Possible Problem<br />

The management console shows RA<br />

failure.<br />

Single RA failure<br />

Possible Contributing Causes to Single RA Failure with a Switchover<br />

The system frequently pauses transfer<br />

for all consistency groups.<br />

If you log in to the failed RA as the<br />

boxmgmt user, a message is displayed<br />

explaining that the reboot regulation<br />

limit has been exceeded.<br />

The management console shows<br />

repeated events that report an RA is<br />

up followed by an RA is down.<br />

The link indicator lights on all host bus<br />

adapters (HBAs) are not illuminated.<br />

The port indicator lights on the Fibre<br />

Channel switch no longer show a link<br />

to the RA.<br />

Port errors occur or there is no target<br />

when running the SAN diagnostics.<br />

The management console shows RA<br />

failure with details pointing to a<br />

problem with the repository volume.<br />

The link indicator lights on the HBA or<br />

HBAs are not illuminated.<br />

The port indicator lights on the<br />

network switch or hub no longer show<br />

a link to the RA.<br />

Reboot regulation failover<br />

Failure of all SAN Fibre Channel HBAs on one RA<br />

Onboard WAN network adapter failure<br />

(Or failure of the optional gigabit Fibre Channel<br />

WAN network adapter)<br />

8–2 6872 5688–002


Solving Replication Appliance (RA) Problems<br />

Table 8–2. Possible Problems for Single RA Failure Wthout a<br />

Switchover<br />

Symptoms Possible Problem<br />

The link indicators lights on the onboard<br />

management network adapter are not<br />

illuminated.<br />

The failure light for the hard disk<br />

indicates a failure.<br />

An error message that appears during a<br />

boot operation indicates failure of one of<br />

the internal disks.<br />

The link indicator lights on the HBA are<br />

not illuminated.<br />

The port indicator lights on the Fibre<br />

Channel switch no longer show a link to<br />

the RA.<br />

For one of the ports on the relevant RA,<br />

errors appear when running the SAN<br />

diagnostics.<br />

Onboard management network adapter<br />

failure<br />

Single hard-disk failure<br />

Port failure of a single SAN Fibre Channel<br />

HBA on one RA<br />

Table 8–3. Possible Problems for Multiple RA Failures with<br />

Symptoms<br />

Symptoms Possible Problem<br />

Replication has stopped on all groups.<br />

MSCS fails over groups to the other<br />

site, or MSCS fails on all nodes.<br />

The management console displays a<br />

WAN error to the other site.<br />

Replication has stopped on all groups.<br />

MSCS fails over groups to the other<br />

site, or MCSC fails on all nodes.<br />

The management console displays a<br />

WAN error to the other site.<br />

Failure of all RAs on one site<br />

All RAs on one site are not attached<br />

6872 5688–002 8–3


Solving Replication Appliance (RA) Problems<br />

Single RA Failures<br />

Problem Description<br />

When an RA fails, a switchover might occur. In some cases, a switchover does not<br />

occur. See “Single RA Failures With Switchover” and “Single RA Failures Without<br />

Switchover.”<br />

Understanding Management Console Access<br />

If the RA that failed had been running site control—that is, the RA owned the virtual<br />

management IP network—and a switchover occurs, the virtual IP address moves to the<br />

new RA.<br />

If you attempt to connect to the management console using one of the static<br />

management IP addresses of the RAs, a connection error occurs if the RA does not have<br />

site control. Thus, you should use the site management IP address to connect to the<br />

management console.<br />

At least one RA (either RA 1 or RA 2) must be attached to the RA cluster for the<br />

management console to function.<br />

If the RA that failed was running site control and a switchover does not occur (such as<br />

with an onboard management network connection failure), the management console<br />

might not be accessible. Also, attempts to log in using PuTTY fail if you use the<br />

boxmgmt log-in account. When an RA does not have site control, you can always log in<br />

using PuTTY and the boxmgmt log-in account.<br />

You cannot determine which RA owns site control unless the management console is<br />

accessible. The site control RA is designated at the bottom of the display as follows:<br />

Another situation in which you cannot log in to the management console is when the<br />

user account has been locked. In this case, follow these steps:<br />

1. Log in interactively using PuTTY with another unlocked user account.<br />

2. Enter unlock_user.<br />

3. Determine whether any users are listed, and follow the messages to unlock the<br />

locked user accounts.<br />

8–4 6872 5688–002


6872 5688–002<br />

Figure 8–1 illustratees<br />

a single RA failure.<br />

Single RA Failure wwith<br />

Switchover<br />

Solving Replication Appliance e (RA) Problems<br />

Figure 8–1. Single RA Failure<br />

In this case, a single<br />

RA fails, and there is an automatic switchover to a surviving RA on<br />

the same site. Any groups that had been running on the failed RA run on o a surviving RA<br />

at the same site.<br />

Each RA handles thhe<br />

replicating activities of the consistency groups for r which it is<br />

designated as the ppreferred<br />

RA. The consistency groups that are affect ted are those that<br />

were configured wiith<br />

the failed RA as the preferred RA. Thus, whenever<br />

an RA becomes<br />

inoperable, the handling<br />

of the consistency groups for that RA switches s over<br />

automatically to thee<br />

functioning RAs in the same RA cluster.<br />

During the RA switchover<br />

process, the server applications do not experience<br />

any I/O<br />

failures. In a geograaphic<br />

clustered environment, MSCS is not aware of the RA failure,<br />

and all application aand<br />

replication operations continue to function norma ally. However,<br />

performance mightt<br />

be affected because the I/O load on the surviving RAs R is now<br />

increased.<br />

8–5


Solving Replication Appliance (RA) Problems<br />

Symptoms<br />

Failures of an RA that cause a switchover are as follows:<br />

• RA hardware issues (such as memory, motherboard, and so forth)<br />

• Reboot regulation failover<br />

• Failure of all SAN Fibre Channel HBAs on one RA<br />

• Onboard WAN network adapter failure (or failure of the optional gigabit Fibre Channel<br />

WAN network adapter)<br />

The following symptoms might help you identify this failure:<br />

• The RA does not boot.<br />

From a power-on reset, the BIOS display shows the BIOS information, RAID adapter<br />

utility prompt, logical drives found, and so forth. The display is similar to the<br />

information shown in Figure 8–2.<br />

Figure 8–2. Sample BIOS Display<br />

Once the RA initializes, the log-in screen is displayed.<br />

Note: Because status messages normally scroll on the screen, you might need to<br />

press Enter to see the log-in screen.<br />

• The management console system status shows an RA failure. (See Figure 8–3.)<br />

To display more information about the error, click the red error in the right column.<br />

The More Info dialog box is displayed with a message similar to the following:<br />

RA 1 in West is down<br />

8–6 6872 5688–002


Solving Replication Appliance (RA) Problems<br />

Figure 8–3. Management Console Display Showing RA Error and RAs Tab<br />

• The RAs tab on the management console shows information similar to that in<br />

Figure 8–3, specifically<br />

− The RA status for RA 1 on the West site shows an error.<br />

− The peer RA on the East site (RA 1) shows a data link error.<br />

− Each RA on the East site shows a WAN connection failure.<br />

− The surviving RA at the failed site (West) does not show any errors.<br />

• Warnings and informational messages similar to those shown in Figure 8–4 appear<br />

on the management console when an RA fails and a switchover occurs. See the<br />

table after the figure for an explanation of the numbered console messages. In your<br />

environment, the messages pertain only to the groups configured to use the failed<br />

RA as the preferred RA.<br />

6872 5688–002 8–7


Solving Replication Appliance (RA) Problems<br />

Figure 8–4. Management Console Messages for Single RA Failure with Switchover<br />

Reference<br />

No.<br />

The following table explains the numbered messages shown in Figure 8–4.<br />

Event<br />

ID<br />

Description E-mail<br />

Immediate<br />

1 3023 At the same site, the other RA reports a<br />

problem getting to the LAN of the failed RA.<br />

2 3008 The site with the failed RA reports that the RA is<br />

probably down.<br />

3 2000 The management console is now running on RA<br />

2.<br />

4 4001 For each consistency group, a minor problem is<br />

reported. The details show that the RA is down<br />

or not a cluster member.<br />

5 4008 For each consistency group, the transfer is<br />

paused at the surviving site to allow a<br />

switchover. The details show the reason for the<br />

pause as switchover.<br />

E-mail Daily<br />

Summary<br />

8–8 6872 5688–002<br />

X<br />

X<br />

X<br />

X<br />

X


Reference<br />

No.<br />

Event<br />

ID<br />

Solving Replication Appliance (RA) Problems<br />

Description E-mail<br />

Immediate<br />

6 4041 For each consistency group at the same site,<br />

the groups are activated at the surviving RA.<br />

This probably means that a switchover to RA 2<br />

at the failed site was successful.<br />

7 5032 For each consistency group at the failed site, the<br />

splitter is again splitting.<br />

8 3021 A WAN link error is reported from each RA at<br />

the surviving site regarding the failed RA at the<br />

other site.<br />

9 4010 For each consistency group at the failed site, the<br />

transfer is started.<br />

10 4086 For each consistency group at the failed site, an<br />

initialization is performed.<br />

11 4087 For each consistency group at the failed site,<br />

the initialization completes.<br />

E-mail Daily<br />

Summary<br />

12 3007 The failed RA (RA 1) is now restored. X<br />

To see the details of the messages listed on the management console display, you must<br />

collect the logs and then review the messages for the time of the failure. Appendix A<br />

explains how to collect the management console logs, and Appendix E lists the event<br />

IDs with explanations.<br />

Actions to Resolve the Problem<br />

The following list summarizes the actions you need to perform to isolate and resolve the<br />

problem:<br />

• Check the LCD display on the front panel of the RA. See “LCD Status Messages” in<br />

Appendix B for more information.<br />

If the LCD display shows an error, run the RA diagnostics. See Appendix B for more<br />

information.<br />

• Check all indicator lights on the rear panel of the RA.<br />

• Review the symptoms and actions in the following topics:<br />

− Reboot Regulation<br />

− Onboard WAN Network Adapter Failure<br />

• If you determine that the failed RA must be replaced, contact the Unisys service<br />

representative for a replacement RA.<br />

After you receive the replacement RA, follow the steps in Appendix D to install and<br />

configure it.<br />

6872 5688–002 8–9<br />

X<br />

X<br />

X<br />

X<br />

X<br />

X


Solving Replication Appliance (RA) Problems<br />

The following procedure provides a detailed description of the actions to perform:<br />

1. Remove the front bezel of the RA and look at the LCD display. During normal<br />

operation, the illuminated message should identify the system.<br />

If the LCD display flashes amber, the system needs attention because of a problem<br />

with power supplies, fans, system temperature, or hard drives.<br />

Figure 8–5 shows the location of the LCD display.<br />

Figure 8–5. LCD Display on Front Panel of RA<br />

If an error message is displayed, check Table B–1. For example, the message E0D76<br />

indicates a drive failure. (Refer to “Single Hard Disk Failure” in this section.)<br />

If the message code is not listed in the Table B–1, run the RA diagnostics, (see<br />

Appendix B).<br />

2. Check the indicators at the rear of the RA as described in the following steps and<br />

visually verify that all are working correctly.<br />

Figure 8–6 illustrates the rear panel of the RA.<br />

Note: The network connections on the rear panel labeled 1 and 2 in the following<br />

illustration might appear different on your RA. The connection labeled 1 is always the RA<br />

replication network, and the connection labeled 2 is always the RA management<br />

network. Pay special attention to the labeling when checking the network connections.<br />

8–10 6872 5688–002


6872 5688–002<br />

Solving Replication Appliance e (RA) Problems<br />

Figure 88–6.<br />

Rear Panel of RA Showing Indicators<br />

• Ping each netwwork<br />

connection (management network and replicatio on network), and<br />

visually verify thhat<br />

the LEDs on either side of the cable on the back k panel are<br />

illuminated. Figure<br />

8–7 shows the location of these LEDs.<br />

If the LEDs are off, the network is not connected. The green LED is<br />

lit if the network<br />

is connected too<br />

a valid link partner on the network. The amber LED D blinks when<br />

network data iss<br />

being sent or received.<br />

If the managemment<br />

network LEDs indicate a problem, refer to “Onboard<br />

Management NNetwork<br />

Adapter Failure” in this section.<br />

If the replication<br />

network LEDs indicate a problem, refer to “Onboa ard WAN Network<br />

Adapter Failure”<br />

in this section.<br />

Figure 8–7. Location of Network LEDs<br />

• Check that the green LEDs for the SAN Fibre Channel HBAs are illu uminated as<br />

shown in Figuree<br />

8–8.<br />

8–11


Solving Replication Appliance (RA) Problems<br />

Figure 8–8. Location of SAN Fibre Channel HBA LEDs<br />

The following table explains the LED patterns and their meanings. If the LEDs<br />

indicate a problem, refer to the two topics for SAN Fibre Chanel HBA failures in this<br />

section.<br />

Green LED Amber LED Activity<br />

On On Power<br />

On Off Online<br />

Off On Signal acquired<br />

Off Flashing Loss of synchronization<br />

Flashing Flashing Firmware error<br />

Reboot Regulation<br />

Problem Description<br />

After frequent, unexplained reboots or restarts of the replication process, the RA<br />

automatically detaches from the RA cluster.<br />

When installing the RAs, you can enable or disable this reboot regulation feature. The<br />

factory default is for the feature to be enabled so that reboot regulation is triggered<br />

whenever a specified number of reboots or failures occur within the specified time<br />

interval.<br />

The two parameters available for the reboot regulation feature are the number of reboots<br />

(including internal failures) and the time interval. The default value for the number of<br />

reboots is 10, and the default value for the time interval is 2 hours.<br />

Only Unisys personnel should change these values. Use the Installation Manager to<br />

change the parameter values or disable the feature. See the Unisys <strong>SafeGuard</strong> <strong>Solutions</strong><br />

Replication Appliance Installation <strong>Guide</strong> for information about using the Installation<br />

Manager tools to make these changes.<br />

8–12 6872 5688–002


Symptoms<br />

The following symptoms might help you identify this failure:<br />

Solving Replication Appliance (RA) Problems<br />

• Frequent transfer pauses for all consistency groups that have the same preferred<br />

RA.<br />

• If you log in to the RA as the boxmgmt user, the following message is displayed:<br />

Reboot regulation limit has been exceeded<br />

• Several messages might be displayed on the Logs tab of the management console<br />

as an RA reboots to try to correct a problem. These messages are listed in<br />

Table 8–4.<br />

Table 8–4. Management Console Messages Pertaining to Reboots<br />

Reference<br />

No./Legend<br />

Event<br />

ID<br />

* 3008 The RA appears to be down.<br />

The RA might attempt to<br />

perform a reboot to correct<br />

the problem.<br />

* 3023 Error in LAN link (as RA<br />

reboots).<br />

* 3021 Error in WAN link (as RA<br />

reboots).<br />

* 3007 The RA is up (the reboot<br />

completes).<br />

* 3022 The LAN link is restored (the<br />

reboot has completed).<br />

* 3020 The WAN link at other site is<br />

restored (the reboot has<br />

completed).<br />

Description E-mail<br />

Immediate<br />

E-mail Daily<br />

Summary<br />

When any of these messages appear multiple times in a short time period, they<br />

might indicate an RA that has continuously rebooted and might have reached the<br />

reboot regulation limit.<br />

Actions to Resolve the Problem<br />

Perform the following actions to isolate and resolve the problem:<br />

1. Collect the RA logs before you attempt to resolve the problem. See Appendix A for<br />

information about collecting logs.<br />

2. To determine whether the hardware is faulty, run the RA diagnostics described in<br />

Appendix B.<br />

3. If the problem remains, submit the RA logs to Unisys for analysis.<br />

6872 5688–002 8–13<br />

X<br />

X<br />

X<br />

X<br />

X<br />

X


Solving Replication Appliance (RA) Problems<br />

4. Once the problem is corrected, the RA automatically attaches to the RA cluster after<br />

a power-on reset. If necessary, reattach the RA to the RA cluster manually by<br />

following these steps:<br />

a. Log in as boxmgmt to the RA through an SSH session using PuTTY.<br />

b. At the prompt, type 4 (Cluster operations) and press Enter.<br />

c. At the prompt, type 1 (Attach to cluster) to attach the RA to the cluster.<br />

d. At the prompt, type Q (Quit).<br />

Failure of All SAN Fibre Channel Host Bus Adapters (HBAs<br />

Problem Description<br />

Symptoms<br />

All SAN Fibre Channel HBAs or adapter ports on the RA fail. This scenario is unlikely<br />

because the RA has redundant ports that are located on different physical adapters. A<br />

SAN connectivity problem is more likely.<br />

Note: A single redundant path does not show errors on the management console<br />

display. See “Port Failure on a Single SAN Fibre Channel HBA on One RA.”<br />

The following symptoms might help you identify this failure:<br />

• The link indicator lights on all SAN Fibre Channel HBAs are not illuminated. (Refer to<br />

Figure 8–8 for the location of these LEDs.)<br />

• The port indicator lights on the Fibre Channel switch no longer show a link to the RA.<br />

• Port errors occur or no target appears when running the Installation Manager SAN<br />

diagnostics.<br />

• Information on the Volumes tab of the management console is inconsistent or<br />

periodically changing.<br />

• The management console shows failures for RAs, storage, and hosts. (See<br />

Figure 8–9.)<br />

8–14 6872 5688–002


Solving Replication Appliance (RA) Problems<br />

Figure 8–9. Management Console Display: Host Connection with RA Is<br />

Down<br />

If you click the red error indication for RAs in the right column, the message is<br />

RA 2 in East can’t access repository volume<br />

If you click the red error indication for storage in the right column, the following<br />

messages are displayed:<br />

If you click the red error indication in the right column for splitters, the message is<br />

ERROR: USMV-EAST2's connection with RA2 is down<br />

• Warnings and informational messages similar to those shown in Figure 8–10 appear<br />

on the management console when an RA fails with this type of problem. See the<br />

table after the figure for an explanation of the numbered console messages.<br />

Also, refer to Figure 8–4 and the table that explains the messages for information<br />

about an RA failure with a generic switchover.<br />

Refer to Table 8–4 for other messages that might occur whenever an RA reboots to<br />

try to correct the problem.<br />

6872 5688–002 8–15


Solving Replication Appliance (RA) Problems<br />

Figure 8–10. Management Console Messages for Failed RA (All SAN HBAs Fail)<br />

8–16 6872 5688–002


Reference<br />

No.<br />

Solving Replication Appliance (RA) Problems<br />

The following table explains the numbered messages shown in Figure 8–10. You<br />

might also see the messages denoted with an asterisk (*).<br />

Event<br />

ID<br />

Description<br />

1 3014 The RA is unable to access the<br />

repository volume (RA 2).<br />

2 4003 For each consistency group that had<br />

the failed RA as the preferred RA, a<br />

group consistency problem is<br />

reported. The details show a<br />

repository volume problem.<br />

3 3012 The RA is unable to access volumes<br />

(all volumes for repository, journal, and<br />

data are listed).<br />

4 4086 Initialization started (RA 1, Quorum ---<br />

West).<br />

5 4087 Initialization complete (RA 1, Quorum -<br />

West). The group has completed the<br />

switchover.<br />

E-mail<br />

Immediate<br />

E-mail<br />

Daily<br />

Summary<br />

To see the details of the messages listed on the management console display, you<br />

must collect the logs and then review the messages for the time of the failure.<br />

Appendix A explains how to collect the management console logs, and Appendix E<br />

lists the event IDs with explanations.<br />

Actions to Resolve the Problem<br />

Perform the following actions to isolate and resolve the problem:<br />

1. Refer to Section 6, “Solving SAN Connectivity Problems,” to determine whether the<br />

problem is described there.<br />

2. If you determine that the SAN Fibre Channel HBA failed and must be replaced,<br />

contact a Unisys service representative for a replacement adapter.<br />

3. Once the replacement adpter is received, perform the following steps to replace the<br />

failed HBA:<br />

a. Open a PuTTY session using the IP address of the RA and log in as<br />

boxmgmt/boxmgmt.<br />

Appendix C provides additional information about the Installation Manager<br />

diagnostics.<br />

b. On the Main menu, type 3 (Diagnostics) and press Enter.<br />

c. On the Diagnostics menu, type 2 (Fibre Channel diagnostics) and press<br />

Enter.<br />

d. On the Fibre Channel Diagnostics menu, type 4 (View Fibre Channel<br />

details) and press Enter.<br />

6872 5688–002 8–17<br />

X<br />

X<br />

X<br />

X<br />

X


Solving Replication Appliance (RA) Problems<br />

Information similar to the following is displayed:<br />

>>Site1 Box 1>>3<br />

Port 0<br />

wwn = 50012482001c6fb0<br />

node_wwn = 50012482001c6fb1<br />

Port id = 0x20100<br />

operating mode = point to point<br />

speed<br />

Port 1<br />

= 2 GB<br />

---------------------------------wwn<br />

= 50012482001ce3c4<br />

node_wwn = 50012482001ce3c5<br />

Port id = 0x10100<br />

operating mode = point to point<br />

speed = 2 GB<br />

e. Write down the port information.<br />

f. On the Fibre Channel Diagnostics menu, type B (Back) and press Enter.<br />

g. On the Diagnostics menu, type B (Back) and press Enter.<br />

h. On the Main Menu, type 4 (Cluster operations) and press Enter.<br />

i. On the Cluster Operations menu, type 2 (Detach from cluster) and press<br />

Enter.<br />

j. Shut down the RA.<br />

k. Replaced the failed adapter with the replacement and then boot the RA.<br />

Note: The replacement adapter does not require any settings to be changed.<br />

l. Repeat steps a through d, and again view the Fibre Channel details to see the<br />

new WWN for the replaced HBA.<br />

m. Using the management of the SAN switch, make the modifications to the zoning<br />

as needed to replace the failed WWN with the new WWN.<br />

n. Use the new WWN to configure the storage.<br />

o. On the Fibre Channel Diagnostics menu, type 1 (SAN diagnostics) and<br />

press Enter. (Refer to steps a through c to access the Fibre Channel<br />

Diagnostics menu.)<br />

When you select the SAN diagnostics option, the system conducts automatic<br />

tests that are designed to identify the most common problems encountered in<br />

the configuration of SAN environments.<br />

Once the tests complete, a message is displayed confirming the successful<br />

completion of SAN diagnostics, or a report is displayed that details any critical<br />

configuration problems.<br />

p. Once no problems are reported from the SAN diagnostics, type B (Back) and<br />

press Enter.<br />

q. On the Fibre Channel Diagnostics menu, type B (Back) and press Enter.<br />

8–18 6872 5688–002


Solving Replication Appliance (RA) Problems<br />

r. On the Diagnostics menu, type B (Back) and press Enter.<br />

s. On the Main Menu, type 4 (Cluster operations) and press Enter.<br />

t. On the Cluster Operations menu, type 1 (Attach to cluster) and press Enter.<br />

This action reattaches the RA, which automatically reboots and restarts<br />

replication.<br />

Note: The replacement Fibre Channel HBA does not need any configuration changes.<br />

Failure of Onboard WAN Adapter or Failure of Optional Gigabit<br />

Fibre Channel WAN Adapter<br />

Problem Description<br />

Symptoms<br />

The onboard WAN adapter failed. This capability serves the replication network.<br />

Notes:<br />

• The gigabit Fibre Channel WAN adapter is an optional component found in some<br />

environments. When this board fails, the symptoms are the same as those observed<br />

when the onboard WAN adapter fails. In that case, the indicator lights pertain to the<br />

gigabit Fibre Channel WAN board instead of the onboard capability.<br />

• The actions to resolve the problem are similar once you isolate the board as the<br />

problem. That is, contact a Unisys service representative for a replacement part.<br />

The following symptoms might help you identify this failure:<br />

• Transfer between sites pauses temporarily for all consistency groups for which this<br />

is the preferred RA while an RA switchover occurs.<br />

• Applications continue to run. High loads might occur because of reduced total<br />

throughput capacity.<br />

• The link indicators on the onboard WAN adapter might not be illuminated. (See<br />

Figure 8–6 for the location of the connector for the replication network WAN.<br />

Figure 8–7 illustrates the LEDs.)<br />

• The port lights on the network switch might indicate that there is no link to the<br />

onboard WAN adapter.<br />

• The management console shows a WAN data link failure for RA 1. The More<br />

information for this error provides the message: “RA-x WAN data link is down.” (See<br />

Figure 8–11.)<br />

6872 5688–002 8–19


Solving Replication Appliance (RA) Problems<br />

Figure 8–11. Management Console Showing WAN Data Link Failure<br />

• The RAs tab on the management console (Figure 8–11) shows an error for the same<br />

RA at each site, indicating that the connectivity between them has been lost.<br />

• Warnings and informational messages similar to those shown in Figure 8–4 for an<br />

RA failure are displayed for this failure. Refer to the table after Figure 8–4 for<br />

descriptions of the messages. For this failure, the details of event ID 4001 show a<br />

WAN data path problem.<br />

Actions to Resolve the Problem<br />

Perform the following actions to isolate and resolve the problem:<br />

• Isolate the problem to the onboard WAN adapter by performing the actions in<br />

“Replication Network Failure in a Geographic Clustered Environment” in<br />

Section 7.<br />

• If you determine that the motherboard must be replaced, contact a Unisys service<br />

representative for a replacement part.<br />

• Contact the Unisys <strong>Support</strong> Center for the appropriate BIOS for the replacement<br />

part.<br />

Note: The replacement motherboard might not have the disk controller set for<br />

RAID1 (mirroring). Check the setting and change it if necessary.<br />

• In rare cases, you might need to obtain a replacement RA from a Unisys service<br />

representative. After you receive the replacement RA, follow the steps in Appendix<br />

D to install and configure it.<br />

8–20 6872 5688–002


Solving Replication Appliance (RA) Problems<br />

Single RA Failures Without a Switchover<br />

Problem Description<br />

Some failures that might occur on an RA do not cause a switchover. These failures are<br />

• Port failure on a single SAN Fibre Channel HBA on one RA<br />

• Onboard management network adapter failure<br />

• Single hard disk failure<br />

Port Failure on a Single SAN Fibre Channel HBA on One RA<br />

Problem Description<br />

Symptoms<br />

One SAN Fibre Channel HBA port on the RA failed.<br />

The following symptoms might help you identify this failure:<br />

• The Logs tab on the management console displays a message for event ID 3030—<br />

Warning RA switched path to storage. (RA , Volumes )—only if the<br />

connection failed during an I/O operation.<br />

• The link indicator lights on the SAN Fibre Channel HBA are not illuminated. (Refer to<br />

Figure 8–8 for the location of these LEDs.)<br />

• The port indicator lights on the Fibre Channel switch no longer show a link to the RA.<br />

• For one port on the relevant RA, errors occur when running the Installation Manager<br />

SAN diagnostics. See Appendix C for information about these diagnostics.<br />

Actions to Resolve the Problem<br />

Perform the following actions to isolate and resolve the problem:<br />

1. If you determine that the SAN Fibre Channel HBA failed and must be replaced,<br />

contact a Unisys service representative for a replacement part.<br />

2. Once the replacement adapter is received, perform the following steps to replace<br />

the failed HBA:<br />

a. Open a PuTTY session using the IP address of the RA, and log in as<br />

boxmgmt/boxmgmt.<br />

Appendix C provides additional information about the Installation Manager<br />

diagnostics.<br />

b. On the Main menu, type 3 (Diagnostics) and press Enter.<br />

c. On the Diagnostics menu, type 2 (Fibre Channel diagnostics) and press<br />

Enter.<br />

d. On the Fibre Channel Diagnostics menu, type 4 (View Fibre Channel<br />

details) and press Enter.<br />

6872 5688–002 8–21


Solving Replication Appliance (RA) Problems<br />

Information similar to the following is displayed:<br />

>>Site1 Box 1>>3<br />

Port 0<br />

wwn = 50012482001c6fb0<br />

node_wwn = 50012482001c6fb1<br />

Port id = 0x20100<br />

operating mode = point to point<br />

speed<br />

Port 1<br />

= 2 GB<br />

---------------------------------wwn<br />

= 50012482001ce3c4<br />

node_wwn = 50012482001ce3c5<br />

Port id = 0x10100<br />

operating mode = point to point<br />

speed = 2 GB<br />

e. Write down the port information.<br />

f. On the Fibre Channel Diagnostics menu, type B (Back) and press Enter.<br />

g. On the Diagnostics menu, type B (Back) and press Enter.<br />

h. On the Main Menu, type 4 (Cluster operations) and press Enter.<br />

i. On the Cluster Operations menu, type 2 (Detach from cluster) and press<br />

Enter.<br />

j. Shut down the RA.<br />

k. Replaced the failed adapter with the replacement and then boot the RA.<br />

Note: The replacement adapter does not require any settings to be changed.<br />

l. Repeat steps a through d and again view the Fibre Channel details to see the<br />

new WWN for the replaced HBA.<br />

m. Using the management of the SAN switch, make the modifications to the zoning<br />

as needed to replace the failed WWN with the new WWN.<br />

n. Use the new WWN to configure the storage.<br />

o. On the Fibre Channel Diagnostics menu, type 1 (SAN diagnostics) and<br />

press Enter. (Refer to steps a through c to access the Fibre Channel<br />

Diagnostics menu.)<br />

When you select the SAN diagnostics option, the system conducts automatic<br />

tests that are designed to identify the most common problems encountered in<br />

the configuration of SAN environments.<br />

Once the tests complete, a message is displayed confirming the successful<br />

completion of SAN diagnostics, or a report is displayed that details any critical<br />

configuration problems.<br />

p. Once no problems are reported from the SAN diagnostics, type B (Back) and<br />

press Enter.<br />

q. On the Fibre Channel Diagnostics menu, type B (Back) and press Enter.<br />

8–22 6872 5688–002


Solving Replication Appliance (RA) Problems<br />

r. On the Diagnostics menu, type B (Back) and press Enter.<br />

s. On the Main Menu, type 4 (Cluster operations) and press Enter.<br />

t. On the Cluster Operations menu, type 1 (Attach to cluster) and press Enter.<br />

This action reattaches the RA, which automatically reboots and restarts<br />

replication.<br />

Note: The replacement Fibre Channel HBA does not need any configuration changes.<br />

Onboard Management Network Adapter Failure<br />

Problem Description<br />

Symptoms<br />

The onboard management network adapter failed.<br />

The following symptoms might help you identify this failure:<br />

• On the management console, the system status and RA status do not display any<br />

error indications.<br />

• The link indicators on the onboard management network adapter are not illuminated.<br />

(See Figure 8–6 for the location of the connector for the onboard management<br />

network adapter. Figure 8–7 illustrates the LEDs.)<br />

• If RA site control was running on the failed RA, you cannot access the management<br />

console or if the management console was open, a banner is displayed showing<br />

“not connected.”<br />

• If RA site control was not running on the failed RA, you can access the management<br />

console.<br />

• You cannot determine which RA owns site control unless the management console<br />

is accessible. The RA site control is designated at the bottom of the display as<br />

follows:<br />

• See “Management Network Failure in a Geographic Clustered Environment” in<br />

Section 7 for additional symptoms.<br />

• The Logs tab on the management console might display a message for event ID<br />

3023—Error in LAN link to RA (RA1)—for this failure.<br />

6872 5688–002 8–23


Solving Replication Appliance (RA) Problems<br />

Actions to Resolve the Problem<br />

Perform the following actions to isolate and resolve the problem:<br />

• Isolate the problem to the onboard management network adapter by performing the<br />

actions in “Management Network Failure in a Geographic Clustered Environment” in<br />

Section 7.<br />

• If you determine the motherboard must be replaced, contact a Unisys service<br />

representative for a replacement part.<br />

• Contact the Unisys <strong>Support</strong> Center for the appropriate BIOS for the replacement<br />

part.<br />

Note: The replacement motherboard might not have the disk controller set for<br />

RAID1 (mirroring). Check the setting and change it if necessary.<br />

• In rare cases, you might need to obtain a replacement RA from a Unisys service<br />

representative. After you receive the replacement RA, follow the steps in Appendix<br />

D to install and configure it.<br />

Single Hard Disk Failure<br />

Problem Description<br />

Symptoms<br />

One of the mirrored internal hard disks for the RA failed.<br />

The following symptoms might help you identify this failure:<br />

• The failure light for a hard disk indicates a failure. Figure 8–12 illustrates the location<br />

of the LEDs for hard disks in the RA.<br />

8–24 6872 5688–002


Solving Replication Appliance (RA) Problems<br />

Figure 8–12. Location of Hard Drive LEDs<br />

• An error message that appears during boot indicates failure of one of the internal<br />

disks.<br />

• The LCD display on the front panel of the RA indicates a drive failure. This error code<br />

is E0D76 as shown in Figure 8–5.<br />

Actions to Resolve the Problem<br />

Perform the following actions to isolate and resolve the problem:<br />

• If the drive failed, you must replace the hard drive. Contact a Unisys service<br />

representative for a replacement part.<br />

• Install the new drive; resynchronization occurs automatically.<br />

Do not power off or reboot the RA while resynchronization is taking place.<br />

Failure of All RAs at One Site<br />

Problem Description<br />

If all RAs fail on one site, replication stops and the data that are currently changing on the<br />

remote site are marked for synchronization. Once the RAs are restored, synchronization<br />

occurs through a full \-sweep operation.<br />

This type of failure is unlikely unless the power source fails.<br />

6872 5688–002 8–25


Solving Replication Appliance (RA) Problems<br />

Symptoms<br />

The following symptoms might help you identify this failure:<br />

• Transfer is paused for all consistency groups.<br />

• Depending on the environment and group settings, applications that were running on<br />

the failed site might stop.<br />

• If the quorum resource belonged to a node at the failed site, MCSC might fail.<br />

• The symptoms for this failure are similar to a total site failure and a network failure<br />

on both the management network and WAN. Because the WAN link is functioning,<br />

the difference is that the following are true:<br />

− Neither site can access the management console using the site management IP<br />

address of the site with the failed RAs.<br />

− Both sites can access the management console using the site management IP<br />

address of the site with the functioning RAs.<br />

Communicate with the administrator at the other site to determine whether that site<br />

can access the management console. Both sites should see a display similar to<br />

Figure 8–13.<br />

Figure 8–13. Management Console Showing All RAs Down<br />

Actions to Resolve the Problem<br />

Perform the following actions to isolate and resolve the problem:<br />

1. Restore power to the failed RAs.<br />

2. If recovery of applications is needed prior to restoring the RAs, see the recovery<br />

topics in Section 3 for geographic replication environments and in Section 4 for<br />

geographic clustered environments.<br />

8–26 6872 5688–002


All RAs Are Not Attached<br />

Problem Description<br />

Symptoms<br />

Solving Replication Appliance (RA) Problems<br />

If all RAs at a site are not attached, connection to the management console is not<br />

available. Also, you cannot access the RA using a PuTTY session and the site<br />

management IP address. You cannot log into the RA using the RA management IP<br />

address and the admin user account. The RA that runs site control is assigned a virtual IP<br />

address that is the site management IP address. Either RA 1 or RA 2 must be attached<br />

to the cluster to have an RA cluster with site control running.<br />

The following symptoms might help you identify this failure:<br />

• You cannot log in to the management console using the site management IP<br />

addresses of the failed sites.<br />

• You cannot initiate an SSH session through PuTTY using the admin account to either<br />

RA management IP address or the site management IP address.<br />

• From the management console of the other site, the WAN appears to be down. (See<br />

Figure 8–11.)<br />

Actions to Resolve the Problem<br />

Perform the following actions to isolate and resolve the problem:<br />

1. Ping the RA using the management IP address. If the ping is not successful, refer to<br />

“Management Network Failure in a Geographic Clustered Environment” in<br />

Section 7. If the ping completes successfully, continue with steps 2 through 5.<br />

2. Log in as boxmgmt to each RA management IP address through an SSH session<br />

using PuTTY. (See “Using the SSH Client” in Appendix C for more information.) If<br />

this is not successful, the RA is probably not attached.<br />

3. To verify that the RA is not attached, follow these steps:<br />

a. Log in as boxmgmt to the RA.<br />

b. At the prompt, type 4 (Cluster operations) and press Enter.<br />

Note: The “reboot regulation limit has been exceeded” message is displayed<br />

when you log in as boxmgmt. In that case, see “Reboot Regulation” in this<br />

section.<br />

c. At the prompt, type 2 (Detach from cluster) and press Enter.<br />

Do not type y to detach. If the RA was not attached, a message is displayed<br />

stating that it is not detached.<br />

6872 5688–002 8–27


Solving Replication Appliance (RA) Problems<br />

Note: Either RA 1 or RA 2 must be attached to have a cluster. RAs 3 through 8<br />

cannot become cluster masters.<br />

4. If the RA is not attached, then type B (Back) and press Enter.<br />

5. At the prompt, type 1 (Attach to cluster) to attach the RA to the cluster.<br />

6. At the prompt, type Q (Quit).<br />

7. Once the RA is attached, log in as admin to the management console and also<br />

initiate a SSH session to the management IP address to ensure that both are<br />

operational.<br />

8. At the management console, click the RAs tab and check that all connections are<br />

working.<br />

8–28 6872 5688–002


Section 9<br />

Solving Server Problems<br />

This section lists symptoms that usually indicate problems with one or more servers.<br />

The problems listed in this section include hardware failure problems. Table 9–1 lists<br />

symptoms and possible problems indicated by the symptom. The problems and their<br />

solutions are described in this section. The graphics, behaviors, and examples in this<br />

section are similar to what you observe with your system but might differ in some<br />

details.<br />

In addition to the symptoms listed, you might receive e-mail messages or SNMP traps<br />

for any of the possible problems or causes. Also, messages similar to e-mail notifications<br />

are displayed on the management console. If you do not see the messages, they might<br />

have already dropped off the display. Review the management console logs for<br />

messages that have dropped off the display.<br />

Table 9–1. Possible Server Problems with Symptoms<br />

Symptom Possible Problem<br />

The management console shows a server<br />

down.<br />

Messages on the management console<br />

show the splitter is down and that the<br />

node fails over.<br />

Multipathing software (such as EMC<br />

PowerPath Administrator) messages report<br />

errors. (This symptom might occur if the<br />

server is unable to connect with the SAN<br />

or if the server HBA fails.)<br />

Host logs and RA log timestamps are not<br />

synchronized.<br />

Cluster node failure (hardware or software)<br />

in geographic clustered environment<br />

possibly resulting from<br />

• Windows server reboot<br />

• Unexpected server shutdown because<br />

of a bug check<br />

• Server crash or restart<br />

• Server unable to connect with SAN<br />

• Server HBA failure<br />

Infrastructure (NTP) server failure<br />

Applications are down. Server failure (hardware or software) in<br />

geographic replication environment<br />

possibly resulting from<br />

• Windows server reboot<br />

• Unexpected server shutdown because<br />

of a bug check<br />

• Server crash or restart<br />

• Server unable to connect with SAN<br />

• Server HBA failure<br />

6872 5688–002 9–1


Solving Server Problems<br />

Cluster Node Failure<br />

(Hardware or Software) in i a<br />

Geographic Clusteered<br />

Environment<br />

Problem Description<br />

9–2<br />

MSCS uses several hearrtbeat<br />

mechanisms to detect whether a node is still actively<br />

responding to cluster acttivities.<br />

MSCS assumes a cluster node has failed wh hen the<br />

cluster node no longer reesponds<br />

to heartbeats that are broadcast over the pu ublic\private<br />

cluster networks and whhen<br />

a SCSI reservation is lost on the quorum volume e.<br />

Figure 9–1 illustrates thiss<br />

failure.<br />

Figure 9–1. Cluster Node Failure<br />

If the server that crashedd<br />

was the MSCS leader (quorum owner), another clu uster node<br />

(the challenger) tries to bbecome<br />

leader and arbitrate for the quorum device. Because the<br />

failed server is no longerr<br />

the quorum device owner in the reservation manag ger, the<br />

arbitration by the challenger<br />

instantly succeeds.<br />

If the challenger node is from the same site as the failed server, arbitration in nstantly<br />

succeeds, and no failoveer<br />

of the quorum device to the remote site is required.<br />

If the challenger node is from the remote site, the RA reverses the replicatio on direction<br />

of the quorum consistency<br />

group. Once failover completes, the challenger arbitration<br />

is<br />

completed.<br />

68 872 5688–002


Solving Server Problems<br />

When a nonleader MSCS node fails, the data groups move to the remaining MSCS local<br />

or remote nodes, depending on preferred ownership settings. From the perspective of<br />

the RA, this situation is equivalent to a user-initiated move of the data groups. That is,<br />

the <strong>SafeGuard</strong> 30m Control resource on the node that tries to bring the group online<br />

sends a command to fail over the group to its site. If the group fails over to a cluster<br />

node on the same site, failover occurs instantly. Otherwise, a consistency group failover<br />

is initiated to the remote site. The <strong>SafeGuard</strong> 30m Control resource does not come<br />

online until the consistency group has completed failover.<br />

Possible Subset Scenarios<br />

The symptoms of a server failure vary based on the reasons that the server went down.<br />

Five different scenarios are described as subsets of this type of failure:<br />

• Windows Server Reboot<br />

• Unexpected Server Shutdown Because of a Bug Check<br />

• Server Crash or Restart<br />

• Server Unable to Connect with SAN<br />

• Server HBA Failure<br />

One of the first things to determine in troubleshooting a server failure is whether the<br />

failure was an unexpected event (a “crash”) or an orderly event such as an operator<br />

reboot. When the server crashes, you usually see a “blue screen” and do not have<br />

access to messages. Once the server comes up again, then you can view messages<br />

regarding the reason it crashed. These messages help diagnose the reason for the initial<br />

shutdown or failure.<br />

In an orderly event, the Windows event log is stopped, and you can view events that<br />

point to the reason for the reboot or restart.<br />

Windows Server Reboot<br />

Problem Description<br />

The consistency groups fail over to another local node or to the other site because a<br />

server fails or goes down. In this scenario, the shutdown is an orderly event and thus<br />

causes the Windows event log service to stop.<br />

6872 5688–002 9–3


Solving Server Problems<br />

Symptoms<br />

The following symptoms might help you identify this failure:<br />

• The management console display shows a server failure similar to that shown in<br />

Figure 9–2.<br />

Figure 9–2. Management Console Display with Server Error<br />

• Warning and informational messages similar to those shown in Figure 9–3 appear on<br />

the management console when a server fails. See the table after the figure for an<br />

explanation of the numbered console messages.<br />

9–4 6872 5688–002


Solving Server Problems<br />

Figure 9–3. Management Console Messages for Server Down<br />

6872 5688–002 9–5


Solving Server Problems<br />

Reference<br />

No.<br />

The following table explains the numbered messages shown in Figures 9–3.<br />

Event<br />

ID<br />

Description<br />

1 5008 The source site reports that server<br />

USMV-CAS100P2 performed an<br />

orderly shutdown.<br />

2 4062 The surviving site accesses the<br />

latest image of the consistency<br />

group during the failover.<br />

3 5032 For each consistency group that<br />

moves to a surviving node, the<br />

splitter is again splitting.<br />

4 4008 For each consistency group that<br />

moves to a surviving node, the<br />

transfer is paused. In the details of<br />

this message, the reason for the<br />

pause is given.<br />

5 1008 The Unisys <strong>SafeGuard</strong> 30m Control<br />

resource successfully issued an<br />

initiate_failover command.<br />

6 4086 For each consistency group that<br />

moves to asurviving node, data<br />

transfer starts and then a quick<br />

initialization starts.<br />

7 4087 For each consistency group that<br />

moves to a surviving node,<br />

initialization completes.<br />

E-mail<br />

Immediate<br />

E-mail<br />

Daily<br />

Summary<br />

To see the details of the messages listed on the management console display, you<br />

must collect the logs and then review the messages for the time of the failure.<br />

Appendix A explains how to collect the management console logs, and Appendix E<br />

lists the event IDs with explanations.<br />

• If you review the system event logs, you can find messages similar to the following<br />

examples that are based on the testing cases used to generate the previous<br />

management console images.<br />

System Event Log for Usmv-Cas100p2 Host (Failure Host on Site 1)<br />

6/01/2008 16:19:13 PM EventLog Information None 6006 N/A USMV-WEST2 The Event log<br />

service was stopped.<br />

6/01/2008 16:19:48 PM EventLog Information None 6009 N/A USMV-WEST2 Microsoft (R)<br />

Windows (R) 5.02. 3790 Service Pack 1 Multiprocessor Free.<br />

6/01/2008 16:19:48 PM EventLog Information None 6005 N/A USMV-USMV-WEST2. The Event<br />

log service was started.<br />

9–6 6872 5688–002<br />

X<br />

X<br />

X<br />

X<br />

X<br />

X<br />

X


Solving Server Problems<br />

System Event Log for Usmv-x455 Host (Surviving Host on Site 2)<br />

6/01/2008 16:19:15 PM ClusSvc Warning Node Mgr 1123 N/A USMV-EAST2 The node lost<br />

communication with cluster node 'USMV-WEST2' on network '<strong>Public</strong>'.<br />

6/01/2008 16:19:15 PM ClusSvc Warning Node Mgr 1123 N/A USMV-EAST2 The node lost<br />

communication with cluster node 'USMV-WEST2' on network 'Private'.<br />

6/01/2008 16:19:56 PM ClusSvc Warning Node Mgr 1135 N/A USMV-EAST2 Cluster node<br />

USMV-WEST2 was removed from the active server cluster membership. Cluster service may have been<br />

stopped on the node, the node may have failed, or the node may have lost communication with the other<br />

active server cluster nodes.<br />

• If you review the cluster log, you can find messages similar to the following<br />

examples that are based on the failed node owning the quorum used to generate the<br />

previous management console images:<br />

Cluster Log for Usmv-West2 Host (Failure Host on Site 1)<br />

0000089c.00000a54:: 2008/05/25-10:31:42.107 ERR [GUM]GumUpdateRemoteNode: Failed to get<br />

completion status for async RPC call,status 1115.(Error 1115: A system shutdown is in progress)<br />

0000089c.00000a54::2008/05/25-10:31:42.107 ERR [GUM] GumSendUpdate: Update on node 2 failed<br />

with 1115 when it must succeed<br />

0000089c.00000a54:: 2008/05/25-10:31:42.107 ERR [GUM] GumpCommFailure 1115 communicating<br />

with node 20000089c.00000a54:: 2008/05/25-10:31:42.107 ERR [NM] Banishing node 1 from active<br />

cluster membership.<br />

0000089c.00000a54:: 2008/05/25-10:31:42.107 ERR [RGP] Node 1: REGROUP WARNING: reload failed.<br />

0000089c.00000a54:: 2008/05/25-10:31:42.107 ERR [NM] Halting this node due to membership or<br />

communications error. Halt code = 1.<br />

0000089c.00000a54:: 2008/05/25-10:31:42.107 ERR [CS] Halting this node to prevent an inconsistency<br />

within the cluster. Error status = 5890. (Error 5890: An operation was attempted that is incompatible with<br />

the current membership state of the node)<br />

0000091c.00000fe4:: 2008/05/25-10:31:42.107 ERR [RM] LostQuorumResource, cluster service<br />

terminated...<br />

Cluster Log for Usmv-East2 Host (Surviving Host on Site 2)<br />

00000268.00000c38::2008/05/25-10:31:42.107 INFO [ClMsg] Received interface unreachable event for<br />

node 1 network 2<br />

00000268.00000c38::2008/05/25-10:31:42.107 INFO [ClMsg] Received interface unreachable event for<br />

node 1 network 1<br />

00000268.00000b70::2008/05/25-10:31:42.107 WARN [NM] Communication was lost with interface<br />

374359a2-5782-4b1d-a863-07f84f8c97d9 (node: USMV-WEST2, network: private)<br />

00000268.00000b70::2008/05/25-10:31:42.107 INFO [NM] Updating local connectivity info for network<br />

afe1f350-f66a-460a-a526-6f58987b911d.<br />

00000268.00000b70::2008/05/25-10:31:42.107 INFO [NM] Started state recalculation timer (2400ms) for<br />

network afe1f350-f66a-460a-a526-6f58987b911d (private)<br />

00000268.00000b70::2008/05/25-10:31:42.107 WARN [NM] Communication was lost with interface<br />

15b9fbe1-c05f-4e90-b937-17fdc27c133e (node: USMV-WEST2, network: public)<br />

00000268.00000b70::2008/05/25-10:31:42.107 INFO [NM] Updating local connectivity info for network<br />

9d905035-8105-4c87-a5bc-ce82e49e764a.<br />

00000268.00000b70::2008/05/25-10:31:42.107 INFO [NM] Started state recalculation timer (2400ms) for<br />

network 9d905035-8105-4c87-a5bc-ce82e49e764a (public)<br />

00000268.000005d0::2008/05/25-10:31:39.733 INFO [NM] We own the quorum resource..<br />

6872 5688–002 9–7


Solving Server Problems<br />

Actions to Resolve the Problem<br />

Perform the following actions to isolate and resolve the problem:<br />

• Check for event 5008 in the management console logs. If this event is replaced by<br />

event 5013, the host probably crashed. See “Unexpected Server Shutdown Because<br />

of a Bug Check” and “Server Crash or Restart.”<br />

• Review the cluster log and check for the system shutdown message as shown in<br />

the preceding examples. Determine whether the quorum resource moved by<br />

checking the surviving nodes for the message “We own the quorum resource.”<br />

• Review the Windows system event log messages and determine whether or not the<br />

server failure was a crash or an orderly event.<br />

In this case, based on the example messages, the Windows system event log<br />

shows that the system started the reboot or shutdown in an orderly manner at<br />

6:19:13 p.m. (message 6006). Because the event log service was shut down, the<br />

events that follow show that the event log service restarted.<br />

For an orderly event, often an operator shuts down the system for some planned<br />

reason.<br />

• If the event log messages do not point to an orderly event, then review<br />

“Unexpected Server Shutdown Because of a Bug Check” and “Server Crash or<br />

Restart” as possible scenarios that fit the circumstances.<br />

Unexpected Server Shutdown Because of a Bug Check<br />

Problem Description<br />

Symptoms<br />

The consistency groups fail over to another local node or to the other site because a<br />

server fails or shuts down unexpectedly and then reboots after the “blue screen” event.<br />

The following symptoms might help you identify this failure:<br />

• The management console display shows a server failure similar to that shown in<br />

Figure 9–2.<br />

• Warning and informational messages similar to those shown in Figure 9–4 appear on<br />

the management console when a server fails. See the table after the figure for an<br />

explanation of the numbered console messages.<br />

9–8 6872 5688–002


Solving Server Problems<br />

Figure 9–4. Management Console Messages for Server Down for Bug Check<br />

6872 5688–002 9–9


Solving Server Problems<br />

Reference<br />

No.<br />

The following table explains the numbered messages shown in Figure 9–4.<br />

Event<br />

ID<br />

Description<br />

1 5013 The splitter for the server USMV-<br />

WEST2 is down unexpectedly.<br />

2 4008 For each consistency group, the<br />

transfer is paused at the source (down)<br />

site. In the details of this message, the<br />

reason for the pause is given.<br />

3 5002 The splitter for server USMV-WEST2 is<br />

unable to access the RA unexpectedly.<br />

4 4008 For each consistency group, the<br />

transfer is paused at the surviving site<br />

to allow a switchover. In the details of<br />

this message, the reason for the pause<br />

is given.<br />

5 4062 The surviving site accesses the latest<br />

image of the consistency group during<br />

the failover.<br />

6 5032 For each consistency group at the<br />

surviving site, the splitter is splitting to<br />

the replication volumes.<br />

7 5002 The RA at the source (down) site<br />

cannot access the splitter for server<br />

USMV-WEST2.<br />

8 4010 For each consistency group at the<br />

source site, the transfer is started.<br />

9 4086 For each consistency group at the<br />

source site, data transfer starts and<br />

then initialization starts.<br />

10 4087 For each consistency group at the<br />

source site, initialization completes.<br />

E-mail<br />

Immediate<br />

E-mail<br />

Daily<br />

Summary<br />

To see the details of the messages listed on the management console display, you<br />

must collect the logs and then review the messages for the time of the failure.<br />

Appendix A explains how to collect the management console logs, and Appendix E<br />

lists the event IDs with explanations.<br />

• If you review the Windows system event logs after the system reboots, you can find<br />

messages similar to the following examples that are based on the testing cases<br />

used to generate the previous management console images.<br />

9–10 6872 5688–002<br />

X<br />

X<br />

X<br />

X<br />

X<br />

X<br />

X<br />

X<br />

X


System Log for Usmv-West2 Host (Failure Host on Site 1)<br />

Solving Server Problems<br />

6/01/2008 18:12:42 PM EventLog Error None 6008 N/A USMV-WEST2 The previous system<br />

shutdown at 18:02:42 PM on 6/01/2008 was unexpected.<br />

6/01/2008 18:12:42 PM EventLog Information None 6009 N/A USMV-WEST2 Microsoft (R)<br />

Windows (R) 5.02. 3790 Service Pack 1 Multiprocessor Free.<br />

6/01/2008 18:12:42 PM EventLog Information None 6005 N/A USMV-WEST2 The Event log<br />

service was started.<br />

6/01/2008 18:12:42 PM Save Dump Information None 1001 N/A USMV-WEST2 The<br />

computer has rebooted from a bugcheck. The bugcheck was: 0x0000007e (0xffffffffc0000005,<br />

0xe000015f97c8a664, 0xe000015f9e52be68, 0xe000015f9e52afb0). A dump was saved in:<br />

C:\WINDOWS\MEMORY.DMP.<br />

System Log for Usmv-East2 Host (Surviving Host on Site 2)<br />

6/01/2008 18:02:42 PM ClusSvc Warning Node Mgr 1123 N/A USMV-EAST2 The node lost<br />

communication with cluster node 'USMV-WEST2' on network '<strong>Public</strong>'.<br />

6/01/2008 18:02:42 PM ClusSvc Warning Node Mgr 1123 N/A USMV-EAST2 The node lost<br />

communication with cluster node 'USMV-WEST2' on network 'Private'.<br />

6/01/2008 18:02:42 PM ClusDisk Error None 1209 N/A USMV-EAST2 Cluster service is requesting<br />

a bus reset for device\Device\ClusDisk0.<br />

6/01/2008 18:02:42 PM ClusSvc Warning Node Mgr 1135 N/A USMV-EAST2 Cluster node<br />

USMV-WEST2 was removed from the active server cluster membership. Cluster service may have been<br />

stopped on the node, the node may have failed, or the node may have lost communication with the other<br />

active server cluster nodes.<br />

• If you review the cluster log, you can find messages similar to the following<br />

examples that are based on the testing cases used to generate the previous<br />

management console images:<br />

Cluster Log for Usmv-West2 Host (Failure Host on Site 1)<br />

For this error situation, no entries appear in the cluster log.<br />

Cluster Log for Usmv-East2 Host (Surviving Host on Site 2)<br />

000007e0.00000138::2008/06/01-18:02:42.104 INFO [ClMsg] Received interface unreachable event for<br />

node 1 network 2<br />

000007e0.00000138:: 2008/06/01-18:02:42.104 INFO [ClMsg] Received interface unreachable event for<br />

node 1 network 1<br />

000007e0.00000124:: 2008/06/01-18:02:42.104 WARN [NM] Communication was lost with interface<br />

5019923b-d7a1-4886-825f-207b5938d11e (node: USMV-WEST2, network: <strong>Public</strong>)<br />

000007e0.00000124:: 2008/06/01-18:02:42.104 WARN [NM] Communication was lost with interface<br />

f409cf69-9c30-48f0-8519-ad5dd14c3300 (node: B)<br />

000001c0.00000664:: 2008/06/01-18:02:42.507 ERR Physical Disk : [DiskArb] GetPartInfo<br />

completed, status 170. (Error 170: the request resource is in use)<br />

000001c0.00000664:: 2008/06/01-18:02:42.507 ERR Physical Disk : [DiskArb] Failed to read<br />

(sector 12), error 170.<br />

000001c0.00000664:: 2008/06/01-18:02:42.507 INFO Physical Disk : [DiskArb] We are about to<br />

break reserve.<br />

000007e0.00000a0c:: 2008/06/01-18:02:42.881 INFO [NM] We own the quorum resource.<br />

6872 5688–002 9–11


Solving Server Problems<br />

Actions to Resolve the Problem<br />

Perform the following actions to isolate and resolve the problem:<br />

1. Review the Windows application event log messages to determine the cause of the<br />

unexpected event.<br />

In this case, based on the four example messages, the first Windows system event<br />

log shows event 6008 in which the system unexpectedly shut down; it was not a<br />

reboot.<br />

Then event 6009 is typically displayed as a reboot message. This event occurs<br />

regardless of the reason for the reboot. The same is true for event 6005.<br />

The Save Dump event 1001 shows that a memory dump was saved. Based on this<br />

message, consult the Microsoft Knowledge Base regarding bug checks.<br />

(http://support.microsoft.com/). Search for bug check 0x0000007e, or stop<br />

error 0x0000007e and replace the stop number with the one displayed.<br />

2. Once you have the appropriate Knowledge Base article from the Microsoft site,<br />

follow the recommendations in the article to resolve the issue.<br />

3. If the information from the Knowledge Base article does not solve resolve the<br />

problem, collect and save the memory dump file and then submit it to the Unisys<br />

<strong>Support</strong> Center.<br />

Server Crash or Restart<br />

Problem Description<br />

Symptoms<br />

When the server goes down for whatever reason and then restarts in a geographic<br />

clustered environment, the consistency groups fail over to the other site and then fail<br />

over to the original site once the server is restarted.<br />

The following symptoms might help you identify his failure:<br />

• The management console display shows a server failure similar to that shown in<br />

Figure 9–2.<br />

• Warnings and informational messages similar to those shown in Figure 9–4 appear<br />

on the management console when the server fails. See the table after that figure for<br />

an explanation of the numbered console messages.<br />

• If you review the Windows system event log, you can find messages similar to the<br />

following examples that are based on the testing cases used to generate the<br />

management console images for Figures 9–2 and 9–4:<br />

9–12 6872 5688–002


System Log for Usmv-West2 Host (Failure Host on Site 1)<br />

Solving Server Problems<br />

6/01/2008 18:42:39 PM EventLog Error None 6008 N/A USMV-WEST2 The previous system<br />

shutdown at 18:05:55 PM on 6/01/2008 was unexpected.<br />

6/01/2008 18:42:39 PM EventLog Information None 6009 N/A USMV-WEST2 Microsoft (R)<br />

Windows (R) 5.02. 3790 Service Pack 2 Multiprocessor Free.<br />

6/01/2008 18:42:39 PM EventLog Information None 6005 N/A USMV-WEST2 The Event log<br />

service was started.<br />

System Log for Usmv-East2 Host (Surviving Host on Site 2)<br />

6/01/2008 18:05:55 PM ClusSvc Warning Node Mgr 1123 N/A USMV-EAST2 The node lost<br />

communication with cluster node 'USMV-WEST2' on network '<strong>Public</strong>'.<br />

6/01/2008 18:05:55 PM ClusSvc Warning Node Mgr 1123 N/A USMV-EAST2 The node lost<br />

communication with cluster node 'USMV-WEST2' on network 'Private'.<br />

6/01/2008 18:05:55 PM ClusDisk Error None 1209 N/A USMV-EAST2 Cluster service is requesting<br />

a bus reset for device \Device\ClusDisk0.<br />

6/01/2008 18:05:55 PM ClusSvc Warning Node Mgr 1135 N/A USMV-EAST2 Cluster node<br />

USMV-WEST2 was removed from the active server cluster membership. Cluster service may have been<br />

stopped on the node, the node may have failed, or the node may have lost communication with the other<br />

active server cluster nodes.<br />

• If you review the cluster log, you can find messages similar to the following<br />

examples that are based on the testing cases used to generate the management<br />

console images for Figures 9–2 and 9–4:<br />

Cluster Log for Usmv-West2 Host (Failure Host on Site 1)<br />

For this error situation, no entries appear in the cluster log.<br />

Cluster Log for Usmv-East2 Host (Surviving Host on Site 2)<br />

000007e0.00000138::2008/06/01-18:05:55.102 INFO [ClMsg] Received interface unreachable event for<br />

node 1 network 2<br />

000007e0.00000138:: 2008/06/01-18:05:55.102 INFO [ClMsg] Received interface unreachable event for<br />

node 1 network 1<br />

000007e0.00000124:: 2008/06/01-18:05:55.102 WARN [NM] Communication was lost with interface<br />

5019923b-d7a1-4886-825f-207b5938d11e (node: USMV-WEST2, network: <strong>Public</strong>)<br />

000007e0.00000124:: 2008/06/01-18:05:55.102 WARN [NM] Communication was lost with interface<br />

f409cf69-9c30-48f0-8519-ad5dd14c3300 (node: USMV-WEST2, network: Private LAN)<br />

000001c0.00000168:: 2008/06/01-18:05:55.504 ERR Physical Disk : [DiskArb] GetPartInfo<br />

completed, status 170. (Error 170: the requested resource is in use)<br />

000001c0.00000168:: 2008/06/01-18:05:55.504 ERR Physical Disk : [DiskArb] Failed to read<br />

(sector 12), error 170.<br />

000001c0.00000168:: 2008/06/01-18:05:55.504 INFO Physical Disk : [DiskArb] We are about to<br />

break reserve.<br />

000007e0.00000764:: 2008/06/01-18:05:55.079 INFO [NM] We own the quorum resource.<br />

6872 5688–002 9–13


Solving Server Problems<br />

Actions to Resolve the Problem<br />

Perform the following actions to isolate and resolve the problem:<br />

1. Run the Microsoft Product <strong>Support</strong> MPS Report Utility to gather system information.<br />

(See “Using the MPS Report Utility” in Appendix A.)<br />

2. Submit the MPS report to the Unisys <strong>Support</strong> Center.<br />

Server Unable to Connect with SAN<br />

Problem Description<br />

Symptoms<br />

The server is unable to connect to the SAN.<br />

The following symptoms might help you identify this failure:<br />

• The management console display shows a server failure similar to that shown in<br />

Figure 9–5.<br />

Figure 9–5. Management Console Display Showing LA Site Server Down<br />

To display more information about the error, click on More in the right column. A<br />

message similar to the following is displayed:<br />

ERROR: Splitter USMV-WEST2 is down<br />

• Warnings and informational messages similar to those shown in Figure 9–6 appear<br />

on the management console when the server fails. See the table after the figure for<br />

an explanation of the numbered console messages.<br />

9–14 6872 5688–002


Solving Server Problems<br />

Figure 9–6. Management Console Images Showing Messages for Server Unable to<br />

Connect to SAN<br />

Reference<br />

No.<br />

The following table explains the numbered messages in Figure 9–6.<br />

Event<br />

ID<br />

Description<br />

1 5013 The splitter for the server USMV-WEST2 is<br />

down.<br />

2 4008 For each consistency group at the failed site,<br />

the transfer is paused to allow a failover to<br />

the surviving site.<br />

3 4008 For each consistency group, the transfer is<br />

paused at the surviving site to allow a failover.<br />

In the details of this message, the reason for<br />

the pause is given.<br />

4 5002 The splitter for the server USMV-WEST2 is<br />

unable to access the RA.<br />

5 4010 The consistency groups on the original failed<br />

site start data transfer.<br />

6 4086 For each consistency group at the failed site,<br />

data transfer starts and then initialization<br />

starts.<br />

7 4087 For each consistency group at the failed site,<br />

data transfer completes.<br />

E-mail<br />

Immediate<br />

E-mail<br />

Daily<br />

Summary<br />

• The multipathing software (EMC PowerPath Administrator) flashes a red X on the<br />

right side of the toolbar.<br />

6872 5688–002 9–15<br />

X<br />

X<br />

X<br />

X<br />

X<br />

X<br />

X


Solving Server Problems<br />

• The PowerPath Administrator Console reports failures similar to those shown in<br />

Figure 9–7.<br />

Figure 9–7. PowerPath Administrator Console Showing Failures<br />

• If you review the server system event log, you can find error messages similar to the<br />

following examples that are based on the testing cases used to generate the<br />

previous management console images.<br />

Type : warning<br />

Source : Ftdisk<br />

EventID : 57<br />

Description : The system failed to flush data to the transaction log. Corruption may occur.<br />

Type : error<br />

Source : Emcpbase<br />

EventID : 100<br />

Description : Path Bus x Tgt y LUN z to APMxxxx is dead<br />

The event 100 will appear numerous times for each bus, target and LUN.<br />

Actions to Resolve the Problem<br />

Perform the following actions to isolate and resolve the problem:<br />

1. At the server, run a tool such as the PowerPath Administrator that might aid in<br />

diagnosing the problem.<br />

2. Log in to the storage software and determine whether problems are reported. If so,<br />

use the information for that software to correct the problems.<br />

Something might have happened to the volume, or the zoning configuration on the<br />

switch might have been changed. Also, a connection issue could exist such as a<br />

fabric switch or storage cable failure.<br />

9–16 6872 5688–002


Solving Server Problems<br />

3. If the problem is not limited to one server, run the Installation Manager Fibre<br />

Channel diagnostics. Appendix C explains how to run the Installation Manager<br />

diagnostics and provides information about the various diagnostic capabilities.<br />

4. If the problem still appears at the host, an adapter with multiple ports might have<br />

failed. Replace the Fibre Channel adapter in the host if the storage, zoning, and<br />

cabling appear correct. Ensure that the storage and zoning are corrected to use the<br />

new WWN as necessary. (See “Server HBA Failure” for resolution actions.)<br />

Server HBA Failure<br />

Problem Description<br />

Symptoms<br />

One HBA in the server failed on a host that has multiple paths to storage.<br />

The following symptoms might help you identify this failure:<br />

• The multipathing software (such as EMC PowerPath Administrator) flashes a red X<br />

on the right side of the toolbar.<br />

• The PowerPath Administrator console reports failures similar to those shown in<br />

Figure 9–8.<br />

Figure 9–8. PowerPath Administrator Console Showing Adapter Failure<br />

6872 5688–002 9–17


Solving Server Problems<br />

• If you review the server system event log, you can find error messages similar to the<br />

following example:<br />

Actions to Resolve<br />

Type : error<br />

Source : Emcpbase<br />

EventID : 100<br />

Description:<br />

Path Bus x Tgt y LUN z to APMxxxx is dead<br />

The event 100 will appear numerous times for each target and LUN.<br />

To replace an HBA in the server, perform the following steps:<br />

1. Run Emulex HBAnywhere and record the WWNs in use by the server.<br />

2. Shut down the server.<br />

3. Replace the failed HBA and then boot the server.<br />

4. Run Emulex HBAnywhere and record the new WWN.<br />

5. Using the SAN switch management modify the zoning as needed to replace the<br />

failed WWN with new WWN.<br />

6. If manual discovery was used for the storage, update the configuration to use the<br />

new WWN.<br />

Infrastructure (NTP) Server Failure<br />

Problem Description<br />

Symptoms<br />

The replication environment is not affected by an NTP server failure. Timestamps of log<br />

entries are affected.<br />

The following symptoms might help you identify the failure:<br />

• When comparing log entries of a failover, the host application log and the<br />

management console entries are not synchronized.<br />

• You are unable to run the synchronization diagnostics as described in the Unisys<br />

<strong>SafeGuard</strong> <strong>Solutions</strong> Replication Appliance Installation <strong>Guide</strong>.<br />

Actions to Resolve the Problem<br />

To resolve an NTP server failure, perform the following steps:<br />

1. Temporarily change the cluster mode for a data consistency group to MSCS<br />

manual (for a group replicating from the source site to the target site).<br />

2. Perform a move-group operation on a cluster group that contains a Unisys <strong>SafeGuard</strong><br />

Control resource to a node at the target site.<br />

3. View the management console log for event 1009 as shown in Figure 9–9.<br />

9–18 6872 5688–002


6872 5688–002<br />

4. View the host aapplication<br />

event log for event 1115, as follows:<br />

Event Type :<br />

Event Source :<br />

Event Category :<br />

Event ID :<br />

Date :<br />

Time :<br />

User :<br />

Computer<br />

Description:<br />

:<br />

Online resource fai<br />

Group is not a MSC<br />

Action: Verify throu<br />

Or if doing manual<br />

Figure 9–9. Event 1009 Display<br />

Warning<br />

30mControl<br />

None<br />

1115<br />

9/10/2006<br />

12:09:04 PM<br />

N/A<br />

USMV-EAST2<br />

Resource name: Daata1<br />

RA CLI command: UCLI -ssh -l plugin -pw **** -superbatch 172.16.7.60 initiate_ _failover group=Data1<br />

active_site=East cluster_owner=USMV-EAST2<br />

5. Compare the timestamps.<br />

If the time betwween<br />

the timestamps is not within a couple of minut tes, the host and<br />

RAs are not synnchronized.<br />

6. Use the Installaation<br />

Manager site connectivity IP diagnostic by perf forming the<br />

following stepss.<br />

(For more information, see Appendix C.)<br />

a. Log in to ann<br />

RA as user boxmgmt with the password boxmg gmt.<br />

b. On the Maain<br />

Menu, type 3 (Diagnostics) and press Enter.<br />

Solving Server S Problems<br />

led.<br />

CS auto-data group (5).<br />

ugh the Management Console that the Global cluster mode is set t to MSCS auto-data.<br />

recovery, ensure an image has been selected.<br />

c. On the Diaagnostics<br />

menu, type 1 (IP diagnostics) and press Enter.<br />

9–19


Solving Server Problems<br />

d. On the IP Diagnostics menu, type 1 (Site connectivity tests) and press Enter.<br />

e. When asked to select a target for the tests, type 5 (Other host) and press<br />

Enter.<br />

f. Enter the IP address for the NTP server that you want to test.<br />

Note: In step e, you must specify 5 (Other host) and 4 (NTP Server). This<br />

choice is because site 2 does not specify an NTP server in the configuration, and<br />

the test will fail if you use 4 (NTP Server).<br />

7. If the NTP server fails, check that the NTP service on the NTP server is functioning<br />

correctly.<br />

8. Use the Installation Manager port diagnostics IP diagnostic to ensure that no ports<br />

are blocked. (For more information about running port diagnostics, see Appendix C.)<br />

9. Check that the NTP server specified for the host is the same NTP server specified<br />

for the RAs at site 1. (If you want to view the RA configuration settings, use the<br />

Installation Manager Setup View capability. For information about that capability,<br />

refer to the Unisys <strong>SafeGuard</strong> <strong>Solutions</strong> Replication Appliance Installation <strong>Guide</strong>.)<br />

10. Repeat steps 1 through 5 choosing a group that will move a group from the target<br />

site to the source site.<br />

Server Failure (Hardware or Software) in a<br />

Geographic Replication Environment<br />

Problem Description<br />

When a server goes down in a geographic replication environment, the circumstances<br />

and Windows event log messages are similar to those for the server failure in a<br />

geographic clustered environment. That is, the five subset scenarios previously<br />

presented apply as far as the event log messages and actions to resolve are concerned.<br />

The primary difference is that the main symptom of the server failure in this environment<br />

is that the user applications fail.<br />

Refer to the previous five subset scenarios for more details.<br />

9–20 6872 5688–002


Section 10<br />

Solving Performance Problems<br />

This section lists symptoms that usually indicate performance problems. Table 10–1 lists<br />

symptoms and possible problems indicated by the symptom. The problems and their<br />

solutions are described in this section. This section also includes a general discussion of<br />

high-load event. The graphics, behaviors, and examples in this section are similar to what<br />

you observe with your system but might differ in some details.<br />

The management console provides graphs that you can use to evaluate performance.<br />

For more information, see the Unisys <strong>SafeGuard</strong> <strong>Solutions</strong> Replication Appliance<br />

Administrator’s <strong>Guide</strong>.<br />

In addition to the symptoms listed, you might receive e-mail messages or SNMP traps<br />

for the possible problems. Also, messages similar to e-mail notifications are displayed on<br />

the management console. If you do not see the messages, they might have already<br />

dropped off the display. Review the management console logs for messages that have<br />

dropped off the display.<br />

Table 10–1. Possible Performance Problems with Symptoms<br />

Symptom Possible Problem<br />

The initialization progression indicator (%)<br />

in the management interface progresses<br />

significantly slower than expected.<br />

Initialization completes after a significantly<br />

longer period of time than expected.<br />

The event log indicates that the disk<br />

manager has reported high load conditions<br />

for a specific consistency group or groups.<br />

A consistency group or groups start to<br />

initialize. This initialization can occur once<br />

or multiple times, depending on the<br />

circumstances.<br />

Slow initialization<br />

High-load (disk manager)<br />

6872 5688–002 10–1


Solving Performance Problems<br />

Table 10–1. Possible Performance Problems with Symptoms<br />

Symptom Possible Problem<br />

The event log indicates that the distributor<br />

has reported high load conditions for a<br />

specific consistency group or groups.<br />

A consistency group or groups start to<br />

initialize. This initialization can occur once<br />

or multiple times, depending on the<br />

circumstances.<br />

Applications are offline for a lengthy period<br />

during changes in the replication direction.<br />

Slow Initialization<br />

Problem Description<br />

Symptoms<br />

High load (distributor)<br />

Failover time lengthens<br />

Initialization of a consistency group or groups takes longer than expected.<br />

Progression of initialization is reported through the management console in percentages.<br />

You might notice that the percentage for a group has not progressed in a long time or<br />

progresses at a slow rate. This progression might or might not be normal depending on<br />

several factors.<br />

For some groups, it might be natural to take a long time to advance to the next<br />

percentage. One percent of 10 TB is much larger than one percent of 100 GB; therefore,<br />

larger groups would take longer to advance in initialization.<br />

The following symptoms might help you identify this failure:<br />

• The initialization progression indicator (%) in the management interface progresses<br />

significantly slower than expected.<br />

• Initialization completes after a significantly longer period of time than expected.<br />

10–2 6872 5688–002


Actions to Resolve the Problem<br />

Perform the following actions to isolate and resolve the problem:<br />

Solving Performance Problems<br />

• Verify the bandwidth of the connection between sites using the Installation Manager<br />

network diagnostic tools to test the WAN speed while there is no traffic over the<br />

WAN. Appendix C explains how to run these diagnostics.<br />

• Use the Installation Manager Fibre Channel diagnostic tools or customer<br />

storage/SAN diagnostic tools to test the performance of the source and target<br />

storage LUNs to ensure that all storage LUNs are capable of handling the observed<br />

load. Appendix C explains how to run the Installation Manager diagnostics.<br />

If storage performance on either site is poor, the replication system could be limited<br />

in its ability to read from the replication volumes on the source site or to write to the<br />

journal volume on the remote site. Poor storage performance reduces the maximum<br />

speed at which the RAs can initialize.<br />

• Verify that no bandwidth limitation exists on the relevant group or groups properties.<br />

• Use the event log to verify that no other events occurred during initialization—for<br />

example, high load conditions, WAN disconnections, or storage disconnections—that<br />

could have caused the initialization to restart.<br />

• Diagnosis of these types of problems is usually specific to the environment. Collect<br />

RA logs and submit a service request to Unisys support if the cause of slow<br />

initialization cannot be determined through the actions given above. See Appendix A<br />

for information about collecting logs.<br />

General Description of High-Load Event<br />

A high-load event reports that, at the time of the event, a bottleneck existed in the<br />

replication process. To keep track of the changes being made during the bottleneck, the<br />

replication goes into “marking mode” and records the location of all changed data on the<br />

source replication volume until the activity causing the bottleneck has subsided.<br />

The three possible points at which a bottleneck might occur are<br />

• Between the host and RA—Disk Manager<br />

Of the three points for a bottleneck to occur, this point is the rarest to cause the<br />

bottleneck. This type of bottleneck occurs when the host is writing to the storage<br />

device faster than the RA can handle.<br />

• The WAN<br />

This type of bottleneck occurs when the host is writing to the storage device faster<br />

than the RAs can replicate over the available bandwidth. For example, a host is<br />

writing to the storage device during peak hours at a rate of 60 Mbps. The RAs<br />

compress this data down to 15 Mbps. The available bandwidth is 10 Mbps. Clearly,<br />

during peak hours, the bandwidth is not sufficient to support the write rate;<br />

therefore, during peak hours, a number of high load events occur.<br />

6872 5688–002 10–3


Solving Performance Problems<br />

• The remote storage—Distributor<br />

This type of bottleneck occurs when the storage device containing the journal<br />

volume on the remote site cannot keep up with the speed that the data is being<br />

replicated to the remote site. To avoid this situation, configure the journal volume on<br />

the fastest possible LUNs using the fastest RAID and the most disk spindles. Also,<br />

use multiple journal volumes located on different physical disks in the storage array<br />

or use separate disk subsystems in the same consistency group so that the<br />

replication can perform an additional layer of striping. The replication stripes the<br />

images across these multiple journal volumes.<br />

High-Load (Disk Manager) Condition<br />

Problem Description<br />

Symptoms<br />

The disk manager reports high-load conditions.<br />

The following symptoms might help you identify this failure:<br />

• The event log indicates that the disk manager reported high load conditions for a<br />

specific consistency group or groups (event ID 4019).<br />

• A consistency group or groups start to initialize. This initialization can occur once or<br />

multiple times, depending on the circumstances.<br />

Actions to Resolve<br />

Perform the following actions to isolate and resolve the problem:<br />

• Use the Installation Manager network diagnostic tools to test the WAN speed while<br />

there is no traffic over the WAN. Appendix C explains how to run these diagnostics.<br />

• Analyze the performance data for the consistency groups on the RA to ensure that<br />

the incoming write rate is not outside the limits of the available bandwidth or the<br />

capabilities of the RA.<br />

• High loads can occur naturally during traffic peaks or during periods of high external<br />

activity on the WAN. If the high load events occur infrequently or can be associated<br />

with a temporal peak, consider this behavior as normal.<br />

• Diagnosis of these types of problems is usually specific to the environment. Collect<br />

RA logs and submit a service request to the Unisys <strong>Support</strong> Center if the high load<br />

events occur frequently and you cannot resolve the problem through the actions<br />

previously listed. See Appendix A for information about collecting logs.<br />

10–4 6872 5688–002


High-Load (Distributor) Condition<br />

Problem Description<br />

Symptoms<br />

The distributor reports high-load conditions.<br />

The following symptoms might help you identify this failure:<br />

Solving Performance Problems<br />

• The event log indicates that the distributor reported high load conditions for a<br />

specific consistency group or groups.<br />

• A consistency group or groups start to initialize. This initialization can occur once or<br />

multiple times, depending on the circumstances.<br />

Actions to Resolve the Problem<br />

Perform the following actions to isolate and resolve the problem:<br />

• Use the Installation Manager Fibre Channel diagnostic tools or customer storage or<br />

SAN diagnostic tools to test the performance of the target-site storage LUNs.<br />

Appendix C explains how to run the Installation Manager diagnostics.<br />

• Analyze the WAN performance of the consistency group or groups, and ensure that<br />

loads are not too high for handling by the target-site storage devices.<br />

• High loads can occur naturally during traffic peaks. If the high-load events occur<br />

infrequently or can be associated with a temporal peak, consider this behavior as<br />

normal.<br />

• Diagnosis of these types of problems is usually specific to the environment. Collect<br />

RA logs and submit a service request to the Unisys <strong>Support</strong> Center if the high-load<br />

events occur frequently and you cannot resolve the problem through the actions<br />

previously listed. See Appendix A for information about collecting logs.<br />

Failover Time Lengthens<br />

Problem Description<br />

Symptoms<br />

Prior to changing the replication direction, the images must be distributed to the targetsite<br />

volumes. The applications are not available during this process.<br />

Applications are offline for a lengthy period during changes to the replication direction.<br />

Actions to Resolve the Problem<br />

Refer to the Unisys <strong>SafeGuard</strong> <strong>Solutions</strong> Planning and Installation <strong>Guide</strong> for more<br />

information on pending timeouts.<br />

6872 5688–002 10–5


Solving Performance Problems<br />

10–6 6872 5688–002


Appendix A<br />

Collecting and Using Logs<br />

Whenever a failure occurs, you might need to collect and analyze log information to<br />

assist in diagnosing the problem. This appendix presents information on the following<br />

tasks:<br />

• Collecting RA logs<br />

• Collecting server (host) logs<br />

• Analyzing RA log collection files<br />

• Analyzing server (host) logs<br />

• Analyzing intelligent fabric switch logs<br />

Collecting RA Logs<br />

When you collect logs from one RA, you automatically collect logs from all other RAs and<br />

from the servers. Occasionally, you might need to collect logs from the servers (hosts)<br />

manually. Refer to “Collecting Server (Host) Logs” later in this appendix for more<br />

information.<br />

Each time you complete a log collection, the files are saved for a maximum of 7 days.<br />

The length of time the files remain available depends on the size and number of log<br />

collections performed. To ensure that you have the log files that you need, download and<br />

store the files locally. Log files with dates older than 7 days from the current date are<br />

automatically removed.<br />

To collect the RA logs, perform the following procedures:<br />

1. Set the Automatic Host Info Collection option<br />

2. Test FTP connectivity<br />

3. Determine when the failure occurred<br />

4. Convert local time to GMT or UTC<br />

5. Collect logs from the RA<br />

6872 5688–002 A–1


Collecting and Using Logs<br />

Setting the Automatic Host Info Collection Option<br />

Perform the following steps to set the Automatic Host Info Collection Option:<br />

1. On the System menu select System Settings in the Management Console.<br />

The System Settings page appears.<br />

2. Choose the Automatic Host Info Collection option from Miscellaneous<br />

Settings.<br />

For more information, refer to the Unisys <strong>SafeGuard</strong> <strong>Solutions</strong> Planning and Installation<br />

<strong>Guide</strong>.<br />

Testing FTP Connectivity<br />

To test FTP connectivity, perform the following steps on the management PC. The<br />

information you provide depends on whether logs are being collected locally on an FTP<br />

server or sent to an FTP server at the Unisys Product <strong>Support</strong> site.<br />

1. To initiate an FTP session, type FTP at a command prompt. Press Enter.<br />

2. Type Open. Press Enter.<br />

3. At the To prompt, enter one of the following and then press Enter:<br />

• ftp.ess.unisys.com (the Unisys FTP address)<br />

• Your local FTP server IP address<br />

4. At the User prompt, enter one of the following and then press Enter:<br />

• FTP, if you specified the Unisys FTP address<br />

• Your local FTP user account<br />

5. At the Password prompt, enter one of the following and then press Enter:<br />

• Your Internet e-mail address if you specified the Unisys FTP address<br />

• Your local FTP account password<br />

6. Type bye and press Enter to log out.<br />

Determining When the Failure Occurred<br />

Perform the following steps to determine when the failure occurred:<br />

Note: If you cannot determine the failure time from the RA logs, use the Windows<br />

event logs on each server (host) to determine the failure time.<br />

1. Select the Logs tab from the navigation pane in the Management Console.<br />

A list of events is displayed. Each event entry includes a Level column that indicates<br />

the severity of the event.<br />

If necessary, click View and select Detailed.<br />

2. Scan the Description column to find the event for which you want to gather logs.<br />

A–2 6872 5688–002


Collecting and Using Logs<br />

3. Select the event and click the Filter Log option.<br />

The Filter Log dialog box appears.<br />

4. Select any option from scope list (normal, detailed, advanced) and level list (info,<br />

warning, error).<br />

5. Write down the timestamp that is displayed for the event. You must convert the<br />

time displayed to GMT—also called Coordinated Universal Time (UTC).<br />

This timestamp is used to calculate the start date and end time for log collection.<br />

6. Click OK.<br />

Converting Local Time to GMT or UTC<br />

Perform the following steps to convert the time in which the failure occurred to GMT or<br />

UTC. You need the time zone you wrote down in the preceding procedure.<br />

1. In Windows Control Panel, click Date and Time.<br />

2. Select the Time Zone tab.<br />

3. Look in the list for the GMT or UTC offset value corresponding to the time zone you<br />

wrote down in the procedure “Determining When the Failure Occurred.” The offset<br />

value represents the number of hours that the time zone is ahead or behind GMT or<br />

UTC.<br />

4. Add or subtract the GMT or UTC offset value from the local time.<br />

Example<br />

If the time zone is Pacific Standard Time, the GMT or UTC offset value is –8:00. If the<br />

time in which the failure occurred is 13:30, then GMT or UTC is 21:30.<br />

Collecting RA Logs<br />

Use the Installation Manager, which is a centralized collection tool, to collect logs from<br />

all accessible RAs, servers (hosts), and intelligent fabric switches.<br />

Before you begin log collection, determine the failure date and time. If you have SANTap<br />

switches and want to collect information from the switches, know the user name and<br />

password to access the switches.<br />

To collect RA logs, perform the following steps:<br />

1. Start the SSH client by performing the steps in “Using the SSH Client” in<br />

Appendix C. Use the site management IP address; log in with boxmgmt as the login<br />

user name and boxmgmt as the password.<br />

2. On the Main Menu, type 3 (Diagnostics) and press Enter.<br />

3. On the Diagnostics menu, type 4 (Collect system info) and press Enter.<br />

6872 5688–002 A–3


Collecting and Using Logs<br />

4. When prompted, provide the following information. Press Enter after each item.<br />

(The program displays date and time in GMT/UTC format.)<br />

a. Start date: This date specifies how far back the log collection is to start. Use<br />

the MM/DD/YYYY format. Do not accept the default date; the date should be at<br />

least 2 days earlier than the current date. This date must include the date and<br />

time in which the failure occurred.<br />

b. Start time: This time specifies the GMT/UTC in which log collection is to start.<br />

Use the HH:MM:SS format.<br />

c. End date: This date specifies when log collection is to end. Accept the default<br />

date, which is the current date.<br />

d. End time: This time specifies when log collection is to end. Accept the default<br />

time, which is the current time.<br />

5. Type y to collect information from the other site.<br />

6. Type y or n, and press Enter when asked about sending the results to an FTP<br />

server.<br />

If you choose not to send the results to an FTP server, skip to step 8. The results are<br />

stored at the URL http:///info/. You can access the<br />

collected results by logging in with webdownload as the log-in name and<br />

webdownload as the password. (If your system is set for secure Web<br />

transactions, then the URL begins with https://.)<br />

If you choose to send the results to an FTP server and the procedure has been<br />

performed previously, all of the information is filled in. If not, provide the following<br />

information for the management PC:<br />

a. When prompted for the FTP server, type one of the following and then press<br />

Enter.<br />

• The IP address of the Unisys Product <strong>Support</strong> FTP server, 192.61.61.78, or<br />

ftp.ess.unisys.com<br />

• The IP address of your local FTP server<br />

b. Press Enter to accept the default FTP port number, or type a different port<br />

number if you are using a management PC with a nonstandard port number.<br />

c. Type the local user account when prompted for the FTP user name. Press<br />

Enter.<br />

d. If you are using the Unisys FTP server, type incoming as the folder name of<br />

the FTP location in which to store the collected information. Press Enter.<br />

If you are using a local FTP server, press Enter for none.<br />

A–4 6872 5688–002


Collecting and Using Logs<br />

e. Type a name for the file on the FTP server in the following format:<br />

.tar<br />

Example: 19557111_Company1.tar<br />

Note: If no name is specified, the name will be similar to the following:<br />

sysInfo--hosts-from-)-.tar<br />

Example: sysInfo-l1-l2-r1-r2-hosts-from-l1-r1-2006.08.17.16.28.31.tar<br />

f. Type the appropriate password. Press Enter.<br />

7. On the Collection mode menu, type 3 (RAs and hosts) and press Enter.<br />

Note: The “hosts” part of this menu selection (RAs and hosts) collects intelligent<br />

fabric switch information.<br />

8. Type y or n, and press Enter when asked if you have SANTap switches from which<br />

you want to collect information.<br />

If you do not have SANTap switches, go to step 10.<br />

If you want to collect information from SANTap switches, enter the user name and<br />

password to access the switch when prompted.<br />

9. Type n if prompted on whether to perform a full collection, unless otherwise<br />

instructed by a Unisys service representative.<br />

10. Type n when prompted to limit collection time.<br />

The collection program checks connectivity to all RAs and then displays a list of the<br />

available hosts and SANTap switches from which to collect information.<br />

11. Type All and press Enter.<br />

The Installation Manager shows the collection progress and reports that it<br />

successfully collected data. This collection might take several minutes. Once the<br />

data collection completes, a message indicates that the collected information is<br />

available at the FTP server you specified or at the URL (http:///info/ or https:///info/).<br />

12. Press Enter.<br />

13. On the Diagnostics dialog box, type Q and press Enter to exit the program.<br />

14. Type Y when prompted to quit and press Enter.<br />

Verifying the Results<br />

• Ensure that “Failed for hosts” has no entries. The success or failure entries might be<br />

listed multiple times.<br />

For the collection to be successful for hosts and intelligent fabric switches, all entries<br />

must indicate “Succeeded for hosts.”<br />

For the collection to be successful for RAs, all entries must indicate “Collected data<br />

from .”<br />

6872 5688–002 A–5


Collecting and Using Logs<br />

• There is a 20-minute timeout on the collection process for RAs. There is a 15-minute<br />

timeout on the collection process for each host.<br />

• If the collection from the remote site failed because of a WAN failure, run the<br />

process locally at the remote site.<br />

• If the connection with an RA is lost while the collection is in process, no<br />

information is collected. Run the process again.<br />

• If you transferred the data by FTP to a management PC, you can transfer the<br />

collected data to the Unisys Product <strong>Support</strong> Web site at your convenience.<br />

Otherwise, if you are connected to the Unisys Product <strong>Support</strong> Web site, the<br />

collected data is transferred automatically to this Web site.<br />

• If you use the Web interface, you must download the collected data to the<br />

management PC and then transfer the collected data to the Unisys Product <strong>Support</strong><br />

Web site at your convenience.<br />

Collecting Server (Host) Logs<br />

Use the following utilities to collect log information:<br />

• MPS Report Utility<br />

• Host information collector (HIC) utility<br />

Using the MPS Report Utility<br />

Use the Microsoft MPS Report Utility to collect detailed information about the current<br />

host configuration. You must have administrative rights to run this utility.<br />

Unisys uses the cluster (MSCS) version of this utility if that version is available from<br />

Microsoft. This version of the utility enables you to gather cluster information as well as<br />

the standard Microsoft information. If the server is not clustered, the utility still runs, but<br />

the cluster files in the output are blank.<br />

The average time for the utility to complete is between 5 and 20 minutes. It might take<br />

longer if you run the utility during peak production time.<br />

You can download the MPS Report Utility from the Unisys FTP server at the following<br />

location: (You are not prompted for a username or password.)<br />

ftp://ftp.ntsupport.unisys.com/outbound/MPS-REPORTS/<br />

Select one of the following directories, depending on your operating system<br />

environment:<br />

• 32-BIT<br />

• 64-BIT-IA64<br />

• 64-BIT-X64 (not a clustered version)<br />

A–6 6872 5688–002


Collecting and Using Logs<br />

Output Files<br />

Individual output files are created by using the following directory structure. Depending<br />

on the MPS Report version, the file name and directory name might vary.<br />

Directory: %systemroot%\MPSReports , typically C:\windows\MPSReports<br />

File name: %COMPUTERNAME%_MPSReports_xxx.CAB<br />

Using the Host Information Collector (HIC) Utility<br />

Note: You can skip this procedure unless directed to complete it by the Unisys support<br />

personnel. Host log collection occurs automatically if the Automatic Host Info Collection<br />

option on the System menu of the management console is selected.<br />

Perform the following steps to collect log information from the hosts:<br />

1. At the command prompt on the host, change to the appropriate directory depending<br />

on your system:<br />

• For 32-bit and Intel Itanium 2-based systems, enter<br />

cd C:\Program Files\KDriver\hic<br />

• For x64 systems, enter<br />

cd C:\Program Files (x86)\KDriver\hic<br />

2. Type one of the following commands:<br />

• host_info_collector –n (noninteractive mode)<br />

• host_info_collector (interactive mode)<br />

If you choose the interactive mode command, provide the following site information:<br />

• Account ID: Click System Settings on the System menu of the<br />

Management Console, and click on Account Settings in the System<br />

Settings dialog box to access this information.<br />

• Account name: The name of the customer who purchased the Unisys <strong>SafeGuard</strong><br />

30m solution.<br />

• Contact name: The name of the person responsible for collecting logs.<br />

• Contact mail: The mail account of the person responsible for collecting logs.<br />

Note: Ignore messages about utilities that are not installed.<br />

6872 5688–002 A–7


Collecting and Using Logs<br />

Verifying the Results<br />

• The process generates a single tar file of the host logs in the gzip format.<br />

• On 32-bit and Intel Itanium 2-based systems, the host logs are located in the<br />

following directory:<br />

C:\Program Files\KDriver\hic<br />

• On 64-bit systems, the host logs are located on the following directory:<br />

C:\Program Files (x86)\KDriver\hic<br />

Analyzing RA Log Collection Files<br />

If you use the Installation Manager RA log collection process, logs are collected from all<br />

accessible RAs and servers (hosts). When the tar file is extracted using this process, the<br />

information is gathered in a file on the FTP server that is, by default, named with the<br />

following format:<br />

sysInfo--hosts-from-)-.tar<br />

The is in the format yyyy.mm.dd.hh.mm.ss.<br />

An example of such a file name is<br />

sysInfo-lr-l2-r1-r2-hosts-from-l1-r1-2007.09.07.17.37.39.tar<br />

For each RA on which logs were collected, directories are created with the following<br />

formats:<br />

extracted..<br />

HLR--<br />

The is in the format yyyy.mm.dd.hh.mm.ss.<br />

An example of the name of an extracted directory for the RA is<br />

extracted.l1.2007.06.05.19.25.03 (from left RA 1 on June 5, 2007 at 19:25:03)<br />

In the RA identifier information, the l1 to 8 and r1 to 8 designations refer to RAs at the<br />

left and right sites. That is, site 1 RAs 1 through 8 are designated with l, and site 2 RAs 1<br />

through 8 are designated with r.<br />

If the RA collected a host log, the host information is collected in a directory beginning<br />

with HLR. For example, HLR-r1-2007.06.05.19.25.03 is the directory from right (site 2)<br />

RA1 on June 5, 2007 at 19:25:03.<br />

This directory is described in “Host Log Extraction Directory” later in this appendix.<br />

A–8 6872 5688–002


RA Log Extraction Directory<br />

Collecting and Using Logs<br />

Several files and directories are placed inside the extracted directory for the RA:<br />

• parameters: file containing the time frame for the collection<br />

• CLI: file that containing the output collected by running CLI commands<br />

• aiw: file containing the internal log of the system, which is used by third-level<br />

support<br />

• aiq: file containing the internal log of the system, which is used by third-level support<br />

• cm_cli: internal file used by third-level support<br />

• init_hl: internal file used by third-level support<br />

• kbox_status: file used by third-level support<br />

• unfinished_init_hl: file used by third-level support<br />

• log: file containing the log of the collection process itself (used only by third-level<br />

support)<br />

• summary: file containing a summary of the main events from the internal logs of the<br />

system, which is used by third-level support<br />

• files: directory containing the original directories from the appliance<br />

• processes: directory containing some internal information from the system such as<br />

network configuration, processes state, and so forth<br />

• tmp: temporary directory<br />

Of the preceding items, you should understand the time frame of the collection from the<br />

parameters file and focus on the CLI file information. To determine whether the logs<br />

were correctly collected, check that the time frame of the collection correlates with the<br />

time of the issue, and verify that logs were collected from all nodes.<br />

Root-Level Files<br />

Several files are saved at the root level of the extracted directory: parameters file, CLI<br />

file, aiw file, aiq file, cm_cli file, init_hl file, kbox_status file, unfinished_init_hl file, log file,<br />

and summary file.<br />

Parameters File<br />

The parameters file contains the parameters given to the log gathering tool. Those<br />

parameters set the time frame for the log collection and are reflected in the parameters<br />

file. The format for the date is yyyy/mm/dd.<br />

The following example illustrates the contents of a parameters file:<br />

only_connectivity=”0”<br />

min=”2007/08/03 16:25:02”<br />

max=”2007/08/04 19:25:02”<br />

withCores=”1”<br />

6872 5688–002 A–9


Collecting and Using Logs<br />

The value ”0” for only_connectivity in the parameters file is a standard value for logs.<br />

The value “1” for withCores means that core logs (long) were collected for the time<br />

displayed.<br />

CLI File<br />

The CLI file contains the output from executing various CLI commands. The commands<br />

issued to produce the information are saved to the CLI file in the tmp directory. Usually<br />

executing CLI commands in the process of collecting logs produces volumes of output.<br />

The types of information that are contained in the CLI file are as follows:<br />

• Account settings and license<br />

• Alert settings<br />

• Box states<br />

• Consistency groups, settings, and state<br />

• Consistency group statistics<br />

• Site name<br />

• Splitters<br />

• Management console logs for the period collected<br />

• Global accumulators (used by third-level support)<br />

• Various settings and system statistics<br />

• Save_settings command output<br />

• Splitters settings and state<br />

• Volumes settings and state<br />

• Available images<br />

The commands used to collect the output are listed in the runCLIFile, described later in<br />

this appendix.<br />

Log File<br />

This file contains a report of the log collection that executed. It shows the start and stop<br />

time for the log.<br />

If there is a problem running CLI commands, information appears at the end of the file<br />

similar to the following:<br />

2007/06/05 19:25:40: info: running CLI commands<br />

2007/06/05 19:25:40: info: retrieving site name<br />

2007/06/05 19:25:40: info: site name is "Tunguska"<br />

2007/06/05 19:25:40: info: retrieving groups<br />

2007/06/05 19:25:40: error: while running CLI commands: when running CLI<br />

get_groups, RC=2<br />

2007/06/05 19:25:40: error: while running CLI commands: errors retrieving<br />

groups. skipping CLI commands.<br />

A–10 6872 5688–002


Collecting and Using Logs<br />

Summary File<br />

The summary file is at the root of the extracted directory and contains a summary of the<br />

main events from the internal logs of the system. The format of this file is used by thirdlevel<br />

support. However, you might find a summary of the errors helpful in some cases.<br />

Files Directory<br />

The files directory contains several subdirectories and files in those directories. The<br />

directories are etc, home, collector, rreasons, proc, and var.<br />

etc Directory<br />

This directory contains the rc.local file, which is used by third-level support.<br />

home Directory<br />

The home directory contains the kos directory containing several files and these<br />

subdirectories: cli, connectivity_tool, control, customer_monitor, hlr, install_logs, kbox,<br />

management, monitor, mpi__perf, old_config, replication, rmi, snmp, and utils.<br />

The home directory also contains the collector and rreasons directories.<br />

collector Directory<br />

This directory contains the connectivity_tool subdirectory, which lists results from<br />

connectivity tests to configured IP addresses on the local host loopback and the specific<br />

ports on the IP addresses that require testing for various protocols.<br />

rreasons Directory<br />

This directory contains the rreasons.log file, which lists the reasons for any reboots in<br />

the specified time frame.<br />

This file is used by third-level support but can be helpful in reviewing the reboot reasons,<br />

as shown in the following sample file:<br />

***************************************************************************<br />

***************************************************************************<br />

***************************************************************************<br />

***************************************************************************<br />

***************************************************************************<br />

=== LogLT STARTED HERE - 2007/07/05 22:40:40 ===<br />

***************************************************************************<br />

***************************************************************************<br />

***************************************************************************<br />

***************************************************************************<br />

***************************************************************************<br />

Couldn't open 'logger.ini' file, so assuming default 'all' with level<br />

DEBUG2007/07/05 22:40:40.834 - #2 - 1421 - RebootReasons:<br />

getRebootReasons2007/07/05 22:40:40.834 - #2 - 1421 - rreasons: Reboot Log:<br />

[Mon Apr 16 20:33:00 2007] : kernel watchdog 0 expired (time=66714<br />

lease=1390 last_tick=65233) 0=(1390,65233) 1=(30000,63214) 2=(1400,65233)<br />

6872 5688–002 A–11


Collecting and Using Logs<br />

Note: In the example, the “kernel watchdog 0 expired” message indicates a typical<br />

reboot that was not a result of an error.<br />

Other Directories<br />

The proc, and var directories are also contained within the files directory and are used by<br />

third-level support.<br />

processes Directory<br />

The processes directory contains the InfoCollect, sbin, usr, home, and bin directories and<br />

several subdirectories.<br />

InfoCollect Directory<br />

Under the InfoCollect directory, the SanDiag.sh file contains the SAN diagnostic logs.<br />

The ConnectivityTest.sh file contains connection information. Connection errors in this<br />

log do not indicate an error in the configuration or function.<br />

sbin Directory<br />

This directory contains files with information pertaining to networking.<br />

• Ifconfig file: Lists configuration information as shown in the following example:<br />

eth0 Link encap:Ethernet HWaddr 00:14:22:11.DD:1B<br />

inet addr:10.10.21.51 Bcast:10.255.255.255 Mask:255.255.255.0<br />

UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1<br />

RX packets:286265797 errors:0 dropped:0 overruns:0 frame:0<br />

TX packets:228318046 errors:0 dropped:0 overruns:0 carrier:0<br />

collisions:0 txqueuelen:100<br />

RX bytes:1377792659 (1.2 GiB) TX bytes:2189256742 (2.0 GiB)<br />

Base address:0xecc0 Memory:fe6e0000-fe700000<br />

eth1 Link encap:Ethernet HWaddr 00:14:22:11.DD:1C<br />

inet addr:172.16.21.51 Bcast:172.16.255.255 Mask:255.255.0.0<br />

UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1<br />

RX packets:13341097 errors:0 dropped:0 overruns:0 frame:0<br />

TX packets:12365085 errors:0 dropped:0 overruns:0 carrier:0<br />

collisions:0 txqueuelen:5000<br />

RX bytes:4156827090 (3.8 GiB) TX bytes:4192345752 (3.9 GiB)<br />

Base address:0xdcc0 Memory:fe4e0000-fe500000<br />

eth1 Link encap:Ethernet HWaddr 00:14:22:11.DD:1C<br />

inet addr:172.16.21.51 Bcast:172.16.255.255 Mask:255.255.0.0<br />

UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1<br />

Base address:0xdcc0 Memory:fe4e0000-fe500000<br />

lo Link encap:Local Loopback<br />

inet addr:127.0.0.1 Mask:255.0.0.0<br />

UP LOOPBACK RUNNING MTU:16436 Metric:1<br />

RX packets:11289452 errors:0 dropped:0 overruns:0 frame:0<br />

TX packets:11289452 errors:0 dropped:0 overruns:0 carrier:0<br />

collisions:0 txqueuelen:0<br />

RX bytes:3269809825 (3.0 GiB) TX bytes:3269809825 (3.0 GiB)<br />

A–12 6872 5688–002


Collecting and Using Logs<br />

• route file: Lists other pieces of routing information, as shown in the following<br />

example:<br />

Kernel IP routing table<br />

Destination Gateway Genmask Flags Metric Ref Use Iface<br />

10.10.21.0 * 255.255.255.0 U 0 0 0 eth0<br />

172.16.0.0 * 255.255.0.0 U 0 0 0 eth1<br />

usr Directory<br />

The usr directory contains two subdirectories: bin and sbin.<br />

The bin subdirectory contains the kps.pl file.<br />

The following is an example of the kps.pl file for an attached RA:<br />

Uptime: 7:25pm up 18:24, 1 user, load average: 5.99, 4.38, 2.05<br />

Processes:<br />

control_process - UP<br />

control_loop.tcsh - UP<br />

replication - UP<br />

mgmt_loop.tcsh - UP<br />

management_server - UP<br />

cli - down<br />

rmi_loop.tcsh - UP<br />

rmi - UP<br />

monitor_loop.tcsh - UP<br />

load_monitor.pl - UP<br />

runall - down<br />

hlr_kbox - UP<br />

rcm_run_loop.tcsh - UP<br />

customer_monitor.pl - UP<br />

Modules:<br />

st - UP<br />

sll - UP<br />

var_link - UP<br />

kaio_mod-2.4.32-k22 - UP<br />

The following is an example of the kps.pl file for a detached RA:<br />

Uptime: 7:25pm up 18:24, 1 user, load average: 5.99, 4.38, 2.05<br />

Processes:<br />

control_process - down<br />

control_loop.tcsh - down<br />

replication - down<br />

mgmt_loop.tcsh - down<br />

management_server - down<br />

cli - down<br />

rmi_loop.tcsh - down<br />

rmi - down<br />

monitor_loop.tcsh - down<br />

load_monitor.pl - down<br />

runall - down<br />

hlr_kbox - UP<br />

rcm_run_loop.tcsh - down<br />

customer_monitor.pl - down<br />

Modules:<br />

st - UP<br />

6872 5688–002 A–13


Collecting and Using Logs<br />

sll - UP<br />

var_link - UP<br />

kaio_mod-2.4.32-k22 - UP<br />

The sbin subdirectory contains the biosdecode and dmidecode files. The biosdecode file<br />

provides hardware-specific RA BIOS information and the pointers to locations where this<br />

information is stored. The dmidecode file provides handle and other information for<br />

components capable of passing this information to a Desktop Management Interface<br />

(DMI) agent.<br />

home Directory<br />

The home directory contains the kos subdirectory, which contains other subdirectories<br />

that yield the get_users_lock_state.tcsh file. This file contains all the users on the RA.<br />

bin Directory<br />

The bin directory contains the df-h and lspci files. The df-h file contains directory size and<br />

disk size usage statistics for the RA hard disk drive. The lspci file contains PCI bridge bus<br />

numbers, revisions, and OEM identification strings for inbuilt devices in the RA.<br />

tmp Directory<br />

The tmp directory contains the runCLI file listing the commands that generated the CLI<br />

file. It also contains the getGroups file, which is a temporary file to gather the list of<br />

consistency groups.<br />

runCLI File<br />

The following is an example of the runCLI file saved in the tmp directory that shows the<br />

CLI commands executed:<br />

• get_logs from= to=–n<br />

The time and date are specified as day, month, year as follows:<br />

get_logs from="22:03 03/08/2007" to="17:03 04/08/2007” –n<br />

• config_io_throttling –n<br />

• config_multipath_monitoring –n<br />

• get_account_settings –n<br />

• get_alert_settings –n<br />

• get_box_states –n<br />

• get_global_policy –n<br />

• get_groups –n<br />

• get_groups_sets –n<br />

• get_group_settings –n<br />

• get_group_state –n<br />

• get_group_statistics –n<br />

A–14 6872 5688–002


Collecting and Using Logs<br />

• get_id_names –n<br />

• get_initiator_bindings –n<br />

• get_pairs –n<br />

• get_raw_stats –n<br />

• get_snmp_settings –n<br />

• get_syslog_settings –n<br />

• get_system_status –n<br />

• get_system_settings –n<br />

• get_system_statistics –n<br />

• get_tweak_params –n<br />

• get_version –n<br />

• get_virtual_targets –n<br />

• save_settings –n<br />

• get_splitter_settings site=""<br />

• get_splitter_states site=""<br />

• get_san_splitter_view site=""<br />

• get_san_volumes site=""<br />

• get_santap_view site=""<br />

• get_volume_settings site=""<br />

• get_volume_state site=""<br />

• get_images group="" (This command is repeated for each group.)<br />

getGroups File<br />

This internal file is used to generate the runCLI file.<br />

Host Log Extraction Directory<br />

When the RA collects a host log, the host information is collected in a directory named<br />

with the HLR-- format.<br />

Such a directory contains a tar.gz file for servers with a name similar in format to the<br />

following:<br />

HLR-r1_USMVEAST2_1157647546524147.tar.gz<br />

When you extract a tar.gz file, you can choose to decompress the ZIP file<br />

(to_transfer.tar) to a temp folder and open it, or you can choose to extract the files to a<br />

directory.<br />

When the file is for intelligent fabric switches, the file name does not have the .gz<br />

extension.<br />

6872 5688–002 A–15


Collecting and Using Logs<br />

Analyzing Server (Host) Logs<br />

The output file from host collection is named<br />

Unisys_host_info___.tar.gz<br />

This file contains a folder named “collected_items,” which contains the following files<br />

and directories:<br />

• Cluster_log: a folder containing the cluster.log file generated by MSCS<br />

• Hic_logs: a folder containing logs used by third-level support<br />

• Host_logs: a folder containing logs used by third-level support<br />

• Msinfo32: information from the Msinfo32.exe file<br />

• Registry.dump: the registry dump for this server<br />

• Tweak: the internal RA parameters on this server<br />

• Watchdog log: log created by the KDriverWatchDog service<br />

• Commands: a file containing output from commands executed on this server,<br />

including<br />

− A view of the LUNs recognized by this server<br />

− Some internal RA structures<br />

− Output from the dumpcfg.exe file<br />

− Windows event logs for system, security, and applications<br />

Analyzing Intelligent Fabric Switch Logs<br />

The output file from collecting information from intelligent fabric switches is named with<br />

the following format:<br />

HLR-__identifier.tar<br />

The following name is an example of this format:<br />

HLR-l1_CISCO_232c000dec1a7a02.tar<br />

Once you extract the .tar file, some files are listed with formats similar to the following:<br />

CVT_.tar_AT__M3_tech<br />

CVT_.tar_AT__M3_isapi_tech<br />

CVT_.tar_AT__M3_santap_tech<br />

A–16 6872 5688–002


Appendix B<br />

Running Replication Appliance (RA)<br />

Diagnostics<br />

This appendix<br />

• Explains how to clear the system event log (SEL.)<br />

• Describes how to run hardware diagnostics for the RA.<br />

• Lists the LCD status messages shown on the RA.<br />

Clearing the System Event Log (SEL)<br />

Before you run the RA diagnostics, you need to clear the SEL to prevent errors from<br />

being generated during the diagnostics run.<br />

1. Insert the bootable Replication Appliance (RA) Diagnostic CD-ROM in the CD/DVD<br />

drive.<br />

2. Press Ctrl+Alt+Delete to reboot the RA.<br />

The RA displays the following event log menu.<br />

3. Select Show all system event log records using the arrow keys, then press<br />

Enter.<br />

This action results in an SEL summary and indicates whether the SEL contains<br />

errors. If there are errors, an error description is given.<br />

Note: You cannot scroll up or down in this screen.<br />

A clear SEL without errors has “IPMI SEL contains 1 records” displayed in the<br />

summary. Anything greater than one record indicates that errors are present.<br />

6872 5688–002 B–1


Running Replication Appliance (RA) Diagnostics<br />

Note: The preceding step did not clear the SEL; ignore the statement “Log area<br />

Reset/Cleared.”<br />

4. Press any key to return to the main boot menu.<br />

5. Select Clear System Event Log using the arrow keys, and press Enter to ensure<br />

that the SEL is cleared of all error entries.<br />

Note: Depending on whether there are error entries, this clearing action could take<br />

up to 1 minute to complete.<br />

6. Press any key again to return to the main boot menu.<br />

7. Select Show all system event log records using the arrow keys and press<br />

Enter. Confirm that “IPMI SEL contains 1 records” is shown.<br />

8. Press any key to return to the main boot menu.<br />

Note: If you accidentally press Escape and leave the main boot menu, a Diag<br />

prompt is displayed. Type menu to return to the main boot menu.<br />

Running Hardware Diagnostics<br />

Running the hardware diagnostics for the RA includes completing the Custom Test and<br />

Express Test diagnostics.<br />

Follow these steps to run the hardware diagnostics for the RA:<br />

1. At the main boot menu, use the arrow keys to select Run Diags …; then press<br />

Enter.<br />

2. On the Customer Diagnostic Menu, press 2 to select Run ddgui graphicsbased<br />

diagnostic.<br />

The system diagnostic files begin loading and a message is displayed giving<br />

information about the software and showing “initializing…”<br />

Once the diagnostics are loaded and ready to be executed, the Main Menu is<br />

displayed.<br />

B–2 6872 5688–002


Custom Test<br />

Running Replication Appliance (RA) Diagnostics<br />

1. On the Main Menu, select Custom Test using the arrow keys; then press Enter.<br />

The Custom Test dialog box is displayed as follows:<br />

2. Expand the PCI Devices folder to view the PCI devices installed in the system<br />

including those devices that are “on-board.”<br />

3. Select the PCI Devices folder; then press Enter.<br />

This action causes each PCI device to be interrogated in turn and a message is<br />

displayed for each one. Verify that the correct number of QLogic adapters is shown.<br />

4. Press OK after each message is displayed until all PCI devices have been recognized<br />

and passed. The message “All tests passed.” is displayed.<br />

Note: If any devices fail this test, investigate and rectify the problem; then clear the<br />

SEL as explained in “Clearing the System Event Log (SEL).”<br />

5. Close the Custom Test dialog box and return to the Main Menu.<br />

6872 5688–002 B–3


Running Replication Appliance (RA) Diagnostics<br />

Express Test<br />

1. On the Main Menu, select Express Test using the arrow keys; then press Enter.<br />

A warning is displayed advising that media must be installed on all drives or else<br />

some tests might fail.<br />

2. If a diskette drive is installed in the system, insert a blank, formatted diskette and<br />

then click OK to start the test. If no diskette drive is installed, just click OK.<br />

During testing, a status screen is displayed.<br />

If the diagnostic test run is successful, the message “All tests passed.” appears.<br />

Notes:<br />

• During the video portion of the testing, the screen typically flickers and goes<br />

blank.<br />

• If any errors occur, investigate and resolve the problem, and then rerun the<br />

diagnostic tests. Before you rerun the tests, be sure to clear the SEL as<br />

explained in “Clearing the System Event Log (SEL).”<br />

3. Click OK to exit the diagnostic tests.<br />

The Main Menu is then displayed.<br />

4. Select Exit using the arrow keys; then press Enter.<br />

The following message is displayed:<br />

Displaying the end of test result.log ddgui.txt. Strike a Key when ready.<br />

5. Press any key to display the diagnostic test summary screen.<br />

6. Verify that no errors are listed. Scroll up and down to see the different portions of<br />

the output.<br />

Note: If any errors are listed, investigate and resolve the problem; then rerun the<br />

diagnostic tests. Before you rerun the tests, be sure to clear the SEL as explained in<br />

“Clearing the System Event Log (SEL).”<br />

7. Press Escape to return to the original Customer Diagnostic Menu.<br />

8. Press 4 to quit and return to the main boot menu.<br />

9. Select Exit; then press Enter.<br />

10. Remove all media from the diskette and CD/DVD drives.<br />

LCD Status Messages<br />

The LCDs on the RA signify status messages. Table B–1 lists the LCD status messages<br />

that can occur and the probable cause for each message. The LCD messages refer to<br />

events recorded in the SEL.<br />

Note: For information about corrective actions for the messages listed in Table B–1,<br />

refer to the documentation supplied with the system.<br />

B–4 6872 5688–002


Running Replication Appliance (RA) Diagnostics<br />

Table B–1. LCD Status Messages<br />

Line 1 Message Line 2 Message Cause<br />

SYSTEM ID SYSTEM NAME The system ID is a unique name, 5 characters or less,<br />

defined by the user.<br />

The system name is a unique name, 16 characters or<br />

less, defined by the user.<br />

The system ID and name display under the following<br />

conditions:<br />

• The system is powered on.<br />

• The power is off and active POST errors are<br />

displayed.<br />

E000 OVRFLW CHECK LOG LCD overflow message. A maximum of three error<br />

messages can display sequentially on the LCD. The<br />

fourth message is displayed as the standard overflow<br />

message.<br />

E0119 TEMP AMBIENT Ambient system temperature is out of the acceptable<br />

range.<br />

E0119 TEMP BP The backplane board is out of the acceptable temperature<br />

range.<br />

E0119 TEMP CPU n The specified microprocessor is out of the acceptable<br />

temperature range.<br />

E0119 TEMP SYSTEM The system board is out of the acceptable temperature<br />

range.<br />

E0212 VOLT 3.3 The system power supply is out of the acceptable voltage<br />

range; the power supply is faulty or improperly installed.<br />

E0212 VOLT 5 The system power supply is out of the acceptable voltage<br />

range; the power supply is faulty or improperly installed.<br />

E0212 VOLT 12 The system power supply is out of the acceptable voltage<br />

range; the power supply is faulty or improperly installed.<br />

E0212 VOLT BATT Faulty battery; faulty system board.<br />

E0212 VOLT BP 12 The backplane board is out of the acceptable voltage<br />

range.<br />

E0212 VOLT BP 3.3 The backplane board is out of the acceptable voltage<br />

range.<br />

E0212 VOLT BP 5 The backplane board is out of the acceptable voltage<br />

range.<br />

E0212 VOLT CPU VRM The microprocessor voltage regulator module (VRM)<br />

voltage is out of the acceptable range. The<br />

microprocessor VRM is faulty or improperly installed. The<br />

system board is faulty.<br />

E0212 VOLT NIC 1.8V Integrated NIC voltage is out of the acceptable range; the<br />

power supply is faulty or improperly installed. The system<br />

board is faulty.<br />

6872 5688–002 B–5


Running Replication Appliance (RA) Diagnostics<br />

Table B–1. LCD Status Messages<br />

Line 1 Message Line 2 Message Cause<br />

E0212 VOLT NIC 2.5V Integrated NIC voltage is out of the acceptable range. The<br />

power supply is faulty or improperly installed. The system<br />

board is faulty.<br />

E0212 VOLT PLANAR REG The system board is out of the acceptable voltage range.<br />

The system board is faulty.<br />

E0276 CPU VRM n The specified microprocessor VRM is faulty,<br />

unsupported, improperly installed, or missing.<br />

E0276 MISMATCH VRM n The specified microprocessor VRM is faulty,<br />

unsupported, improperly installed, or missing.<br />

E0280 MISSING VRM n The specified microprocessor VRM is faulty,<br />

unsupported, improperly installed, or missing.<br />

E0319 PCI OVER CURRENT The expansion cord is faulty or improperly installed.<br />

E0412 RPM FAN n The specified cooling fan is faulty, improperly installed, or<br />

missing.<br />

E0780 MISSING CPU 1 Microprocessor is not installed in socket PROC_1.<br />

E07F0 CPU IERR The microprocessor is faulty or improperly installed.<br />

E07F1 TEMP CPU n HOT The specified microprocessor is out of the acceptable<br />

temperature range and has halted operation.<br />

E07F4 POST CACHE The microprocessor is faulty or improperly installed.<br />

E07F4 POST CPU REG The microprocessor is faulty or improperly installed.<br />

E07FA TEMP CPU n THERM The specified microprocessor is out of the acceptable<br />

temperature range and is operating at a reduced speed or<br />

frequency.<br />

E0876 POWER PS n No power is available from the specified power supply.<br />

The specified power supply is improperly installed or<br />

faulty.<br />

E0880 INSUFFICIENT PS Insufficient power is being supplied to the system. The<br />

power supplies are improperly installed, faulty, or<br />

missing.<br />

E0CB2 MEM SPARE ROW The correctable errors threshold was met in a memory<br />

bank; the errors were remapped to the spare row.<br />

E0CF1 MBE DIMM Bank n The memory modules installed in the specified bank are<br />

not the same type and size. The memory module or<br />

modules are faulty.<br />

E0CF1 POST MEM 64K A parity failure occurred in the first 64 KB of main<br />

memory.<br />

E0CF1 POST NO MEMORY The main-memory refresh verification failed.<br />

E0CF5 LGO DISABLE SBE Multiple single-bit errors occurred on a single memory<br />

module.<br />

B–6 6872 5688–002


Running Replication Appliance (RA) Diagnostics<br />

Table B–1. LCD Status Messages<br />

Line 1 Message Line 2 Message Cause<br />

E0D76 DRIVE FAIL A hard drive or RAID controller is faulty or improperly<br />

installed.<br />

E0F04 POST DMA INIT Direct memory access (DMA) initialization failed. DMA<br />

page register write/read operation failed.<br />

E0F04 POST MEM RFSH The main-memory refresh verification failed.<br />

E0F04 POST SHADOW BIOS-shadowing failed.<br />

E0F04 POST SHD TEST The shutdown test failed.<br />

E0F0B POST ROM CHKSUM The expansion card is faulty or improperly installed.<br />

E0F0C VID MATCH CPU n The specified microprocessor is faulty, unsupported,<br />

improperly installed, or missing.<br />

E10F3 LOG DISABLE BIOS The BIOS disabled logging errors.<br />

E13F2 IO CHANNEL CHECK The expansion card is faulty or improperly installed. The<br />

system board is faulty.<br />

E13F4 PCI PARITY<br />

E13F5 PCI SYSTEM<br />

E13F8 CPU BUS INIT The microprocessor or system board is faulty or<br />

improperly installed.<br />

E13F8 CPU MCKERR Machine check error. The microprocessor or system<br />

board is faulty or improperly installed.<br />

E13F8 HOST TO PCI BUS<br />

E13F8 MEM CONTROLLER A memory module or the system board is faulty or<br />

improperly installed.<br />

E20F1 OS HANG The operating system watchdog timer has timed out.<br />

EFFF1 POST ERROR A BIOS error occurred.<br />

EFFF2 BP ERROR The backplane board is faulty or improperly installed.<br />

6872 5688–002 B–7


Running Replication Appliance (RA) Diagnostics<br />

B–8 6872 5688–002


Appendix C<br />

Running Installation Manager<br />

Diagnostics<br />

To determine the causes of various problems as well as perform numerous procedures,<br />

you must access the Installation Manager functions and diagnostics capabilities.<br />

Using the SSH Client<br />

Throughout the procedures in this guide you might need to use the secure shell (SSH)<br />

client. Perform the following steps whenever you are asked to use the SSH client or to<br />

open a PuTTY session:<br />

1. From Windows Explorer, double-click the PuTTY.exe file.<br />

2. When prompted, enter the applicable IP address.<br />

3. Select SSH for the protocol and keep the default port settings (port 22).<br />

4. Click Open.<br />

5. If prompted by a PuTTY security dialog box, click Yes.<br />

6. When prompted to log in, type the identified user name and then press Enter.<br />

7. When prompted for a password, type the identified password and then press Enter.<br />

Running Diagnostics<br />

When you open the PuTTY session and log in as boxmgmt/boxmgmt, the Main Menu of<br />

Installation Manager is displayed. This menu offers the following six choices: Installation,<br />

Setup, Diagnostics, Cluster Operations, Reboot/Shutdown, and Quit.<br />

For more information about these capabilities, see the Unisys <strong>SafeGuard</strong> <strong>Solutions</strong><br />

Replication Appliance Installation <strong>Guide</strong>.<br />

6872 5688–002 C–1


Running Installation Manager Diagnostics<br />

To access the various diagnostic capabilities of Installation Manager, perform the<br />

following steps:<br />

1. Open a PuTTY session using the IP address of the RA, and log in as<br />

boxmgmt/boxmgmt.<br />

The Main Menu is displayed, as follows:<br />

** Main Menu **<br />

[1] Install<br />

[2] Setup<br />

[3] Diagnostics<br />

[4] Cluster Operations<br />

[5] Reboot / Shutdown<br />

[Q] Quit<br />

2. Type 3 (Diagnostics) and press Enter.<br />

The Diagnostics menu is displayed as follows:<br />

** Diagnostics **<br />

IP Diagnostics<br />

[1] IP diagnostics<br />

[2] Fibre Channel diagnostics<br />

[3] Synchronization diagnostics<br />

[4] Collect system info<br />

[B] Back<br />

[Q] Quit<br />

The four diagnostics capabilities are explained in the following topics.<br />

Use the IP diagnostics when you need to check port connectivity, view IP addresses,<br />

test throughput, and review other related information.<br />

On the Diagnostics menu, type 1 (IP diagnostics) and press Enter to access the IP<br />

Diagnostics menu as shown:<br />

** IP Diagnostics **<br />

[1] Site connectivity tests<br />

[2] View IP details<br />

[3] View routing table<br />

[4] Test throughput<br />

[5] Port diagnostics<br />

[6] System connectivity<br />

[B] Back<br />

[Q] Quit<br />

C–2 6872 5688–002


Site Connectivity Tests<br />

Running Installation Manager Diagnostics<br />

On the IP Diagnostics menu, type 1 (Site connectivity tests) and press Enter to<br />

access the Site connectivity tests menu.<br />

Note: You must apply settings to the RA before you can test options 1 through 4 in the<br />

following list.<br />

The options to test are as follows:<br />

** Select the target to which to test connectivity: **<br />

[1] Gateway<br />

[2] Primary DNS server<br />

[3] Secondary DNS server<br />

[4] NTP Server<br />

[5] Other host<br />

[B] Back<br />

[Q] Quit<br />

Tests for options 1 through 4 return a result of success or failure.<br />

For option 5, you must specify the target IP address that you want to test. The test<br />

returns the relative success of 0 through 100 percent over both the management and<br />

WAN interfaces.<br />

View IP Details<br />

From the IP Diagnostics menu, type 2 (View IP details) and press Enter to run an<br />

ipconfig process. The displayed results of the process are similar to the following:<br />

eth0 Link encap:Ethernet HWaddr 00:0F:1F:6A:03:E7<br />

inet addr:10.10.17.61 Bcast:10.10.17.255 Mask:255.255.255.0<br />

UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1<br />

RX packets:12751337 errors:0 dropped:0 overruns:0 frame:0<br />

TX packets:13628048 errors:0 dropped:0 overruns:0 carrier:0<br />

collisions:0 txqueuelen:1000<br />

RX bytes:1084700432 (1034.4 Mb) TX bytes:2661155798 (2537.8 Mb)<br />

Base address:0xecc0 Memory:fe6e0000-fe700000<br />

eth1 Link encap:Ethernet HWaddr 00:0F:1F:6A:03:E8<br />

inet addr:172.16.17.61 Bcast:172.16.255.255 Mask:255.255.0.0<br />

UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1<br />

RX packets:10519453 errors:0 dropped:0 overruns:0 frame:0<br />

TX packets:10244866 errors:0 dropped:0 overruns:0 carrier:0<br />

collisions:0 txqueuelen:5000<br />

RX bytes:2846677622 (2714.8 Mb) TX bytes:2702094827 (2576.9 Mb)<br />

Base address:0xdcc0 Memory:fe4e0000-fe500000<br />

6872 5688–002 C–3


Running Installation Manager Diagnostics<br />

eth1:1 Link encap:Ethernet HWaddr 00:0F:1F:6A:03:E8<br />

inet addr:172.16.17.60 Bcast:172.16.255.255 Mask:255.255.0.0<br />

UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1<br />

Base address:0xdcc0 Memory:fe4e0000-fe500000<br />

lo Link encap:Local Loopback<br />

inet addr:127.0.0.1 Mask:255.0.0.0<br />

UP LOOPBACK RUNNING MTU:16436 Metric:1<br />

RX packets:3853904 errors:0 dropped:0 overruns:0 frame:0<br />

TX packets:3853904 errors:0 dropped:0 overruns:0 carrier:0<br />

collisions:0 txqueuelen:0<br />

RX bytes:3312865098 (3159.3 Mb) TX bytes:3312865098 (3159.3 Mb)<br />

View Routing Table<br />

On the IP Diagnostics menu, type 3 (View routing table) and press Enter to display<br />

the routing table.<br />

Test Throughput<br />

On the IP Diagnostics menu, type 4 (Test throughput) and press Enter to use iperf to<br />

test throughput to another RA.<br />

Once you select this option, Installation Manager guides you through the following<br />

dialog. The bold text shows sample entries.<br />

Note: The Fibre Channel interface only appears if the Installation Manager Diagnostic<br />

capability was preconfigured to run on Fibre Channel. Then the option appears a [2} in<br />

the menu list.<br />

Enter the IP address to which to test throughput:<br />

>>192.168.1.86<br />

Select the interface from which to test throughput:<br />

** Interface **<br />

[1] Management interface<br />

[2] Fibre Channel Interface<br />

[3] WAN interface<br />

>>3<br />

Enter the desired number of concurrent streams:<br />

>>2<br />

Enter the test duration (seconds):<br />

>>10<br />

C–4 6872 5688–002


Running Installation Manager Diagnostics<br />

If the test is successful, the system responds with a standard iperf output that<br />

resembles the following:<br />

Checking connectivity to 10.10.17.51<br />

Connection to 10.10.17.51 established.<br />

Client Connecting to 10.10.17.51, TCP port 5001<br />

Binding to local address 10.10.17.61<br />

TCP window size: 4.00 MByte (WARNING: requested 2.00 MByte)<br />

[ 6] local 10.10.17.61 port 35222 connected with 10.10.17.51 port 5001<br />

[ 5] local 10.10.17.61 port 35222 connected with 10.10.17.51 port 5001<br />

[ ID] Interval Transfer Bandwidth<br />

[ 5] 0.0-10.6 sec 59.1 Mbytes 46.9 Mbits/sec<br />

[ 6] 0.0-10.6 sec 59.1 Mbytes 46.9 Mbits/sec<br />

[SUM] 0.0-10.6 sec 118 Mbytes 93.9 Mbits/sec<br />

Port Diagnostics<br />

On the IP Diagnostics menu, type 5 (Port diagnostics) and press Enter to check that<br />

none of the ports used by the RAs are blocked (for example, by a firewall). You must test<br />

each RA individually—that is, designate each RA, in turn, to be the server.<br />

Once you select the option, Installation Manager guides you through one of the following<br />

dialogs, depending on whether you designate the RA to be the server or the client. In the<br />

dialogs, sample entries are bold.<br />

For the server, the dialog is as follows:<br />

In which mode do you want to run ports diagnostics?<br />

** **<br />

[1] Server<br />

[2] Client<br />

>>1<br />

Note: Before you select the server designation for the RA, detach the RA that you<br />

intend to specify as the server.<br />

6872 5688–002 C–5


Running Installation Manager Diagnostics<br />

After you specify the RA that you want to test as the server, move to the RA from which<br />

you wish to run the port diagnostics tests. Designate that RA as a client, as noted in the<br />

following dialog:<br />

** **<br />

[1] Server<br />

[2] Client<br />

>>2<br />

Did you already designate another RA to be the server (y/n)<br />

>>y<br />

Enter the IP address to test:<br />

>>10.10.17.51<br />

If the test is successful, the system responds with output that resembles the following:<br />

Port No. TCP Connection<br />

5030 OK<br />

5040 OK<br />

4401 OK<br />

1099 OK<br />

5060 Blocked<br />

4405 OK<br />

5001 OK<br />

5010 OK<br />

5020 OK<br />

Correct the problem on any port that returns a Blocked response.<br />

System Connectivity<br />

Use the system connectivity options to test connections and generate reports on<br />

connections between RAs anywhere in the system. You can perform the tests during<br />

installation and during normal operation. The tests performed to verify connections are<br />

as follows:<br />

• Ping<br />

• TCP (to ports and IP addresses, to the specific processes of the RA, and using SSH)<br />

• UDP (general and to RA processes)<br />

• RA internal protocols<br />

C–6 6872 5688–002


Running Installation Manager Diagnostics<br />

On the IP Diagnostics menu, type 6 (System connectivity) and press Enter to access<br />

the System Connectivity menu as follows:<br />

** System Connectivity **<br />

[1] System connectivity test<br />

[2] Advanced connectivity test<br />

[3] Show all results from last connectivity check<br />

[B] Back<br />

[Q] Quit<br />

When you select System connectivity test and Full mesh network check, the<br />

test reports errors in communications from any RA to any other RA in the system.<br />

When you select System connectivity test and Check from local RA to all<br />

other boxes, the test reports errors from the local RA to any other RA in the system.<br />

When you select Advanced connectivity test, the test reports on the connection<br />

from an IP address that you specified on the local appliance to an IP address and port<br />

that you specified on an RA anywhere in the system. Use this option to diagnose a<br />

problem specific to a local IP address or port.<br />

When you select Show all results from last connectivity check, the test reports<br />

all results from the previous tests—not only the errors, but also the tests that completed<br />

successfully.<br />

6872 5688–002 C–7


Running Installation Manager Diagnostics<br />

You might receive one of the messages shown in Table C–1 from the connectivity test<br />

tool.<br />

Table C–1. Messages from the Connectivity Testing Tool<br />

Message Meaning<br />

Machine is down. There is no communication with the RA.<br />

Perform the following steps to determine<br />

the problem:<br />

• Verify that the firewall permits pinging<br />

the RA, that is, using a CMP echo.<br />

• Check that the RA is connected and<br />

operating.<br />

• Check that the required ports are<br />

open. (Refer to Section 7, “Solving<br />

Networking Problems,” for tables with<br />

the port information.)<br />

is down. The host connection exists but the RA is<br />

not responding.<br />

Perform the following steps to determine<br />

the problem:<br />

• Check that the required ports are<br />

open. (Refer to Section 7, “Solving<br />

Networking Problems” for tables with<br />

the port information.)<br />

• Verify that the RA is attached to an RA<br />

cluster.<br />

Connection to link: protocol:<br />

FAILED.<br />

Link ()<br />

FAILED.<br />

No connection is available to the host<br />

through the protocol.<br />

The connection that was checked has<br />

failed.<br />

All OK. The connection is working.<br />

To discover which port is involved in the error or failure, run the test again and select<br />

Show all results from last connectivity check. The port on which each failure<br />

occurred is shown.<br />

C–8 6872 5688–002


Fibre Channel Diagnostics<br />

Running Installation Manager Diagnostics<br />

Use the Fibre Channel diagnostics when you need to check SAN connections, review<br />

port settings, see details of the Fibre Channel, determine Fibre Channel targets and<br />

LUNs, and perform I/O operations to a LUN.<br />

On the Diagnostics menu, type 2 (Fibre Channel diagnostics) and press Enter to<br />

access the Fibre Channel Diagnostics menu as follows:<br />

** Fibre Channel Diagnostics **<br />

[1] Run SAN diagnostics<br />

[2] View Fibre Channel details<br />

[3] Detect Fibre Channel targets<br />

[4] Detect Fibre Channel LUNs<br />

[5] Detect Fibre Channel SCSI-3 reserved LUNs<br />

[6] Perform I/O to a LUN<br />

[B] Back<br />

[Q] Quit<br />

Run SAN Diagnostics<br />

On the Fibre Channel Diagnostics menu, type 1 (Run SAN diagnostics) and press<br />

Enter to run the SAN diagnostics.<br />

When you select this option, the system conducts a series of automatic tests to identify<br />

the most common problems encountered in the configuration of SAN environments,<br />

such as the following:<br />

• Storage inaccessible within a site<br />

• Delays with writes or reads to disk<br />

• Disk not accessible in the network<br />

• Configuration issues<br />

Once the tests complete, a message is displayed confirming the successful completion<br />

of SAN diagnostics, or a report is displayed that provides additional details.<br />

Results similar to the following are displayed for a successful diagnostics run of port 0:<br />

0 errors:<br />

0 warnings:<br />

Total=0<br />

6872 5688–002 C–9


Running Installation Manager Diagnostics<br />

Sample results follow for a diagnostics run that returns errors:<br />

ConfigB_Site2 Box2>>1<br />

>>Running SAN diagnostics. This may take a few moments...<br />

results of SAN diagnostics are<br />

3 errors:<br />

1. Found device with no guid : wwn=5006016b1060090d lun=0 port=0 vendor=DGC<br />

product=LUNZ<br />

2. Found device with no guid : wwn=500601631060090d lun=0 port=0 vendor=DGC<br />

product=LUNZ<br />

3. Found device with no guid : wwn=5006016b1060090d lun=0 port=1 vendor=DGC<br />

product=LUNZ<br />

9 warnings:<br />

1. device wwn=500601631060090d lun=8<br />

guid=(storage=CLARION,buffer=Vector(96,6,1,96,155,195,14,0,125,87,93,152,230,2<br />

29,218,17)) found in port 1 and not in port 0<br />

2. device wwn=500601631060090d lun=7<br />

guid=(storage=CLARION,buffer=Vector(96,6,1,96,155,195,14,0,127,87,93,152,230,2<br />

29,218,17)) found in port 1 and not in port 0<br />

3. device wwn=500601631060090d lun=6<br />

guid=(storage=CLARION,buffer=Vector(96,6,1,96,155,195,14,0,129,87,93,152,230,2<br />

29,218,17)) found in port 1 and not in port 0<br />

4. device wwn=500601631060090d lun=5<br />

guid=(storage=CLARION,buffer=Vector(96,6,1,96,155,195,14,0,131,87,93,152,230,2<br />

29,218,17)) found in port 1 and not in port 0<br />

5. device wwn=500601631060090d lun=4<br />

guid=(storage=CLARION,buffer=Vector(96,6,1,96,155,195,14,0,133,87,93,152,230,2<br />

29,218,17)) found in port 1 and not in port 0<br />

6. device wwn=500601631060090d lun=3<br />

guid=(storage=CLARION,buffer=Vector(96,6,1,96,155,195,14,0,135,87,93,152,230,2<br />

29,218,17)) found in port 1 and not in port 0<br />

7. device wwn=500601631060090d lun=2<br />

guid=(storage=CLARION,buffer=Vector(96,6,1,96,155,195,14,0,137,87,93,152,230,2<br />

29,218,17)) found in port 1 and not in port 0<br />

C–10 6872 5688–002


Running Installation Manager Diagnostics<br />

8. device wwn=500601631060090d lun=1<br />

guid=(storage=CLARION,buffer=Vector(96,6,1,96,155,195,14,0,139,87,93,152,230,2<br />

29,218,17)) found in port 1 and not in port 0<br />

9. device wwn=500601631060090d lun=0<br />

guid=(storage=CLARION,buffer=Vector(96,6,1,96,155,195,14,0,141,87,93,152,230,2<br />

29,218,17)) found in port 1 and not in port 0<br />

Total=12<br />

View the Fibre Channel Details<br />

On the Fibre Channel Diagnostics menu, type 2 (View Fibre Channel details) and<br />

press Enter to show the current Fibre Channel details.<br />

The operation mode is identified automatically according to the SAN switch<br />

configuration. Usually the RA is configured for the point-to-point mode unless the SAN<br />

switch is hard-wired to port L.<br />

Note: You can use the View Fibre Channel details capability to obtain information about<br />

WWNs that is needed for zoning.<br />

You can check the status for the following on the Fibre Channel Diagnostics menu:<br />

• Speed<br />

• Operating node<br />

• Node WWN<br />

• Changes made<br />

• Connection issues<br />

• Additions of new HBAs<br />

Sample results showing Fibre Channel details for port 0 and port 1 follow:<br />

ConfigB_Site2 Box2>>2<br />

>> Port 0<br />

-----------------------------------wwn<br />

= 5001248200875c81<br />

node_wwn = 5001248200875c80<br />

port id = 0x20100<br />

operating mode = point to point<br />

speed = 2 GB<br />

Port 1<br />

-----------------------------------wwn<br />

= 5001248201a75c81<br />

node_wwn = 5001248201a75c80<br />

port id = 0x20500<br />

operating mode = point to point<br />

speed = 2 GB<br />

6872 5688–002 C–11


Running Installation Manager Diagnostics<br />

If all cables are disconnected, the operating mode results for all ports are disconnected.<br />

If only one cable is disconnected, then the operating mode for the affected port is<br />

disconnected, as shown in the following sample results:<br />

ConfigB_Site2 Box2>>2<br />

>> Port 0<br />

------------------------------------<br />

wwn = 5001248200875c81<br />

node_wwn = 5001248200875c80<br />

port id = 0x20100<br />

operating mode = point to point<br />

speed = 2 GB<br />

Port 1<br />

------------------------------------<br />

wwn = 5001248201a75c81<br />

node_wwn = 5001248201a75c80<br />

port id = 0x0<br />

operating mode = disconnected<br />

speed = 2 GB<br />

Detect Fibre Channel Targets<br />

On the Fibre Channel Diagnostics menu, type 3 (Detect Fibre Channel targets) and<br />

press Enter to see a list of the targets that are accessible to the RA through ports A<br />

and B.<br />

Some of the reasons to use this capability are as follows:<br />

• Zoning issues<br />

• Failure to detect a host<br />

• SAN connection issues<br />

• Need for WWN or storage details of each RA<br />

The following sample results provide port WWN, node WWN, and port information:<br />

ConfigB_Site2 Box2>>3<br />

>><br />

Port 0<br />

Port WWN Node WWN Port ID<br />

----------------------------------------------------<br />

1) 0x500601631060090d 0x500601609060090d 0x20000<br />

2) 0x5006016b1060090d 0x500601609060090d 0x20400<br />

C–12 6872 5688–002


Port 1<br />

Port WWN Node WWN Port ID<br />

----------------------------------------------------<br />

1) 0x500601631060090d 0x500601609060090d 0x20000<br />

2) 0x5006016b1060090d 0x500601609060090d 0x20400<br />

Detect Fibre Channel LUNs<br />

Running Installation Manager Diagnostics<br />

On the Fibre Channel Diagnostics menu, type 64(Detect Fibre Channel LUNs) and<br />

press Enter to see a list of all volumes on the SAN that are visible to the RA.<br />

Using this capability can detect<br />

• Issues with volume access<br />

• LUN repository details<br />

• Additions of volumes<br />

In the following sample results that show the types of information returned, the<br />

information wraps around:<br />

ConfigB_Site2 Box2>>4<br />

>>This operation may take a few minutes...<br />

Size Vendor Product Serial Number Vendor Specific UID<br />

Port WWN LUN CGs Site ID<br />

================================================================================<br />

1. 4.00GB DGC RAID 5 APM00031800182 LUN ID: 127<br />

CLARION: 60,06,01,60,9b,c3,0e,00,8d,57,5d,98,e6,e5,da,11:0<br />

1 500601631060090d 0 2<br />

2. 4.00GB DGC RAID 5 APM00031800182 LUN ID: 125<br />

CLARION: 60,06,01,60,9b,c3,0e,00,8b,57,5d,98,e6,e5,da,11:0<br />

1 500601631060090d 1 2<br />

3. 4.00GB DGC RAID 5 APM00031800182 LUN ID: 123<br />

CLARION: 60,06,01,60,9b,c3,0e,00,89,57,5d,98,e6,e5,da,11:0<br />

1 500601631060090d 2 2<br />

4. 4.00GB DGC RAID 5 APM00031800182 LUN ID: 121<br />

CLARION: 60,06,01,60,9b,c3,0e,00,87,57,5d,98,e6,e5,da,11:0<br />

1 500601631060090d 3 2<br />

6872 5688–002 C–13


Running Installation Manager Diagnostics<br />

5. 4.00GB DGC RAID 5 APM00031800182 LUN ID: 119<br />

CLARION: 60,06,01,60,9b,c3,0e,00,85,57,5d,98,e6,e5,da,11:0<br />

1 500601631060090d 4 2<br />

6. 4.00GB DGC RAID 5 APM00031800182 LUN ID: 117<br />

CLARION: 60,06,01,60,9b,c3,0e,00,83,57,5d,98,e6,e5,da,11:0<br />

1 500601631060090d 5 2<br />

7. 1.00GB DGC RAID 5 APM00031800182 LUN ID: 115<br />

CLARION: 60,06,01,60,9b,c3,0e,00,81,57,5d,98,e6,e5,da,11:0<br />

1 500601631060090d 6 0<br />

8. 4.00GB DGC RAID 5 APM00031800182 LUN ID: 113<br />

CLARION: 60,06,01,60,9b,c3,0e,00,7f,57,5d,98,e6,e5,da,11:0<br />

1 500601631060090d 7 2<br />

9. 62.00GB DGC RAID 5 APM00031800182 LUN ID: 111<br />

CLARION: 60,06,01,60,9b,c3,0e,00,7d,57,5d,98,e6,e5,da,11:0<br />

1 500601631060090d 8 40<br />

10. N/A DGC LUNZ APM00031800182 -<br />

N/A<br />

0 500601631060090d 0 N/A<br />

11. N/A DGC LUNZ APM00031800182 -<br />

N/A<br />

0 5006016b1060090d 0 N/A<br />

12. N/A DGC LUNZ APM00031800182 -<br />

N/A<br />

1 5006016b1060090d 0 N/A<br />

C–14 6872 5688–002


Detect Fibre Channel Scsi3 Reserved LUNs<br />

Running Installation Manager Diagnostics<br />

On the Fibre Channel Diagnostics menu, type 5 (Detect Fibre Channel Scsi3<br />

reserved LUNs) and press Enter to list all LUNs that have SCSI-3 reservations. The<br />

information returned includes the WWN, LUN number, port number, and reservation<br />

type.<br />

Perform I/O to a LUN<br />

On the Fibre Channel Diagnostics menu, type 6 (Perform I/O to a LUN) and press<br />

Enter to initiate a dialog that guides you through performing an I/O operation to a LUN.<br />

Note: The write operation removes any data that you might have. Use the write<br />

operation only when you are installing at the site.<br />

The following example for a read operation shows sample responses in bold type.<br />

SYDNEY Box1>>6<br />

>>This operation may take a few minutes...<br />

Size Vendor Product Serial Number Vendor Specific UID<br />

Port WWN Ctrl LUN<br />

============================================================================<br />

1. 9.00GB DGC RAID 5 APM00024400378 LUN 29 Sydney<br />

JouCLARION: 60,06,01,60,db,e3,0e,00,d1,3a,e0,54,cd,b6,db,11:0<br />

0 500601601060009a SP-A 0<br />

0 500601681060009a SP-B 0<br />

1 500601601060009a SP-A 0<br />

1 500601681060009a SP-B 0<br />

.<br />

.<br />

.<br />

10. 10.00GB DGC RAID 5 APM00024400378 LUN 36 Sydney<br />

JouCLARION: 60,06,01,60,db,e3,0e,00,da,3a,e0,54,cd,b6,db,11:0<br />

0 500601601060009a SP-A 10<br />

0 500601681060009a SP-B 10<br />

1 500601601060009a SP-A 10<br />

1 500601681060009a SP-B 10<br />

Select: 6<br />

Select operation to perform:<br />

** Operation To Perform **<br />

[1] Read<br />

[2] Write<br />

6872 5688–002 C–15


Running Installation Manager Diagnostics<br />

SYDNEY Box1>>1<br />

>><br />

Enter the desired transaction size:<br />

SYDNEY Box1>>10485760<br />

Do you want to read the whole LUN? (y/n)<br />

>>y<br />

1 buffers in<br />

1 buffers out<br />

total time : 0.395567 seconds<br />

2.65082e+07 bytes/sec<br />

25.2802 MB/sec<br />

2.52802 IO/sec<br />

CRC = 4126172682534249172<br />

I/O succeeded.<br />

The following example for a write operation shows sample responses in bold type.<br />

SYDNEY Box1>>6<br />

>>This operation may take a few minutes...<br />

Size Vendor Product Serial Number Vendor Specific UID<br />

Port WWN Ctrl LUN<br />

============================================================================<br />

1. 9.00GB DGC RAID 5 APM00024400378 LUN 29 Sydney<br />

JouCLARION: 60,06,01,60,db,e3,0e,00,d1,3a,e0,54,cd,b6,db,11:0<br />

0 500601601060009a SP-A 0<br />

0 500601681060009a SP-B 0<br />

1 500601601060009a SP-A 0<br />

1 500601681060009a SP-B 0<br />

.<br />

.<br />

.<br />

10. 10.00GB DGC RAID 5 APM00024400378 LUN 36 Sydney<br />

JouCLARION: 60,06,01,60,db,e3,0e,00,da,3a,e0,54,cd,b6,db,11:0<br />

0 500601601060009a SP-A 10<br />

0 500601681060009a SP-B 10<br />

1 500601601060009a SP-A 10<br />

1 500601681060009a SP-B 10<br />

============================================================================<br />

Select: 10<br />

Select operation to perform:<br />

** Operation To Perform **<br />

[1] Read<br />

[2] Write<br />

SYDNEY Box1>>2<br />

>><br />

Enter the desired transaction size:<br />

SYDNEY Box1>>10485760<br />

C–16 6872 5688–002


Enter the number of transactions to perform:<br />

SYDNEY Box1>>100<br />

Enter the number of blocks to skip:<br />

SYDNEY Box1>>16<br />

100 buffers in<br />

100 buffers out<br />

total time : 40.7502 seconds<br />

2.57318e+07 bytes/sec<br />

24.5398 MB/sec<br />

2.45398 IO/sec<br />

CRC = 3829111553924479115<br />

I/O succeeded.<br />

Synchronization Diagnostics<br />

Running Installation Manager Diagnostics<br />

On the Diagnostics menu, type 3 (Synchronization diagnostics) and press Enter to<br />

verify that a RA is synchronized.<br />

Note: The RA must be attached to run the synchronization diagnostics. Reattaching the<br />

RA causes the RA to reboot.<br />

The results displayed are similar to the following example:<br />

remote refid st t when poll reach delay offset jitter<br />

=============================================================================<br />

*10.10.0.1 192.116.202.203 3 u 438 1024 377 0.337 12.971 6.241<br />

+11 10.10.0.1 2 u 484 1024 376 0.090 -4.530 0.023<br />

LOCAL(0) LOCAL(0) 13 1 2 64 377 0.000 0.000 0.004<br />

The columns in the previous output are defined as follows:<br />

• remote—host names or addresses of the servers and peers used for synchronization<br />

• refid—current source of synchronization<br />

• st—stratum<br />

• t—type (u=unicast, m=multicast, l=local, – =do not know)<br />

• when—time since the peer was last heard, in seconds<br />

• poll—poll interval, in seconds<br />

• reach—status of the reachability register in octal format<br />

• delay—latest delay in milliseconds<br />

• offset—latest offset in milliseconds<br />

• jitter—latest jitter in milliseconds<br />

6872 5688–002 C–17


Running Installation Manager Diagnostics<br />

The symbol at the left margin indicates the synchronization status of each peer. The<br />

currently selected peer is marked with an asterisk (*); additional peers designated as<br />

acceptable for synchronization are marked with a plus sign (+). Peers marked with * and<br />

+ are included in the weighted average computation to set the local clock. Data<br />

produced by peers marked with other symbols is discarded. The LOCAL(0) entry<br />

represents the values obtained from the internal clock on the local machine.<br />

Collect System Info<br />

On the Diagnostics menu, type 4 (Collect system info) and press Enter to collect<br />

system information for later processing and analysis. You specify where to place the<br />

information collected. In some cases, you might need to transfer it to a vendor for<br />

technical support. You are prompted to provide the following information:<br />

• The time frame for log collection<br />

• Whether to collect information from the remote site<br />

• FTP details if you choose to send the results to an FTP server<br />

• Which logs to collect<br />

• Whether you have SANTap switches from which you want to collect information<br />

Note: The dialog asks whether you want full collection. If you choose full collection,<br />

additional technical information is supplied, but the time required for the collection<br />

process is lengthened. Unless specifically instructed by a Unisys service representative,<br />

do not choose full collection.<br />

The following dialog provides sample responses in bold type for collecting system<br />

information:<br />

>>GMT right now is 11/24/2005 14:45:43<br />

Enter the start date:<br />

>>11/22/2005<br />

Enter the start time:<br />

>>12:00:00<br />

Enter the end date:<br />

>>11/24/2005<br />

Enter the end time:<br />

>>14:45:43<br />

Note: The start and end times are used only for collection of the system<br />

logs. Logs from hosts are collected in their entirety.<br />

Do you want to collect system information from the other site also? (y/n)<br />

>>y<br />

Do you want to send results to an ftp server? (y/n)<br />

>>y<br />

C–18 6872 5688–002


Running Installation Manager Diagnostics<br />

Enter the name of the ftp server to which you want to transfer the<br />

collected system information:<br />

>>ftp.ess.unisys.com<br />

Enter the port number to which to connect on the FTP server:<br />

>>21<br />

Enter the FTP user name:<br />

>>MY_USERNAME<br />

Enter the location on the FTP server in which you want to put the collected<br />

system information:<br />

>>incoming<br />

Enter the file on the FTP server in which you want to put the collected<br />

system information:<br />

>>19557111_company.tar<br />

Enter the FTP password:<br />

>>*******<br />

Select the logs you want to collect:<br />

** Collection mode **<br />

[1] Collect logs from RAs only<br />

[2] Collect logs from hosts only<br />

[3] Collect logs from RAs and hosts<br />

>>3<br />

Do you have SANTap switches from which you want to collect information?<br />

>>n<br />

Do you want to perform full collection? (y/n)<br />

>>n<br />

Do you want to limit collection time? (y/n)<br />

>>n<br />

Once you complete the information-entry dialog, Installation Manager checks<br />

connectivity and displays a list of accessible hosts for which the feature is enabled. (See<br />

the Unisys <strong>SafeGuard</strong> <strong>Solutions</strong> Replication Appliance Administrator’s <strong>Guide</strong> for more<br />

information.). You must indicate the hosts for which you want to collect logs. You can<br />

select one or more individual hosts or enter NONE or ALL.<br />

Once you specify the hosts, Installation Manager returns system information and logs for<br />

all accessible RAs, including the remote RAs, if so instructed. This software also returns<br />

a success or failure status report for each RA from which it has been instructed to collect<br />

information.<br />

6872 5688–002 C–19


Running Installation Manager Diagnostics<br />

Installation Manager also collects logs for the selected hosts and reports on the success<br />

or failure of each collection. The timeout on the collection process is 20 minutes.<br />

Once the information is collected and you requested that it be stored on an ftp server,<br />

the system reports that it is transferring the collected information to the specified FTP<br />

location. Once the transfer completes, you are prompted to press ENTER to continue.<br />

You can also open or download the stored files using your browser. Log in as<br />

webdownload/webdownload, and access the files at one of these URLs:<br />

• For nonsecured servers: http:///info/<br />

• For secured servers: https:///info/<br />

The following error conditions apply:<br />

• If the connection with an RA is lost while information collection is in progress, no<br />

information is collected.<br />

You can run the process again. If the collection from the remote site failed because<br />

of a WAN failure, run the process locally at the remote site.<br />

• If simultaneous information collection is occurring from the same RA, only the<br />

collector that established the first connection can succeed.<br />

• FTP failure results in failure of the entire process.<br />

If this process fails to collect the desired host information, you can alternatively generate<br />

host information collection directly for individual hosts. Use the Host Information<br />

Collector (HIC) utility as described in Appendix A. Also, the Unisys <strong>SafeGuard</strong> <strong>Solutions</strong><br />

Administrator’s <strong>Guide</strong> provides additional information about the HIC utility.<br />

C–20 6872 5688–002


Appendix D<br />

Replacing a Replication Appliance<br />

(RA)<br />

To replace an RA at a site, you must perform the following tasks as described in this<br />

appendix:<br />

• Save configuration settings.<br />

• Record the group properties and save the Global cluster mode settings.<br />

• Modify the Preferred RA setting.<br />

• Detach the failed RA.<br />

• Remove the Fibre Channel adapter cards.<br />

• Install and configure the replacement RA.<br />

• Verify the RA installation.<br />

• Restore group properties.<br />

• Ensure the existing RA can switch over to the new RA.<br />

Note: During this process, be sure that the direction of all consistency groups is from<br />

the site without the failed RA to the site with the RA during this process. You might<br />

need to move groups.<br />

6872 5688–002 D–1


Replacing a Replication Appliance (RA)<br />

Saving the Configuration Settings<br />

Before you replace an RA, Unisys recommends that you save the current environment<br />

settings to a file. The saved file is a script that contains CLI commands for all groups,<br />

volumes, and replication pairs needed to re-create the environment. The file is used for<br />

backup purposes only.<br />

1. From a command prompt on the management PC, enter the following command to<br />

change to the directory where the plink.exe file is located:<br />

cd putty<br />

2. Update the following command with your site management IP address and<br />

administrator (admin) password, and then enter the command:<br />

plink -ssh site management IP address -l admin -pw admin password<br />

save_settings > sitexandsitey.txt<br />

Note: If a message is displayed asking whether you want to add a cached registry<br />

key, type y and press Enter. The file is automatically saved to the management PC<br />

in the same directory from which the command was issued.<br />

If you need to restore the settings saved in the previous procedure, update the following<br />

command with your site management IP address and administrator (admin) password,<br />

and then enter the command:<br />

plink -ssh site management IP address -l admin -pw admin password -m<br />

version30.txt<br />

Recording Policy Properties and Saving Settings<br />

Before you begin the RA replacement procedure, ensure to record the policy properties<br />

and save the Global cluster mode settings.<br />

Perform the following steps for each consistency group to record policy properties and<br />

save settings:<br />

1. Select the Policy tab.<br />

2. Write down and save the current preferred RA settings and Global cluster mode<br />

parameter for each consistency group. Use this record to restore these values after<br />

you replace the RA.<br />

3. Click OK.<br />

4. Repeat steps 1 through 3 for all the other groups.<br />

.<br />

D–2 6872 5688–002


Modifying the Preferred RA Setting<br />

Replacing a Replication Appliance (RA)<br />

For each consistency group, record the Preferred RA and Global cluster mode settings<br />

so that they can be stored at the end of this procedure.<br />

Perform the following steps to change all consistency groups that were running on the<br />

failed RA to a surviving RA:<br />

1. Select the Policy tab.<br />

2. Change the Preferred RA setting to a surviving RA number for all consistency<br />

groups that had the Preferred RA value set to the failed RA. Perform steps 2a<br />

through 2e for each group.<br />

a. If the Global cluster mode parameter is set to one of the following options,<br />

skip this step, and continue with step 4d:<br />

• None<br />

• Manual (shared quorum)<br />

• Manual<br />

b. Change the Global cluster mode parameter to<br />

• Manual (if using MSCS with shared quorum)<br />

• Manual (if using MSCS with majority node set)<br />

c. Click Apply.<br />

d. Change the Preferred RA setting, and then click Apply.<br />

e. Change the Global cluster mode parameter to the original setting.<br />

f. Click Apply.<br />

2. Select the Consistency Group and click the Status tab to verify that all groups<br />

are running on the new RA number.<br />

Review the current status of the preferred RA under the components pane.<br />

3. Detach the failed RA. If you can log on to the RA, detach the RA by performing the<br />

following steps. Else continue with “Removing Fibre Channel Adapter Cards.”<br />

a. Use the Putty utility to connect to the box management IP address for the RA that<br />

is being replaced.<br />

b. Type boxmgmt when prompted to log in, and then type the appropriate<br />

password if it has changed from the default password boxmgmt.<br />

The Main Menu is displayed.<br />

c. Type 4 (Cluster operations) and press Enter.<br />

d. Type 2 (Detach from cluster) to detach the RA from the cluster, and then press<br />

Enter.<br />

e. Type y when prompted to detach and press Enter.<br />

f. Type B (Back) and press Enter to return to the Main Menu.<br />

g. Type quit and close the PuTTY window.<br />

6872 5688–002 D–3


Replacing a Replication Appliance (RA)<br />

Removing Fibre Channel Adapter Cards<br />

Perform the following to remove the RA and Fibre Channel host bus adapters (HBAs):<br />

1. Power off the failed RA.<br />

2. Physically disconnect and remove the failed RA from the rack.<br />

3. Physically remove the Fibre Channel HBAs from the failed RA and insert them into<br />

the replacement RA.<br />

Note: If you cannot use the cards from the existing RA, refer to “Failure of All SAN<br />

Fibre Channel Host Bus Adapters (HBAs)” in Section 8 for information about<br />

replacing a failed HBA.<br />

Installing and Configuring the Replacement RA<br />

To install and configure the replacement RA, you must complete several tasks, as follow:<br />

• Complete the procedure in “Cable and Apply Power to New RA.”<br />

• Complete the procedure in “Connecting and Accessing the RA.”<br />

• Complete the procedure in “Configuring the RA.”<br />

• Complete the procedures in “Verifying the RA Installation.”<br />

Cable and Apply Power to the New RA<br />

1. Insert the new RA into the rack and apply power.<br />

2. Insert the Unisys <strong>SafeGuard</strong> <strong>Solutions</strong> RA Setup Disk CD-ROM into the CD/DVD<br />

drive of the RA. Ensure that this disk is the same version that is running in the other<br />

RAs.<br />

3. Power off and then power on the RA.<br />

4. As the RA boots, check the BIOS level as displayed in the Unisys banner and note<br />

the level displayed. At the end of the replacement procedure, you can compare the<br />

existing RA BIOS level with the new RA BIOS level. The RA BIOS might need to be<br />

updated.<br />

Connecting and Accessing the RA<br />

1. Power on the appropriate RA.<br />

2. Connect an Ethernet cable between the management PC used for installation and<br />

the WAN Ethernet segment to which the RA is connected.<br />

If you connect the management PC directly to the RA, use a crossover cable.<br />

3. Assign the following IP address and subnet mask to the management PC:<br />

10.77.77.50 (IP address)<br />

255.255.255.0 (subnet mask)<br />

4. Access the RA by using the SSH client. (See Appendix C.) Use the 10.77.77.77 IP<br />

address, which has a subnet mask of 255.255.255.0.<br />

D–4 6872 5688–002


Replacing a Replication Appliance (RA)<br />

5. Log in with the boxmgmt user name and the boxmgmt password.<br />

6. Provide the following information for the layout of the RA installation:<br />

a. When prompted about the number of sites in the environment<br />

• Type 2 to install in a geographic replication environment or a geographic<br />

clustered environment.<br />

• Type 1 to install in a continuous data protection environment.<br />

b. Type the number of RAs at the site, and press Enter.<br />

The Main Menu appears.<br />

Checking Storage-to-RA Access<br />

If the LUNs are not accessible, check your switch configuration and zoning. Verify that all<br />

LUNs are accessible by using the Main Menu of the Installation Manager and<br />

performing the following steps:<br />

1. Type 3 (Diagnostics).<br />

2. Type 2 (Fibre Channel diagnostics).<br />

3. Type 4 (Detect Fibre Channel LUNs).<br />

After a few minutes, a list of detected LUNs appears.<br />

4. Press the spacebar until all expected LUNs appear.<br />

5. Type B (Back).<br />

6. Type B again.<br />

The Main Menu appears.<br />

7. If you do not see all Fibre Channel LUNs in step 4, correct the environment and<br />

repeat steps 1 through 6.<br />

Enabling PCI-X Slot Functionality<br />

If your system is configured with a gigabit (Gb) WAN, which is used for the optical WAN<br />

connection, perform the following steps on the Main Menu of the replacement RA:<br />

1. Type 2 (Setup).<br />

2. Type 8 (Advanced option).<br />

3. Type 12 (Enable/disable additional remote interface).<br />

4. Type yes when prompted on whether to enable the additional remote interface.<br />

5. Type B twice to return to the Main Menu.<br />

6872 5688–002 D–5


Replacing a Replication Appliance (RA)<br />

Configuring the RA<br />

1. On the Main Menu, type 1 (Installation).<br />

2. Type 2 (Get Setup information from an installed RA). Press Enter.<br />

The Get Settings Wizard menu appears with Get Settings from Installed<br />

RA selected.<br />

3. Press Enter.<br />

4. Type 1 (Management interface) to view the settings from the installed RA.<br />

5. Type y when prompted to configure a temporary IP address.<br />

6. Type the IP address.<br />

7. Type the IP subnet mask and then press Enter.<br />

8. Type y or n, depending on your environment, when prompted to configure a<br />

gateway.<br />

9. Type the box management IP address of Site 1 RA 1 to import the settings from that<br />

RA.<br />

10. Type y to import the settings.<br />

11. Press Enter to continue when a message states that the configuration was<br />

successfully imported.<br />

The Get Settings Wizard menu appears with Apply selected.<br />

12. Perform the following steps to apply the configuration to the RA:<br />

a. Press Enter to continue.<br />

The complete list of settings is displayed. These settings are the same as the<br />

ones for Site 1 RA 1.<br />

b. Type y to apply these settings.<br />

c. Type 1 or 2 when prompted for a site number, depending on the site on which<br />

the RA is located.<br />

d. Type the RA number when prompted.<br />

A confirmation message appears when the settings are applied successfully.<br />

e. Press Enter.<br />

The Get Settings Wizard menu appears with Proceed to the Complete<br />

Installation Wizard selected.<br />

f. Press Enter to continue.<br />

The Complete Installation Wizard menu appears with Configure<br />

repository volume selected.<br />

13. Configure the repository volume by completing the following steps:<br />

a. Press Enter.<br />

b. Type 2 (Select a previously formatted repository volume).<br />

D–6 6872 5688–002


Replacing a Replication Appliance (RA)<br />

c. Select the number of the repository volume corresponding to the group of<br />

displayed volumes, and press Enter.<br />

d. Press Enter again.<br />

The Complete Installation Wizard menu appears with Attach to cluster<br />

selected.<br />

14. Attach the RA to the RA cluster by completing the following steps:<br />

a. Press Enter.<br />

b. Type y at the prompt to attach to the cluster.<br />

The RA reboots.<br />

c. Close the PuTTY session if necessary.<br />

Verifying the RA Installation<br />

To verify that the RA is correctly installed, you must<br />

• Verify the WAN bandwidth<br />

• Verify the clock synchronization<br />

Verifying WAN Bandwidth<br />

Use the following procedure to verify the actual versus the expected WAN bandwidth.<br />

Note: Correct any problems and rerun the verification.<br />

1. Open an SSH session to the box management IP address for the replacement RA.<br />

2. Type boxmgmt when prompted to log in, and then type the appropriate password<br />

if it has changed from the default password boxmgmt.<br />

The Main Menu is displayed.<br />

3. Type 3 (Diagnostics) and press Enter.<br />

The Diagnostics menu appears.<br />

4. Type 1 (IP diagnostics) and press Enter.<br />

The IP Diagnostics menu appears.<br />

5. Type 4 (Test throughput) and press Enter.<br />

6. Type the WAN IP address of the peer RA; for example, site 2 RA 1 is the peer for<br />

site 1 RA 1.<br />

7. Type 2 (WAN interface).<br />

8. At the prompt, type 20 to change the default value for the desired number of<br />

concurrent streams.<br />

9. At the prompt for the test duration, type 60 to change the default value.<br />

A message is displayed that the connection was established.<br />

6872 5688–002 D–7


Replacing a Replication Appliance (RA)<br />

10. After 60 seconds, make sure that the following information is displayed on the<br />

screen. Ignore any TCP Windows Size warnings.<br />

• IP connection for every stream<br />

• Interval, Transfer, and Bandwidth for every stream<br />

• Expected bandwidth in the [SUM] display at the bottom of the screen<br />

11. On the IP Diagnostics menu, type Q (Quit), and then type y.<br />

Verifying Clock Synchronization<br />

The timing of all Unisys <strong>SafeGuard</strong> 30m activities across all RAs in an installation must<br />

be synchronized against a single clock (for example, on the network time protocol [NTP]<br />

server). Consequently, you need to synchronize the replacement RA.<br />

For the procedure to verify RA synchronization, see the Unisys <strong>SafeGuard</strong> <strong>Solutions</strong><br />

Replication Appliance Installation <strong>Guide</strong>.<br />

Restoring Group Properties<br />

Perform the following steps on the Management Console for each group that needs<br />

to have the Preferred RA setting restored to an RA other than RA 1.. All Preferred RA<br />

settings are set to RA 1.<br />

1. Select the Policy tab for the consistency group.<br />

2. On the General Settings section, change the Preferred RA setting to the<br />

original setting, and then click Apply.<br />

3. Change the Global cluster mode under Advanced to the original setting if it<br />

was changed earlier.<br />

4. Click Apply.<br />

Ensuring the Existing RA Can Switch Over to the<br />

New RA<br />

Once the new RA is part of the configuration, the management console does not display<br />

any errors. Shut down any other RA at the site to ensure that the newly replaced RA can<br />

successfully complete the switchover. As the existing RA reboots, check the BIOS level<br />

as displayed in the Unisys banner and note it.<br />

Compare the BIOS level noted for the exiting (rebooting) RA with the BIOS level you<br />

noted for the replacement RA. If the BIOS levels do not match, contact the Unisys<br />

<strong>Support</strong> Center to obtain the correct BIOS.<br />

D–8 6872 5688–002


Appendix E<br />

Understanding Events<br />

Event Log<br />

Event Topics<br />

Various events generate entries to the Unisys <strong>SafeGuard</strong> 30m solution system log.<br />

These events are predefined in the system according to topic, level of severity, and<br />

scope. The Unisys <strong>SafeGuard</strong> 30m solution supports proactive notification of an event—<br />

either by sending e-mail messages or by generating system log events that are logged<br />

by a management application.<br />

The system records log entries in response to a wide range of predefined events. Each<br />

event carries an event ID. For manageability, the system divides the events into general<br />

and advanced types. In most cases, you can monitor system behavior effectively by<br />

viewing the general events only. For troubleshooting a problem, technical support<br />

personnel might want to review the advanced log events.<br />

Event topics correspond to the components where the events occur, including<br />

• Management (management console and CLI)<br />

• Site<br />

• RA<br />

• Consistency group<br />

• Splitter<br />

A single event can generate multiple log entries.<br />

6872 5688–002 E–1


Understanding Events<br />

Event Levels<br />

Event Scope<br />

The levels of severity for events are defined as follows (in ascending order):<br />

• Info<br />

These messages are informative in nature, usually referring to changes in the<br />

configuration or normal system state.<br />

• Warning<br />

These messages indicate a warning, usually referring to a transient state or to an<br />

abnormal condition that does not degrade system performance.<br />

• Error<br />

These messages indicate an important event that is likely to disrupt normal system<br />

behavior, performance, or both.<br />

A single change in the system—for example, an error over a communications line—can<br />

affect a wide range of system components and cause the system to generate a large<br />

number of log events. Many of these events contain highly technical information that is<br />

intended for use by Unisys service representatives. When all of the events are displayed,<br />

you might find it difficult to identify the particular events in which you are interested.<br />

You can use the scope to manage the type and quantity of events that are displayed in<br />

the log. An event belongs to one of the following scopes:<br />

• Normal<br />

Events with a Normal scope result when the system analyzes a wide range of<br />

system data to generate a single event that explains the root cause for an entire set<br />

of Detailed and Advanced events. Usually, these events are sufficient for effective<br />

monitoring of system behavior.<br />

• Detailed<br />

Events with a Detailed scope include all events for all components that are<br />

generated for users and that are not included among the events that have a Normal<br />

scope. The display of Detailed events includes Normal events also.<br />

• Advanced<br />

Events with an Advanced scope contain technical information. In some cases, such<br />

as troubleshooting a problem, a Unisys service representative might need to retrieve<br />

information from the Advanced log events.<br />

.<br />

E–2 6872 5688–002


Displaying the Event Log<br />

Understanding Events<br />

The event log is displayed either from the Management Console or using the CLI.<br />

To display event logs, select Logs in the navigation pane; the most recent events in the<br />

event log are displayed. For more information about a particular event log, double-click<br />

the event log. The Log Event Properties dialog box displays details of the individual<br />

event.<br />

You can sort the log events according to any of the columns (that is, level, scope, time,<br />

site, ID and topic) in ascending or descending order.<br />

Perform the following steps to display advanced logs:<br />

1. Click the Filter log tool bar option in the event pane.<br />

The Filter Log dialog box appears.<br />

2. Change the scope to Advanced.<br />

3. Click OK.<br />

For more information about using the management console, see the Unisys <strong>SafeGuard</strong><br />

<strong>Solutions</strong> Replication Appliance Administrator’s <strong>Guide</strong>.<br />

To display the event log from the CLI, run the get_logs command and specify values for<br />

each of the parameters. Specify the parameters carefully to avoid displaying unnecessary<br />

log information. You can use the terse display parameter to show more or less<br />

information for the displayed events as desired.<br />

For information about the CLI, see the Unisys <strong>SafeGuard</strong> <strong>Solutions</strong> Replication Appliance<br />

Command Line Interface (CLI) Reference <strong>Guide</strong>.<br />

Using the Event Log for <strong>Troubleshooting</strong><br />

The event log provides information that can be useful in determining the cause or nature<br />

of problems that might arise during operation.<br />

The “group capabilities” events provide an important tool for understanding the behavior<br />

of a consistency group. Each group capabilities event—such as group capabilities OK,<br />

group capabilities minor problem, or group capabilities<br />

problem—provides a high-level description of a current group situation with regard to<br />

each of the RAs and identifies the RA that is currently handling the group.<br />

6872 5688–002 E–3


Understanding Events<br />

The information reported for each RA includes the following:<br />

• RA status: Indicates whether an RA is currently a member of the RA cluster (that is,<br />

alive) or not a member (that is, dead).<br />

• Marking status: yes or no.<br />

• Transfer status: yes, no, no data loss (that is, flushing), or yes unstable (that is, the<br />

RA cannot be initialized if closed or detached).<br />

• Journal capability: yes (that is, distributing, logged access, and so forth), no, or static<br />

(that is, access to an image is enabled but access to a different image is not enabled,<br />

cannot distribute, and cannot support image access)<br />

• Preferred: yes or no.<br />

In addition, the event log reports the RA on which the group is actually running and the<br />

status of the link between the sites.<br />

A group capabilities event is generated whenever there is a change in the capabilities of<br />

a group on any RA. The message reports on any limitations to the capabilities of the<br />

group and provides reasons for these limitations.<br />

Tracking logged events can explain changes in a group state (for example, the reason<br />

replication was paused, the reason the group switched to another RA, and so forth).<br />

The group capabilities events might offer reasons that particular actions are not<br />

performed. For example, if you want to know the reason the group transfer was paused,<br />

you can check the event log for the “pause replication” action. If, however, you want to<br />

know the reason a group transfer did not start, you might check the most recent group<br />

capabilities event.<br />

The level of a group capabilities event can be INFO, WARNING, or ERROR, depending<br />

on the severity of the reported situation. These levels correspond to the OK, minor<br />

problem, and problem bookmarks that follow group capabilities in the message<br />

descriptions.<br />

List of Events<br />

The list of events is presented in tabular format with the following given for each event:<br />

• Event ID<br />

• Topic (for example, Management, Site, RA, Splitter, Group)<br />

• Level (for example, Info, Warning, Error )<br />

• Description<br />

• Scope<br />

• Time<br />

• Site<br />

E–4 6872 5688–002


List of Normal Events<br />

Event<br />

ID<br />

Understanding Events<br />

Normal events include both root-cause events (a single description for an event that can<br />

generate multiple events) and other selected basic events. Some Normal events do not<br />

have a topic or trigger. Table E–1 lists Normal events with their descriptions.<br />

Topic<br />

Table E–1. Normal Events<br />

Level<br />

Description<br />

1000 Management Info User logged in. (User<br />

)<br />

1001 Management Warning Log in failed. (User<br />

)<br />

1003 Management Warning Failed to generate SNMP<br />

trap. (Trap contents)<br />

1004 Management Warning Failed to send e-mail alert<br />

to specified address.<br />

(Address , Event summary<br />

)<br />

1005 Management Warning Failed to update file. (File<br />

<br />

1006 Management Info Settings changed. (User<br />

, Settings<br />

)<br />

1007 Management Warning Settings change failed.<br />

(User , Settings<br />

, Reason<br />

)<br />

1008 Management Info User action succeeded.<br />

(User , Action<br />

)<br />

Trigger<br />

User log-in action<br />

User failed to log in<br />

The system failed to<br />

send SNMP trap.<br />

The system failed to<br />

send an e-mail alert.<br />

The system failed to<br />

update the local<br />

configuration file<br />

(passwords, SSH<br />

keys, system log<br />

configuration, and<br />

SNMP configuration).<br />

The user changed<br />

settings.<br />

The system failed to<br />

change settings.<br />

The user performed<br />

one of these actions:<br />

bookmark_image,<br />

clear_markers,<br />

set_markers,<br />

undo_logged_<br />

writes, set_num_<br />

of_streams.<br />

6872 5688–002 E–5


Understanding Events<br />

Event<br />

ID<br />

Topic<br />

Table E–1. Normal Events<br />

Level<br />

Description<br />

1009 Management Warning User action failed. (User<br />

, Action ,<br />

Reason )<br />

1011 Management Error Grace period expired. You<br />

must install an activation<br />

code to activate your<br />

license.<br />

1014 Management Info User bookmarked an<br />

image. (Group ,<br />

Snapshot )<br />

1015 Management Warning RA-to-storage multipathing<br />

problem (RA ,<br />

Volume )<br />

1016 Management Warning<br />

Off<br />

RA- multipathing fixed.<br />

problem (RA ,<br />

Volume )<br />

1017 Management Warning RA- multipathing problem.<br />

(RA ,<br />

Splitter)<br />

1018 Management Warning<br />

Off<br />

RA- multipathing problem<br />

fixed. (RA , Splitter<br />

)<br />

1019 Management Warning User action succeeded.<br />

(Markers cleared. Group<br />

,)<br />

(Replication set attached<br />

as clean. Group)<br />

3001 RA Warning RA is no longer a cluster<br />

member. (RA )<br />

3005 RA Error Settings conflict between<br />

sites. (Reason )<br />

Trigger<br />

One of these actions<br />

failed:<br />

bookmark_image,<br />

clear_markers,<br />

set_markers,<br />

undo_logged_<br />

writes, set_num_<br />

of_streams.<br />

Grace period expired<br />

The user bookmarked<br />

an image.<br />

Single path only or<br />

more paths between<br />

RA and volume are<br />

not available.<br />

All paths between the<br />

RA and volume are<br />

available.<br />

One or more paths<br />

between the RA and<br />

the splitter are not<br />

available.<br />

All paths between the<br />

RA and the splitter<br />

are available.<br />

User cleared markers<br />

or attached replication<br />

set as clean.<br />

An RA is<br />

disconnected from<br />

site control.<br />

A settings conflict<br />

between the sites<br />

was discovered.<br />

E–6 6872 5688–002


Event<br />

ID<br />

Topic<br />

Table E–1. Normal Events<br />

Level<br />

Description<br />

3006 RA Error Off Settings conflict between<br />

sites resolved by user.<br />

(Using Site <br />

settings)<br />

3030 RA Warning RA switched path to<br />

storage. (RA ,<br />

Volume )<br />

4056 Group Warning No image was found in<br />

the journal to match the<br />

query. (Group )<br />

4090 Group Warning Target-side log is 90<br />

percent full. When log is<br />

full, writing by hosts at<br />

target side is disabled.<br />

(Group )<br />

4106 Group Warning Capacity reached; cannot<br />

write additional markers<br />

for this group to<br />

.<br />

Starting full sweep. (Group<br />

)<br />

4117 Group Warning Virtual access buffer is 90<br />

percent full. When the<br />

buffer is full, writing by<br />

hosts at the target side is<br />

disabled. (Group )<br />

5008 Splitter Warning Host shut down. (Host<br />

Splitter )<br />

5010 Splitter Warning Splitter stopped;<br />

depending on policy,<br />

writing by host might be<br />

disabled for some groups,<br />

and a full sweep might be<br />

required for other groups.<br />

(Splitter )<br />

5011 Splitter Warning Splitter stopped; full<br />

sweep is required. (Splitter<br />

)<br />

5012 Splitter Warning The splitter stopped; write<br />

operations to replication<br />

volumes are disabled.<br />

(Splitter )<br />

Understanding Events<br />

Trigger<br />

A settings conflict<br />

between the sites<br />

was resolved by the<br />

user.<br />

A storage path<br />

change was initiated<br />

by the RA.<br />

No image was found<br />

in the journal to<br />

match the query.<br />

The target-side log is<br />

90 percent full.<br />

The disk space for the<br />

markers was filled for<br />

the group.<br />

The usage of the<br />

virtual access buffer<br />

has reached 90<br />

percent.<br />

The host was shut<br />

down or restarted.<br />

The user stopped the<br />

splitter after removing<br />

volumes; volumes are<br />

disconnected.<br />

The user stopped the<br />

splitter after removing<br />

volumes; volumes are<br />

disconnected.<br />

The splitter stopped;<br />

host access to all<br />

volumes is disabled.<br />

6872 5688–002 E–7


Understanding Events<br />

Event<br />

ID<br />

Topic<br />

Table E–1. Normal Events<br />

Level<br />

Description<br />

10000 — Info Changes are occurring in<br />

the system. Analysis in<br />

progress.<br />

10001 — Info System changes have<br />

occurred. The system is<br />

now stable.<br />

10002 — Info The system activity has<br />

not stabilized—issuing an<br />

intermediate report.<br />

10101 — Error The cause of the system<br />

activity is unclear. To<br />

obtain more information,<br />

filter the events log using<br />

the Detailed scope.<br />

10102 — Info Site control recorded<br />

internal changes that do<br />

not affect system<br />

operation.<br />

10202 — Info Settings have changed. —<br />

10203 — Info The RA cluster is down. —<br />

10204 — Error One or more RAs are<br />

disconnected from the RA<br />

cluster.<br />

10205 — Error A communications<br />

problem occurred in an<br />

internal process.<br />

10206 — Info An internal process was<br />

restarted.<br />

10207 — Error An internal process was<br />

restarted.<br />

10210 — Error Initialization is<br />

experiencing high-load<br />

conditions.<br />

10211 — Error A temporary problem<br />

occurred in the Fibre<br />

Channel link between the<br />

splitters and the RAs.<br />

10212 — Error Off The temporary problem<br />

that occurred in the Fibre<br />

Channel link between the<br />

splitters and the RAs is<br />

resolved.<br />

Trigger<br />

E–8 6872 5688–002<br />

—<br />

—<br />

—<br />

—<br />

—<br />

—<br />

—<br />

—<br />

—<br />

—<br />

—<br />


Event<br />

ID<br />

Topic<br />

Table E–1. Normal Events<br />

Level<br />

Description<br />

10501 — Info Synchronization<br />

completed.<br />

10502 — Info Access to the target-side<br />

image is enabled.<br />

10503 — Error The system is transferring<br />

the latest snapshot before<br />

pausing transfer (no data<br />

loss).<br />

10504 — Info The journal was cleared. —<br />

10505 — Info The system completed<br />

undoing writes to the<br />

target-side log.<br />

10506 — Info The roll to the physical<br />

images is complete.<br />

Logged access to the<br />

physical image is now<br />

available.<br />

10507 — Info Because of system<br />

changes, the journal was<br />

temporarily out of service.<br />

The journal is now<br />

available.<br />

10508 — Info All data were flushed from<br />

the local-side RA;<br />

automatic failover<br />

proceeds.<br />

10509 — Info The initial long<br />

resynchronization has<br />

completed.<br />

10510 — Info Following a paused<br />

transfer, the system is<br />

now cleared to restart<br />

transfer.<br />

10511 — Info The system finished<br />

recovering the replication<br />

backlog.<br />

12001 — Error The splitter is down. —<br />

12002 — Error An error occurred in all<br />

WAN links to the other<br />

site. The other site is<br />

possibly down.<br />

Understanding Events<br />

Trigger<br />

6872 5688–002 E–9<br />

—<br />

—<br />

—<br />

—<br />

—<br />

—<br />

—<br />

—<br />

—<br />

—<br />


Understanding Events<br />

Event<br />

ID<br />

Topic<br />

Table E–1. Normal Events<br />

Level<br />

Description<br />

12003 — Error An error occurred in the<br />

WAN link to the RA at the<br />

other site.<br />

12004 — Error An error occurred in the<br />

data link over the WAN. All<br />

RAs are unable to transfer<br />

replicated data to the<br />

other site.<br />

12005 — Error An error occurred in the<br />

data link over the WAN.<br />

The RA is unable to<br />

transfer replicated data to<br />

the other site.<br />

12006 — Error The RA is disconnected<br />

from the RA cluster.<br />

12007 — Error All RAs are disconnected<br />

from the RA cluster.<br />

12008 — Error The RA is down. —<br />

12009 — Error The group entered high<br />

load.<br />

12010 — Error A journal error occurred.<br />

Full sweep is to be<br />

performed after the error<br />

is corrected.<br />

12011 — Error The target-side log or<br />

virtual buffer is full. Writing<br />

by hosts at the target side<br />

is disabled.<br />

12012 — Error The system cannot enable<br />

virtual access to the<br />

image.<br />

12013 — Error The system cannot enable<br />

access to a specified<br />

image.<br />

12014 — Error The Fibre Channel link<br />

between all RAs and all<br />

splitters and storage is<br />

down.<br />

12016 — Error The Fibre Channel link<br />

between all RAs and all<br />

storage is down.<br />

Trigger<br />

E–10 6872 5688–002<br />

—<br />

—<br />

—<br />

—<br />

—<br />

—<br />

—<br />

—<br />

—<br />

—<br />

—<br />


Event<br />

ID<br />

Topic<br />

Table E–1. Normal Events<br />

Level<br />

Description<br />

12022 — Error The Fibre Channel link<br />

between the RA and<br />

splitters or storage<br />

volumes (or both) is down.<br />

12023 — Error The Fibre Channel link<br />

between the RA and all<br />

splitters and storage is<br />

down.<br />

12024 — Error The Fibre Channel link<br />

between the RA and all<br />

splitters is down.<br />

12025 — Error The Fibre Channel link<br />

between the RA and all<br />

storage is down.<br />

12026 — Error An error occurred in the<br />

WAN link to the RA at the<br />

other site.<br />

12027 — Error All replication volumes<br />

attached to the<br />

consistency group (or<br />

groups) are not accessible.<br />

12029 — Error The Fibre Channel link<br />

between all RAs and one<br />

or more volumes is down.<br />

12033 — Error The repository volume is<br />

not accessible; data might<br />

be lost.<br />

12034 — Error Writes to storage occurred<br />

without corresponding<br />

writes to the RA.<br />

12035 — Error An error occurred in the<br />

WAN link to the RA cluster<br />

at the other site.<br />

12036 — Error A renegotiation of the<br />

transfer protocol is<br />

requested.<br />

12037 — Error All volumes attached to<br />

the consistency group (or<br />

groups) are not accessible.<br />

Understanding Events<br />

Trigger<br />

6872 5688–002 E–11<br />

—<br />

—<br />

—<br />

—<br />

—<br />

—<br />

—<br />

—<br />

—<br />

—<br />

—<br />


Understanding Events<br />

Event<br />

ID<br />

Topic<br />

Table E–1. Normal Events<br />

Level<br />

Description<br />

12038 — Error All journal volumes<br />

attached to the<br />

consistency group (or<br />

groups) are not accessible.<br />

12039 — Error A long resynchronization<br />

started.<br />

12040 — Error The system detected bad<br />

sectors in a volume.<br />

12041 — Error The splitter is up. —<br />

12042 — Error All WAN links to the other<br />

site are restored.<br />

12043 — Error The WAN link to the RA at<br />

the other site is restored.<br />

12044 — Error Problem with IP link<br />

between RA (in at least in<br />

one direction).<br />

12045 — Error Problem with all IP links<br />

between RA<br />

12046 — Error Problem with IP links<br />

between RA<br />

12047 — Error RA network interface card<br />

(NIC) problem.<br />

14001 — Error Off The splitter is up. —<br />

14002 — Error Off All WAN links to the other<br />

site are restored.<br />

14003 — Error Off The WAN link to the RA at<br />

the other site is restored.<br />

14004 — Error Off The data link over the<br />

WAN is restored. All RAs<br />

can transfer replicated<br />

data to the other site.<br />

14005 — Error Off The data link over the<br />

WAN is restored. The RA<br />

can transfer replicated<br />

data to the other site.<br />

14006 — Error Off The connection of the RA<br />

to the RA cluster is<br />

restored.<br />

Trigger<br />

E–12 6872 5688–002<br />

—<br />

—<br />

—<br />

—<br />

—<br />

—<br />

—<br />

—<br />

—<br />

—<br />

—<br />

—<br />

—<br />


Event<br />

ID<br />

Topic<br />

Table E–1. Normal Events<br />

Level<br />

Description<br />

14007 — Error Off The connection of all RAs<br />

to the RA cluster is<br />

restored.<br />

14008 — Error Off The RA is up. —<br />

14009 — Error Off The group exited high<br />

load. The initialization<br />

completed.<br />

14010 — Error Off The journal error was<br />

corrected. A full sweep<br />

operation is required.<br />

14011 — Error Off The target-side log or<br />

virtual buffer is no longer<br />

full.<br />

14012 — Error Off Virtual access to an image<br />

is enabled.<br />

14013 — Error Off The system is no longer<br />

trying to access a diluted<br />

image.<br />

14014 — Error Off The Fibre Channel link<br />

between all RAs and all<br />

splitters and storage is<br />

restored.<br />

14016 — Error Off The Fibre Channel link<br />

between all RAs and all<br />

storage is restored.<br />

14022 — Error Off The Fibre Channel link that<br />

was down between the<br />

RA and splitters or storage<br />

volumes (or both) is<br />

restored.<br />

14023 — Error Off The Fibre Channel link<br />

between the RA and all<br />

splitters and storage is<br />

restored.<br />

14024 — Error Off The Fibre Channel link<br />

between the RA and all<br />

splitters is restored.<br />

14025 — Error Off The Fibre Channel link<br />

between the RA and all<br />

storage is restored.<br />

14026 — Error Off The WAN link to the RA at<br />

the other site is restored.<br />

Understanding Events<br />

Trigger<br />

6872 5688–002 E–13<br />

—<br />

—<br />

—<br />

—<br />

—<br />

—<br />

—<br />

—<br />

—<br />

—<br />

—<br />

—<br />


Understanding Events<br />

Event<br />

ID<br />

Topic<br />

Table E–1. Normal Events<br />

Level<br />

Description<br />

14027 — Error Off Access to all volumes<br />

attached to the<br />

consistency group (or<br />

groups) is restored.<br />

14029 — Error Off The Fibre Channel link<br />

between all RAs and one<br />

or more volumes is<br />

restored.<br />

14033 — Error Off Access to the repository<br />

volume is restored.<br />

14034 — Error Off Replication consistency in<br />

writes to storage is<br />

restored.<br />

14035 — Error Off The WAN link to the RA at<br />

the other site is restored.<br />

14036 — Error Off The renegotiation of the<br />

transfer protocol is<br />

complete.<br />

14037 — Error Off Access to all replication<br />

volumes attached to the<br />

consistency group (or<br />

groups) is restored.<br />

14038 — Error Off Access to all journal<br />

volumes attached to the<br />

consistency group (or<br />

groups) is restored.<br />

14039 — Info The long resynchronization<br />

has completed.<br />

14040 — Error Off The system detected a<br />

correction of bad sectors<br />

in the volume.<br />

14041 — Error Off The system detected that<br />

the volume is no longer<br />

read-only.<br />

14042 — Error Off A synchronization is in<br />

progress to restore any<br />

failed writes in the group.<br />

14043 — Error Off A synchronization is in<br />

progress to restore any<br />

failed writes.<br />

Trigger<br />

E–14 6872 5688–002<br />

—<br />

—<br />

—<br />

—<br />

—<br />

—<br />

—<br />

—<br />

—<br />

—<br />

—<br />

—<br />


Event<br />

ID<br />

Topic<br />

Table E–1. Normal Events<br />

Level<br />

Description<br />

14044 — Error Off Problem with IP link<br />

between RAs (in at least in<br />

one direction) corrected.<br />

14045 — Error Off All IP links between RAs<br />

restored.<br />

14046 — Error Off IP link between RAs<br />

restored.<br />

14047 — Error Off RA network interface card<br />

(NIC) problem corrected.<br />

16000 — Error Transient root cause. —<br />

16001 — Error The splitter was down.<br />

The problem is corrected.<br />

16002 — Error An error occurred in all<br />

WAN links to the other<br />

site. The problem is<br />

corrected.<br />

16003 — Error An error occurred in the<br />

WAN link to the RA at the<br />

other site. The problem is<br />

corrected.<br />

16004 — Error An error occurred in the<br />

data link over the WAN. All<br />

RAs were unable to<br />

transfer replicated data to<br />

the other site. The<br />

problem is corrected.<br />

16005 — Error An error occurred in the<br />

data link over the WAN.<br />

The RA was unable to<br />

transfer replicated data to<br />

the other site. The<br />

problem is corrected.<br />

16006 — Error The RA was disconnected<br />

from the RA cluster. The<br />

connection is restored.<br />

16007 — Error All RAs were<br />

disconnected from the RA<br />

cluster. The problem is<br />

corrected.<br />

16008 — Error The RA was down. The<br />

problem is corrected.<br />

Understanding Events<br />

Trigger<br />

6872 5688–002 E–15<br />

—<br />

—<br />

—<br />

—<br />

—<br />

—<br />

—<br />

—<br />

—<br />

—<br />

—<br />


Understanding Events<br />

Event<br />

ID<br />

Topic<br />

Table E–1. Normal Events<br />

Level<br />

Description<br />

16009 — Error The group entered high<br />

load. The problem is<br />

corrected.<br />

16010 — Error A journal error occurred.<br />

The problem is corrected.<br />

A full sweep is required.<br />

16011 — Error The target-side log or<br />

virtual buffer was full.<br />

Writing by the hosts at the<br />

target side was disabled.<br />

The problem is corrected.<br />

16012 — Error The system could not<br />

enable virtual access to<br />

the image. The problem is<br />

corrected.<br />

16013 — Error The system could not<br />

enable access to the<br />

specified image. The<br />

problem is corrected.<br />

16014 — Error The Fibre Channel link<br />

between all RAs and all<br />

splitters and storage was<br />

down. The problem is<br />

corrected.<br />

16016 — Error The Fibre Channel link<br />

between all RAs and all<br />

storage was down. The<br />

problem is corrected.<br />

16022 — Error The Fibre Channel link<br />

between the RA and<br />

splitters or storage<br />

volumes (or both) was<br />

down. The problem is<br />

corrected.<br />

16023 — Error The Fibre Channel link<br />

between the RA and all<br />

splitters and storage was<br />

down. The problem is<br />

corrected.<br />

16024 — Error The Fibre Channel link<br />

between the RA and all<br />

splitters was down. The<br />

problem is corrected.<br />

Trigger<br />

E–16 6872 5688–002<br />

—<br />

—<br />

—<br />

—<br />

—<br />

—<br />

—<br />

—<br />

—<br />


Event<br />

ID<br />

Topic<br />

Table E–1. Normal Events<br />

Level<br />

Description<br />

16025 — Error The Fibre Channel link<br />

between the RA and all<br />

storage was down. The<br />

problem is corrected.<br />

16026 — Error An error occurred in the<br />

WAN link to the RA at the<br />

other site. The problem is<br />

corrected.<br />

16027 — Error All volumes attached to<br />

the consistency group (or<br />

groups) were not<br />

accessible. The problem is<br />

corrected.<br />

16029 — Error The Fibre Channel link<br />

between all RAs and one<br />

or more volumes was<br />

down. The problem is<br />

corrected.<br />

16033 — Error The repository volume<br />

was not accessible. The<br />

problem is corrected.<br />

16034 — Error Off Writes to storage occurred<br />

without corresponding<br />

writes to the RA. The<br />

problem is corrected.<br />

16035 — Error An error occurred in the<br />

WAN link to the RA at the<br />

other site. The problem is<br />

corrected.<br />

16036 — Error The renegotiation of the<br />

transfer protocol was<br />

requested and has been<br />

completed.<br />

16037 — Error All replication volumes<br />

attached to the<br />

consistency group (or<br />

groups) were not<br />

accessible. The problem is<br />

corrected.<br />

Understanding Events<br />

Trigger<br />

6872 5688–002 E–17<br />

—<br />

—<br />

—<br />

—<br />

—<br />

—<br />

—<br />

—<br />


Understanding Events<br />

Event<br />

ID<br />

Topic<br />

Table E–1. Normal Events<br />

Level<br />

Description<br />

16038 — Error All journal volumes<br />

attached to the<br />

consistency group (or<br />

groups) were not<br />

accessible. The problem is<br />

corrected.<br />

16039 — Info The system ran a long<br />

resynchronization.<br />

16040 — Error The system detected bad<br />

sectors in the volume. The<br />

problem is corrected.<br />

16041 — Error The system detected that<br />

the volume was read-only.<br />

The problem is corrected.<br />

16042 — Error The splitter write<br />

operation might have<br />

failed while the group was<br />

transferring data.<br />

16043 — Error The splitter write<br />

operations might have<br />

failed.<br />

16044 — Error There was a problem with<br />

an IP link between RAs (in<br />

at least in one direction)<br />

16045 — Error There was a problem with<br />

all IP links between RAs.<br />

Problem has been<br />

corrected<br />

16046 — Error There was a problem with<br />

an IP link between RAs.<br />

Problem has been<br />

corrected.<br />

16047 — Error There was an RA network<br />

interface card (NIC)<br />

problem. Problem has<br />

been corrected.<br />

18001 — Error Off The splitter was<br />

temporarily up but is down<br />

again.<br />

18002 — Error Off All WAN links to the other<br />

site were temporarily<br />

restored, but the problem<br />

has returned.<br />

Trigger<br />

E–18 6872 5688–002<br />

—<br />

—<br />

—<br />

—<br />

—<br />

—<br />

—<br />

—<br />

—<br />

—<br />

—<br />


Event<br />

ID<br />

Topic<br />

Table E–1. Normal Events<br />

Level<br />

Description<br />

18003 — Error Off The WAN link to the RA at<br />

the other site was<br />

temporarily restored, but<br />

the problem has returned.<br />

18004 — Error Off The data link over the<br />

WAN was temporarily<br />

restored, but the problem<br />

has returned. All RAs are<br />

unable to transfer<br />

replicated data to the<br />

other site.<br />

18005 — Error Off The data link over the<br />

WAN was temporarily<br />

restored, but the problem<br />

has returned. The RA is<br />

currently unable to<br />

transfer replicated data to<br />

the other site.<br />

18006 — Error Off The connection of the RA<br />

to the RA cluster was<br />

temporarily restored, but<br />

the problem has returned.<br />

18007 — Error Off All RAs were temporarily<br />

restored to the RA cluster,<br />

but the problem has<br />

returned.<br />

18008 — Error Off The RA was temporarily<br />

up, but is down again.<br />

18009 — Error Off The group temporarily<br />

exited high load, but the<br />

problem has returned.<br />

18010 — Error Off The journal error was<br />

temporarily corrected, but<br />

the problem has returned.<br />

18011 — Error Off The target-side log or<br />

virtual buffer was<br />

temporarily no longer full,<br />

and write operations by<br />

the hosts at the target<br />

side were re-enabled.<br />

However, the problem has<br />

returned.<br />

Understanding Events<br />

Trigger<br />

6872 5688–002 E–19<br />

—<br />

—<br />

—<br />

—<br />

—<br />

—<br />

—<br />

—<br />


Understanding Events<br />

Event<br />

ID<br />

Topic<br />

Table E–1. Normal Events<br />

Level<br />

Description<br />

18012 — Error Off Virtual access to the<br />

image was temporarily<br />

enabled, but the problem<br />

has returned.<br />

18013 — Error Off Access to an image was<br />

temporarily enabled, but<br />

the problem has returned.<br />

18014 — Error Off The Fibre Channel link<br />

between all RAs and all<br />

splitters and storage was<br />

temporarily restored, but<br />

the problem has returned.<br />

18016 — Error Off The Fibre Channel link<br />

between all splitters and<br />

all storage was temporarily<br />

restored, but the problem<br />

has returned.<br />

18022 — Error Off The Fibre Channel link that<br />

was down between the<br />

RA and splitters or storage<br />

volumes (or both) was<br />

temporarily restored, but<br />

the problem has returned.<br />

18023 — Error Off The Fibre Channel link<br />

between the RA and all<br />

storage was temporarily<br />

restored, but the problem<br />

has returned.<br />

18024 — Error Off The Fibre Channel link<br />

between the RA and all<br />

splitters was temporarily<br />

restored, but the problem<br />

has returned.<br />

18025 — Error Off The Fibre Channel link<br />

between the RA and all<br />

storage was temporarily<br />

restored, but the problem<br />

has returned.<br />

18026 — Error The WAN link to the RA at<br />

the other site was<br />

temporarily restored, but<br />

the problem has returned.<br />

Trigger<br />

E–20 6872 5688–002<br />

—<br />

—<br />

—<br />

—<br />

—<br />

—<br />

—<br />

—<br />


Event<br />

ID<br />

Topic<br />

Table E–1. Normal Events<br />

Level<br />

Description<br />

18027 — Error Off Access to all journal<br />

volumes attached to the<br />

consistency group (or<br />

groups) was temporarily<br />

restored, but the problem<br />

has returned.<br />

18029 — Error Off The Fibre Channel link<br />

between all RAs and one<br />

or more volumes was<br />

temporarily restored, but<br />

the problem has returned.<br />

18033 — Error Off Access to the repository<br />

volume was temporarily<br />

restored, but the problem<br />

has returned.<br />

18034 — Error Off Replication consistency in<br />

write operations to<br />

storage and to RAs was<br />

temporarily restored, but<br />

the problem has returned.<br />

18035 — Error Off The WAN link to the RA at<br />

the other site was<br />

temporarily restored, but<br />

the problem has returned.<br />

18036 — Error Off The negotiation of the<br />

transfer protocol was<br />

completed but is again<br />

requested.<br />

18037 — Error Off Access to all volumes<br />

attached to the<br />

consistency group (or<br />

groups) was temporarily<br />

restored, but the problem<br />

has returned.<br />

18038 — Error Off Access to all replication<br />

volumes attached to the<br />

consistency group (or<br />

groups) was temporarily<br />

restored, but the problem<br />

has returned.<br />

18039 — Info The long resynchronization<br />

completed but has now<br />

restarted.<br />

Understanding Events<br />

Trigger<br />

6872 5688–002 E–21<br />

—<br />

—<br />

—<br />

—<br />

—<br />

—<br />

—<br />

—<br />


Understanding Events<br />

Event<br />

ID<br />

Topic<br />

Table E–1. Normal Events<br />

Level<br />

Description<br />

18040 — Error Off The user marked the<br />

volume as OK, but the<br />

bad-sectors problem<br />

persists.<br />

18041 — Error Off The user marked the<br />

volume as OK, but the<br />

read-only problem<br />

persists.<br />

18042 — Error Off The synchronization<br />

restored any failed write<br />

operations in the group,<br />

but the problem has<br />

returned.<br />

18043 — Error Off An internal problem has<br />

occurred.<br />

18044 — Error Off Problem with IP link<br />

between RAs (in at least<br />

one direction) was<br />

corrected, but problem<br />

has returned.<br />

18045 — Error Off Problem with all IP links<br />

between RAs (in at least in<br />

one direction) was<br />

corrected, but problem<br />

has returned.<br />

18046 — Error Off Problem with IP link<br />

between RAs was<br />

corrected, but problem<br />

has returned.<br />

18047 — Error Off RA network interface card<br />

(NIC) problem was<br />

corrected, but problem<br />

has returned.<br />

List of Detailed Events<br />

Trigger<br />

Detailed events are all events with respect to components generated for use by users<br />

and do not have a normal scope. Table E–2 lists these events and their descriptions.<br />

E–22 6872 5688–002<br />

—<br />

—<br />

—<br />

—<br />

—<br />

—<br />

—<br />


Event<br />

ID<br />

Topic<br />

Table E–2. Detailed Events<br />

Level<br />

Description<br />

Understanding Events<br />

Trigger<br />

1002 Management Info User logged out. (User ) The user logged<br />

out of the system.<br />

1010 Management Warning Grace period expires in 1 day.<br />

You must install an activation<br />

code to activate your Unisys<br />

<strong>SafeGuard</strong> solution license.<br />

1012 Management Warning License expires in 1 day. You<br />

must obtain a new Unisys<br />

<strong>SafeGuard</strong> 30m solution<br />

license.<br />

1013 Management Error License expired. You must<br />

obtain a new Unisys <strong>SafeGuard</strong><br />

30m solution license.<br />

2000 Site Info Site management running on<br />

.<br />

3000 RA Info RA as become a cluster<br />

member. (RA )<br />

3002 RA Warning Site management switched<br />

over to this RA. (RA ,<br />

Reason )<br />

3007 RA Warning<br />

Off<br />

The grace period<br />

expires in 1 day.<br />

The Unisys<br />

<strong>SafeGuard</strong> 30m<br />

solution license<br />

expires in 1 day.<br />

The Unisys<br />

<strong>SafeGuard</strong> 30m<br />

solution license<br />

expired.<br />

Site control is<br />

open; the RA has<br />

become the<br />

cluster leader.<br />

The RA is<br />

connected to site<br />

control.<br />

Leadership is<br />

transferred from<br />

an RA to another<br />

RA.<br />

RA is up. (RA ) The RA that was<br />

previously down<br />

came up.<br />

3008 RA Warning RA appears to be down. (RA<br />

)<br />

3011 RA Info RA access to a volume or<br />

volumes restored. (RA ,<br />

Volume , Volume<br />

Type )<br />

3012 RA Warning RA unable to access a volume<br />

or volumes. (RA , Volume<br />

, Volume Type<br />

)<br />

An RA suspects<br />

that the other RA<br />

is down.<br />

Volumes that were<br />

inaccessible<br />

became<br />

accessible.<br />

Volumes ceased to<br />

be accessible to<br />

the RA.<br />

6872 5688–002 E–23


Understanding Events<br />

Event<br />

ID<br />

Topic<br />

Table E–2. Detailed Events<br />

Level<br />

3013 RA Warning<br />

Off<br />

Description<br />

RA access to restored. (RA ,<br />

Volume )<br />

3014 RA Warning RA unable to access<br />

. (RA<br />

, Volume )<br />

3020 RA Warning<br />

Off<br />

WAN connection to an RA at<br />

other site is restored. (RA at<br />

other site: )<br />

3021 RA Warning Error in WAN connection to an<br />

RA at other site. (RA at other<br />

site: )<br />

3022 RA Warning<br />

Off<br />

LAN connection to RA<br />

restored. (RA )<br />

3023 RA Warning Error in LAN connection to an<br />

RA. RA )<br />

4000 Group Info Group capabilities OK. (Group<br />

)<br />

4001 Group Warning Group capabilities minor<br />

problem. (Group )<br />

Trigger<br />

The repository<br />

volume that was<br />

inaccessible<br />

became<br />

accessible.<br />

The repository<br />

volume became<br />

inaccessible to a<br />

single RA.<br />

The RA regained<br />

the WAN<br />

connection to an<br />

RA at the other<br />

site.<br />

The RA lost the<br />

WAN connection<br />

to an RA at the<br />

other site.<br />

The RA regained<br />

the LAN<br />

connection to an<br />

RA at the local<br />

site.<br />

The RA lost the<br />

LAN connection to<br />

an RA at the local<br />

site, without losing<br />

the connection<br />

through the<br />

repository volume.<br />

Capabilities are full<br />

and previous<br />

capabilities are<br />

unknown.<br />

Capabilities are<br />

either temporarily<br />

not full on the RA<br />

on which the<br />

group is currently<br />

running, or<br />

indefinitely not full<br />

on the RA on<br />

which the group is<br />

not running.<br />

E–24 6872 5688–002


Event<br />

ID<br />

Topic<br />

Table E–2. Detailed Events<br />

Level<br />

Description<br />

4003 Group Error Group capabilities problem.<br />

(Group )<br />

4007 Group Info Pausing data transfer. (Group<br />

, Reason: )<br />

4008 Group Warning Pausing data transfer. (Group<br />

, Reason: )<br />

4009 Group Error Pausing data transfer. (Group<br />

, Reason: )<br />

4010 Group Info Starting data transfer. (Group<br />

)<br />

4015 Group Info Transferring latest snapshot<br />

before pausing transfer (no<br />

data loss). (Group )<br />

4016 Group Warning Transferring latest snapshot<br />

before pausing transfer (no<br />

data loss). (Group )<br />

4017 Group Error Transferring latest snapshot<br />

before pausing transfer (no<br />

data loss). (Group )<br />

4018 Group Warning Transfer of latest snapshot<br />

from source is complete (no<br />

data loss). (Group )<br />

Understanding Events<br />

Trigger<br />

Capabilities are not<br />

full indefinitely on<br />

the RA on which<br />

the group is<br />

running.<br />

The user stopped<br />

the transfer.<br />

The system<br />

temporarily<br />

stopped the<br />

transfer.<br />

The system<br />

stopped the<br />

transfer<br />

indefinitely.<br />

The user<br />

requested a start<br />

transfer.<br />

In a total storage<br />

disaster, the<br />

system flushed<br />

the buffer before<br />

stopping<br />

replication.<br />

In a total storage<br />

disaster, the<br />

system flushed<br />

the buffer before<br />

stopping<br />

replication.<br />

In a total storage<br />

disaster, the<br />

system flushed<br />

the buffer before<br />

stopping<br />

replication.<br />

In a total storage<br />

disaster, the last<br />

snapshot from the<br />

source site is<br />

available at the<br />

target site.<br />

6872 5688–002 E–25


Understanding Events<br />

Event<br />

ID<br />

Topic<br />

Table E–2. Detailed Events<br />

Level<br />

Description<br />

4019 Group Warning Group in high load; transfer is<br />

to be paused temporarily.<br />

(Group )<br />

4020 Group Warning<br />

Off<br />

Group is no longer in high load.<br />

(Group )<br />

4021 Group Error Journal full—initialization<br />

paused. To complete<br />

initialization, enlarge the journal<br />

or allow long<br />

resynchronization. (Group<br />

)<br />

4022 Group Error Off Initialization resumed. (Group<br />

)<br />

4023 Group Error Journal full—transfer paused.<br />

To restart the transfer, first<br />

disable access to image.<br />

(Group )<br />

4024 Group Error Off Transfer restarted. (Group<br />

)<br />

4025 Group Warning Group in high load—<br />

initialization to be restarted.<br />

(Group )<br />

4026 Group Warning<br />

Off<br />

Group no longer in high load.<br />

(Group )<br />

4027 Group Error Group in high load—the journal<br />

is full. The roll to physical<br />

image is paused, and transfer<br />

is paused. (Group )<br />

4028 Group Error Off Group no longer in high load.<br />

(Group )<br />

Trigger<br />

The disk manager<br />

has a high load.<br />

The disk manager<br />

no longer has a<br />

high load.<br />

In initialization, the<br />

journal is full and<br />

a long<br />

resynchronization<br />

is not allowed.<br />

End of an<br />

initialization<br />

situation in which<br />

the journal is full<br />

and a long<br />

resynchronization<br />

was not allowed.<br />

Access to the<br />

image is enabled<br />

and the journal is<br />

full.<br />

End of a situation<br />

in which access to<br />

the image is<br />

enabled and the<br />

journal is full.<br />

The group has a<br />

high load;<br />

initialization is to<br />

be restarted.<br />

The group no<br />

longer has a high<br />

load.<br />

No space remains<br />

to which to write<br />

during roll.<br />

Journal capacity<br />

was added, or<br />

image access was<br />

disabled.<br />

E–26 6872 5688–002


Event<br />

ID<br />

Topic<br />

Table E–2. Detailed Events<br />

Level<br />

Description<br />

4040 Group Error Journal error—full sweep to be<br />

performed. (Group )<br />

4041 Group Info Group activated. (Group<br />

, RA )<br />

4042 Group Info Group deactivated. (Group<br />

, RA )<br />

4043 Group Warning Group deactivated. (Group<br />

, RA )<br />

4044 Group Error Group deactivated. (Group<br />

, RA )<br />

4051 Group Info Disabling access to image—<br />

resuming distribution. (Group<br />

)<br />

4054 Group Error Enabling access to image.<br />

(Group )<br />

4057 Group Warning Specified image was removed<br />

from the journal. Try a later<br />

image. (Group )<br />

4062 Group Info Access enabled to latest<br />

image. (Group ,<br />

Failover site )<br />

4063 Group Warning Access enabled to latest<br />

image. (Group ,<br />

Failover site )<br />

Understanding Events<br />

Trigger<br />

A journal volume<br />

error occurred.<br />

The group is<br />

replication-ready;<br />

that is, replication<br />

could take place if<br />

other factors are<br />

acceptable, such<br />

as RAs, network,<br />

and storage<br />

access.<br />

A user action<br />

deactivated the<br />

group.<br />

The system<br />

temporarily<br />

deactivated the<br />

group.<br />

The system<br />

deactivated the<br />

group indefinitely.<br />

The user disabled<br />

access to an<br />

image (that is,<br />

distribution is<br />

resumed).<br />

The system<br />

enabled access to<br />

an image<br />

indefinitely.<br />

The specified<br />

image was<br />

removed from the<br />

journal (that is,<br />

FIFO).<br />

Access was<br />

enabled to the<br />

latest image during<br />

automatic failover.<br />

Access was<br />

enabled to the<br />

latest image during<br />

automatic failover.<br />

6872 5688–002 E–27


Understanding Events<br />

Event<br />

ID<br />

Topic<br />

Table E–2. Detailed Events<br />

Level<br />

Description<br />

4064 Group Error Access enabled to latest<br />

image. (Group ,<br />

Failover site )<br />

4080 Group Warning Current lag exceeds maximum<br />

lag. (Group , Lag<br />

, Maximum lag<br />

)<br />

4081 Group Warning<br />

off<br />

Current lag within policy.<br />

(Group , Lag ,<br />

Maximum lag )<br />

4082 Group Warning Starting full sweep. (Group<br />

)<br />

4083 Group Warning Starting volume sweep. (Group<br />

, Pair )<br />

4084 Group Info Markers cleared. (Group<br />

)<br />

4085 Group Warning Unable to clear markers.<br />

(Group )<br />

4086 Group Info Initialization started. (Group<br />

)<br />

4087 Group Info Initialization completed. (Group<br />

)<br />

4091 Group Error Target-side log is full; write<br />

operations by the hosts at the<br />

target side is disabled. (Group<br />

, Site )<br />

4095 Group Info Writing target-side log to<br />

storage; writes to log cannot<br />

be undone. (Group )<br />

Trigger<br />

Access was<br />

enabled to the<br />

latest during<br />

automatic failover.<br />

The group lag<br />

exceeds the<br />

maximum lag<br />

(when not<br />

regulating an<br />

application).<br />

The group lag<br />

drops from above<br />

the maximum lag<br />

to below 90<br />

percent of the<br />

maximum.<br />

Group markers<br />

were set.<br />

Volume markers<br />

were set.<br />

Group markers<br />

were cleared.<br />

An attempt to<br />

clear the group<br />

markers failed.<br />

Initialization<br />

started.<br />

Initialization<br />

completed.<br />

The target-side log<br />

is full.<br />

Started marking to<br />

retain write<br />

operations in the<br />

target-side log.<br />

E–28 6872 5688–002


Event<br />

ID<br />

Topic<br />

Table E–2. Detailed Events<br />

Level<br />

Description<br />

4097 Group Warning Maximum journal lag<br />

exceeded. Distribution in fastforward—older<br />

images<br />

removed from journal. (Group<br />

)<br />

4098 Group Warning<br />

Off<br />

Maximum journal lag within<br />

limit. Distribution normal—<br />

rollback information retained.<br />

(Group )<br />

4099 Group Info Initializing in long<br />

resynchronization mode.<br />

(Group )<br />

4110 Group Info Enabling virtual access to<br />

image. (Group )<br />

4111 Group Info Virtual access to image<br />

enabled. (Group )<br />

4112 Group Info Rolling to physical image.<br />

(Group )<br />

4113 Group Info Roll to physical image stopped.<br />

(Group )<br />

4114 Group Info Roll to physical image<br />

complete—logged access to<br />

physical image is now enabled.<br />

(Group )<br />

Understanding Events<br />

Trigger<br />

Fast-forward<br />

action started<br />

(causing a loss of<br />

snapshots taken<br />

before as<br />

maximum journal<br />

lag was<br />

exceeded).<br />

Five minutes have<br />

passed since the<br />

fast-forward action<br />

stopped.<br />

The system<br />

started a long<br />

resynchronization.<br />

The user initiated<br />

enabling virtual<br />

access to an<br />

image.<br />

The user enabled<br />

virtual access to an<br />

image.<br />

Rolling to the<br />

image (in<br />

background) while<br />

virtual access to<br />

the image is<br />

enabled.<br />

Rolling to the<br />

image (that is, the<br />

background, while<br />

virtual access to<br />

the image is<br />

enabled) is<br />

stopped.<br />

The system<br />

completed the roll<br />

to the physical<br />

image.<br />

6872 5688–002 E–29


Understanding Events<br />

Event<br />

ID<br />

Topic<br />

Table E–2. Detailed Events<br />

Level<br />

Description<br />

4115 Group Error Unable to enable access to<br />

virtual image because of<br />

partition table error. (The<br />

partition table on at least one<br />

of the volumes in group<br />

has been modified<br />

since logged access was last<br />

enabled to a physical image. To<br />

enable access to a virtual<br />

image, first enable logged<br />

access to a physical image.)<br />

4116 Group Error Virtual access buffer is full—<br />

writing by hosts at the target<br />

side is disabled. (Group<br />

)<br />

4118 Group Error Cannot enable virtual access to<br />

an image. (Group )<br />

4119 Group Error Initiator issued an out-ofbounds<br />

I/O operation. Contact<br />

technical support. (Initiator<br />

, Group<br />

, Volume )<br />

4120 Group Warning Journal usage (with logged<br />

access enabled) now exceeds<br />

this threshold. (Group<br />

, )<br />

4121 Group Error Unable to gain permissions to<br />

write to replica.<br />

Trigger<br />

An attempt to<br />

pause on a virtual<br />

image is<br />

unsuccessful<br />

because of a<br />

change in the<br />

partition table of a<br />

volume or volumes<br />

in the group.<br />

An attempt to<br />

write to the virtual<br />

image is<br />

unsuccessful<br />

because the virtual<br />

access buffer<br />

usage is 100<br />

percent.<br />

An attempt to<br />

enable virtual<br />

access to the<br />

image is<br />

unsuccessful<br />

because of<br />

insufficient<br />

memory.<br />

A configuration<br />

problem exists.<br />

Journal usage<br />

(with logged<br />

access enabled)<br />

has passed a<br />

specified<br />

threshold.<br />

RAs unable to<br />

write to replication<br />

or journal volumes<br />

because they do<br />

not have proper<br />

permissions.<br />

E–30 6872 5688–002


Event<br />

ID<br />

Topic<br />

Table E–2. Detailed Events<br />

Level<br />

Description<br />

4122 Group Trying to regain permissions to<br />

write to replica.<br />

4123 Group Error Unable to access volumes –<br />

bad sectors encountered.<br />

4124 Group Error Off Trying to access volumes that<br />

previously had bad sectors.<br />

5000 Splitter Info Splitter or splitters are attached<br />

to a volume. (Splitter<br />

, Volume )<br />

5001 Splitter Info Splitter or splitters are<br />

detached from a volume.<br />

(Splitter , Volume<br />

)<br />

5002 Splitter Error RA is unable to access splitter.<br />

(Splitter , RA )<br />

5003 Splitter Error Off RA access to splitter is<br />

restored. (Splitter ,<br />

RA )<br />

5004 Splitter Error Splitter is unable to access a<br />

replication volume or volumes.<br />

(Splitter , Volume<br />

)<br />

5005 Splitter Error Off Splitter access to replication<br />

volume or volumes is restored.<br />

(Splitter , Volume<br />

)<br />

5006 OBSOLETE<br />

5007 OBSOLETE<br />

Understanding Events<br />

Trigger<br />

User has indicated<br />

that the<br />

permissions<br />

problem has been<br />

corrected.<br />

RAs unable to<br />

write to replication<br />

or journal volumes<br />

due to bad sectors<br />

on the storage.<br />

User has indicated<br />

that the bad<br />

sectors problem<br />

has been<br />

corrected.<br />

The user attached<br />

a splitter to a<br />

volume.<br />

The user detached<br />

a splitter from a<br />

volume.<br />

The RA is unable<br />

to access a<br />

splitter.<br />

The RA can access<br />

a splitter that was<br />

previously<br />

inaccessible.<br />

The splitter cannot<br />

access a volume.<br />

The splitter can<br />

access a volume<br />

that was<br />

previously<br />

inaccessible.<br />

6872 5688–002 E–31


Understanding Events<br />

Event<br />

ID<br />

Topic<br />

Table E–2. Detailed Events<br />

Level<br />

Description<br />

5013 Splitter Error Splitter is down. (Splitter<br />

)<br />

5015 Splitter Error Off Splitter is up. (Splitter<br />

)<br />

5016 Splitter Warning Splitter has restarted. (Splitter<br />

)<br />

5030 Splitter Error Splitter write failed. (Splitter<br />

, Group )<br />

5031 Splitter Warning Splitter is not splitting to<br />

replication volumes; volume<br />

sweeps are required. (Host<br />

, Volumes , Groups )<br />

5032 Splitter Info Splitter is splitting to replication<br />

volumes. (Host ,<br />

Volumes ,<br />

Groups (Groups)<br />

5035 Splitter Info Writes to replication volumes<br />

are disabled. (Splitter<br />

, Volumes , Groups )<br />

5036 Splitter Warning Writes to replication volumes<br />

are disabled. (Host< host>,<br />

Volumes ,<br />

Groups )<br />

5037 Splitter Error Writes to replication volumes<br />

are disabled. (Splitter<br />

, Volumes , Groups )<br />

Trigger<br />

Connection to the<br />

splitter was lost<br />

with no warning;<br />

splitter crashed or<br />

the connection is<br />

down.<br />

Connection to the<br />

splitter was<br />

regained after a<br />

splitter crash.<br />

The boot<br />

timestamp of the<br />

splitter has<br />

changed.<br />

The splitter write<br />

operation to the<br />

RA was<br />

successful; the<br />

write operation to<br />

the storage device<br />

was not<br />

successful.<br />

The splitter is not<br />

splitting to the<br />

replication<br />

volumes.<br />

The splitter started<br />

splitting to the<br />

replication<br />

volumes.<br />

Write operations<br />

to the replication<br />

volumes are<br />

disabled.<br />

Write operations<br />

to the replication<br />

volumes are<br />

disabled.<br />

Write operations<br />

to the replication<br />

volumes are<br />

disabled.<br />

E–32 6872 5688–002


Event<br />

ID<br />

Topic<br />

Table E–2. Detailed Events<br />

Level<br />

Description<br />

5038 Splitter Info Splitter delaying writes.<br />

(Splitter , Volumes<br />

, Groups<br />

)<br />

5039 Splitter Warning Splitter delaying writes.<br />

(Splitter , Volumes<br />

, Groups<br />

)<br />

5040 Splitter Error Splitter delaying writes.<br />

(Splitter , Volumes<br />

, Groups<br />

)<br />

5041 Splitter Info Splitter is not splitting to<br />

replication volumes. (Splitter<br />

, Volumes , Groups )<br />

5042 Splitter Warning Splitter is not splitting to<br />

replication volumes. (Splitter<br />

, Volumes , Groups )<br />

5043 Splitter Error Splitter not splitting to<br />

replication volumes. (Splitter<br />

, Volumes , Groups )<br />

5045 Splitter Warning Simultaneous problems<br />

reported in splitter and RA.<br />

Full-sweep resynchronization is<br />

required after restarting data<br />

transfer.<br />

5046 Splitter Warning Transient error—reissuing<br />

splitter write.<br />

Understanding Events<br />

Trigger<br />

—<br />

6872 5688–002 E–33<br />

—<br />

—<br />

The splitter is not<br />

splitting to the<br />

replication<br />

volumes because<br />

of a user decision.<br />

The splitter is not<br />

splitting to the<br />

replication<br />

volumes.<br />

The splitter is not<br />

splitting to the<br />

replication<br />

volumes because<br />

of a system action.<br />

The marking<br />

backlog on the<br />

splitter was lost as<br />

a result of<br />

concurrent<br />

disasters to the<br />

splitter and the<br />

RA.<br />


Understanding Events<br />

E–34 6872 5688–002


Appendix F<br />

Configuring and Using SNMP Traps<br />

The RA in the Unisys <strong>SafeGuard</strong> 30m solution is SNMP capable—that is, the solution<br />

supports monitoring and problem notification using the standard Simple Network<br />

Management Protocol (SNMP), including support for SNMPv3. The solution supports<br />

various SNMP queries to the agent and can be configured so that events generate<br />

SNMP traps, which are sent to designated servers.<br />

Software Monitoring<br />

To configure SNMP traps for monitoring, see the Unisys <strong>SafeGuard</strong> 30m Solution<br />

Planning and Installation <strong>Guide</strong>.<br />

You cannot query the RA software management information base (MIB). You can query<br />

the MIB-II. The RA SNMP agent includes MIB-II support. Also see “Hardware<br />

Monitoring.” For more information on MIB-II, see the document at<br />

http://www.faqs.org/rfcs/rfc1213.html<br />

All of the management console log events listed in Appendix E generate SNMP traps<br />

depending on the severity of the trap configuration.<br />

The Unisys MIB OID is 1.3.6.1.4.1.21658.<br />

The trap identifiers for Unisys traps are as follows:<br />

1: Info<br />

2: Warning<br />

3: Error<br />

6872 5688–002 F–1


Configuring and Using SNMP Traps<br />

The Unisys trap variables and their possible values are defined in Table F–1.<br />

Table F–1. Trap Variables and Values<br />

Variable OID Description Value<br />

dateAndTime 3.1.1.1 Date and time that the trap was<br />

sent<br />

eventID 3.1.1.2 Unique event identifier (See<br />

values in “List of Events” in<br />

Appendix E.)<br />

siteName 3.1.1.3 Name of site where event<br />

occurred<br />

eventLevel 3.1.1.4 See values 1: info<br />

2: warning<br />

3: warning off<br />

4: error<br />

5: error off<br />

eventTopic 3.1.1.5 See values 1: site<br />

2: K-Box<br />

3: group<br />

4: splitter<br />

5: management<br />

hostName 3.1.1.6 Name of host —<br />

kboxName 3.1.1.7 Name of RA —<br />

volumeName 3.1.1.8 Name of volume —<br />

groupName 3.1.1.9 Name of group —<br />

eventSummary 3.1.1.10 Short description of event —<br />

eventDescription 3.1.1.11 More detailed description of<br />

event<br />

F–2 6872 5688–002<br />

—<br />

—<br />

—<br />


Configuring and Using SNMP Traps<br />

SNMP Monitoring and Trap Configuration<br />

To configure SNMP traps, see the Unisys <strong>SafeGuard</strong> <strong>Solutions</strong> Planning and Installation<br />

<strong>Guide</strong>.<br />

On the management console, use the SNMP Settings menu (in the System menu) to<br />

manage the SNMP capabilities. Through that menu, you can enable and disable the<br />

agent or the SNMP traps feature, modify the configuration for SNMP traps, and add or<br />

remove SNMP users.<br />

In addition, the RA provides several CLI commands for SNMP, as follows:<br />

• The enable_snmp command to enable the SNMP agent<br />

• The disable_snmp command to disable the SNMP agent<br />

• The set_snmp_community command to define a community of users (for SNMPv1)<br />

• The add_snmp_user command to add SNMP users (for SNMPv3)<br />

• The remove_snmp_user command to remove SNMP users (for SNMPv3)<br />

• The get_snmp_settings command to display whether the agent is currently set to be<br />

enabled, the current configuration for SNMP traps, and the list of registered SNMP<br />

users<br />

• The config_snmp_traps command to configure the SNMP traps feature so that<br />

events generate traps. Before you enable the feature, you must designate the IP<br />

address or DNS name for a host at one or more sites to receive the SNMP traps.<br />

Note: You can designate a DNS name for a host only in installations for which a<br />

DNS has been configured.<br />

• The test_snmp_trap command to send a test SNMP trap<br />

When the SNMP agent is enabled, SNMP users can submit queries to retrieve various<br />

types of information about the RA.<br />

You can also designate the minimum severity for which an event should generate an<br />

SNMP trap (that is, info, warning, or error in order from less severe to more severe with<br />

error as the initial default). Once the SNMP traps feature is enabled, the system sends<br />

an SNMP trap to the designated host whenever an event of sufficient severity occurs.<br />

Installing MIB Files on an SNMP Browser<br />

Install the RA MIB file (\MIBS\mib.txt on the Unisys <strong>SafeGuard</strong> <strong>Solutions</strong> Splitter Install<br />

Disk CD-ROM) on an SNMP browser. Follow the instructions for your browser to load<br />

the MIB file.<br />

6872 5688–002 F–3


Configuring and Using SNMP Traps<br />

Resolving SNMP Issues<br />

For SNMP issues, first determine whether the issue is an SNMP trap or an SNMP<br />

monitoring issue by performing the procedure for verifying SNMP traps in the Unisys<br />

<strong>SafeGuard</strong> <strong>Solutions</strong> Planning and Installation <strong>Guide</strong>.<br />

If you do not receive traps, perform the steps in “Monitoring Issues” and then in “Trap<br />

Issues.”<br />

Monitoring Issues<br />

Trap Issues<br />

1. Ping the RA management IP address from the management server that has the<br />

SNMP browser.<br />

2. Ensure that the community name used on the RA configuration matches the<br />

management server running the SNMP browser (version 1 and 2). Use public as a<br />

community name.<br />

3. Ensure that the user and password used on the RA configuration matches the<br />

management server running the SNMP browser (version 3).<br />

1. Ensure that the trap destination is on the same network as the management<br />

network and that a firewall has not blocked SNMP traffic.<br />

2. Ensure that the same version of SNMP is configured in the management software<br />

that receives traps.<br />

F–4 6872 5688–002


Appendix G<br />

Using the Unisys <strong>SafeGuard</strong> 30m<br />

Collector<br />

The Unisys <strong>SafeGuard</strong> 30m Collector utility enables you to easily collect information<br />

about the environment so that you can solve problems. An enterprise solution requires<br />

many logs, and gathering the log information can be time intensive. Often the person<br />

who collects the information is not familiar with all the interfaces to the hardware. The<br />

Collector solves these problems. An experienced installer configures log collection one<br />

time, and then other personnel can use a “one-button” approach to log collection.<br />

You can use this utility to create custom scripts to complete tasks tailored to your<br />

environment. You choose which CLI commands to include in the custom scripts so that<br />

you build the capabilities you need. Refer to the Unisys <strong>SafeGuard</strong> <strong>Solutions</strong> Introduction<br />

to Replication Appliance Command Line Interface (CLI) for more information about CLI<br />

commands.<br />

The Collector gathers configuration information from RAs, storage subsystems, and<br />

switches. No information is collected from the servers in the environment.<br />

Installing the <strong>SafeGuard</strong> 30m Collector<br />

This utility offers two modes: Collector and View. You determine the available modes<br />

when you install the program. If you install the Collector and specify Collector mode,<br />

both modes are enabled. If you install the Collector and specify View mode, the Collector<br />

mode functions are disabled. The View mode is primarily used by support personnel at<br />

the Unisys <strong>Support</strong> Center.<br />

If you are installing the Collector at a customer installation, be sure to install the utility on<br />

PCs at both sites.<br />

The utility requires .NET Framework 2.0 and J# redistributable, which are on the Unisys<br />

<strong>SafeGuard</strong> 30m Solution Control Install Disk CD-ROM in the Redistributable folder.<br />

The directories under this folder are dotNet Framework 2.0 and JSharp.<br />

6872 5688–002 G–1


Using the Unisys <strong>SafeGuard</strong> 30m Collector<br />

Notes:<br />

• The readme file on that CD-ROM contains the same information as this appendix.<br />

• If you installed a previous version of the Collector, uninstall this utility and remove<br />

the folder and all of the files in the folder before you begin this installation.<br />

Perform the following steps to install the Collector:<br />

1. Insert the CD-ROM in the CD/DVD drive, and start the file Unisys <strong>SafeGuard</strong> 30m<br />

Collector.msi.<br />

2. On the Installation Wizard welcome screen, click Next.<br />

3. On the Customer Information screen, type the user name and organization, and<br />

click Next.<br />

4. On the Destination Folder screen, select a destination folder and click Next.<br />

Note: If you are using the Windows Vista operating system, install the Collector<br />

into a separate directory named C:\Unisys\30m\Collector.<br />

5. On the Select Options: screen, select Collector mode –install at site or<br />

select View mode –install at support center, and then click Next.<br />

6. On the Ready to Install the Program screen, click Install.<br />

The Installation wizard begins installing the files, and the Installing Unisys<br />

<strong>SafeGuard</strong> 30m Collector screen is displayed to indicate the status of the<br />

installation.<br />

After the files are installed, the Installation Wizard Completed screen is<br />

displayed.<br />

7. Click Finish.<br />

Before You Begin the Configuration<br />

Before you begin configuring the Collector, be sure you have the following information:<br />

• IP addresses<br />

− SAN switches<br />

− Network switches<br />

− RA site management<br />

• Log-in names<br />

− SAN switches<br />

− Network switches<br />

− RA (for custom scripts only)<br />

G–2 6872 5688–002


Using the Unisys <strong>SafeGuard</strong> 30m Collector<br />

• Passwords<br />

− SAN switches<br />

− Network switches<br />

− RA (for custom scripts only)<br />

• EMC Navisphere CLI<br />

− Storage<br />

• Autologon configuration<br />

− SAN switches (Consult your SAN switch documentation for the autologon<br />

configuration.)<br />

If you are using a Cisco SAN switch, enable the SSH server before you begin the<br />

configuration. See “Configuring RA, Storage, and SAN Switch Component Types Using<br />

Built-Ins” in this appendix.<br />

Handling the Security Breach Warning<br />

If you previously installed the Collector and have uninstalled the utility and all the files,<br />

when you begin configuring RAs or adding RAs, you might get this message:<br />

WARNING – POTENTIAL SECURITY BREACH!<br />

If you receive this message, complete these steps:<br />

1. Delete the IP address for the RA.<br />

2. Use the following plink command:<br />

C:\>plink -l admin -pw admin get_version<br />

Messages about the host key and a new key are displayed.<br />

3. Type Y in response to the message “Update cached key?”<br />

Once you have updated the cached key, complete the steps in “Configuring RAs” to<br />

discover the IP addresses for the RAs.<br />

6872 5688–002 G–3


Using the Unisys <strong>SafeGuard</strong> 30m Collector<br />

Using Collector Mode<br />

Installing the utility in Collector mode enables all the capabilities to gather log information<br />

using scripts and also enables View mode.<br />

Getting Started<br />

To access the Collector, follow these steps:<br />

1. On the Start menu, point to Programs, then click Unisys, then click <strong>SafeGuard</strong><br />

30m Collector; and click <strong>SafeGuard</strong> 30m Collector.<br />

2. Select the Components.ssc file on the Open Unisys <strong>SafeGuard</strong> 30m Collector<br />

File dialog box.<br />

The Unisys <strong>SafeGuard</strong> 30m Collector program window is displayed with two panes<br />

open.<br />

Configuring RAs<br />

To collect data, specify the site management IP address of either of the RA clusters for a<br />

site. The “built-in” scripts are a preconfigured set of CLI commands that facilitate easy<br />

data collection.<br />

The other site management IP address is automatically discovered when you specify<br />

either of the RA site management addresses.<br />

To configure the RA, perform these steps:<br />

1. Start the Collector.<br />

2. If needed, expand the Components tree in the left pane.<br />

3. Select BI Built-In (under RA), right-click, and click Copy Built-In (Discover RA).<br />

4. On the Script dialog box, type the RA site management IP address in the IP<br />

Address field and click Save.<br />

If you have multiple <strong>SafeGuard</strong> solutions, repeat steps 3 and 4 for each set of RA<br />

clusters.<br />

After you enter the IP address, the Collector window is updated with the folder of each<br />

site management IP address appearing below the RA folder. Each IP folder contains the<br />

built-in scripts that are enabled.<br />

The following sample window shows the IP address folders listed in the left pane. In this<br />

figure, two <strong>SafeGuard</strong> solutions are configured—the set of IP addresses (172.16.17.50<br />

and 172.16.17.60) for the two RA clusters in solution 1 and the IP address 172.16.7.50<br />

for the continuous data protection (CDP) solution, which always has only one RA cluster.<br />

G–4 6872 5688–002


Adding Customer Information<br />

Using the Unisys <strong>SafeGuard</strong> 30m Collector<br />

Add information about the Unisys service representative, customer, and architect so that<br />

the Unisys <strong>Support</strong> Center can contact the site easily. To add the information, perform<br />

the following steps on the Unisys <strong>SafeGuard</strong> 30m Collector program window.<br />

1. On the File menu, click Properties.<br />

2. On the Properties dialog box, select the appropriate tab: Customer, Architect,<br />

or CIR.<br />

3. Type in the information for each field on each tab. (For instance, type text in the<br />

Name, Office, Mobile, E-mail, and Additional Info fields for the CIR tab.)<br />

The Architect tab provides an Installed Date field. Use the Additional Info field for any<br />

other information that the Unisys <strong>Support</strong> Center might need, such as a support<br />

request number.<br />

4. Click OK.<br />

6872 5688–002 G–5


Using the Unisys <strong>SafeGuard</strong> 30m Collector<br />

Running All Scripts<br />

To collect data from all enabled scripts in a <strong>SafeGuard</strong> <strong>Solutions</strong> Components (SSC) file,<br />

perform these steps on the Unisys <strong>SafeGuard</strong> 30m Collector program window.<br />

1. Select Components.<br />

2. Right-click, and click Run, or click the Run button.<br />

Note: The status bar shows the progress of script executions and the amount of data<br />

collected.<br />

Compressing an SSC File to Send to the <strong>Support</strong> Center<br />

Once you run the utility to collect information, you can compress the SSC file to send to<br />

the Unisys <strong>Support</strong> Center.<br />

Note: A Collector components file has the .ssc suffix. Once an SSC file is compressed,<br />

the corresponding <strong>SafeGuard</strong> <strong>Solutions</strong> Data (SSD) file has the .ssd suffix.<br />

On the Unisys <strong>SafeGuard</strong> 30m Collector program window, follow these steps to<br />

compress an SSC file:<br />

1. Click Compress SSC on the File menu.<br />

Once the file is compressed, the file name and path are displayed at the top in the<br />

right pane of the window. The data is exported to the file named Components.ssd in<br />

the directory C:\Program Files\Unisys\30m\Collector\Data.<br />

Note: For the Microsoft Vista operating system, the SSD file resides in the<br />

directory where the Collector is installed. A typical location for this file is<br />

C:\Unisys\30m\Collector\Components.ssd.<br />

2. Send the SSD file to the Unisys <strong>Support</strong> Center at<br />

Safeguard30msupport@unisys.com.<br />

Duplicating the Installation on Another PC<br />

To duplicate the installation of the Collector at a different PC (for example, on the second<br />

site), perform these steps:<br />

1. Copy the SSD file from the PC with the installed Collector to the second PC, placing<br />

it in the C:\Program Files\Unisys\30m\Collector\Data directory.<br />

2. Start the Collector.<br />

3. Click Cancel on the Open Unisys <strong>SafeGuard</strong> 30m Collector File dialog box.<br />

The Unisys <strong>SafeGuard</strong> 30m Collector program window is displayed.<br />

Note: Once an SSD file is extracted, you can select the .ssc file.<br />

4. On the File menu, select Uncompress SSD.<br />

G–6 6872 5688–002


Using the Unisys <strong>SafeGuard</strong> 30m Collector<br />

5. On the Open <strong>SafeGuard</strong> 30m Data File dialog box, select from the list of<br />

available files the SSD file that you wish to uncompress.<br />

If a message appears asking about overwriting the SSC file, click Yes.<br />

6. Ensure that all scripts run from this PC by selecting each component type and<br />

running the scripts for each component.<br />

Understanding Operations in Collector Mode<br />

The Components.ssc file contains the configuration information. If you make changes to<br />

the Components.ssc file—such as adding, deleting, editing, enabling, and disabling<br />

scripts—these changes are automatically saved. You can also make these changes to a<br />

saved SSC file except that you cannot delete scripts from a saved SSC file. You must<br />

open the Components.ssc file to delete scripts.<br />

Understanding and Saving SSC Files<br />

Because you can enable and disable scripts in any SSC file, you can create saved SSC<br />

files for specific uses. If you want to run a subset of the available scripts, save the<br />

Components.ssc file as a new SSC file with a unique name. You can then enable or<br />

disable scripts in the saved SSC file. The saved SSC file is always updated from the<br />

Components.ssc file for information such as the available scripts and the details within<br />

each script. In addition, all changes that are made to any SSC file are updated in the<br />

Components.ssc file. Only scripts that were enabled in the saved SSC file are enabled<br />

when updated from a Components.ssc file.<br />

For example, you could save an SSC file with all RAs except one disabled. You might<br />

name it “radisabled.ssc”. If you have the radisabled.ssc file open and add a new script to<br />

it, the script is automatically added to the Components.ssc file.<br />

Whenever the Components.ssc file is updated with a new script, that script is<br />

automatically added to any saved SSC files.<br />

If you add a new RA to the configuration, the Components.ssc file and any existing<br />

saved SSC files are updated with the component and its scripts are disabled.<br />

If you make deletions to the Components.ssc file, the deletions are automatically<br />

removed from any saved SSC files.<br />

6872 5688–002 G–7


Using the Unisys <strong>SafeGuard</strong> 30m Collector<br />

Sample Scenario<br />

If you want to collect data at one site only or if you want to view the data from one site,<br />

you can create a new saved SSC file for each site. Follow these steps to create the<br />

saved SSC files.<br />

1. Add any desired scripts to the Components.ssc file.<br />

2. Open an SSC file.<br />

3. Click Save As on the File menu, and enter a unique name for the file.<br />

4. Enable and disable scripts as desired.<br />

For example, you might disable one site. To do so, follow these steps:<br />

a. Select the IP address of a component (perhaps Site 1 RA cluster management<br />

IP.)<br />

b. Right-click and click Disable.<br />

Repeat steps 2 through 4 to create additional customized files.<br />

Opening an SSC File<br />

On the Unisys <strong>SafeGuard</strong> 30m Collector program window, perform the following steps<br />

to open an SSC file:<br />

1. Click Open on the File menu.<br />

2. Select an SSC file and click Open.<br />

Configuring RA, Storage, and SAN Switch Component Types Using<br />

Built-In Scripts<br />

The built-in scripts are preconfigured; they contain CLI commands for RAs, navicli<br />

commands for Clariion storage, and CLI commands for switches that facilitate easy data<br />

collection. It takes about 4 minutes for the built-in scripts for one RA to run and about 2<br />

minutes for the built-in scripts for a SAN switch to run.<br />

After you configure built-in scripts, the left pane is updated with the IP addresses below<br />

the component type. Each IP folder contains the built-in scripts that are enabled.<br />

See the previous sample window with the IP address folders listed in the left pane. In<br />

that figure, two <strong>SafeGuard</strong> solutions are configured—the set of IP addresses<br />

(172.16.17.50 and 172.16.17.60) for the two RA clusters and the IP address 172.16.7.50<br />

for the continuous data protection (CDP) setup, which always has only one RA cluster.<br />

G–8 6872 5688–002


Using the Unisys <strong>SafeGuard</strong> 30m Collector<br />

On the Unisys <strong>SafeGuard</strong> 30m Collector program window, follow these steps to use<br />

built-in scripts to configure RA, Storage, and SAN Switch component types:<br />

1. Expand a component type—RA, Storage, or SAN Switch—and select BI Built-<br />

In.<br />

2. Right-click and click Copy Built-In.<br />

3. On the Script dialog box, complete the available fields and click Save.<br />

Note: You can select one script instead of all scripts by selecting a script name instead<br />

of selecting BI-Built-In.<br />

For the RA Component Type<br />

To collect data, specify the site management IP address of either of the RA clusters for a<br />

site. The other site management IP address is automatically discovered when you<br />

specify either of the RA site management addresses.<br />

If you have multiple <strong>SafeGuard</strong> solutions, repeat the three previous steps for each set of<br />

RA clusters.<br />

For the Storage Component Type<br />

Clariion is the only storage component with built-in scripts available.<br />

For the SAN Switch Component Type<br />

Before configuring a Cisco SAN switch, enter config mode on the switch and type #ssh<br />

server enable. To determine the state of the SSH server, type show ssh server<br />

when not in config mode. Refer to the Cisco MDS 9020 Fabric Switch Configuration<br />

<strong>Guide</strong> and Command Reference for more information about switch commands.<br />

If you run the tech-support command under SAN Switch from the Collector, the data<br />

capture might take a long time. You can follow the progress in the status bar of the<br />

window.<br />

If you run commands for a Brocade switch and receive the following message, the<br />

Brocade switch is downlevel and does not support the SSH protocol:<br />

rbash: switchShow: command not found<br />

Upgrade the switch software to a later version that supports the SSH protocol.<br />

6872 5688–002 G–9


Using the Unisys <strong>SafeGuard</strong> 30m Collector<br />

Enabling Scripts<br />

You can interactively enable all the scripts in any SSC file, the scripts for one component<br />

in the SSC file, or a single script. To enable a disabled script, you must open the SSC file<br />

containing the script. Perform the following steps on the Unisys <strong>SafeGuard</strong> 30m<br />

Collector program window.<br />

Enable All Scripts<br />

1. Select Components.<br />

2. Right-click and click Enable.<br />

Enabled scripts are shown in green.<br />

Enable Scripts for One Component<br />

1. Select the IP address of the component.<br />

2. Right-click and click Enable.<br />

Enabled scripts are shown in green.<br />

Enable a Single Script<br />

1. Select the script name.<br />

2. Right-click and click Enable.<br />

The enabled script is shown in green.<br />

Disabling Scripts<br />

You can interactively disable all the scripts in any SSC file, the scripts for one component<br />

in the SSC file, or a single script. Perform the following steps on the Open Unisys<br />

<strong>SafeGuard</strong> 30m Collector program window.<br />

Disable All Scripts<br />

1. Select Components.<br />

2. Right-click and click Disable.<br />

Disabled scripts are shown in red.<br />

Disable Scripts for One Component<br />

1. Select the IP address of the component.<br />

2. Right-click and click Disable.<br />

Disabled scripts are shown in red.<br />

Disable a Single Script<br />

1. Select the script name.<br />

2. Right-click and click Disable.<br />

The disabled script is shown in red.<br />

G–10 6872 5688–002


Running Scripts<br />

Using the Unisys <strong>SafeGuard</strong> 30m Collector<br />

You can interactively run all the scripts in any SSC file; the scripts for one component<br />

type such as RA, Storage, SAN Switch, or Other; the scripts for one component in the<br />

SSC file; or a single script.<br />

Note: You can use the Run button on the Collector toolbar or the Run command in the<br />

following procedures.<br />

Perform the following steps on the Unisys <strong>SafeGuard</strong> 30m Collector program window.<br />

Run All Scripts<br />

1. Select Components.<br />

2. Right-click and click Run.<br />

Run Scripts for One Component Type<br />

1. Select a component type—RA, Storage, SAN Switch, or Other.<br />

2. Right-click and click Run.<br />

The status of the executing scripts is displayed in the right pane. The status bar<br />

shows the component type that is running, the IP address, the script name, and<br />

instructions for halting script execution. A progress bar indicates that the Collector is<br />

running the script and shows the amount of data being captured by the script. Once<br />

script execution completes, the status bar shows the last script run.<br />

Run Scripts for One Component<br />

1. Select either the IP address or custom-named component.<br />

2. Right-click and click Run.<br />

The status of the executing scripts is displayed in the right pane. The status bar<br />

shows the component type that is running, the IP address, the script name, and<br />

instructions for halting script execution. A progress bar indicates that the Collector is<br />

running the script and shows the amount of data being captured by the script. Once<br />

script execution completes, the status bar shows the last script run.<br />

Run a Single Script<br />

1. Select a script name.<br />

2. Right-click and click Run.<br />

The status of the executing scripts is displayed in the right pane. The status bar<br />

shows the component type that is running, the IP address, the script name, and<br />

instructions for halting script execution. A progress bar indicates that the Collector is<br />

running the script and shows the amount of data being captured by the script. Once<br />

script execution completes, the status bar shows the last script run.<br />

Stopping Script Execution<br />

To stop a script while it is executing, click Stop on the Collector toolbar. All scripts that<br />

have been stopped are marked with a green X. The status of the stopped script is<br />

displayed in the right pane.<br />

6872 5688–002 G–11


Using the Unisys <strong>SafeGuard</strong> 30m Collector<br />

Deleting Scripts<br />

You can interactively delete scripts only in the Components.ssc file. Perform the<br />

following steps on the Unisys <strong>SafeGuard</strong> 30m Collector program window.<br />

Delete Scripts for One Component<br />

1. Select the IP address or custom-named component.<br />

2. Right-click and click Delete.<br />

Delete a Single Script<br />

1. Expand an IP address or a custom-named component; then select a script name.<br />

2. Right-click and click Delete.<br />

Adding Scripts for RA, Storage, and SAN Switch Component Types<br />

You can interactively add custom scripts to any SSC file by copying an existing script or<br />

by specifying a new script. Perform the following steps on the Unisys <strong>SafeGuard</strong> 30m<br />

Collector program window.<br />

Add New Script for a Component Type<br />

1. Select a component type—RA, Storage, or SAN Switch.<br />

2. Right-click and click New.<br />

3. Complete the script form.<br />

4. Click Save.<br />

Add a New Script Based on an Existing Custom Script<br />

1. Select a script name.<br />

2. Right-click and click New.<br />

3. Complete the form. Change the script name and the command.<br />

4. Click Save.<br />

Adding Scripts for the Other Component Type<br />

Perform the following steps on the Unisys <strong>SafeGuard</strong> 30m Collector program window.<br />

1. Select the component type Other.<br />

2. Right-click and click New.<br />

3. On the Select Program dialog box, navigate to the appropriate directory and<br />

choose the file to run. Then click Open.<br />

4. On the Script dialog box, type a component name in the Component field.<br />

5. Type a unique name for the script in the Script Name field.<br />

6. Review the selected file name that is displayed in the Command field. Modify the<br />

file name as necessary.<br />

G–12 6872 5688–002


Using the Unisys <strong>SafeGuard</strong> 30m Collector<br />

The following example illustrates using a custom component (adding a new script as<br />

shown in the previous procedure) to mount and unmount drives.<br />

Note: In this example, the Collector must be installed on the server with the kutils<br />

utility installed or with the stand-alone kutils utility installed.<br />

C:\batch_File\mount_r.bat<br />

%This command, when run, mounts the specified drive<br />

Echo ON<br />

cd c:\program files\kdriver\kutils<br />

kutils.exe umount r:<br />

kutils.exe mount r:<br />

echo "Finished"<br />

C:\batch_File\unmount_r.bat<br />

%This command, when run, unmounts the specified drive<br />

cd c:\program files\kdriver\kutils<br />

kutils.exe flushedFS r:<br />

kutils.exe unmount r:<br />

Scheduling an SSC File<br />

Perform the following steps on the Unisys <strong>SafeGuard</strong> 30m Collector program window.<br />

1. Click Schedule on the menu bar.<br />

2. On the Schedule Unisys <strong>SafeGuard</strong> 30m Collector File dialog box, enter the<br />

information required for each field as follows:<br />

a. Type the password.<br />

b. Type the date and start time.<br />

c. Select a Perform task option, which determines how often the schedule runs.<br />

d. Enter the end date if shown. (You do not need an end date for a Perform task of<br />

Once.)<br />

3. Click Select.<br />

4. On the Select Unisys <strong>SafeGuard</strong> 30m Collector dialog box, select the<br />

appropriate SSC file for which you wish to run the schedule, and then click Open.<br />

The Schedule Unisys <strong>SafeGuard</strong> 30m Collector File dialog box is again<br />

displayed. The Collector opens the selected SSC file as the current SSC file.<br />

5. Click Add.<br />

6. Click Exit.<br />

Note: You can create one schedule for an SSC file. To create additional schedules,<br />

create additional SSC files with the desired scripts enabled. The resultant scheduled data<br />

is appended to any current data (if available). For example, if you run the Collector using<br />

Windows Scheduler three times, three outputs are displayed in the right pane one after<br />

another with the timestamps for each.<br />

6872 5688–002 G–13


Using the Unisys <strong>SafeGuard</strong> 30m Collector<br />

Querying a Scheduled SSC File<br />

Perform the following steps on the Unisys <strong>SafeGuard</strong> 30m Collector program window.<br />

1. Click Schedule from the menu bar.<br />

2. On the Schedule Unisys <strong>SafeGuard</strong> 30m Collector File dialog box, click<br />

Query.<br />

3. On the Tasks window, select the task name that is the same as the scheduled SSC<br />

file.<br />

4. Right-click and click Properties.<br />

5. View the details of the scheduled task in the window; then click OK to close the<br />

task Properties window.<br />

6. Close the Tasks window and then select the Schedule Unisys <strong>SafeGuard</strong> 30m<br />

Collector window.<br />

7. Click Exit.<br />

Note: For the Microsoft Vista operating system, if you want to see the scheduled task<br />

after scheduling a task, click Query on the Schedule Unisys <strong>SafeGuard</strong> 30m<br />

Collector File dialog box. The Vista Microsoft Management Control (mmc) window is<br />

displayed. Press F5 to see the scheduled task.<br />

Deleting a Scheduled SSC File<br />

Perform the following steps on the Unisys <strong>SafeGuard</strong> 30m Collector program window.<br />

1. Select Schedule from the menu bar.<br />

2. On the Schedule Unisys <strong>SafeGuard</strong> 30m Collector File dialog box, click<br />

Query.<br />

3. On the Tasks window, select the task name that is the same as the scheduled SSC<br />

file.<br />

4. Right-click and click Delete.<br />

5. Close the Tasks window and then select the Schedule Unisys <strong>SafeGuard</strong> 30m<br />

Collector window.<br />

6. Click Exit.<br />

G–14 6872 5688–002


Using View Modde<br />

6872 5688–002<br />

If you installed the Collector in View mode, the support personnel at the<br />

Unisys <strong>Support</strong><br />

Center can use Vieww<br />

Mode to view the information. To access the Collector,<br />

follow<br />

these steps:<br />

1. Start the Collecctor.<br />

2. On the Open UUnisys<br />

<strong>SafeGuard</strong> 30m Collector File dialog box, b click Cancel.<br />

The Unisys <strong>SafeGuard</strong><br />

30m Collector program window is displayed d.<br />

Note: Once aan<br />

SSD file is extracted, you can select the . ssc file.<br />

3. On the File meenu,<br />

click Uncompress SSD.<br />

4. On the Open S<strong>SafeGuard</strong><br />

30m Data File dialog box, select from m the list of<br />

available files thhe<br />

SSD file that you wish to uncompress.<br />

5. In View mode, expand the components tree and then expand a com mponent type:<br />

RA, Storage, SAN Switch, or Other.<br />

6. Click a script naame<br />

from those displayed to view the data collected d from that script.<br />

The data is dispplayed<br />

in the right pane.<br />

The following figure<br />

displays a sample of View mode with data disp played in the right<br />

pane.<br />

7. On the File meenu,<br />

click Exit.<br />

Using the Unisys <strong>SafeGuard</strong> d 30m Collector<br />

G–15


Using the Unisys <strong>SafeGuard</strong> 30m Collector<br />

G–16 6872 5688–002


Appendix H<br />

Using kutils<br />

Usage<br />

The server-based kutils utility enables you to manage host splitters across all platforms.<br />

This utility is installed automatically when you install the Unisys <strong>SafeGuard</strong> 30m splitter<br />

on a host machine. When the splitting function is performed by an intelligent fabric<br />

switch, you can install a stand-alone version of the kutils utility separately on host<br />

machines.<br />

For details on the syntax and use of the ktuils commands, see the Unisys <strong>SafeGuard</strong><br />

<strong>Solutions</strong> Replication Appliance Administrator’s <strong>Guide</strong>.<br />

A kutils command is always introduced with the kutils string. If you enter the string<br />

independently—that is, without any parameters—the ktuils utility returns usages notes,<br />

as follows:<br />

C:\program files\kdriver\kutils>kutils<br />

Usage: kutils <br />

Path Designations<br />

You can designate the path to a device in the following ways:<br />

• Device path example<br />

“SCSI\DISK&VEN_KASHYA&PROD_MASTER_RIGHT&REV_0001\5&133EF78A&0&000”<br />

• Storage path example<br />

“SCSI#DISK&VEN_KASHYA&PROD_MASTER_RIGHT&REV_0001#5&133EF78A&0&000#{53<br />

f56307-b6bf-11d0-94f2-00a0c91efb8b}”<br />

• Volume path example<br />

“\\?\Volume{33b4a391-26af-11d9-b57b-505054503030}”<br />

Each command notes the particular designation to use. In addition, some commands,<br />

such as showDevices and showFS, return the symbolic link for a device. The symbolic<br />

link generally provides additional information about the characteristics of the specific<br />

devices.<br />

6872 5688–002 H–1


Using kutils<br />

The following are examples of symbolic links:<br />

“\Device\0000005c”<br />

“\Device\EmcPower\Power2”<br />

“\Device\Scsi\q123001Port2Path0Target0Lun2”<br />

Command Summary<br />

The kutils utility offers the following commands:<br />

• disable: Removes host access to the specified device or volume (Windows only).<br />

• enable: Restores host access to a specified device or volume (Windows only).<br />

• flushFS: Initiates an operating system flush of the file system (Windows only).<br />

• manage_auto_host_info_collection: Indicates whether the automatic host<br />

information collection is enabled or disabled, or enables or disables automatic host<br />

information collection.<br />

• mount: Mounts a file system (Windows only).<br />

• rescan: Scans storage for all existing disks (Windows only).<br />

• showDevices: Presents a list of physical devices to which the host has access,<br />

providing (as available) the device path, storage path, and symbolic link for each<br />

device (Windows only).<br />

• showFS: Presents the drive designation and, as available, the device path, storage<br />

path, and symbolic link for each mounted physical device (Windows only).<br />

• show_vol_info: Presents information on the specified volume, including: the Unisys<br />

<strong>SafeGuard</strong> 30m solution name (if “created” in Unisys <strong>SafeGuard</strong> <strong>Solutions</strong>), size, and<br />

storage path.<br />

• show_vols: Presents information on all volumes to which the host has access<br />

including: the Unisys <strong>SafeGuard</strong> 30m solution name (if “created” in Unisys<br />

<strong>SafeGuard</strong> <strong>Solutions</strong>), size, and storage path<br />

• sqlRestore: Restores an image previously created by the sqlSnap command<br />

(Windows only)<br />

• sqlSnap: Performs a VDI-based SQL Server image (Windows only).<br />

• start: Resumes the splitting of write operations.<br />

• stop: Discontinues the splitting of write operations to an RA (that is, places the host<br />

splitter in pass-through mode in which data is written to storage only).<br />

• umount: Unmounts the file system (Windows only).<br />

H–2 6872 5688–002


Appendix I<br />

Analyzing Cluster Logs<br />

Samples of cluster log messages for problems and situations are listed throughout this<br />

guide. You can search on text strings from cluster log messages to find specific<br />

references.<br />

The information gathered in cluster logs is critical in determining the cause of a given<br />

cluster problem. Without the diagnostic information from the cluster logs, you might find<br />

it difficult to determine the root cause of a cluster problem.<br />

This appendix provides information to help you use the cluster log as a diagnostic tool.<br />

Introduction to Cluster Logs<br />

The cluster log is a text log file updated by the Microsoft Cluster Service (MSCS) and its<br />

associated cluster resource. The cluster log contains diagnostic messages about cluster<br />

events that occur on an individual cluster member or node. This file provides more<br />

detailed information than the cluster events written in the system event log.<br />

A cluster log reports activity for one node. All member nodes in a cluster perform as a<br />

single unit. Therefore, when a problem occurs, it is important to gather log information<br />

from all member nodes in the cluster. This information gathering is typically done using<br />

the Microsoft MPS Report Utility. Gather the information immediately after a problem<br />

occurs to ensure cluster log data is not overwritten.<br />

By default, the cluster log name and location are as follows:<br />

• C:\Winnt\Cluster\cluster.log<br />

Note: For windows 2003 cluster.log file is located in the following path:<br />

C:\WINDOWS\Cluster<br />

• Captured with MPS Report Utility: _Cluster.log<br />

6872 5688–002 I–1


Analyzing Cluster Logs<br />

Creating the Cluster Log<br />

In Windows 2000 Advanced Server and Windows 2000 Datacenter Server, by default,<br />

cluster logging is enabled on all nodes. You can define the characteristics and behavior of<br />

the cluster log with system environment variables.<br />

To access the system environment variables, perform the following actions:<br />

1. In Control Panel, double-click System.<br />

2. Select the Advanced tab.<br />

3. Click Environment Variables.<br />

You can get additional information regarding the system environment variables in<br />

Microsoft Knowledge Base article 16880, “How to Turn On Cluster Logging in Microsoft<br />

Cluster Server” at this URL:<br />

http://support.microsoft.com/default.aspx?scid=kb;en-us;168801<br />

The default cluster settings are listed in Table I–1. Some parameters might not be listed<br />

when viewing the system environment variables. If a variable is not listed, its default<br />

value is still in effect.<br />

Table I–1. System Environment Variables Related to Clustering<br />

Variable Name Default Setting Comment<br />

ClusterLog %SystemRoot%<br />

\Cluster\Cluster.log<br />

Determines the location and name<br />

of cluster log file.<br />

ClusterLogSize 8 MB Determines the size of the cluster<br />

log. The default size is usually not<br />

large enough to retain history on<br />

enterprise systems. The<br />

recommended setting is 64 MB.<br />

ClusterLogLevel 2 Sets the level of detail for log<br />

entries, as follows:<br />

0 = No logging<br />

1 = Errors only<br />

2 = Errors and Warnings<br />

3 = Everything that occurs<br />

Used only with the /debug<br />

parameter on MSCS startup.<br />

Review Microsoft Knowledge Base<br />

article 258078 for more information<br />

about using the /debug parameter.<br />

I–2 6872 5688–002


Analyzing Cluster Logs<br />

Table I–1. System Environment Variables Related to Clustering<br />

Variable Name Default Setting Comment<br />

ClusterLogOverwrite<br />

Note: By default, the<br />

ClusterLogOverwrite setting is<br />

disabled. Unisys recommends that<br />

this setting remain disabled. When<br />

this setting is enabled, all cluster<br />

log history is lost if MSCS is<br />

restarted twice in succession.<br />

Understanding the Cluster Log Layout<br />

Process ID<br />

Thread ID<br />

Date<br />

GMT<br />

0 Determines whether a new cluster<br />

log is to be created when MSCS<br />

starts.<br />

0 = Disabled<br />

1 = Enabled<br />

Figure I–1 illustrates the layout of the cluster log. The paragraphs following the figure<br />

explain the various parts of the layout.<br />

Figure I–1. Layout of the Cluster Log<br />

The process ID is the process number assigned by the operating system to a service or<br />

application.<br />

The thread ID is a thread of a particular process. A process typically has multiple threads<br />

listed. Within a large cluster log, it is particularly useful to search by thread ID to find the<br />

messages related to the same thread.<br />

The date listed is the date of the entry. You can use this date to match the date of the<br />

problem in the system event log.<br />

The time entered in the Windows 2000 cluster log is always in Greenwich Mean Time<br />

(GMT). The format of the entry is HH:MM:SS.SSS. The SS.SSS entry represents<br />

seconds carried out to the thousandths of a second. There can be multiple .SSS entries<br />

for the same thousandth of a second. Therefore, there can be more than 999 cluster log<br />

entries vsn exist for any given second.<br />

6872 5688–002 I–3


Analyzing Cluster Logs<br />

Cluster Module<br />

Table I–2 lists the various modules of MSCS. These module names are logged within<br />

square brackets in the cluster log.<br />

Table I–2. Modules of MSCS<br />

API API <strong>Support</strong><br />

ClMsg Cluster messaging<br />

ClNet Cluster network engine<br />

CP Checkpoint Manager<br />

CS Cluster service<br />

DM Database Manager<br />

EP Event Processor<br />

FM Failover Manager<br />

GUM Global Update Manager<br />

INIT Initialization<br />

JOIN Join<br />

LM Log Manager<br />

MM Membership Manager<br />

NM Node Manager<br />

OM Object Manager<br />

RGP Regroup<br />

RM Resource Monitor<br />

For additional descriptions of the cluster components, refer to the Windows 2000 Server<br />

Resource Kit at this URL:<br />

http://www.microsoft.com/technet/prodtechnol/windows2000serv/reskit/default.msp<br />

x?mfr=true<br />

Click the following link for Windows 2003 to refer to the Windows 2003 Server Resource<br />

Kit:<br />

http://www.microsoft.com/windowsserver2003/techinfo/reskit/tools/default.mspx<br />

Click the following link to interpret the cluster logs:<br />

http://technet2.microsoft.com/windowsserver/en/library/16eb134d-584e-46d9-9bf4-<br />

6836698cd26a1033.mspx?mfr=true<br />

I–4 6872 5688–002


Sample Cluster Log<br />

Analyzing Cluster Logs<br />

The sample cluster log that follows illustrates the component names in brackets.<br />

Cluster Operation<br />

00000848.00000ba0::2008/05/05-16:11:31.000 [RGP] Node 1: REGROUP INFO:<br />

regroup engine requested immediate shutdown.<br />

00000848.00000ba0::2008/05/05-16:11:31.000 [NM] Prompt shutdown is requested<br />

by a membership engine<br />

00000adc.00000acc::2008/05/05-16:11:31.234 [RM] Going away, Status = 1,<br />

Shutdown = 0.<br />

The cluster operation is the task currently being performed by the cluster. Each cluster<br />

module (listed in Table I–2) can perform hundreds of operations, such as forming a<br />

cluster, joining a cluster, checkpointing, moving a group manually, and moving a group<br />

because of a failure.<br />

Posting Information to the Cluster Log<br />

The cluster log file is organized by date and time. Process threads of MSCS and<br />

resources post entries in an intermixed fashion. As the threads are performing various<br />

cluster functions, they constantly post entries to the cluster log in an interspersed<br />

manner.<br />

The following sample cluster log shows various disks in the process of coming online.<br />

The entries are not logically grouped by disk; rather, the entries are logged as each<br />

thread posts its unique information.<br />

In the left navigation pane, click on Windows 2000 Server Resource Kit and click<br />

on Distributed Systems <strong>Guide</strong>, then Enterprise Technologies, and then<br />

Interpreting the Cluster Log.<br />

Sample Cluster Log<br />

Thread ID<br />

↓<br />

00000444.00000600::2008/11/18-18:23:48.307 Physical Disk :<br />

[DiskArb] Issuing GetSectorSize on signature 9a042144.<br />

00000444.000005e0::2008/11/18-18:23:48.307 Physical Disk :<br />

[DiskArb]Successful read (sector 12) [:0] (0,00000000:00000000).<br />

00000444.00000608::2008/11/18-18:23:48.307 Physical Disk :<br />

[DiskArb]DisksOpenResourceFileHandle: CreateFile successful.<br />

6872 5688–002 I–5


Analyzing Cluster Logs<br />

00000444.00000600::2008/11/18-18:23:48.307 Physical Disk :<br />

[DiskArb] GetSectorSize completed, status 0.<br />

00000444.00000608::2008/11/18-18:23:48.307 Physical Disk :<br />

DiskArbitration must be called before DisksOnline.<br />

00000444.00000600::2008/11/18-18:23:48.307 Physical Disk :<br />

[DiskArb] ArbitrationInfo.SectorSize is 512<br />

00000444.00000608::2008/11/18-18:23:48.307 Physical Disk :<br />

[DiskArb] Arbitration Parameters (1 9999).<br />

00000444.00000600::2008/11/18-18:23:48.307 Physical Disk :<br />

[DiskArb] Issuing GetPartInfo on signature 9a042144.<br />

Because the cluster performs many operations simultaneously, the log entries pertaining<br />

to a particular thread are interwoven along with the threads of the other cluster<br />

operations. Depending on the number of cluster groups and resources, reading a cluster<br />

log can become difficult.<br />

Tip: To follow a particular operation, search by the thread ID. For instance, to follow<br />

online events for Physical Disk V, perform these steps using the preceding sample<br />

cluster log:<br />

1. Anchor the cursor in the desired area.<br />

2. Search up or down for thread 00000600.<br />

Diagnosing a Problem Using Cluster Logs<br />

The following topics provide you with useful information for diagnosing problems using<br />

cluster logs:<br />

• Gathering Materials<br />

• Opening the Cluster Log<br />

• Converting GMT to Local Time<br />

• Converting Cluster Log GUIDs to Text Resource Names<br />

• Understanding State Codes<br />

• Understanding Persistent State<br />

• Understanding Error and Status Codes<br />

I–6 6872 5688–002


Gathering Materials<br />

Analyzing Cluster Logs<br />

You need to gather the following pieces of information, tools, and files to use with the<br />

cluster logs to diagnose problems:<br />

• Information<br />

− Date and time of problem occurrence<br />

− Server time zone<br />

• Tools<br />

− Notepad or Wordpad text viewer<br />

− This command-line tool is embedded in Windows. The command syntax is Net<br />

Helpmsg ).<br />

• Output from the MPS Report Utility from all cluster nodes<br />

• Files from the MPS Report Utility run<br />

− Cluster log (Mandatory)<br />

The file name is _Cluster.log.<br />

− System event log (Mandatory)<br />

The file name is _Event_Log_System.txt.<br />

− .nfo system information file for installed adapters and driver versions (Reference)<br />

The file name is _Msinfo.nfo.<br />

− Cluster registry hive for cross-referencing information used in the cluster log<br />

(Reference)<br />

The file name is _Cluster_Registry.hiv.<br />

− Cluster configuration file for a basic listing of cluster nodes, groups, resources,<br />

and dependencies (available in MPS Report Utility version 7.2 or later)<br />

The file name is _Cluster_mps_Information.txt.<br />

Opening the Cluster Log<br />

Use a text editor to view the cluster log file in the MPS Report Utility. Notepad or<br />

Wordpad works well. Notepad allows text searches up or down the document. Wordpad<br />

allows text searches only down the document.<br />

Note: Do not open the cluster.log file on a production cluster. Logging stops while the<br />

file is open. Instead, copy the cluster.log file first and then open the copy to read the file.<br />

The cluster log is on the local system in the directory Winnt/Cluster/Cluster.log.<br />

6872 5688–002 I–7


Analyzing Cluster Logs<br />

Converting GMT/UCT to Local Time<br />

The time posted in the cluster log is given as GMT/UCT. You must convert GMT/UCT to<br />

the local time to cross-reference cluster log entries with system and application event<br />

log entries.<br />

You can find the local time zone in the .nfo file in MPS Reports under system summary.<br />

You can also use the Web site www.worltimeserver.com to find accurate local time for a<br />

given city, GMT/UCT, and the difference between the two in hours.<br />

Converting Cluster Log GUIDs to Text Resource Names<br />

A globally unique identifier (GUID) is a 32-character hexadecimal string used to identify a<br />

unique entity in the cluster. A unique entry can be a node name, group name, resource<br />

name, or cluster name.<br />

The GUID format is nnnnnnnn-nnnn-nnnn-nnnn-nnnnnnnnnnnn.<br />

The following are examples of GUIDs in the cluster log:<br />

000007d0.00000808::2008/04/23-21:48:23.105 [FM] FmpHandleResourceTransition: resource<br />

Name = ae775058-af20-4ba2-a911-af138b1f65bd old state=130 new state=3<br />

000007d0.00000808::2008/04/23-21:48:23.448 [FM] FmpRmOfflineResource: RMOffline() for<br />

6060dc33-5737-4277-b2f2-9cc45629ef0 returned error 997<br />

000007d0.00001970::2008/05/02-21:41:58.846 [FM] OnlineResource: e65bc275-66d1-41ff-<br />

8a4e-89ad6643838b depends on 758bb9bb-7d1f-4148-a994-684dd4f8c969. Bring online<br />

first.<br />

000007d0.0000081::2008/05/04-17:21:06.888 [FM] New owner of Group b072608c-b7f3-48b0-<br />

83f8-7c922c14e709 is 2, state 0, curstate 1.<br />

Mapping a Text Name to a GUID<br />

The two methods for mapping a text name to a GUID are<br />

• Automatic mapping<br />

• Reviewing the cluster registry hive<br />

I–8 6872 5688–002


Automatic Mapping<br />

Analyzing Cluster Logs<br />

The simplest method of mapping a text name to a GUID is the automatic mapping<br />

performed by some versions of the MPS Report tool. However, most versions of the<br />

MPS Report tool do not perform this automatic function.<br />

For those versions with the automatic mapping feature, you can find the information in<br />

the cluster configuration file (_Cluster_Mps_Information.txt). The<br />

following listing shows this mapping:<br />

f9f0b528-b674-40fb-9770-c65e17a2a387 = SQL Network Name<br />

f0dd1852-acc8-4921-b33a-a77dd5cdcfee = SQL Server Fulltext (SQL1)<br />

f0aca2c4-049f-4255-9332-92a69cc07326 = MSDTC<br />

eff360f3-d987-4a020-8f3c-4118056a50b2 = MSDTC IP Address<br />

e74769f8-67e1-43b2-9bec-93171c31d182 = SQL IP Address 1<br />

e09f61cf-8ebf-4cd1-9ae3-58ed4d2b0fbc = Disk K:<br />

Reviewing the Cluster Registry Hive<br />

The second method of mapping a text name to a GUID is more complex and involves<br />

opening the cluster registry hive from the MPS Report tool and then reviewing the<br />

contents.<br />

Follow these steps to open and review the cluster registry hive:<br />

1. Start the Registry Editor (Regedt32.exe).<br />

2. Click the HKEY_LOCL_MACHINE hive.<br />

3. Click the HKEY_LOCAL_MACHINE root folder.<br />

4. Click Load Hive on the Registry menu.<br />

5. Select the _Cluster_Registry.hiv file; then press Ctrl-C.<br />

6. Select Open.<br />

7. Press Ctrl-V to obtain the key name.<br />

8. Expand the cluster hive and review the GUIDS, which are located in the subkeys<br />

Groups, Resources, Networks, and NetworkInterfaces, as shown in Figure I–2.<br />

6872 5688–002 I–9


Analyzing Cluster Logs<br />

I–10<br />

Figure I–2. Expandded<br />

Cluster Hive (in Windows 2000 Server)<br />

Scroll through the GUUIDs<br />

until you find the one that matches the GUID from<br />

the<br />

cluster log. You can aalso<br />

open each key until you find the matching GUID D.<br />

Tip: Under each GUID iss<br />

a TYPE field. This field identifies a resource type such<br />

as<br />

physical disk, IP addresss,<br />

network name, generic application, generic service e, and so<br />

forth. You can use this fiield<br />

to find a specific resource type and then map it to the GUID.<br />

Understanding State CCodes<br />

MSCS uses state codes to determine the status of a cluster component. The e state varies<br />

depending on the type off<br />

cluster components, which are nodes, groups, resources,<br />

networks, and network innterfaces.<br />

Some state codes are posted in the cluster<br />

log using<br />

the numeric code and others<br />

using the actual value for the code.<br />

68 872 5688–002


Examples of State Codes in the Cluster Log<br />

Analyzing Cluster Logs<br />

The following example entries show state codes for the resource, group, network<br />

interface, node, and network types of cluster component:<br />

• Resource<br />

In this example, the resource is changing states from online pending (129) to online<br />

(2).<br />

00000850.00000888::2008/05/05-17:37:29.125 [FM] FmpHandleResource<br />

Transition: Resource Name = 87e55402-87cb-4354-95e7-6dd864b79039 old state =<br />

129 new state=2<br />

• Group<br />

In this example, the group state is set to offline (1).<br />

00000898.000008a0::2008/05/05-06:25:55:062 [FM] Setting group 1951e272-6271-<br />

4ea3-b0f9-cd767537f245 owner to node 2, state 1<br />

• Network interface<br />

This example provides the actual value of the state code, not the numeric code.<br />

00000898.00000598:2008/05/05-06:28:40;921 [ClMsg] Received interface<br />

unreachable event for node 2 network 1<br />

• Node<br />

This example provides the actual value of the state code, not the numeric code.<br />

00000898.0000060c::2008/05/05-06:28:45:953 [EP] Node down event received<br />

00000898.000008a8:2008/05/05-06:28:45:953 [Gum] Nodes down: 0002. Locker=1,<br />

Locking=1<br />

• Network<br />

This example provides the actual value of the state code, not the numeric code.<br />

00000898.000008a4::2008/05/05-06:25:53:703 [NM] Processing local interface<br />

up event for network 0433c4e2-a577-4325-9ebd-a9d3d2b9b81f.<br />

6872 5688–002 I–11


Analyzing Cluster Logs<br />

State Codes<br />

Table I–3 lists the state codes from the Windows 2000 Resource Kit for nodes.<br />

Table I–3. Node State Codes<br />

State Code State<br />

–1 ClusterNodeStateUnknown<br />

0 ClusterNodeUp<br />

1 ClusterNodeDown<br />

2 ClusterNodePaused<br />

3 ClusterNodeJoining<br />

Table I–4 lists the state codes from the Windows 2000 Resource Kit for groups.<br />

Table I–4. Group State Codes<br />

State Code State<br />

–1 ClusterGroupStateUnknown<br />

0 ClusterGroupOnline<br />

1 ClusterGroupOffline<br />

2 ClusterGroupFailed<br />

3 ClusterGroupPartialOnline<br />

Table I–5 lists the state codes from the Windows 2000 Resource Kit for resources.<br />

Table I–5. Resource State Codes<br />

State Code State<br />

–1 ClusterResourceStateUnknown<br />

0 ClusterResourceInherited<br />

1 ClusterResourceInitializing<br />

2 ClusterResourceOnline<br />

3 ClusterResourceOffline<br />

4 ClusterResourceFailed<br />

128 ClusterResourcePending<br />

I–12 6872 5688–002


Table I–5. Resource State Codes<br />

State Code State<br />

129 ClusterResourceOnlinePending<br />

130 ClusterResourceOfflinePending<br />

Analyzing Cluster Logs<br />

Table I–6 lists the state codes from the Windows 2000 Resource Kit for network<br />

interfaces.<br />

Table I–6. Network Interface State Codes<br />

State Code State<br />

–1 ClusterNetInterfaceStateUnknown<br />

0 ClusterNetInterfaceUnavailable<br />

1 ClusterNetInterfaceFailed<br />

2 ClusterNetInterfaceUnreachable<br />

3 ClusterNetInterfaceUp<br />

Table I–7 lists the state codes from the Windows 2000 Resource Kit for networks<br />

Table I–7. Network State Codes<br />

State Code State<br />

–1 ClusterNetworkStateUnknown<br />

0 ClusterNetworkUnavailable<br />

1 ClusterNetworkDown<br />

2 ClusterNetworkPartitioned<br />

3 ClusterNetworkUp<br />

6872 5688–002 I–13


Analyzing Cluster Logs<br />

Understanding Persistent State<br />

Persistent state is not a state code, but rather a key in the cluster registry hive for groups<br />

and resources. The persistent state key reflects the current state of a resource or group.<br />

This key is not a permanent value; it changes value when a group or resource changes<br />

states.<br />

You can change the value of the persistent state key, which can be useful for<br />

troubleshooting or managing the cluster. For example, you can change the value before a<br />

manual failover or shutdown to prevent a particular group or resource from starting<br />

automatically.<br />

The value for the persistent state can be 0 (disabled or offline) or 1 (enabled or online).<br />

The default value is 1.<br />

If the value for persistent state is 0, the group or resource remains in an offline state<br />

until it is manually brought online.<br />

The following is an example cluster log reference to persistent state:<br />

000008bc.00000908::2008/05/12-23:45:36/687 [FM] FmpPropagateGroupState:<br />

Group 1951e272-6271-4ea3-b0f9-cd767537f245 state = 3, persistent state = 1<br />

For more information about persistent state, view Microsoft Knowledge Base article<br />

259243, “How to Set the Startup Value for a Resource on a Clustered Server” at this<br />

URL:<br />

http://support.microsoft.com/default.aspx?scid=kb;en-us;259243<br />

I–14 6872 5688–002


Understanding Error and Status Codes<br />

Analyzing Cluster Logs<br />

You can easily interpret error and status codes that occur in cluster log entries by issuing<br />

the following command from the command line:<br />

Net Helpmsg <br />

This command returns a line of explanatory text that corresponds to the number.<br />

Examples<br />

• For the error code value of 5 as shown in the following example, the Net Helpmsg<br />

command returns “Access is denied.”<br />

00000898.000008f0:2008/30-16:03:31.979 [DM] DmpCheckpointTimerCb -Failed to<br />

reset log, error=5<br />

• For the status code value of 997 as shown in the following example, the Net<br />

Helpmsg command returns “Overlapped I/O operation is in progress.” This status<br />

code is also known as “I/O pending.”<br />

00000898.00000a8c::2008/05/05-06:38:14.187 [FM] FmpOnlineResource: Returning<br />

Resource 87e55402-87cb-4354-95e7-6dd864b79039, state 129, statue 997<br />

• For the status code value of 170 as shown in the following example, the Net<br />

Helpmsg command returns “The requested resource is in use.”<br />

000009a4.000009c4::2008/05/15-07:28:42.303 Physicsl Disk :[DiskArb]<br />

CompletionRoutine, status 170<br />

6872 5688–002 I–15


Analyzing Cluster Logs<br />

I–16 6872 5688–002


Index<br />

A<br />

accessing an image, 3-2<br />

analyzing<br />

intelligent fabric switch logs, A-16<br />

RA log collection files, A-8<br />

server (host) logs, A-16<br />

B<br />

bandwidth, verifying, D-7<br />

bin directory, A-14<br />

C<br />

changes for this release, 1-2<br />

clearing the system event log (SEL), B-1<br />

ClearPath MCP<br />

bringing data consistency group online, 3-5<br />

manual failover, 3-5<br />

recovery tasks, 3-5<br />

CLI file, A-10<br />

clock synchronization, verifying, D-8<br />

cluster failure, recovering, 4-19<br />

cluster log<br />

cluster registry hive, I-9<br />

definition, I-1<br />

error and status codes, I-15<br />

GUID format, I-8<br />

GUIDs, I-8<br />

layout, I-3<br />

mapping GUID to text name, I-8<br />

name and location, I-1<br />

opening, I-7<br />

overview, 2-9<br />

persistent state, I-14<br />

state codes, I-10, I-12<br />

cluster registry hive, I-9<br />

cluster service modules, I-4<br />

cluster settings<br />

system environment variables, I-2<br />

cluster setup, checking, 4-1<br />

collecting host logs<br />

using host information collector (HIC)<br />

utility, A-7<br />

using MPS utility, A-6<br />

collecting RA logs, A-1, A-3<br />

Collector (See Unisys <strong>SafeGuard</strong> 30m<br />

Collector)<br />

collector directory, A-11<br />

configuration settings, saving, D-2<br />

configuring additional RAs, D-4<br />

configuring the replacement RA, D-6<br />

connecting, accessing the replacement<br />

RA, D-4<br />

connectivity testing tool messages, C-8<br />

converting local time to GMT or UTC, A-3<br />

6872 5688–002 Index–1<br />

D<br />

data consistency group<br />

bringing online, 3-3, 4-9<br />

bringing online for ClearPath MCP, 3-5<br />

manual failover, 3-2, 4-8<br />

manual failover for ClearPath MCP, 3-5<br />

recovery tasks, 3-2, 3-5, 4-7<br />

recovery tasks for ClearPath MCP, 3-5<br />

taking offline, 4-7, 5-9<br />

data flow, overview, 2-3<br />

detaching the failed RA, D-3<br />

determining when the failure occurred, A-2<br />

diagnostics<br />

Installation Manager, C-1<br />

RA hardware, B-2<br />

directory<br />

bin, A-14<br />

collector, A-11<br />

etc, A-11<br />

files, A-11<br />

home, A-11, A-14<br />

host log extraction, A-15<br />

InfoCollect, A-12<br />

processes, A-12<br />

rreasons, A-11


Index<br />

E<br />

sbin, A-12<br />

tmp, A-14<br />

usr, A-13<br />

e-mail notifications<br />

configuring a diagnostic e-mail<br />

notification, 2-8<br />

overview, 2-8<br />

enabling PCI-X slot functionality, D-5<br />

environment settings, restoring, D-2<br />

etc directory, A-11<br />

event log, E-1<br />

displaying, E-3<br />

event levels, E-2<br />

event scope, E-2<br />

event topics, E-1<br />

list of Detailed events, E-22<br />

list of Normal events, E-5<br />

overview, 2-7<br />

using for troubleshooting, E-3<br />

events<br />

event log, E-1<br />

understanding, E-1<br />

events that cause journal distribution, 2-10<br />

F<br />

Fabric Splitter, 2-4<br />

Fibre Channel diagnostics<br />

detecting Fibre Channel LUNs, C-13<br />

detecting Fibre Channel Scsi3 Reserved<br />

LUNs, C-15<br />

detecting Fibre Channel targets, C-12<br />

performing I/O to LUN, C-15<br />

running SAN diagnostics, C-9<br />

viewing Fibre Channel details, C-11<br />

Fibre Channel HBA LEDs<br />

location, 8-12<br />

files directory, A-11<br />

full-sweep initialization, 4-4<br />

G<br />

geographic clustered environment<br />

basic configuration diagram, 2-2<br />

definition, 2-1<br />

overview, 2-2<br />

recovery from total failure of one site, 4-19<br />

geographic replication environment, 2-1<br />

definition, 2-1<br />

server failure, 9-20<br />

total storage loss, 5-13<br />

GMT<br />

converting local time to, A-3<br />

example of local time conversion, A-3<br />

group initialization effects on move-group<br />

operation, 4-3<br />

Index–2 6872 5688–002<br />

H<br />

HIC (See host information collector (HIC)<br />

utility)<br />

high load<br />

disk manager reports, 10-4<br />

general description, 10-3<br />

home directory, A-11, A-14<br />

host information collector (HIC) utility<br />

overview, 2-9<br />

using, A-7<br />

host logs collection<br />

using host information collector (HIC)<br />

utility, A-7<br />

using MPS utility, A-6<br />

I<br />

InfoCollect directory, A-12<br />

initialization<br />

from marking mode, 4-5<br />

full sweep, 4-4<br />

long resynchronization, 4-4<br />

initiate_failover command, 4-6<br />

Installation Manager<br />

diagnostics, 2-9<br />

Diagnostics menu, 8-17, 8-21, C-2<br />

steps to run, C-2<br />

Installation Manager diagnostics<br />

collect system info, C-18<br />

Fibre Channel diagnostics, C-9<br />

IP diagnostics, C-2<br />

synchronization diagnostics, C-17<br />

installing and configuring the replacement<br />

RA, D-4<br />

IP diagnostics<br />

port diagnostics, C-5<br />

site connectivity tests, C-3<br />

system connectivity, C-6, C-7


K<br />

test throughput, C-4<br />

view IP details, C-3<br />

view routing table, C-4<br />

kutils<br />

command summary, H-2<br />

overview, 2-10<br />

path designations, H-1<br />

string, H-1<br />

using, H-1<br />

L<br />

Local Replication by CDP, 2-5<br />

log extraction directory<br />

host, A-15<br />

RA, A-9<br />

log file, A-10<br />

long resynchronization, 4-4<br />

M<br />

management console<br />

locked user, 8-4<br />

RA attached to cluster, 8-4<br />

understanding access, 8-4<br />

manual failover<br />

data consistency group, 3-2, 4-8<br />

performing, 4-7<br />

performing with data consistency group<br />

(older image), 4-8<br />

quorum consistency groups, 4-14, 4-23<br />

manual failover for ClearPath MCP<br />

data consistency group, 3-5<br />

manual failover of volumes and data<br />

consistency groups<br />

accessing an image, 3-2<br />

marking mode, initializing from, 4-5<br />

MIB<br />

OID Unisys, F-1<br />

RA file, F-3<br />

MIB II, F-1<br />

Microsoft Cluster Service, 2-1<br />

modifying the Preferred RA setting, D-3<br />

move group operation, initialization<br />

effects, 4-3<br />

MPS utility, A-6<br />

MSCS (See Microsoft Cluster Service)<br />

MSCS properties, checking, 4-1<br />

Index<br />

6872 5688–002 Index–3<br />

N<br />

network bindings<br />

checking, 4-2<br />

cluster specific, 4-3<br />

host network specific, 4-2<br />

network LEDs<br />

location, 8-11<br />

networking problem<br />

cluster node public NIC failure (geographic<br />

clustered environment), 7-3<br />

management network failure (geographic<br />

clustered environment), 7-11<br />

port information, 7-32<br />

private cluster network failure (geographic<br />

clustered environment), 7-22<br />

public or client WAN failure (geographic<br />

clustered environment), 7-6<br />

replication network failure (geographic<br />

clustered environment), 7-15<br />

temporary WAN failures, 7-21<br />

total communication failure (geographic<br />

clustered environment), 7-26<br />

new for this release, 1-2<br />

P<br />

parameters file, A-9<br />

performance problem<br />

failover time lengthens, 10-5<br />

high load<br />

disk manager, 10-4<br />

distributer, 10-5<br />

slow initialization, 10-2<br />

persistent state key, I-14<br />

port information, 7-32<br />

processes directory, A-12<br />

Q<br />

quorum consistency group<br />

manual failover, 4-14, 4-23


Index<br />

R<br />

RA problem<br />

all RAs at one site fail, 8-25<br />

all RAs not attached, 8-27<br />

all SAN Fibre Channel HBAs fail, 8-14<br />

onboard management network adapter<br />

fails, 8-23<br />

onboard WAN network adapter fails, 8-19<br />

optional Gigabit Fibre Channel WAN<br />

network adapter fails, 8-19<br />

reboot regulation failover, 8-12<br />

single hard disk fails, 8-24<br />

single RA failure, 8-4<br />

single RA failures with switchover, 8-5<br />

single RA failures without switchover, 8-21<br />

single SAN Fibre Channel HBA on one RA<br />

fails, 8-21<br />

rear panel indicators, 8-11<br />

recording group properties and saving<br />

settings, D-2<br />

recovery<br />

all RAs fail on site, 4-11<br />

from site failure, 4-19<br />

from total failure of one site in geographic<br />

clustered environment, 4-19<br />

site 1 failure with quorum owner located<br />

on site 2, 4-25<br />

site 1 failure with quorum resource owned<br />

by site 1, 4-19<br />

using older image, 4-7<br />

recovery tasks<br />

data consistency group, 3-2, 4-7<br />

data consistency group for ClearPath<br />

MCP, 3-5<br />

reformatting the repository volume, 5-8<br />

removing Fibre Channel host bus<br />

adapters, D-4<br />

replacing an RA, D-1<br />

replication appliance (RA)<br />

connecting, accessing, D-4<br />

diagnostics, B-2<br />

LCD status messages, B-4<br />

replacing, D-1<br />

replication appliance (RA)<br />

analyzing logs from, A-8<br />

collecting logs from, A-1<br />

replication, reversing direction, 4-10, 4-15<br />

repository volume<br />

not accessible, 5-6<br />

reformatting, 5-8<br />

restoring environment settings, D-2<br />

restoring failover settings, 4-24<br />

restoring group properties, D-8<br />

resynchronization, long, 4-4<br />

rreasons directory, A-11<br />

runCLI file, A-14<br />

Index–4 6872 5688–002<br />

S<br />

<strong>SafeGuard</strong> 30m Control<br />

behavior during move group, 4-5<br />

SAN connectivity problem<br />

RAs not accessible to splitter, 6-12<br />

total SAN switch failure (geographic<br />

clustered environment), 6-17<br />

volume not accessible to RAs, 6-3<br />

volume not accessible to splitter, 6-7<br />

saving configuration settings, D-2<br />

sbin directory, A-12<br />

server problem<br />

cluster node failure (georgraphic clustered<br />

environment), 9-2<br />

infrastructure (NTP) server fails, 9-18<br />

server crash or restart, 9-12<br />

server failure (georgraphic replication<br />

environment), 9-20<br />

server HBA fails, 9-17<br />

server unable to connect with SAN, 9-14<br />

unexpected server shutdown because of a<br />

bug check, 9-8<br />

Windows server reboot, 9-3<br />

SNMP traps<br />

configuring and using, F-1<br />

MIB, F-1<br />

resolving issues, F-4<br />

variables and values, F-2<br />

SSH client, using, C-1<br />

state codes, I-10, I-12<br />

storage problem<br />

journal volume not accessible, 5-11<br />

repository volume not accessible, 5-6<br />

storage failure on one site (geographic<br />

clustered environment), 5-16<br />

total storage loss (geographic replicated<br />

environment), 5-13<br />

user or replication volume not<br />

accessible, 5-4<br />

storage-to-RA access, checking, D-5<br />

summary file, A-11<br />

system event log (SEL), clearing, B-1<br />

system status<br />

using CLI commands, 2-8


T<br />

using the management console, 2-7<br />

tar file, A-15<br />

testing FTP connectivity, A-2<br />

tmp directory, A-14<br />

troubleshooting<br />

general procedures, 2-11<br />

recovering from site failure, 4-19<br />

U<br />

Unisys <strong>SafeGuard</strong> 30m Collector, G-1<br />

Collector mode, G-4<br />

adding customer information, G-5<br />

adding scripts, G-12<br />

automatic discovery of RAs, G-4<br />

compressing an SSC file, G-6<br />

configuring component types using<br />

built-ins scripts, G-8<br />

configuring RAs, G-4<br />

configuring SAN switches, G-9<br />

deleting a scheduled SSC file, G-14<br />

deleting scripts, G-12<br />

disabling scripts, G-10<br />

duplicating installation on another<br />

PC, G-6<br />

enabling scripts, G-10<br />

opening an SSC file, G-8<br />

querying a scheduled SSC file, G-14<br />

running all scripts, G-6<br />

running scripts, G-11<br />

scheduling an SSC file, G-13<br />

stopping script execution, G-11<br />

installing, G-1<br />

prior to configuring, G-2<br />

security breach warning, G-3<br />

View mode, G-15<br />

Unisys <strong>SafeGuard</strong> 30m solution<br />

definition, 2-1<br />

unmounting volumes<br />

at production site, 3-4<br />

at remote site, 3-3<br />

unmounting volumes at source site, 3-4<br />

user types, preconfigured for RAs, 2-8<br />

using the SSH client, C-1<br />

using this guide, 1-3<br />

usr directory, A-13<br />

UTC<br />

converting local time to, A-3<br />

example of local time conversion, A-3<br />

Index<br />

6872 5688–002 Index–5<br />

V<br />

verify_failover command, 4-6<br />

verifying clock synchronization, D-8<br />

verifying the replacement RA installation, D-7<br />

volumes<br />

unmounting at source site, 3-4<br />

W<br />

WAN bandwidth, verifying, D-7<br />

webdownload/webdownload, 2-8, C-20


Index<br />

Index–6 6872 5688–002


© 2008 Unisys Corporation.<br />

All rights reserved.<br />

*68725688-002*<br />

6872 5688–002

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!