SafeGuard Solutions Troubleshooting Guide - Public Support Login ...

Unisys SafeGuard Solutions 

Troubleshooting Guide 

Unisys SafeGuard Solutions Release 6.0 

June 2008

Unisys SafeGuard Solutions 


Unisys SafeGuard Solutions Release 6.0 

June 2008 6872 5688–002 

unisys 

imagine it. done.

NO WARRANTIES OF ANY NATURE ARE EXTENDED BY THIS DOCUMENT. Any product or related information 

described herein is only furnished pursuant and subject to the terms and conditions of a duly executed agreement to 

purchase or lease equipment or to license software. The only warranties made by Unisys, if any, with respect to the 

products described in this document are set forth in such agreement. Unisys cannot accept any financial or other 

responsibility that may be the result of your use of the information in this document or software material, including 

direct, special, or consequential damages. 

You should be very careful to ensure that the use of this information and/or software material complies with the laws, 

rules, and regulations of the jurisdictions with respect to which it is used. 

The information contained herein is subject to change without notice. Revisions may be issued to advise of such 

changes and/or additions. 

Notice to U.S. Government End Users: This is commercial computer software or hardware documentation developed at 

private expense. Use, reproduction, or disclosure by the Government is subject to the terms of Unisys standard 

commercial license for the products, and where applicable, the restricted/limited rights provisions of the contract data 

rights clauses. 

Unisys is a registered trademark of Unisys Corporation in the United States and other countries. 

All other brands and products referenced in this document are acknowledged to be the trademarks or registered 

trademarks of their respective holders.

Unisys SafeGuard 

Solutions 



Solutions Release 6.0 



Troubleshooting 

Guide 

Unisys 

SafeGuard 


Release 6.0 

6872 5688–002 6872 5688–002 

Bend here, peel upwards and apply to spine.

Contents 

Section 1. About This Guide 

Section 2. Overview 

Purpose and Audience .......................................................................... 1–1 

Related Product Information ................................................................. 1–1 

Documentation Updates ....................................................................... 1–1 

What’s New in This Release ................................................................. 1–2 

Using This Guide ................................................................................... 1–3 

Geographic Replication Environment .................................................... 2–1 

Geographic Clustered Environment ...................................................... 2–2 

Data Flow .............................................................................................. 2–3 

Diagnostic Tools and Capabilities.......................................................... 2–7 

Event Log ............................................................................. 2–7 

System Status ..................................................................... 2–7 

E-mail Notifications .............................................................. 2–8 

Installation Diagnostics ........................................................ 2–9 

Host Information Collector (HIC) ......................................... 2–9 

Cluster Logs......................................................................... 2–9 

Unisys SafeGuard 30m Collector......................................... 2–9 

RA Diagnostics .................................................................... 2–9 

Hardware Indicators ............................................................ 2–9 

SNMP Support ................................................................... 2–10 

kutils Utility ........................................................................ 2–10 

Discovering Problems ......................................................................... 2–10 

Events That Cause Journal Distribution ............................ 2–10 

Troubleshooting Procedures ............................................................... 2–11 

Identifying the Main Components and Connectivity 

of the Configuration....................................................... 2–11 

Understanding the Current State of the System ............... 2–12 

Verifying the System Connectivity .................................... 2–12 

Analyzing the Configuration Settings ................................ 2–13 

Section 3. Recovering in a Geographic Replication 

Environment 

Manual Failover of Volumes and Data Consistency Groups ................. 3–2 

Accessing an Image ............................................................ 3–2 

Testing the Selected Image at Remote Site ....................... 3–3 

Manual Failover of Volumes and Data Consistency Groups for 

ClearPath MCP Hosts ....................................................................... 3–5 

6872 5688–002 iii

Contents 

Accessing an Image ............................................................. 3–5 

Testing the Selected Image at Remote Site ........................ 3–5 

Section 4. Recovering in a Geographic Clustered Environment 

Checking the Cluster Setup ................................................................... 4–1 

MSCS Properties .................................................................. 4–1 

Network Bindings ................................................................. 4–2 

Group Initialization Effects on a Cluster Move-Group 

Operation ........................................................................................... 4–3 

Full-Sweep Initialization ........................................................ 4–4 

Long Resynchronization ....................................................... 4–4 

Initialization from Marking Mode .......................................... 4–5 

Behavior of SafeGuard 30m Control During a Move-Group 

Operation ........................................................................................... 4–5 

Recovering by Manually Moving an Auto-Data (Shared 

Quorum) Consistency Group ............................................................. 4–7 

Taking a Cluster Data Group Offline ..................................... 4–7 

Performing a Manual Failover of an Auto-Data 

(Shared Quorum) Consistency Group to a 

Selected Image ................................................................ 4–8 

Bringing a Cluster Data Group Online and Checking 

the Validity of the Image .................................................. 4–9 

Reversing the Replication Direction of the 

Consistency Group ......................................................... 4–10 

Recovery When All RAs Fail on Site 1 (Site 1 Quorum Owner) .......... 4–11 

Recovery When All RAs Fail on Site 1 (Site 2 Quorum Owner) .......... 4–17 

Recovery When All RAs and All Servers Fail on One Site ................... 4–19 

Site 1 Failure (Site 1 Quorum Owner) ................................ 4–19 

Site 1 Failure (Site 2 Quorum Owner) ................................ 4–25 

Section 5. Solving Storage Problems 

User or Replication Volume Not Accessible .......................................... 5–4 

Repository Volume Not Accessible ....................................................... 5–6 

Reformatting the Repository Volume ................................... 5–8 

Journal Not Accessible ........................................................................ 5–11 

Journal Volume Lost Scenarios ........................................................... 5–13 

Total Storage Loss in a Geographic Replicated Environment ............. 5–13 

Storage Failure on One Site in a Geographic Clustered 

Environment .................................................................................... 5–16 

Storage Failure on One Site with Quorum Owner 

on Failed Site ................................................................. 5–17 

Storage Failure on One Site with Quorum Owner 

on Surviving Site ............................................................ 5–20 

Section 6. Solving SAN Connectivity Problems 

Volume Not Accessible to RAs .............................................................. 6–3 

Volume Not Accessible to SafeGuard 30m Splitter ............................... 6–7 

iv 6872 5688–002

Contents 

RAs Not Accessible to SafeGuard 30m Splitter .................................. 6–12 

Total SAN Switch Failure on One Site in a Geographic 

Clustered Environment ................................................................... 6–17 

Cluster Quorum Owner Located on Site with Failed 

SAN Switch ................................................................... 6–18 

Cluster Quorum Owner Not on Site with Failed 

SAN Switch ................................................................... 6–22 

Section 7. Solving Network Problems 

Public NIC Failure on a Cluster Node in a Geographic 

Clustered Environment ..................................................................... 7–3 

Public or Client WAN Failure in a Geographic Clustered 

Environment ..................................................................................... 7–6 

Management Network Failure in a Geographic Clustered 

Environment ................................................................................... 7–11 

Replication Network Failure in a Geographic Clustered 

Environment ................................................................................... 7–15 

Temporary WAN Failures .................................................................... 7–21 

Private Cluster Network Failure in a Geographic Clustered 

Environment ................................................................................... 7–22 

Total Communication Failure in a Geographic Clustered 

Environment ................................................................................... 7–26 

Port Information .................................................................................. 7–32 

Section 8. Solving Replication Appliance (RA) Problems 

Single RA Failures ................................................................................. 8–4 

Single RA Failure with Switchover ...................................... 8–5 

Reboot Regulation ............................................................. 8–12 

Failure of All SAN Fibre Channel Host Bus Adapters 

(HBAs ............................................................................ 8–14 

Failure of Onboard WAN Adapter or Failure of 

Optional Gigabit Fibre Channel WAN Adapter .............. 8–19 

Single RA Failures Without a Switchover ........................................... 8–21 

Port Failure on a Single SAN Fibre Channel HBA on 

One RA .......................................................................... 8–21 

Onboard Management Network Adapter Failure .............. 8–23 

Single Hard Disk Failure ..................................................... 8–24 

Failure of All RAs at One Site .............................................................. 8–25 

All RAs Are Not Attached .................................................................... 8–27 

Section 9. Solving Server Problems 

Cluster Node Failure (Hardware or Software) in a Geographic 

Clustered Environment ..................................................................... 9–2 

Possible Subset Scenarios .................................................. 9–3 

Windows Server Reboot ..................................................... 9–3 

Unexpected Server Shutdown Because of a Bug 

Check .............................................................................. 9–8 

6872 5688–002 v

Contents 

Server Crash or Restart ...................................................... 9–12 

Server Unable to Connect with SAN .................................. 9–14 

Server HBA Failure ............................................................. 9–17 

Infrastructure (NTP) Server Failure ...................................................... 9–18 

Server Failure (Hardware or Software) in a Geographic 

Replication Environment ................................................................. 9–20 

Section 10. Solving Performance Problems 

Slow Initialization ................................................................................. 10–2 

General Description of High-Load Event ............................................. 10–3 

High-Load (Disk Manager) Condition ................................................... 10–4 

High-Load (Distributor) Condition ........................................................ 10–5 

Failover Time Lengthens ..................................................................... 10–5 

Appendix A. Collecting and Using Logs 

Collecting RA Logs ............................................................................... A–1 

Setting the Automatic Host Info Collection Option ............. A–2 

Testing FTP Connectivity .................................................... A–2 

Determining When the Failure Occurred ............................ A–2 

Converting Local Time to GMT or UTC ............................... A–3 

Collecting RA Logs .............................................................. A–3 

Collecting Server (Host) Logs ............................................................... A–6 

Using the MPS Report Utility .............................................. A–6 

Using the Host Information Collector (HIC) Utility .............. A–7 

Analyzing RA Log Collection Files ........................................................ A–8 

RA Log Extraction Directory ................................................ A–9 

tmp Directory .................................................................... A–14 

Host Log Extraction Directory ........................................... A–15 

Analyzing Server (Host) Logs .............................................................. A–16 

Analyzing Intelligent Fabric Switch Logs ............................................ A–16 

Appendix B. Running Replication Appliance (RA) Diagnostics 

Clearing the System Event Log (SEL)................................................... B–1 

Running Hardware Diagnostics ............................................................ B–2 

Custom Test ........................................................................ B–3 

Express Test ........................................................................ B–4 

LCD Status Messages .......................................................................... B–4 

Appendix C. Running Installation Manager Diagnostics 

Using the SSH Client ............................................................................ C–1 

Running Diagnostics ............................................................................. C–1 

IP Diagnostics ...................................................................... C–2 

Fibre Channel Diagnostics ................................................... C–9 

Synchronization Diagnostics ............................................. C–17 

Collect System Info ........................................................... C–18 

vi 6872 5688–002

Appendix D. Replacing a Replication Appliance (RA) 

Contents 

Saving the Configuration Settings ........................................................ D–2 

Recording Policy Properties and Saving Settings ................................. D–2 

Modifying the Preferred RA Setting ..................................................... D–3 

Removing Fibre Channel Adapter Cards ............................................... D–4 

Installing and Configuring the Replacement RA ................................... D–4 

Cable and Apply Power to the New RA .............................. D–4 

Connecting and Accessing the RA ...................................... D–4 

Checking Storage-to-RA Access .......................................... D–5 

Enabling PCI-X Slot Functionality ......................................... D–5 

Configuring the RA .............................................................. D–6 

Verifying the RA Installation .................................................................. D–7 

Restoring Group Properties .................................................................. D–8 

Ensuring the Existing RA Can Switch Over to the New RA ................. D–8 

Appendix E. Understanding Events 

Event Log .............................................................................................. E–1 

Event Topics ........................................................................ E–1 

Event Levels ........................................................................ E–2 

Event Scope......................................................................... E–2 

Displaying the Event Log ..................................................... E–3 

Using the Event Log for Troubleshooting ............................ E–3 

List of Events ........................................................................................ E–4 

List of Normal Events .......................................................... E–5 

List of Detailed Events ...................................................... E–22 

Appendix F. Configuring and Using SNMP Traps 

Software Monitoring ............................................................................. F–1 

SNMP Monitoring and Trap Configuration ............................................ F–3 

Installing MIB Files on an SNMP Browser ............................................ F–3 

Resolving SNMP Issues ........................................................................ F–4 

Appendix G. Using the Unisys SafeGuard 30m Collector 

Appendix H. Using kutils 

Installing the SafeGuard 30m Collector ................................................ G–1 

Before You Begin the Configuration ..................................................... G–2 

Handling the Security Breach Warning ................................ G–3 

Using Collector Mode ........................................................................... G–4 

Getting Started .................................................................... G–4 

Understanding Operations in Collector Mode ..................... G–7 

Using View Mode ............................................................................... G–15 

Usage .................................................................................................... H–1 

Path Designations ................................................................................. H–1 

Command Summary ............................................................................. H–2 

6872 5688–002 vii

Contents 

Appendix I. Analyzing Cluster Logs 

Introduction to Cluster Logs ................................................................... I–1 

Creating the Cluster Log ....................................................... I–2 

Understanding the Cluster Log Layout ................................. I–3 

Sample Cluster Log ................................................................................ I–5 

Posting Information to the Cluster Log ................................. I–5 

Diagnosing a Problem Using Cluster Logs ............................................. I–6 

Gathering Materials ............................................................... I–7 

Opening the Cluster Log ....................................................... I–7 

Converting GMT/UCT to Local Time ..................................... I–8 

Converting Cluster Log GUIDs to Text Resource 

Names ............................................................................... I–8 

Understanding State Codes ................................................ I–10 

Understanding Persistent State .......................................... I–14 

Understanding Error and Status Codes ............................... I–15 

Index ............................................................................................. 1 

viii 6872 5688–002

Figures 

2–1. Basic Geographic Clustered Environment ......................................................... 2–2 

2–2. Data Flow ........................................................................................................... 2–3 

2–3. Data Flow with Fabric Splitter ............................................................................ 2–5 

2–4. Data flow in CDP ................................................................................................ 2–6 

4–1. All RAs Fail on Site 1 (Site 1 Quorum Owner) ................................................. 4–11 

4–2. All RAs Fail on Site 1 (Site 2 Quorum Owner) ................................................. 4–17 

4–3. All RAs and Servers Fail on Site 1 (Site 1 Quorum Owner) ............................. 4–20 

4–4. All RAs and Servers Fail on Site 1 (Site 2 Quorum Owner) ............................. 4–25 

5–1. Volumes Tab Showing Volume Connection Errors ............................................ 5–4 

5–2. Management Console Messages for the User Volume Not Accessible 

Problem ......................................................................................................... 5–5 

5–3. Groups Tab Shows “Paused by System” .......................................................... 5–5 

5–4. Management Console Display: Storage Error and RAs Tab Shows 

Volume Errors ................................................................................................ 5–7 

5–5. Volumes Tab Shows Error for Repository Volume ............................................ 5–7 

5–6. Groups Tab Shows All Groups Paused by System ............................................ 5–7 

5–7. Management Console Messages for the Repository Volume not 

Accessible Problem ....................................................................................... 5–8 

5–8. Volumes Tab Shows Journal Volume Error ..................................................... 5–11 

5–9. RAs Tab Shows Connection Errors .................................................................. 5–11 

5–10. Groups Tab Shows Group Paused by System ................................................. 5–12 

5–11. Management Console Messages for the Journal Not Accessible 

Problem ....................................................................................................... 5–12 

5–12. Management Console Volumes Tab Shows Errors for All Volumes ............... 5–14 

5–13. RAs Tab Shows Volumes That Are Not Accessible ......................................... 5–14 

5–14. Multipathing Software Reports Failed Paths to Storage Device ..................... 5–15 

5–15. Storage on Site 1 Fails ..................................................................................... 5–16 

5–16. Cluster “Regroup” Process ............................................................................. 5–17 

5–17. Cluster Administrator Displays ......................................................................... 5–19 

5–18. Multipathing Software Shows Server Errors for Failed Storage 

Subsystem ................................................................................................... 5–19 

6–1. Management Console Showing “Inaccessible Volume” Errors ........................ 6–3 

6–2. Management Console Messages for Inaccessible Volumes ............................. 6–3 

6–3. Management Console Error Display Screen ...................................................... 6–7 

6–4. Management Console Messages for Volumes Inaccessible to Splitter ............ 6–8 

6–5. EMC PowerPath Shows Disk Error .................................................................. 6–10 

6–6. Management Console Display Shows a Splitter Down ................................... 6–12 

6–7. Management Console Messages for Splitter Inaccessible to RA ................... 6–13 

6–8. SAN Switch Failure on One Site ...................................................................... 6–17 

6–9. Management Console Display with Errors for Failed SAN Switch .................. 6–18 

6872 5688–002 ix

Figures 

6–10. Management Console Messages for Failed SAN Switch ................................ 6–19 

6–11. Management Console Messages for Failed SAN Switch with Quorum 

Owner on Surviving Site ............................................................................... 6–23 

7–1. Public NIC Failure of a Cluster Node .................................................................. 7–3 

7–2. Public NIC Error Shown in the Cluster Administrator ......................................... 7–5 

7–3. Public or Client WAN Failure............................................................................... 7–7 

7–4. Cluster Administrator Showing Public LAN Network Error ................................ 7–8 

7–5. Management Network Failure .......................................................................... 7–11 

7–6. Management Console Display: “Not Connected” ........................................... 7–13 

7–7. Management Console Message for Event 3023 .............................................. 7–13 

7–8. Replication Network Failure .............................................................................. 7–15 

7–9. Management Console Display: WAN Down .................................................... 7–17 

7–10. Management Console Log Messages: WAN Down ........................................ 7–17 

7–11. Management Console RAs Tab: All RAs Data Link Down ............................... 7–18 

7–12. Private Cluster Network Failure ........................................................................ 7–22 

7–13. Cluster Administrator Display with Failures ...................................................... 7–23 

7–14. Total Communication Failure ............................................................................ 7–26 

7–15. Management Console Display Showing WAN Error ........................................ 7–27 

7–16. RAs Tab for Total Communication Failure ........................................................ 7–28 

7–17. Management Console Messages for Total Communication Failure ................ 7–28 

7–18. Cluster Administrator Showing Private Network Down ................................... 7–31 

7–19. Cluster Administrator Showing Public Network Down .................................... 7–31 

8–1. Single RA Failure ................................................................................................. 8–5 

8–2. Sample BIOS Display .......................................................................................... 8–6 

8–3. Management Console Display Showing RA Error and RAs Tab......................... 8–7 

8–4. Management Console Messages for Single RA Failure with 

Switchover...................................................................................................... 8–8 

8–5. LCD Display on Front Panel of RA .................................................................... 8–10 

8–6. Rear Panel of RA Showing Indicators ............................................................... 8–11 

8–7. Location of Network LEDs................................................................................ 8–11 

8–8. Location of SAN Fibre Channel HBA LEDs ....................................................... 8–12 

8–9. Management Console Display: Host Connection with RA Is Down ................ 8–15 

8–10. Management Console Messages for Failed RA (All SAN HBAs Fail) ............... 8–16 

8–11. Management Console Showing WAN Data Link Failure .................................. 8–20 

8–12. Location of Hard Drive LEDs ............................................................................ 8–25 

8–13. Management Console Showing All RAs Down ................................................ 8–26 

9–1. Cluster Node Failure ........................................................................................... 9–2 

9–2. Management Console Display with Server Error ............................................... 9–4 

9–3. Management Console Messages for Server Down ........................................... 9–5 

9–4. Management Console Messages for Server Down for Bug Check ................... 9–9 

9–5. Management Console Display Showing LA Site Server Down ........................ 9–14 

9–6. Management Console Images Showing Messages for Server Unable 

to Connect to SAN ....................................................................................... 9–15 

9–7. PowerPath Administrator Console Showing Failures ....................................... 9–16 

9–8. PowerPath Administrator Console Showing Adapter Failure ........................... 9–17 

9–9. Event 1009 Display ........................................................................................... 9–19 

I–1. Layout of the Cluster Log .................................................................................... I–3 

I–2. Expanded Cluster Hive (in Windows 2000 Server) ............................................ I–10 

x 6872 5688–002

Figures 

6872 5688–002 xi

Figures 

xii 6872 5688–002

Tables 

2–1. User Types ......................................................................................................... 2–8 

2–2. Events That Cause Journal Distribution ........................................................... 2–11 

5–1. Possible Storage Problems with Symptoms ..................................................... 5–1 

5–2. Indicators and Management Console Errors to Distinguish Different 

Storage Volume Failures ................................................................................ 5–3 

6–1. Possible SAN Connectivity Problems ................................................................ 6–1 

7–1. Possible Networking Problems with Symptoms ............................................... 7–1 

7–2. Ports for Internet Communication ................................................................... 7–33 

7–3. Ports for Management LAN Communication and Notification ........................ 7–33 

7–4. Ports for RA-to-RA Internal Communication .................................................... 7–34 

8–1. Possible Problems for Single RA Failure with a Switchover .............................. 8–2 

8–2. Possible Problems for Single RA Failure Wthout a Switchover ......................... 8–3 

8–3. Possible Problems for Multiple RA Failures with Symptoms ............................ 8–3 

8–4. Management Console Messages Pertaining to Reboots ................................ 8–13 

9–1. Possible Server Problems with Symptoms ....................................................... 9–1 

10–1. Possible Performance Problems with Symptoms ........................................... 10–1 

B–1. LCD Status Messages ....................................................................................... B–5 

C–1. Messages from the Connectivity Testing Tool .................................................. C–8 

E–1. Normal Events .................................................................................................... E–5 

E–2. Detailed Events ................................................................................................ E–23 

F–1. Trap Variables and Values .................................................................................. F–2 

I–1. System Environment Variables Related to Clustering ........................................ I–2 

I–2. Modules of MSCS ............................................................................................... I–4 

I–3. Node State Codes ............................................................................................. I–12 

I–4. Group State Codes ............................................................................................ I–12 

I–5. Resource State Codes ...................................................................................... I–12 

I–6. Network Interface State Codes ........................................................................ I–13 

I–7. Network State Codes ........................................................................................ I–13 

6872 5688–002 xiii

Tables 

xiv 6872 5688–002

Section 1 

About This Guide 

Purpose and Audience 

This document presents procedures for problem analysis and troubleshooting of the 

Unisys SafeGuard 30m solution. It is intended for Unisys service representatives and 

other technical personnel who are responsible for maintaining the Unisys SafeGuard 

30m solution installation. 

Related Product Information 

The methods described in this document are based on support and diagnostic tools that 

are provided as standard components of the Unisys SafeGuard 30m solution. You can 

find additional information about these tools in the following documents: 

• Unisys SafeGuard Solutions Planning and Installation Guide 

• Unisys SafeGuard Solutions Replication Appliance Administrator’s Guide 

• Unisys SafeGuard Solutions Introduction to Replication Appliance Command Line 

Interface (CLI) 

• Unisys SafeGuard Solutions Replication Appliance Installation Guide 

Note: Review the information in the Unisys SafeGuard Solutions Planning and 

Installation Guide about making configuration changes before you begin troubleshooting 

a problem. 

Documentation Updates 

This document contains all the information that was available at the time of 

publication. Changes identified after release of this document are included in problem list 

entry (PLE) 18609274. To obtain a copy of the PLE, contact your Unisys service 

representative or access the current PLE from the Unisys Product Support Web site: 

http://www.support.unisys.com/all/ple/18609274 

Note: If you are not logged into the Product Support site, you will be asked to do so. 

6872 5688–002 1–1


What’s New in This Release 

Some of the important changes in the 6.0 release are summarized in this table. 


Continuous Data 

Protection (CDP) 

Change Notes 

Support for Concurrent 

Local and Remote (CLR) 

Support for CLARiiON 

splitter. 

Support for Brocade 

intelligent fabric splitting 

(multi-VI mode only), using 

the Brocade 7500 SAN 

Router. 

Support for configurations 

using a mix of splitters 

within the same RA 

cluster and across RA 

clusters at different sites. 

Redesign of the 

Management Console GUI 

for greater ease-of-use. 

SNMP trap viewer, Log 

Collection and Analysis, 

Auto-discovery of 

SafeGuard components in 

Safeguard Command 

Center. 

A Unisys SafeGuard Duplex solution that uses one 

Replication Appliance (RA) cluster to replicate data 

across the Storage Area Network (SAN). 

Concurrent Local (CDP) and Concurrent Remote 

Replication (CRR) of the same production volumes. 

Unisys SafeGuard solutions work with the 

CLARiiON CX3 Series CLARiiON Splitter service to 

deliver a fully heterogeneous array-based data 

replication solution that is achieved without the 

need for host-based agents. 

To support the heterogeneous environment at 

switch level, Safeguard Solution supports 

Intelligent fabric splitting with Brocade switch. 

SafeGuard solutions can support mixed splitters in 

a given solution configuration. 

New RA GUI interface is easy to navigate and 

more clear to use. 

Command Center now has the log collection and 

automatic discovery of the devices. 

1–2 6872 5688–002

Using This Guide 


This guide offers general information in the first four sections. Read Section 2 to 

understand the overall approach to troubleshooting and to gain an understanding of the 

Unisys SafeGuard 30m solution architecture. 

Section 3 describes recovery in a geographic replication environment, and Section 4 

offers information and recovery procedures for geographic clustered environments. 

Sections 5 through 10 group potential problems into categories and describe the 

problems. You must recognize symptoms, identify the problem or failed component, and 

then decide what to do to correct the problem. Sections 5 through 10 include a table at 

the beginning of each section that lists symptoms and potential problems. 

Each problem is then presented in the following format: 

• Problem Description: Description of the problem 

• Symptoms: List of symptoms that are typical for this problem 

• Actions to Resolve the Problem: Steps recommended to solve the problem 

The appendixes provide information about using tools and offer reference information 

that you might find useful in different situations. 

6872 5688–002 1–3


1–4 6872 5688–002

Section 2 

Overview 

The Unisys SafeGuard Solutions are flexible, integrated business continuance solutions 

especially suitable for protecting business-critical application environments. The Unisys 

SafeGuard 30m solution provides two distinct functions that act in concert: replication of 

data and automated application recovery through clustering over great distances. 

Typically, the Unisys SafeGuard 30m solution is implemented in one of these 

environments: 

• Geographic replication environment: In this replication environment, data from 

servers at one site are replicated to a remote site. 

• Geographic clustered environment: In this replication environment, Microsoft Cluster 

Service (MSCS) is installed on servers that span sites and that participate in one 

cluster. The use of a Unisys SafeGuard 30m Control resource allows automated 

failover and recovery by controlling the replication direction with a MSCS resource. 

The resource is used in this environment only. 

Geographic Replication Environment 

Unisys SafeGuard Solutions supports replication of data over Fibre Channel to local SANattached 

storage and over WAN to remote sites. It also allows failover to a secondary 

site and continues operations in the event of a disaster at the primary site. 

Unisys SafeGuard Solutions replicates data over any distance: 

• within the same site (CDP), or 

• to another site halfway around the globe (CRR), or 

• both (CLR.) 

6872 5688–002 2–1

Overview 

Geographic Clusteered 

Environment 

2–2 

In the geographic clusterred 

environment, MSCS and cluster nodes are part of o the 

environment. Figure 2–1 illustrates a basic geographic clustered environmen nt that 

consists of two sites. In addition to server clusters, the typical configuration is made up 

of an RA cluster (RA 1 annd 

RA 2) at each of the two sites. However, multiple e RA cluster 

configurations are also poossible. 

Note: The dashed liness 

in Figure 2–1 represent the server WAN connections. 

To 

simplify the view, redunddant 

and physical connections are not shown. 

Figure 2–11. 

Basic Geographic Clustered Environment t 

68 872 5688–002

Data Flow 

Write 

Overview 

Figure 2–2 shows the data flow in the basic system configuration for data written by the 

server. The system replicates the data in snapshot replication mode to a remote site. 

The data flow is divided into the following segments: write, transfer, and distribute. 

Figure 2–2. Data Flow 

The flow of data for a write transaction is as follows: 

1. The host writes data to the splitter (either on the host or the fabric) that immediately 

sends it to the RA and to the production site replication volume (storage system). 

2. After receiving the data, the RA returns an acknowledgement (ACK) to the splitter. 

The storage system returns an ACK after successfully writing the data to storage. 

3. The splitter sends an ACK to the host that the write operation has been completed 

successfully. 

In snapshot replication mode, this sequence of events (steps 1 to 3) can be repeated 

multiple times before the snapshot is closed. 

6872 5688–002 2–3

Overview 

Transfer 

Distribute 

The flow of data for transfer is as follows: 

1. After processing the snapshot data (that is, applying the various compression 

techniques), the RA sends the snapshot over the WAN to its peer RA at the remote 

site. 

2. The RA at the remote site writes the snapshot to the journal. At the same time, the 

remote RA returns an ACK to its peer at the production site. 

Note: Alternatively, you can set an advanced policy parameter so that lag is 

measured to the journal. In that case, the RA at the target site returns an ACK to its 

peer at the source site only after it receives an ACK from the journal (step 3). 

3. After the complete snapshot is written to the journal, the journal returns an ACK to 

the RA. 

When possible, and unless instructed otherwise, the Unisys SafeGuard 30m solution 

proceeds at first opportunity to “distribute” the image to the appropriate location on the 

storage system at the remote site. The logical flow of data for distribution is as follows: 

1. The remote RA reads the image from the journal. 

2. The RA reads existing information from the relevant remote replication volume. 

3. The RA writes “undo” information (that is, information that can support a rollback, if 

necessary) to the journal. 

Note: Steps 2 and 3 are skipped when the maximum journal lag policy parameter 

causes distribution to operate in fast-forward mode. 

(See the Unisys SafeGuard Solutions Replication Appliance Administrator’s Guide for 

more information.) 

4. The RA writes the image to the appropriate remote replication volume. 

Alternatives to the basic system architecture 

The following are derivatives of the basic system architecture: 

Fabric Splitter 

An intelligent fabric switch can perform the splitting function instead of a Unisys 

SafeGuard Solutions host-based Splitter installed on the host. In this case, the host 

sends a single write transaction to the switch on its way to storage. At the switch, 

however, the message is split, with a copy sent also to RA (as shown in Figure 2–3). The 

system behaves the same way as it does when using a Unisys SafeGuard Solutions 

host-based splitter on the host to perform the splitting function. 

2–4 6872 5688–002

Figure 2–3. Data Flow with Fabric Splitter 

Local Replication by CDP 

Overview 

You can use CDP to perform replication over short distances—that is, to replicate 

storage at the same site as CRR does over long distances. Operation of the system is 

similar to CRR including the ability to use the journal to recover from a corrupted data 

image, and the ability, if necessary, to fail over to the remote side or storage pool. In 

Figure 2–4, there is no WAN, the storage pools are part of the storage at the same site, 

and the same RA appears in each of the segments. 

6872 5688–002 2–5

Overview 

Figure 2–4. Data flow in CDP 

Note: The repository volume must belong to remote-side storage pool. Unisys 

SafeGuard Solutions support a simultaneous mix of groups for remote and local 

replication. Individual volumes and groups, however, must be designated for either 

remote or local replication, but not for both. Certain policy parameters do not apply for 

local replication by CDP. 

Single RA 

Note: Unisys SafeGuard Solutions does not support single RA configuration (at both 

sites or at a single site). 

2–6 6872 5688–002

Diagnostic Tools and Capabilities 

Event Log 

Overview 

The Unisys SafeGuard 30m solution offers the following tools and capabilities to help 

you diagnose and solve problems. 

The replication capability of the Unisys SafeGuard 30m solution records log entries in 

response to a wide range of predefined events. The event log records all significant 

events that have recently occurred in the system. Appendix E lists and explains the 

events. 

Each event is classified by an event ID. The event ID can be used to help analyze or 

diagnose system behavior, including identifying the trigger for a rolling problem, 

understanding a sequence of events, and examining whether the system performed the 

correct set of actions in response to a component failure. 

You can monitor system behavior by viewing the event log through the management 

console, by issuing CLI commands, or by reading RA logs. The exact period of time 

covered by the log varies according to the operational state of the environment during 

that period or, in the case of RA logs, the time period that was specified. The capacity of 

the event log is 5000 events. 

For problems that are not readily apparent and for situations that you are monitoring for 

failure, you can configure an e-mail notification to send all logs to you in a daily summary. 

Once you resolve the problem, you can remove the event notifications. See “Configuring 

a Diagnostic E-mail Notification” in this section to configure a daily summary of events. 

System Status 

The management console displays an immediate indication of any problem that 

interferes with normal operation of the Unisys SafeGuard 30m environment. If a 

component fails, the indication is accompanied by an error message that provides 

detailed information about the failure. 

6872 5688–002 2–7

Overview 

You must log in to the management console to monitor the environment and to view 

events. The RAs are preconfigured with the users defined in Table 2–1. 

Table 2–1. User Types 

User Initial Password Permissions 

boxmgmt boxmgmt Install 

admin admin All except install and 

webdownload 

monitor monitor Read only 

webdownload webdownload webdownload 

SE Unisys(CSC) All except install and 

webdownload 

Note: The password boxmgmt is not used to log in to the management console; it is 

only used for SSH sessions. 

The CLI provides all users with status commands for the complete set of Unisys 

SafeGuard 30m components. You can use the information and statistics provided by 

these commands to identify bottlenecks in the system. 

E-mail Notifications 

The e-mail notification mechanism sends specified event notifications (or alerts) to 

designated individuals. Also, you can set up an e-mail notification for once a day that 

contains a daily summary of events. 

Configuring a Diagnostic E-mail Notification 

1. From the management console, click Alert Settings on the System menu. 

2. Under Rules, click Add. 

3. Using the diagnostic rule, select the appropriate topic, level, and type options. 

Diagnostic Rule 

This rule sends all messages on a daily basis to personnel of your choice. 

Topics: All Topics 

Level: Information 

Scope: Detailed 

Type Daily 

4. Under Addresses, click Add. 

2–8 6872 5688–002

Overview 

5. In the New Address box, type the e-mail address to which you would like event 

notifications sent. You can specify more than one e-mail address. 

6. Click OK. 

7. Repeat steps 4 through 6 for each additional e-mail recipient. 

8. Click OK. 

9. Click OK. 

Installation Diagnostics 

The Diagnostics menu of the Installation Manager provides a suite of diagnostic tools for 

testing the functionality and connectivity of the installed RAs and Unisys SafeGuard 30m 

components. Appendix C explains how to use the Installation Manager diagnostics. 

Installation Manager is also used to collect RA logs and host splitter logs from one 

centralized location. See Appendix A for more information about collecting logs. 

Host Information Collector (HIC) 

Cluster Logs 

The HIC collects extensive information about the environment, operation, and 

performance of any server on which a splitter has been installed. You can use the 

Installation Manager to collect logs across the entire environment including RAs and all 

servers on which the HIC feature is enabled. The HIC can also be used at the server. See 

Appendix A for more information about collecting logs. 

In a geographic clustered environment, MSCS maintains logs of events for the clustered 

environment. Analyzing these logs is helpful in diagnosing certain problems. Appendix I 

explains how to analyze these logs. 

Unisys SafeGuard 30m Collector 

The Unisys SafeGuard 30m Collector utility enables you to easily collect various pieces of 

information about the environment that can help in solving problems. Appendix G 

describes this utility. 

RA Diagnostics 

Diagnostics specific to the RAs are available to aid in identifying problems. Appendix B 

explains how to use the RA diagnostics. 

Hardware Indicators 

Hardware problems—for example, RA disk failures or RA power problems—are 

identified by status LEDs located on the RAs themselves. Several indicators are 

explained in Section 8, “Solving Replication Appliance (RA) Problems.” 

6872 5688–002 2–9

Overview 

SNMP Support 

kutils Utility 

The RAs support monitoring and problem notification using standard SNMP, including 

support for SNMPv3. You can use SNMP queries to the agent on the RA. Also, you can 

configure the environment such that events generate SNMP traps that are then sent to 

designated hosts. Appendix F explains how to configure and use SNMP traps. 

The kutils utility is a proprietary server-based program that enables you to manage server 

splitters across all platforms. The command-line utility is installed automatically when the 

Unisys SafeGuard 30m splitter is installed on the application server. If the splitting 

function is not on a host but rather is on an intelligent switch, the kutils utility is copied 

from the Splitter CD-ROM. (See the Unisys SafeGuard Solutions Planning and Installation 

Guide for more information.) 

Appendix H explains some kutils commands that are helpful in troubleshooting 

problems. See the Unisys SafeGuard Solutions Replication Appliance Administrator’s 

Guide for complete reference information on the kutils utility. 

Discovering Problems 

Symptoms of problems and notifications occur in various ways with the Unisys 

SafeGuard 30m solution. The tools and capabilities described previously provide 

notifications for some conditions and events. Other problems are recognized from 

failures. Problems might be noted in the following ways: 

• Problems with data because of a rolling disaster, which means that the site needs to 

use a previous snapshot to recover 

• Problems with applications failing 

• Inability to switch processing to the remote or secondary site 

• Problems with the MSCS cluster (such as a failover to another cluster or site) 

• Problems reported in an e-mail notification from an RA 

• Problem reported in an SNMP trap notification 

• Problems listed on the management console as reported in the overall system status 

or in group state or properties 

• Problems reported in the daily summary of events 

In this guide, symptoms and notifications are often listed with potential problems. 

However, the messages and notifications vary based on the problem, and multiple 

events and notifications are possible at any given time. 

Events That Cause Journal Distribution 

Certain conditions might occur that can prevent access to the expected journal image. 

For instance, images might be flushed or distributed so that they are not available. Table 

2–2 lists events that might cause the images to be unavailable. For tables listing all 

events, see Appendix E. 

2–10 6872 5688–002

Table 2–2. Events That Cause Journal Distribution 

Event ID Level Scope Description Trigger 

4042 Info Detailed Group deactivated. 

(Group , RA 

) 

4062 Info Detailed Access enabled to 

latest image. (Group 

, Failover site 

) 

4097 Warning Detailed Maximum journal lag 

exceeded. 

Distribution in fastforward—older 

images removed from 

journal. (Group 

) 

4099 Info Detailed Initializing in long 

resynchronization 

mode. (Group 

) 

Troubleshooting Procedures 

Overview 

A user action deactivated 

the group. 

Access was enabled to 

the latest image during 

automatic failover. 

Fast-forward action 

started and caused the 

snapshots taken before 

the fast-forward action to 

be lost and the maximum 

journal lag to be 

exceeded. 

The system started a long 


For troubleshooting, you must differentiate between problems that arise from 

environmental changes, network changes (cabling, routing and port blocking), or those 

changes related to zoning, logical unit number (LUN) masking, other devices in the SAN, 

and storage failures and problems that arise from misconfiguration or internal errors in 

the environmental setup. 

Refer to the preceding diagrams as you consider the general troubleshooting procedures 

that follow. Use the following four general tasks to help you identify symptoms and 

causes whenever you encounter a problem. 

Identifying the Main Components and Connectivity of the 

Configuration 

Knowledge of the main system components and the connectivity between these 

components is a key to understanding how the entire environment operates. This 

knowledge helps you understand where the problem exists in the overall system context 

and can help you correctly identify which components are affected. 

6872 5688–002 2–11

Overview 

Identify the following components: 

• Storage device, controller, and the configuration of connections to the Fibre Channel 

(FC) switch 

• Switch and port types, and their connectivity 

• Network configuration (WAN and LAN): IP addresses, routing schemes, subnet 

masks, and gateways 

• Participating servers: operating system, host bus adapters (HBAs), connectivity to 

the FC switch 

• Participating volumes: repository volumes, journal volumes, and replication volumes 

Understanding the Current State of the System 

Use the management console and the CLI get commands to understand the current 

state of the system: 

• Is there any component which is shown to be in an error state? 

If so, what is the error? Is it down, disconnected from any other components? 

• What is the state of the groups, splitters, volumes, transfer, and distribution? 

• Is the current state stable or changing within intervals of time? 

Verifying the System Connectivity 

To verify the system connectivity, use physical and tool-based verification methods to 

answer the following questions: 

• Are all the components physically connected? Are the activity or link lights active? 

• Are the components connected to the correct switch or switches? Are they 

connected to the correct ports? 

• Is there connectivity over the WAN between all appliances? Is there connectivity 

between the appliances on the same site over the management network? 

2–12 6872 5688–002

Analyzing the Configuration Settings 

Many problems occur because of improper configuration settings such as improper 

zoning. Analyze the configuration settings to ensure they are not the cause of the 

problem. 

Overview 

• Are the zones properly configured? 

− Splitter-to-storage? 

− Splitter-to-RA? 

− RA-to-storage? 

− RA-to-RA? 

• Are the zones in the switch config? 

• Has the proper switch config been applied? 

• Are the LUNs properly masked? 

− Is the splitter masked to see only the relevant replication volume or volumes? 

− Are the RAs masked to see the relevant replication volume or volumes, 

repository volume, and journal volume or volumes? 

• Are the network settings (such as gateway) for the RAs correct? 

• Are there any possible IP conflicts on the network? 

6872 5688–002 2–13

Overview 

2–14 6872 5688–002

Section 3 

Recovering in a Geographic Replication 

Environment 

This section provides recovery procedures so that user applications can be online as 

quickly as possible in a geographic replication environment. 

An older image might be required to recover from a rolling disaster, human error, a virus, 

or any other failure that corrupts the latest snapshot image. Ensure that the image is 

tested prior to reversing direction. 

Complete the procedures for each group that needs to be moved based on the type of 

hosts in the environment: 

• Manual Failover of Volumes and Data Consistency Groups 

• Manual Failover of Volumes and Data Consistency Groups for ClearPath MCP Hosts 

Refer to the Unisys SafeGuard Solutions Replication Appliance Administrator’s Guide for 

more information on logged and virtual (with roll or without roll) access modes. For 

specific environments, refer to the best practices documents listed under SafeGuard 

Solutions documentation on the Unisys Product Support Web site, 

www.support.unisys.com 

6872 5688–002 3–1

Recovering in a Geographic Replication Environment 

Manual Failover of Volumes and Data Consistency 

Groups 

When you need to perform a manual failover of volumes and data consistency groups, 

complete the following tasks: 

1. Accessing an image 

2. Testing the selected image 

Accessing an Image 

1. From the Management Console, select any one of the data consistency groups 

on the navigation pane. 

2. Select the Status tab, (if it is not opened.) 

3. Perform the following steps to allow access to the target image: 

a. Right-click the Consistency Group and select Pause Transfer. Click Yes 

when the system prompts that the group activity will be paused. 

b. Right-click the Consistency Group and scroll down. 

c. Select the Remote Copy name and click Enable Image Access. 

The Enable Image Access dialog box appears. 

d. Choose Select an image from the list and click Next. 

The Select Explicit Image dialog box appears and displays the available 

images. 

e. Select the desired image from the list and click Next. 

The Image Access Mode dialog box appears. 

f. Select the option Logged access (physical) and click Next. 

The Summary screen displays the Image name and the Image Access mode. 

g. Click Finish. 

Note: This process might take a long time to complete depending on the value 

of the journal lag setting in the group policy of the consistency group. The 

following message appears during the process: 

Enabling log access 

h. Verify the target image name displayed below the bitmap in the components 

pane under the Status tab. 

Transfer:Paused displays at the bottom in the Status tab under the 

components pane. 

3–2 6872 5688–002

Testing the Selected Image at Remote Site 


Perform the following steps to test the selected image at the remote site: 

1. Run the following batch file to mount a volume at the remote site. If necessary, 

modify the program files\kdriver path to fit your environment. 

@echo off 

cd "c:\program files\kdriver\kutils" 

"c:\program files\kdriver\kutils\kutils.exe" umount e: 

"c:\program files\kdriver\kutils\kutils.exe" mount e: 

2. Repeat step 1 for all volumes in the group. 

3. Ensure that the selected image is valid: 

• all applications start successfully using the selected image 

• the data in the image is consistent and valid 

For example, you might want to test whether you can start a database application on 

this image. You might also want to run proprietary test procedures to validate the 

data. 

4. Skip to “Unmounting Volumes at Production site and Reversing Replication 

Direction” if you have tested the validity of the image and the test is successful. If 

the test is unsuccessful, continue with step 5. 

5. To test a different image, perform the procedure “Unmounting the Volumes and 

Disabling the Image Access at Remote site.” 

Unmounting the Volumes and Disabling the Image Access at Remote 

Site 

1. Before choosing another image, unmount the volume using the following batch file. 

If necessary, modify the program files/kdriver path to fit your environment. 

@echo off 


"c:\program files\kdriver\kutils\kutils.exe" flushFS e: 

"c:\program files\kdriver\kutils.exe" umount e: 


3. Select one of the Consistency Groups in the navigation pane on the 

Management Console. 

4. Right-click the Consistency Group and scroll down. 

5. Select the Remote Copy name and click Disable Image Access. 

6. Click Yes when the system prompts you to ensure that all group volumes are 

unmounted. 

7. Repeat the procedures “Accessing an Image” and “Testing the Selected Image at 

the Remote Site”. 

6872 5688–002 3–3


Unmounting the Volumes at Production Site and Reversing 

Replication Direction 

Perform these steps at the host: 

1. To unmount a volume at the production site, run the following batch file. If 

necessary, modify the program files\kdriver path to fit your environment. 

@echo off 


"c:\program files\kdriver\kutils\kutils.exe" flushFS e: 

"c:\program files\kdriver\kutils\kutils.exe" umount e: 


Perform these steps on the Management Console: 

1. Select a Consistency Group from the navigation pane. 

2. Right-click the Group and select Pause Transfer. Click Yes when the system 

prompts that the group activity will be paused. 

3. Click the Status tab. The status of the transfer must display Paused. 

4. Right-click the Consistency group and select Failover to . 

5. Click Yes when the system prompts you to confirm failover. 

6. Ensure that the Start data transfer immediately check box is selected. 

The following warning message appears: 

Warning: Journal will be erased. Do you wish to continue? 

7. Click Yes to continue. 

3–4 6872 5688–002


Manual Failover of Volumes and Data Consistency 

Groups for ClearPath MCP Hosts 

When you need to perform a manual failover of volumes and data consistency groups, 

complete the following tasks: 

1. Accessing an image 

2. Testing the selected image 

Note: For ClearPath MCP hosts, close and free units at the remote site before 

completing the following procedures. This action prevents SCSI Reserved errors being 

logged to units that are no longer accessible. 

Accessing an Image 

Quiescence any databases before accessing an image. Once the pack has failed over 

and has been acquired, resume the databases. 

If the volumes to be failed over are not in use by a database, issue the CLOSE PK 

command from the operator display terminal (ODT) to close the 

volumes. 

For more information on how to access an image, refer to the procedures, 

“Accessing an Image under Manual Failover of Volumes” and “Data Consistency 

Groups”. 

Testing the Selected Image at Remote Site 

1. Mount a volume at the remote site by issuing the ACQUIRE PK 

command from the remote site ODT to acquire the unit. Also acquire any controls 

necessary to access the unit if these controls are not automatically acquired. 

Verify that the MCP can access the volume using commands such as SC– and P PK 

to display the status of the peripherals. 


3. Ensure that the selected image is valid; that is, verify that 

• All applications start successfully using the selected image. 

• The data in the image is consistent and valid. 



data. 

4. If you tested the validity of the image and the test completed successfully, skip to 

“Unmounting Volumes and Reversing Replication Direction at Production site.” If the 

testing is not successful, continue with step 5. 

5. To test a different image, perform the procedure “Unmounting the Volumes and 

Disabling the Image Access at Remote Site.” 

6872 5688–002 3–5


Unmounting the Volumes and Disabling the Image Access at Remote 

Site 

1. Before choosing another image, unmount the volume by issuing the CLOSE PK 

command followed by the FREE PK command from 

the ODT. Verify that the units are closed and freed using peripheral status 

commands. 









Unmounting the Volumes at Source Site and Reversing Replication 

Direction 

Perform these steps at the source site host: 

1. Unmount a volume at the source site by issuing the CLOSE PK 

command followed by the FREE PK command from the ODT to close 

and free the volume. 

If the site is down when the host is recovered, use the FREE PK 

command to free the original source units. In response to inquiry commands, the 

status of the original source units is “closed.” Free the units to prevent access by 

the original source site host. 












3–6 6872 5688–002

Section 4 

Recovering in a Geographic Clustered 

Environment 

This section provides information and procedures that relate to geographic clustered 

environments running Microsoft Cluster Service (MSCS). 

Checking the Cluster Setup 

To ensure that the cluster configuration is correct, check the MSCS properties and the 

network bindings. For more detailed information, refer to “Guide to Creating and 

Configuring a Server Cluster under Windows Server 2003”, which you can download at 

MSCS Properties 

http://www.microsoft.com/downloads/details.aspx?familyid=96F76ED7-9634-4300- 

9159-89638F4B4EF7&displaylang=en 

To check the MSCS properties, enter the following command from the command 

prompt: 

Cluster /prop 

Output similar to the following is displayed: 

T Cluster Name Value 

-- -------------------- ------------------------------ ----------------------- 

M AdminExtensions {4EC90FB0-D0BB-11CF-B5EF-0A0C90AB505} 

D DefaultNetworkRole 2 (0x2) 

S Description 

B Security 01 00 14 80 ... (148 bytes) 

B Security Descriptor 01 00 14 80 ... (148 bytes) 

M Groups\AdminExtensions 

M Networks\AdminExtensions 

M NetworkInterfaces\AdminExtensions 

M Nodes\AdminExtensions 

M Resources\AdminExtensions 

M ResourceTypes\AdminExtensions 

D EnableEventLogReplication 0 (0x0) 

D QuorumArbitrationTimeMax 300 (0x12c) 

D QuorumArbitrationTimeMin 15 (0xf) 

D DisableGroupPreferredOwnerRandomization 0 (0x0) 

D EnableEventDeltaGeneration 1 (0x1) 

D EnableResourceDllDeadlockDetection 0 (0x0) 

D ResourceDllDeadlockTimeout 240 (0xf0) 

D ResourceDllDeadlockThreshold 3 (0x3) 

D ResourceDllDeadlockPeriod 1800 (0x708) 

D ClusSvcHeartbeatTimeout 60 (0x3c) 

D HangRecoveryAction 3 (0x3) 

6872 5688–002 4–1

Recovering in a Geographic Clustered Environment 

If the properties are not set correctly, use one of the following commands to correct the 

settings. 

Majority Node Set Quorum 

Cluster /prop HangRecoveryAction=3 

Cluster /prop EnableEventLogReplication=0 

Shared Quorum 

Cluster /prop QuorumArbitrationTimeMax=300 (not for majority node set) 

Network Bindings 

Cluster /prop QuorumArbitrationTimeMin=15 

Cluster /prop HangRecoveryAction=3 

Cluster /prop EnableEventLogReplication=0 

The following binding priority order and settings are suggested as best practices for 

clustered configurations. These procedures assume that you can identify the public and 

private networks by the connection names that are referenced in the steps. 

Host-Specific Network Bindings and Settings 

1. Open the Network Connections window. 

2. On the Advanced menu, click Advanced Settings. 

3. Select the Networks and Bindings tab. 

This tab shows the binding order in the upper pane and specific connection 

properties in the lower pane. 

4. Verify that the public network connection is above the private network in the binding 

list in the upper pane. 

If it is not, follow these steps to change the order: 

a. Select a network connection in the binding list in the upper pane. 

b. Use the arrows to the right to move the network connection up or down in the 

list as appropriate. 

5. Select the private network in the binding list. In the lower pane, verify that the File 

and Print Sharing for Microsoft Networks and the Client for Microsoft 

Networks check boxes are cleared for the private network. 

6. Click OK. 

7. Highlight the public connections, then right-click and click Properties. 

8. Select Internet (TCP.IP) in the list, and click Properties. 

9. Click Advanced. 

10. Select the WINS tab. 

4–2 6872 5688–002


11. Ensure that Enable LM/Hosts lookup is selected. 

12. Ensure that Disable NetBIOS over TCP/IP is selected. 

13. Repeat steps 7 through 12 for the private network connection. 

Cluster-Specific Network Bindings and Settings 

1. Open the Cluster Administrator. 

2. Right-click the cluster (the top node in the tree structure in the left pane and click 

Properties. 

3. Select the Networks Priority tab. 

4. Ensure that the private network is at the top of the list and that the public network is 

below the private network. 

If it is not, follow these steps to change the order: 

a. Select the private network. 

b. Use the command button at the right to move up the private network up in the 

list as appropriate. 

5. Select the private network, and click Properties. 

6. Verify that the Enable this network for cluster use check box is selected and 

that Internal cluster communications only (private network) is selected. 

7. Click OK. 

8. Select the public network, and click Properties. 

9. Verify that the Enable this network for cluster use check box is selected and 

that All communications (mixed network) is selected. 

10. Click OK. 

Group Initialization Effects on a Cluster 

Move-Group Operation 

The following conditions affect failover times for a cluster move-group operation. A 

cluster move-group operation cannot complete if a lengthy consistency group 

initialization, such as a full-sweep initialization, long resynchronization, or initialization 

from marking mode, is executing in the background. Review these conditions and plan 

accordingly. 

6872 5688–002 4–3


Full-Sweep Initialization 

A full-sweep initialization occurs when the disks on both sites are scanned or read in 

their entirety and a comparison is made, using checksums, to check for differences. Any 

differences are then replicated from the Production site disk to the remote site disk. A 

full-sweep initialization generates an entry in the management console log. 

A full-sweep initialization occurs in the following circumstances: 

• Disabling or enabling a group 

Disabling a group causes all disk replication in the group to stop. A full-sweep 

initialization is performed once the group is enabled. The full-sweep initialization 

guarantees that the disks are consistent between the sites. 

• Adding a new splitter server or host that has access to the disks in the group 

When adding a new splitter to the replication, there is a time before the splitter is 

added to the configuration when activity from this splitter to the disks is not being 

monitored or replicated. To guarantee that no write operations were performed by 

the new splitter before the splitter was configured in the replication, a full-sweep 

initialization is required for all groups that contain disks accessed by this splitter. This 

initialization is done automatically by the system. 

• Double failure of a main component 

When a double failure of a main component occurs, a full-sweep initialization is 

required to guarantee that consistency was maintained. The main components 

include the host, the replication appliance (RA), and the storage subsystem. 

Long Resynchronization 

A long resynchronization occurs when the data difference that needs to be replicated to 

the other site cannot fit on the journal volume. The data is split into multiple snapshots 

for distribution to the other site, and all the previous snapshots are lost. Long 

resynchronization can be caused by long WAN outages, a group being disabled for a long 

time period, and other instances when replication has not been functional for a long time 

period. 

Long resynchronization is not connected with full-sweep initialization and can also 

happen during initialization from marking (see “Initialization from Marking Mode”). It is 

dependant only on the journal volume size and the amount of data to be replicated. 

A long resynchronization is identified in the Status Tab in Components Pane under 

the remote journal bitmap in the management console. The status Performing Long 

Resync is visible for the group that is currently performing a long resynchronization. 

4–4 6872 5688–002

Initialization from Marking Mode 


All other instances of initialization in the replication are caused by marking. The marking 

mode refers to a replication mode in which the location of “dirty,” or changed, data is 

marked in a bitmap on the repository volume. This bitmap is a standard size—no matter 

how much data changes or what size disks are being monitored—so the repository 

volume cannot fill up during marking. 

The replication moves to marking mode when replication cannot be performed normally, 

such as during WAN outages. This marking mode guarantees that all data changes are 

still being recorded until replication is functioning normally. When replication can perform 

normally again, the RAs read the dirty, or changed, data from the source disk based on 

data recorded in the bitmap and replicates it to the disk on the remote site. The length of 

time for this process to complete depends on the amount of dirty, or changed, data as 

well as the performance of other components in the configuration, such as bandwidth 

and the storage subsystem. 

A high-load state can also cause the replication to move to marking mode. A high-load 

state occurs when write activity to the source disks exceeds the limits that the 

replication, bandwidth, or remote disks can handle. Replication moves into marking 

mode at this time until the replication determines the activity has reached a level at 

which it can continue normal replication. The replication then exits the high-load state 

and an initialization from marking occurs. 

See Section 10, “Solving Performance Problems,” for more information on high-load 

conditions and problems. 

Behavior of SafeGuard 30m Control During a 

Move-Group Operation 

During a move-group operation, the Unisys SafeGuard 30m Control resource in a 

clustered environment behaves as follow. Be aware of this information when dealing 

with various failure scenarios. 

1. MSCS issues an offline request because of a failure with a group resource—for 

example, a physical disk—or an MSCS move group. The request is sent to the 

Unisys SafeGuard 30m Control resource on the node that owns the group. 

The MSCS resources that are dependent on the Unisys SafeGuard 30m Control 

resource, such as physical disk resources, are taken offline first. Taking the 

resources offline does not issue any commands to the RA. 

2. MSCS issues an online request to the Unisys SafeGuard 30m Control resource on 

the node to which a group was moved, or in the case of failure, to the next node in 

the preferred owners list. 

3. When the resource receives an online request from MSCS, the Unisys SafeGuard 

30m Control resource issues two commands to control the access to disks: 

initiate_failover and verify_failover. 

6872 5688–002 4–5


Initiate_Failover Command 

This command changes the replication direction from one site to another. 

• If a same-site failover is requested, the command completes successfully with 

no action performed by the RA. 

• The resource issues the verify_failover command to see if the RA performed 

the operations successfully. 

• If a different-site failover is requested, the RA starts changing direction between 

sites and returns successfully. In certain circumstances, the RA returns a failure 

when the WAN is down or a long resynchronization occurs. 

• If the RA returns a failure to the Unisys SafeGuard 30m Control resource, the 

resource logs the failure in the Windows application event log and retries the 

command continuously until the cluster pending timeout is reached. When a 

move-group operation fails to view events posted by the resource, check the 

application event log. The event source of the event entry is the 30m Control. 

Verify_Failover Command 

This command enables the Unisys SafeGuard 30m Control resource to determine 

the time at which the change of the replication direction completes. 

• If a same-site failover is requested, the command completes successfully with 

no action performed by the RA. 

• If a different-site failover is requested, the verify_failover command returns a 

pending status until the replication direction changes. The change of direction 

takes from 2 to 30 minutes. 

• When the verify_failover command completes, write access to the physical disk 

is enabled to the host from the RA and the splitter. 

• If the time to complete the verify_failover command is within the pending 

timeout, the Unisys SafeGuard 30m Control resource comes online followed by 

all the resources dependent on this resource. 

All dependent disks come online using the default physical disk timeout of an 

MSCS cluster. The physical disk is available to the physical disk resource 

immediately; there is no delay. Physical disk access is available when the Unisys 

SafeGuard 30m Control resource comes online. You do not need to change the 

default resource settings for the physical disk. However, the physical disk must 

be dependent on the Unisys SafeGuard 30m Control resource. 

• If the time to complete the verify_failover command is longer than the pending 

timeout of the Unisys SafeGuard 30m Control resource, MSCS fails this 

resource. 

The default pending timeout for a Unisys SafeGuard 30m Control resource is 

15 minutes or 900 seconds. This timeout occurs before the cluster disk timeout. 

4–6 6872 5688–002


If you use the default retry value of 1, this resource issues the following 

commands: 

• Initiate_failover 

• Verify_failover 

• Initiate_failover 

• Verify_failover 

Using the default pending timeout, the Unisys SafeGuard 30m Control resource 

waits a total of 30 minutes to come online; this timeout period equals the 

timeout plus one retry. If the resource does not come online, MSCS attempts to 

move the group to the next node in the preferred owners list and then repeats 

this process. 

Recovering by Manually Moving an Auto-Data 

(Shared Quorum) Consistency Group 

An older image might be required to recover from a rolling disaster, human error, a virus, 

or any other failure that corrupts the latest snapshot image. It is impossible to recover 

automatically to an older image using MSCS because automatic cluster failover is 

designed to minimize data loss. The Unisys SafeGuard 30m solution always attempts to 

fail over to the latest image. 

Note: Manual image recovery is only for data consistency groups, not for the quorum 

group. 

To recover a data consistency group using an older image, you must complete the 

following tasks: 

• Take the cluster data group offline. 

• Perform a manual failover of an auto-data (shared quorum) consistency group to a 

selected image. 

• Bring the cluster group online and check the validity of the image. 

• Reverse the replication direction of the consistency group. 

Taking a Cluster Data Group Offline 

To take a group offline in the cluster for which you are performing a manual recovery, 

complete the following steps: 

1. Open Cluster Administrator on one of the nodes in the MSCS cluster. 

2. Right-click the group that you want to recover and click Take Offline. 

3. Wait until all resources in the group show the status as Offline. 

6872 5688–002 4–7


Performing a Manual Failover of an Auto-Data (Shared Quorum) 

Consistency Group to a Selected Image 

1. Open the Management Console. 


Note: Do not select the quorum group. The data consistency group you select 

should be the cluster data group that you took offline. 

4. Click the Policy tab on the selected Consistency Group. 

5. Scroll down and select Advanced in the Policy tab. 

6. In Global Cluster mode, select Manual (shared quorum) in the Global cluster 

mode list. 

7. Click Apply. 

8. Perform the following steps to access the image: 








images. 












Transfer:Paused status appears at the bottom in the Status tab under the 


4–8 6872 5688–002


Bringing a Cluster Data Group Online and Checking the Validity 

of the Image 

1. Open the Cluster Administrator window on the Management Console. 

2. Move the group to the node on the recovered site by right-clicking the group that 

you previously took offline and then clicking Move Group. 

• If the cluster has more than two nodes, a list of possible owner target nodes 

appears. Select the node to which you want to move the group. 

• If the cluster has only two nodes, the move starts immediately. Go to step 3. 

3. Bring the group online by right-clicking the group name and then clicking Bring 

Online. 

4. Ensure that the selected image is valid; that is, verify that 





data. 


“Reversing the Replication Direction of the Consistency Group.” 

6. If the validity of the image fails and you choose to test a different image, perform the 

following steps: 

a. To take the group offline, right-click the group name and then click Take 

Offline on the Cluster Administrator. 

b. Select one of the Consistency Groups in the navigation pane on the 

Management Console. 

c. Right-click the Consistency Group and scroll down. 

d. Select the Remote Copy name and click Disable Image Access. 

e. Click Yes when the system prompts you to ensure that all group volumes are 

unmounted. 

7. Perform the following steps if you want to choose a different image: 








images. 

6872 5688–002 4–9












pane under Status tab. 

Transfer:Paused status appears at the bottom in the Status tab under the 


8. To bring the cluster group online, using the Cluster Administrator, right-click the 

group name and then click Online to. 

9. Ensure that the selected image is valid. Verify that 





data. 


“Reversing the Replication Direction of the Consistency Group.” 

11. If the image is not valid, repeat steps 6 through 9 as necessary. 

Reversing the Replication Direction of the Consistency Group 

1. Select the Consistency Group from the navigation pane. 



3. Click the Status tab. The status transfer must display Paused. 

4. Click the Policy tab and expand the Advanced Settings (if it is not expanded). 

5. Select Auto data (shared quorum) from the Global Cluster mode list. 

6. Right-click the Consistency Group and select Failover to . 


4–10 6872 5688–002

6872 5688–002 

8. Ensure that thee 

Start data transfer immediately check box is s selected. 

The following wwarning 

message appears: 

Warning: JJournal 

will be erased. Do you wish to continue e? 

9. Click Yes to coontinue. 

Problem Description 

The following pointts 

describe the behavior of the components in this event: 

• When the quorum 

group is running on the site where the RAs faile ed (site 1), the 

cluster nodes oon 

site 1 fail because of quorum lost reservations, an nd cluster nodes 

on site 2 attempt 

to arbitrate for the quorum resource. 

• To prevent a “ssplit 

brain” scenario, the RAs assume that the other site is active 

when a WAN faailure 

occurs. (A WAN failure occurs if the RAs cannot 

communicate 

to at least one RA at the other site.) 

• When the MSCCS 

Reservation Manager on the surviving site (site 2) 

attempts the 

quorum arbitrattion 

request, the RA prevents access. Eventually, all cluster services 

stop and manuaal 

intervention is required to bring up the cluster service. 

Figure 4–1 illustratees 

this failure. 

Recovering in a Geographic Clustere ed Environment 

Recovery When All RAs Fail on Site 1 (Site 1 

Quorum Owner) ) 


All RAs Fail on Site 1 (Site 1 Quorum Owner) 

O 

4–11


Symptoms 

The following symptoms might help you identify this failure: 

• The management console display shows errors and messages similar to those for 

“Total Communication Failure in a Geographic Clustered Environment” in Section 7. 

• If you review the system event log, you find messages similar to the following 

examples: 

System Event Log for Usmv-East2 Host (Surviving Host) 

8/2/2008 1:35:48 PM ClusSvc Error Physical Disk Resource 1038 N/A USMV-WEST2 Reservation of 

cluster disk 'Disk Q:' has been lost. Please check your system and disk configuration. 

8/2/2008 1:35:48 PM Ftdisk Warning Disk 57 N/A USMV-WEST2 The system failed to flush data to 

the transaction log. Corruption may occur. 

System Event Log for Usmv-West2 (Failure Host) 

8/2/2008 1:35:48 PM ClusSvc Error Physical Disk Resource 1038 N/A USMV-WEST2 Reservation of 

cluster disk 'Disk Q:' has been lost. Please check your system and disk configuration. 

8/2/2008 1:35:48 PM Ftdisk Warning Disk 57 N/A USMV-WEST2 The system failed to flush data to 

the transaction log. Corruption may occur. 

• If you review the cluster log, you find messages similar to the following examples: 

Cluster Log for Usmv-East2 (Surviving Host) 

Attempted to try five times before the cluster timed-out. The entries recorded five times in the log: 

00000f90.00000620::2008/02/02-20:36:06.195 ERR Physical Disk : [DiskArb] GetPartInfo 

completed, status 170 (The requested resource is in use). 

00000f90.00000620::2008/02/02-20:36:06.195 ERR Physical Disk : [DiskArb] Failed to read 

(sector 12), error 170. 

00000f90.00000620::2008/02/02-20:36:08.210 ERR Physical Disk : [DiskArb] GetPartInfo 

completed, status 170. 

00000f90.00000620::2008/02/02-20:36:08.210 ERR Physical Disk : [DiskArb] Failed to write 


00000638.00000b10::2008/02/02-20:36:18.273 ERR [FM] Failed to arbitrate quorum resource c336021a- 

083e-4fa0-9d37-7077a590c206, error 170. 

00000638.00000b10::2008/02/02-20:36:18.273 ERR [RGP] Node 2: REGROUP ERROR: arbitration failed. 

00000638.00000b10::2008/02/02-20:36:18.273 ERR [CS] Halting this node to prevent an inconsistency 

within the cluster. Error status = 5892 (The membership engine requested shutdown of the cluster 

service on this node). 

00000684.000005a8::2008/02/02-20:37:53.473 ERR [JOIN] Unable to connect to any sponsor node. 

00000684.000005a8::2008/02/02-20:38:06.020 ERR [FM] FmGetQuorumResource failed, error 170. 

00000684.000005a8::2008/02/02-20:38:06.020 ERR [INIT] ClusterForm: Could not get quorum resource. 

No fixup attempted. Status = 5086 (The quorum disk could not be located by the cluster service). 

00000684.000005a8::2008/02/02-20:38:06.020 ERR [INIT] Failed to form cluster, status 5086 (The 

quorum disk could not be located by the cluster service). 

4–12 6872 5688–002

Cluster Log for Usmv-West2 (Failure Host) 


00000d80.00000bbc::2008/02/02-20:31:21.257 ERR [FM] FmpSetGroupEnumOwner:: MM returned 

MM_INVALID_NODE, chose the default target 

00000da0.00000130::2008/02/02-20:35:48.395 ERR Physical Disk : [DiskArb] 

CompletionRoutine: reservation lost! Status 170 (The requested resource is in use) 

00000da0.00000130::2008/02/02-20:35:48.395 ERR [RM] LostQuorumResource, cluster service 

terminated... 

00000da0.00000b80::2008/02/02-20:35:49.145 ERR Network Name : Unable to open 

handle to cluster, status 1753 (There are no more endpoints available from the endpoint mapper). 

00000da0.00000c20::2008/02/02-20:35:49.145 ERR IP Address : WorkerThread: 

GetClusterNotify failed with status 6 (The handle is invalid). 

00000a04.00000a14::2008/02/02-20:37:23.456 ERR [JOIN] Unable to connect to any sponsor node. 

Attempted to try five times before the cluster timed-out, The entries recorded five times in the log: 

000001e4.00000598::2008/02/02-20:37:23.799 ERR Physical Disk : [DiskArb] GetPartInfo 

completed, status 170 (The resource is in use). 

000001e4.00000598::2008/02/02-20:37:23.831 ERR Physical Disk : [DiskArb] Failed to read 


000001e4.00000598::2008/02/02-20:37:23.831 ERR Physical Disk : [DiskArb] BusReset 

completed, status 31 (A device attached to the system is not functioning). 

000001e4.00000598::2008/02/02-20:37:23.831 ERR Physical Disk : [DiskArb] Failed to break 

reservation, error 31. 

00000a04.00000a14::2008/02/02-20:37:25.830 ERR [FM] FmGetQuorumResource failed, error 31. 

00000a04.00000a14::2008/08/02-20:37:25.830 ERR [INIT] ClusterForm: Could not get quorum resource. 


00000a04.00000a14::2008/02/02-20:37:25.830 ERR [INIT] Failed to form cluster, status 5086 (The 

quorum disk could not be located by the cluster service). 

00000a04.00000a14::2008/02/02-20:37:25.830 ERR [CS] ClusterInitialize failed 5086 

00000a04.00000a14::2008/02/02-20:37:25.846 ERR [CS] Service Stopped. exit code = 5086 

Actions to Resolve the Problem 

If all RAs on site 1 fail and site 1 owns the quorum resource, perform the following tasks 

to recover: 

1. Disable MSCS on all nodes at the site with the failed RAs. 

2. Perform a manual failover of the quorum consistency group. 

3. Reverse replication direction. 

4. Start MSCS on a node on the surviving site. 

5. Complete the recovery process. 

6872 5688–002 4–13


Caution 

Manual recovery is required only if the quorum device is lost because of a 

failure of an RA cluster. 

Before you bring the remote site online and before you perform the manual 

recovery procedure, ensure that MSCS is stopped and disabled on the cluster 

nodes at the production site (site 1 in this case). You must verify the server 

status with a network test. 

Improper use of the manual recovery procedure can lead to an inconsistent 

quorum disk and unpredictable results that might require a long recovery 

process. 

Disabling MSCS 

Stop MSCS on each node at the site where the RAs failed by completing the following 

steps: 

1. In the Control Panel, point to Administrative Tools, and then click Services. 

2. Right-click Cluster Service and click Stop. 

3. Change the startup type to Disabled. 

4. Repeat steps 1 through 3 for each node on the site. 

Performing a Manual Failover of the Quorum Consistency Group 

1. Connect to the Management Console by opening a browser to the management IP 

address of the surviving site. The management console can be accessed only by the 

site with a functional RA cluster because the WAN is down. 

2. Click the Quorum Consistency Group (that is, the consistency group that holds 

the quorum drive) in the navigation pane. 

3. Click the Policy tab. 

4. Under Advanced, select Manual (shared quorum) in the Global cluster 

mode list, and click Apply. 

5. Right-click the Quorum Consistency Group and then select Pause Transfer. 

Click Yes when the system prompts that the group activity will be stopped. 

6. Perform the following steps to allow access to the target image: 

a. Right-click the Consistency Group and scroll down. 

b. Select the Remote Copy name and click Enable Image Access. 


c. Choose Select an image from the list and click Next. 

The Select Explicit Image dialog box displays the available images. 

d. Select the desired image from the list and then click Next. 


4–14 6872 5688–002


e. Select Logged access (physical) and click Next. 

The Summary screen shows the Image name and the Image Access mode. 

f. Click Finish. 


of the journal lag setting in the group policy of the consistency group. 

g. Verify the target image name displayed below the bitmap in the components 


Transfer:Paused status displays under the bitmap in the Status tab under the 


Reversing Replication Direction 

1. Select the Quorum Consistency Group in the navigation pane. 



3. Click the Status tab. The status of the transfer must show Paused. 

4. Right-click the Consistency Group and select Failover to . 

5. Click Yes when the system prompts to confirm failover. 





Starting MSCS 

MSCS should start within 1 minute on the surviving nodes when the MSCS recovery 

setting is enabled. You can manually start MSCS on each node of the surviving site by 

completing the following steps: 

1. In the Control Panel, point to Administrative Tools, and then click Services. 

2. Right-click Cluster Service, and click Start. 

MSCS starts the cluster group and automatically moves all groups to the first-started 

cluster node. 


6872 5688–002 4–15


Completing the Recovery Process 

To complete the recovery process, you must restore the global cluster mode property 

and start MSCS. 

• Restoring the Global Cluster Mode Property for the Quorum Group 

Once the primary site is operational and you have verified that all nodes at both sites 

are online in the cluster, restore the failover settings by performing the following 

steps: 

1. Click the Quorum Consistency Group (that is, the consistency group that 

holds the quorum device) from the navigation pane. 


3. Under Advanced, select Auto-quorum (shared quorum) in the Global 

cluster mode list. 


5. Click Yes when the system prompts that the group activity will be stopped. 

• Enabling MSCS 

Enable and start MSCS on each node at the site where the RAs failed by completing 

the following steps: 

1. In the Control Panel, point to Administrative Tools, and then click 

Services. 

2. Right-click Cluster Services and click Properties. 

3. Change the startup type to Automatic. 

4. Click Start 


6. Open the Cluster Administrator and move the groups to the preferred node. 

4–16 6872 5688–002


Symptoms 

6872 5688–002 

If the quorum groupp 

is running on site 2 and the RAs fail on site 1, all cluster 

nodes 

remain in a running state. All consistency groups remain at the respective 

sites because 

all disk accesses arre 

successful. In this case, because data is stored on n the replication 

volumes—but the ccorresponding 

marking information is not written to the repository 

volume—a full-sweeep 

resynchronization is required following recovery. 

An exception is if thhe 

consistency group option “Allow application to ru un even when 

Unisys SafeGuard SSolutions 

cannot mark data” was selected. The split tter prevents 

access to disks when 

the RAs are not available to write marking data to o the repository 

volume, and I/Os faail. 


this failure. 


Recovery When All RAs Fail on Site 1 (Site 2 

Quorum Owner) ) 


All RAs Fail on Site 1 (Site 2 Quorum Owner) O 

The following sympptoms 

might help you identify this failure: 

• The managemeent 

console display shows errors and messages sim milar to those for 

“Total Communnication 

Failure in a Geographic Clustered Environme ent” in Section 7. 

• If you review thhe 

system event log, you find messages similar to th he following 

examples: 

4–17


System Event Log for Usmv-East2 Host (Surviving Site—Site 2) 

8/2/2008 3:09:27 PM ClusSvc Information Failover Mgr 1204 N/A USMV-EAST2 "The Cluster 

Service brought the Resource Group ""Group 0"" offline." 

8/2/2008 3:09:47 PM ClusSvc Error Failover Mgr 1069 N/A USMV-WEST2 Cluster resource 'Data1' in 

Resource Group 'Group 0' failed. 

8/2/2008 3:10:29 PM ClusSvc Information Failover Mgr 1153 N/A USMV-WEST2 Cluster service is 

attempting to failover the Cluster Resource Group 'Group 0' from node USMV-WEST2 to node USMV- 

EAST2. 


Service brought the Resource Group ""Group 0"" online." 

System Event Log for Usmv-West2 Host (Failure Site—Site 1) 


Service brought the Resource Group ""Group 0"" offline." 

8/2/2008 3:09:47 PM ClusSvc Error Failover Mgr 1069 N/A USMV-WEST2 Cluster resource 'Data1' in 

Resource Group 'Group 0' failed. 

8/2/2008 3:10:29 PM ClusSvc Information Failover Mgr 1153 N/A USMV-WEST2 Cluster service is 

attempting to failover the Cluster Resource Group 'Group 0' from node USMV-WEST2 to node USMV- 

EAST2. 


Service brought the Resource Group ""Group 0"" online." 


Cluster Log for Surviving Site (Site 2) 

000005a0.00000fdc::2008/02/02-21:57:33.543 ERR [FM] FmpSetGroupEnumOwner:: MM returned 


00000ec8.000008b4::2008/02/02-22:09:03.139 ERR Unisys SafeGuard 30m Control : KfLogit: 

Online resource failed. Cannot complete transfer for auto failover. Action: Verify through the 

Management Console that the WAN connection is operational. 

00000ec8.00000f48::2008/02/02-22:10:39.715 ERR Unisys SafeGuard 30m Control : KfLogit: 



Cluster Log for Failure Site (Site 1) 

0000033c.000008e4::2008/02/02-22:09:47.159 ERR Unisys SafeGuard 30m Control : 

KfGetKboxData: get_system_settings command failed. Error: (2685470674). 

0000033c.000008e4::2008/02/02-22:09:47.159 ERR Unisys SafeGuard 30m Control : 

UrcfKConGroupOnlineThread: Error 1117 bringing resource online. (Error 1117: the request could not be 

performed because of an I/O device error). 

0000033c.00000b8c::2008/02/02-22:10:08.168 ERR Unisys SafeGuard 30m Control : 

KfGetKboxData: get_version command failed. Error: (2685470674). 


KfGetKboxData: get_system_settings command failed. Error: (2685470674). 


UrcfKConGroupOnlineThread: Error 1117 bringing resource online. (Error 1117: the request could not be 

performed because of an I/O device error). 

4–18 6872 5688–002



If all RAs on site 1 fail and site 2 owns the quorum resource, you do not need to perform 

manual recovery. Because the surviving site owns the quorum consistency group, MSCS 

automatically restarts, and the data consistency group fails over on the surviving site. 

Recovery When All RAs and All Servers Fail on One 

Site 

The following two cases describe an event in which a complete site fails (for example, 

site 1) and all data I/O, cluster node communication, disk reservations, and so forth, stop 

responding. MSCS nodes on site 2 detect a network heartbeat loss and loss of disk 

reservations, and try to take over the cluster groups that had been running on the nodes 

that failed. 

There are two cases for recovering from this failure based on which site owns the 

quorum group: 

• The RAs and servers fail on site 1 and that site owns the quorum group. 

• The RAs and servers fail on site 1 and site 2 owns the quorum group. 

Manual recovery of MSCS is required as described in the following topic, “Site 1 Failure 

(Site 1 Quorum Owner).” 

If the site can recover in an acceptable amount of time and the quorum owner does not 

reside on the failed site, manual recovery should not be performed. 

The two cases that follow respond differently and are solved differently based on where 

the quorum owner resides. 

Site 1 Failure (Site 1 Quorum Owner) 


In the first failure case, all nodes at site 1 fail as well as the RAs. Thus, the RAs must fail 

quorum arbitration attempts initiated by nodes on the surviving site. Because the RAs on 

the surviving site (site 2) are not able to communicate over the communication 

networks, the RAs assume that it is a WAN network failure and do not allow automatic 

failover of cluster resources. 

MSCS attempts to fail over to a node at site 2. Because the quorum resource was 

owned by site 1, site 2 must be brought up using the manual quorum recovery 

procedure. 

Figure 4–3 illustrates this case. 

6872 5688–002 4–19

Recovering in a Geographic CClustered 

Environment 

4–20 

Figure 4–3. All RAs annd 

Servers Fail on Site 1 (Site 1 Quorum Ow wner) 

68 872 5688–002

Symptoms 






examples: 

System Event Log for Usmv-East2 Host (Failure Site) 

8/3/2008 10:46:01 AM ClusSvc Error Startup/Shutdown 1073 N/A USMV-EAST2 Cluster service 

was halted to prevent an inconsistency within the server cluster. The error code was 5892 (The 

membership engine requested shutdown of the cluster service on this node). 

8/3/2008 10:46:00 AM ClusSvc Error Membership Mgr 1177 N/A USMV-EAST2 Cluster service is 

shutting down because the membership engine failed to arbitrate for the quorum device. This could be 

due to the loss of network connectivity with the current quorum owner. Check your physical network 

infrastructure to ensure that communication between this node and all other nodes in the server cluster is 

intact. 

8/3/2008 10:47:40 AM ClusSvc Error Startup/Shutdown 1009 N/A USMV-EAST2 Cluster service 

could not join an existing server cluster and could not form a new server cluster. Cluster service has 

terminated. 

8/3/2008 10:50:16 AM ClusDisk Error None 1209 N/A USMV-EAST2 Cluster service is requesting 

a bus reset for device \Device\ClusDisk0. 



00000c54.000008f4::2008/02/02-17:13:31.901 ERR [NMJOIN] Unable to begin join, status 1717 (the NIC 

interface is unknown). 

00000c54.000008f4::2008/02/02-17:13:31.901 ERR [CS] ClusterInitialize failed 1717 

00000c54.000008f4::2008/02/02-17:13:31.917 ERR [CS] Service Stopped. exit code = 1717 

00000be0.000008e0::2008/02/02-17:14:53.686 ERR [JOIN] Unable to connect to any sponsor node. 

00000be0.000008e0::2008/02/02-17:14:56.374 ERR [FM] FmpSetGroupEnumOwner:: MM returned 


000001e0.00000bac::2008/02/02-17:16:37.563 ERR IP Address : WorkerThread: 

GetClusterNotify failed with status 6. 

00000e8c.00000ea8::2008/02/02-17:30:20.275 ERR Physical Disk : [DiskArb] Signature of disk 

has changed or failed to find disk with id, old signature 0xe1e7208e new signature 0xe1e7208e, status 2 

(the system cannot find the file specified). 

00000e8c.00000ea8::2008/02/02-17:30:20.289 ERR Physical Disk : SCSI: Attach, error 

attaching to signature e1e7208e, error 2. 

000008e8.000008fc::2008/02/02-17:30:20.289 ERR [FM] FmGetQuorumResource failed, error 2. 

000008e8.000008fc::2008/02/02-17:30:20.289 ERR [INIT] ClusterForm: Could not get quorum resource. 


000008e8.000008fc::2008/02/0-17:30:20.289 ERR [INIT] Failed to form cluster, status 5086. 

000008e8.000008fc::2008/02/02-17:30:20.289 ERR [CS] ClusterInitialize failed 5086 

000008e8.000008fc::2008/02/02-17:30:20.360 ERR [CS] Service Stopped. exit code = 5086 

00000710.00000e80::2008/02/02-17:55:02.092 ERR [FM] FmpSetGroupEnumOwner:: MM returned 


000009cc.00000884::2008/02/02-17:55:12.413 ERR Unisys SafeGuard 30m Control : KfLogit: 

6872 5688–002 4–21





00000dc8.00000c48::2008/02/02-17:12:53.942 ERR Physical Disk : [DiskArb] GetPartInfo 


00000dc8.00000c48::2008/02/02-17:12:53.942 ERR Physical Disk : [DiskArb] Failed to read 


00000dc8.00000c48::2008/02/02-17:12:55.942 ERR Physical Disk : [DiskArb] GetPartInfo 


00000dc8.00000c48::2008/02/02-17:12:55.942 ERR Physical Disk : [DiskArb] Failed to write 


00000fe4.00000810::2008/02/02-17:13:20.030 ERR [FM] Failed to arbitrate quorum resource c336021a- 

083e-4fa0-9d37-7077a590c206, error 2. 

00000fe4.00000810::2008/02/02-17:13:20.030 ERR [RGP] Node 1: REGROUP ERROR: arbitration failed. 

00000fe4.00000810::2008/02/02-17:13:20.030 ERR [NM] Halting this node due to membership or 

communications error. Halt code = 1000 

00000fe4.00000810::2008/02/02-17:13:20.030 ERR [CS] Halting this node to prevent an inconsistency 

within the cluster. Error status = 5892 (The membership engine requested shutdown of the cluster 

service on this node). 

00000dc8.00000f34::2008/02/02-17:13:20.670 ERR Unisys SafeGuard 30m Control : KfLogit: 

Online resource failed. Pending processing terminated by resource monitor. 

00000dc8.00000f34::2008/02/02-17:13:20.670 ERR Unisys SafeGuard 30m Control : 

UrcfKConGroupOnlineThread: Error 1117 bringing resource online. 

000009e4::2008/02/02-17:29:20.587 ERR [FM] FmGetQuorumResource failed, error 2. 

000008e4.000009e4::2008/02/02-17:29:20.587 ERR [INIT] ClusterForm: Could not get quorum resource. 

No fixup attempted. Status = 5086 

000008e4.000009e4::2008/02/02-17:29:20.587 ERR [INIT] Failed to form cluster, status 5086. 

000008e4.000009e4::2008/02/02-17:29:20.587 ERR [CS] ClusterInitialize failed 5086 

000008e4.000009e4::2008/02/02-17:29:20.602 ERR [CS] Service Stopped. exit code = 5086 

000005b4.000008cc::2008/02/02-17:31:11.075 ERR [FM] FmpSetGroupEnumOwner:: MM returned 


00000ff4.000008d8::2008/02/02-17:31:19.901 ERR Unisys SafeGuard 30m Control : KfLogit: 




If all RAs and servers on site 1 fail and site 1 owns the quorum resource, perform the 

following tasks to recover: 

1. Perform a manual failover of the quorum consistency group. 

2. Reverse replication direction. 

3. Start MSCS. 

4. Power on the site if a power failure occurred. 

5. Restore the failover settings. 

Note: Do not bring up any nodes until the manual recovery process is complete. 

4–22 6872 5688–002


Caution 

Manual recovery is required only if the quorum device is lost because of a 

failure of an RA cluster. 

If the cluster nodes at the production site are operational, you must disable 

MSCS. You must verify the server status with a network test or attempt to 

log in to the server. Use the procedure in ”Recovery When All RAs Fail on 

Site 1 (Site 1 Quorum Owner).” 

Improper use of the manual recovery procedure can lead to an inconsistent 

quorum disk and unpredictable results that might require a long recovery 

process. 

Performing a Manual Failover of the Quorum Consistency Group 

To perform a manual failover of the quorum consistency group, follow the procedure 

given in the “Actions to Resolve the Problem” for “Recovery When All RAs Fail on Site 1 

(Site 1 Quorum Owner)” earlier in this section. 

Reversing Replication Direction 

1. Select the Consistency Group from the navigation pane. 




4. Right-click the Consistency Group and select Failover to 

5. Click Yes when the system prompts to confirm failover. 





Starting MSCS 

MSCS should start within 1 minute on the surviving nodes when the MSCS recovery 

setting is enabled. You can manually start MSCS on each node of the surviving site by 

completing the following steps: 

1. In the Control Panel, point to Administrative Tools, and then click 

Services. 

2. Right-click Cluster Service, and click Start. 

MSCS starts the cluster group and automatically moves all groups to the 

first-started cluster node. 


6872 5688–002 4–23


Powering-on a Site 

If a site experienced a power failure, power on the site in the following order: 

• Switches 

• Storage 

Note: Wait until all switches and storage units are initialized before continuing to 

power on the site. 

• RAs 

Note: Wait 10 minutes after you power on the RAs before you power on the hosts. 

• Hosts 

Restoring the Global Cluster Mode Property for the Quorum Group 

Once the primary site is again operational and you have verified that all nodes at both 

sites are online in the cluster, restore the failover settings by completing the following 

steps: 

1. Click the Quorum Consistency Group (that is, the consistency group that holds 

the quorum drive) from the navigation pane. 


3. Under Advanced, select Auto-quorum (shared quorum) in the Global 

cluster mode list. 

4. Ensure that the Allow Regulation box check box is selected. 


4–24 6872 5688–002

Site 1 Failure (Site 2 Quorum Owner) 


6872 5688–002 

If the quorum groupp 

is running on site 2 and a complete site failure occ curs on site 1, a 

quorum failover is nnot 

required. Only data groups on the failed site will require failover. 

All data that is not mmirrored 

and was in the failed RA cache is lost; the latest 

image on 

the remote site is uused 

to recover. Cluster services will be up on all nod des on site 2, and 

cluster nodes will faail 

on site 1. You cannot move a group to nodes on a site where the 

RAs are down (site 1). 

MSCS attempts to fail over to a node at site 2. An e-mail alert is sent st tating that a site 

or RA cluster has faailed. 


this case. 


Figure 4–4. All RAAs 

and Servers Fail on Site 1 (Site 2 Quorum m Owner) 

4–25


Symptoms 





examples: 

System Event Log for Usmv-West2 (Failure Site) 

8/3/2006 1:49:26 PM ClusSvc Information Failover Mgr 1205 N/A USMV-WEST2 "The Cluster 

Service failed to bring the Resource Group ""Cluster Group"" completely online or offline." 

8/3/2008 1:49:26 PM ClusSvc Information Failover Mgr 1203 N/A USMV-WEST2 "The Cluster 

Service is attempting to offline the Resource Group ""Cluster Group""." 

8/3/2008 1:50:46 PM ClusDisk Error None 1209 N/A USMV-WEST2 Cluster service is requesting a 

bus reset for device \Device\ClusDisk0. 



00000e50.00000c10::2008/02/02-20:50:46.165 ERR Physical Disk : [DiskArb] GetPartInfo 

completed, status 170 (the requested resource is in use). 

00000e50.00000c10::2008/02/02-20:50:46.165 ERR Physical Disk : [DiskArb] Failed to read 


00000e50.00000fb4::2008/02/02-20:52:05.133 ERR IP Address : WorkerThread: 

GetClusterNotify failed with status 6 (the handle is invalid). 


00000178.00000dd8::2008/02/02-20:49:30.976 ERR Physical Disk : [DiskArb] GetPartInfo 


00000178.00000dd8::2008/02/02-20:49:30.992 ERR Physical Disk : [DiskArb] Failed to read 


00000d80.00000e68::2008/02/02-20:49:57.679 ERR [GUM] GumSendUpdate: GumQueueLocking update 

to node 1 failed with 1818 (The remote procedure call was cancelled). 

00000d80.00000e68::2008/02/02-20:49:57.679 ERR [GUM] GumpCommFailure 1818 communicating 

with node 1 

00000178.00000810::2008/02/02-20:50:45.492 ERR IP Address : WorkerThread: 

GetClusterNotify failed with status 6 (The handle is invalid). 


If all RAs and all servers on site 1 fail and site 2 owns the quorum resource, you do not 

need to perform manual recovery. Because the surviving site owns the quorum 

consistency group, MSCS automatically restarts, and the data consistency group fails 

over on the surviving site. 

4–26 6872 5688–002

Section 5 

Solving Storage Problems 

This section lists symptoms that usually indicate problems with storage. Table 5–1 lists 

symptoms and possible problems indicated by the symptom. The problems and their 

solutions are described in this section. The graphics, behaviors, and examples in this 

section are similar to what you observe with your system but might differ in some 

details. 

In addition to the symptoms listed, you might receive e-mail messages or SNMP traps 

for possible problems. Also, messages similar to e-mail notifications might be displayed 

on the management console. If you do not see the messages, they might have already 

dropped off the display. Review the management console logs for messages that have 

dropped off the display. 

Table 5–1. Possible Storage Problems with Symptoms 

Symptom Possible Problem 

The system pauses the transfer for the 

relevant consistency group. 

The server cannot access this volume; 

writes to this volume fail; the file system 

cannot be mounted; and so forth. 

The management console shows an error 

for all connections to this volume—that is, 

all RAs on the relevant site and all splitters 

attached to this volume. 

The system pauses the transfer for all 

consistency groups. 


for all connections to this volume—that is, 

all RAs on the relevant site and all splitters 

attached to this volume. 

The event log reports that the repository 

volume is inaccessible. 

The event log indicates that the repository 

volume is corrupted. 

User or replication volume not accessible 

Repository volume not accessible 

6872 5688–002 5–1


Table 5–1. Possible Storage Problems with Symptoms 



for the connections between this volume 

and all RAs on the relevant site. 


relevant consistency group. 

The event log indicates that the journal 

was lost or corrupted. 

No volumes from the relevant target and 

worldwide name (WWN) are accessible to 

any initiator on the SAN. 

The cluster regroup process begins and 

the quorum device fails over to a site 

without failed storage. 

The management console shows a storage 

error and replication has stopped. 

Servers report multipath software errors. 

Applications that depend on physical disk 

resources go offline and fail when 

attempting to come online. 

Once resource retry threshold parameters 

are reached, site 1 fails over to site 2. With 

the default settings, this timing is about 30 

minutes. 

Journal not accessible 

Total storage loss in a geographic 

replicated environment 

Storage failure on one site with quorum 

owner on failed site in a geographic 

clustered environment 

Storage failure on one site with quorum 

owner on surviving site in a geographic 


5–2 6872 5688–002


Table 5–2 lists specific storage volume failures and the types of errors and indicators on 

the management console that distinguish each failure. 

Table 5–2. Indicators and Management Console Errors to 

Distinguish Different Storage Volume Failures 

Failure 

Data volume 

lost or failed 

Journal 

volume lost, 

failed, or 

corrupt 

Repository 

volume lost, 

failed, or 

corrupt 

Groups 

Paused 

Status 

Relevant 

Data 

Group 

Relevant 

Data 

Group 

System 

Status 

All Storage and 

RA error 

failure 

Volumes 

Tab 

Storage error Replication 

volume with 

error status 

Storage error Journal 

volume with 

error status 

Repository 

volume with 

error status 

6872 5688–002 5–3 

Logs 

Tab 

Error 

3012 

Error 

3012 

Error 

3014


User or Replication Volume Not Accessible 


Symptoms 

The replication volume is not accessible to any host or splitter. 


• The management console shows an error for storage and the Volumes tab (status 

column) shows additional errors (See Figure 5–1). 

Figure 5–5–1. Volumes Tab Showing Volume Connection Errors 

• Warnings and informational messages similar to those shown in Figure 5–2 appear 

on the management console. See the table after the figure for an explanation of the 

numbered console messages. 

5–4 6872 5688–002


Figure 5–2. Management Console Messages for the User Volume Not Accessible 

Problem 

Reference 

No. 

The following table explains the numbered messages in Figure 5–2. 

Event 

ID 

Description E-mail 

Immediate 

1 4003 Group capabilities problem with the details 

showing that the RA is unable to access . 

E-mail 

Daily 

Summary 

2 3012 The RA is unable to access the volume. X 

• The Groups tab on the management console shows that the system paused the 

transfer for the relevant consistency group. (See Figure 5–3.) 

Figure 5–3. Groups Tab Shows “Paused by System” 

• The server cannot access this volume; writes to this volume fail; the file system 

cannot be mounted; and so forth. 

6872 5688–002 5–5 

X


Actions to Resolve 

Perform the following actions to isolate and resolve the problem: 

• Determine whether other volumes from the same storage device are accessible to 

the same RAs to rule out a total storage loss. If no volumes are seen by an RA, refer 

to “Total Storage Loss in a Geographic Replicated Environment.” 

• Verify that this LUN still exists and has not failed or been removed from the storage 

device. 

• Verify that the LUN is masked to the proper splitter or splitters and RAs. 

• Verify that other servers in the SAN do not use this volume. For example, if an 

MSCS cluster in the SAN acquired ownership of this volume, it might reserve the 

volume and block other initiators from seeing the volume. 

• Verify that the volume has read and write permissions on the storage system. 

• Verify that the volume, as configured in the management console, has the expected 

WWN and LUN. 

Repository Volume Not Accessible 


Symptoms 

The repository volume is not accessible to any SAN-attached initiator, including the 

splitter and RAs. 

Or, the repository volume is corrupted---either by another initiator because of storage 

changes or as a result of storage failure. You must reformat the repository volume 

before replication can proceed normally. 


• The management console shows an error for all connections to this volume—that is, 

all RAs on the relevant site and all splitters attached to this volume. The RAs tab on 

the management console shows errors for the volume. (See Figure 5–4.) 

The following error messages appear for the RAs error condition when you click 

Details: 

Error: RA 1 in Sydney can't access repository volume 

Error: RA 2 in Sydney can't access repository volume 

The following error message appears for the storage error condition, when you click 

Details: 

Error: Repository volume can't be accessed by any RAs 

5–6 6872 5688–002


Figure 5–4. Management Console Display: Storage Error and RAs Tab Shows 

Volume Errors 

• The Volumes tab on the management console shows an error for the repository 

volume, as shown in Figure 5–5. 

Figure 5–5. Volumes Tab Shows Error for Repository Volume 


transfer for all consistency groups, as shown in Figure 5–6. 

Figure 5–6. Groups Tab Shows All Groups Paused by System 

• The Logs tab on the management console lists a message for event ID 3014. This 

message indicates that the RA is unable to access the repository volume or the 

repository volume is corrupted. (See Figure 5–7.) 

6872 5688–002 5–7


Figure 5–7. Management Console Messages for the Repository Volume not 

Accessible Problem 






• Verify that this LUN still exists and has not failed or been removed from the storage 

device. 

• Verify that the LUN is masked to the proper splitter or splitters and RAs. 

• Verify that other servers in the SAN do not use this volume. For example, if an 

MSCS cluster in the SAN acquired ownership of this volume, it might reserve the 

volume and block other initiators from seeing the volume. 



WWN and LUN. 

• If the volume is corrupted or you determine that it must be reformatted, perform the 

steps in “Reformatting the Repository Volume.” 

Reformatting the Repository Volume 

Before you begin the reformatting process in a geographic clustered environment, be 

sure that all groups are located at the site for which the repository volume is not to be 

formatted. 

On RA 1 at the site for which the repository volume is to be formatted, determine from 

the Site Planning Guide which LUN is used for the repository volume. If the LUN is not 

recorded for the repository volume, a list is presented during the volume formatting 

process that shows LUNs and the previously used repository volume is identified. 

5–8 6872 5688–002


Perform the following steps to reformat a repository volume for a particular site: 

1. Click the Data Group in the Management Console, and perform the following 

steps: 

a. Click Policy in the right pane and change the Global Cluster mode 

selection to Manual. 

b. Click Apply. 

c. Right-click the Data Group and select Disable Group. 

d. Click Yes when the system prompts that the copy activities will be stopped. 

2. Skip to step 6 for geographic replication environments. 

3. Perform the following steps for geographic clustered environments: 

a. Open the Group Policy window for the quorum group. 

b. Change the Global Cluster mode selection to Manual. 

c. Click Apply. 

4. Right-click the Consistency Group and select Disable Group. 

5. Click Yes when the system prompts that the copy activities will be stopped. 

6. Select the Splitters tab. 

a. Open the Splitter Properties window for the splitter. 

b. Select all the attached volumes. 

c. Click Detach and then click Apply. 

d. Click OK to close the window. 

e. Delete the splitter at the site for which the repository volume is to be 

reformatted. 

7. Open the PuTTY session on RA1 for the site. 

a. Log on with boxmgmt as the User ID and boxmgmt as the password. 

The Main menu is displayed. 

b. At the prompt, type 2 (Setup) and press Enter. 

c. On the Setup menu, type 2 (Configure repository volume) and press Enter. 

d. Type 1 (Format repository volume) and press Enter. 

e. Enter the appropriate number from the list to select the LUN. Ensure that 

the WWN and LUN are for the volume that you want to format. The LUN 

and identifier displays. 

f. Confirm the volume to format. 

All data is removed from the volume. 

g. Verify that the operation succeeds and press Enter. 

h. On the Main Menu, type Q (quit) and press Enter. 

8. Open a PuTTY session on each additional RA at the site for which the repository 

volume is to be formatted. 

6872 5688–002 5–9


9. Log on with the boxmgmt as the user ID and boxmgmt as the password. 

The Main menu displays. 

a. At the prompt, type 2 (Setup) and press Enter. 

b. On the Setup menu, type 2 (Configure repository volume) and press Enter. 

c. Type 2 (Select a previously formatted repository volume) and press Enter. 

d. Enter the appropriate number from the list to select the LUN. Ensure that 

the WWN and LUN are for the volume that you want to format. The LUN 

and identifier displays. 

e. Confirm the volume to format. All data is removed from the volume. 

f. Verify that the operation succeeds and press Enter. 

g. On the Main menu, type Q (quit) and press Enter. 

Note: Complete step 9 for each additional RA at the site. 

10. On the Management Console, select the Splitters tab. 

a. Click the Add New Splitter icon to open the Add splitter window. 

b. Click Rescan and select the splitter. 

11. Open the Group Properties window and click the Policy tab and perform the 

following steps for each data group: 

a. Change the Global cluster mode selection to auto-data (shared 

quorum). 

b. Right-click the Data Group and click Enable Group. 

12. Skip to step 16 for geographic replication environments. 

13. Perform the following steps for geographic clustered environments. 

a. Right-click the Quorum Group and click Enable Group. 

b. Click the Quorum Group and select Policy in the right pane. 

c. Change the Global Cluster mode selection to Auto-quorum (shared 

quorum). 

14. Verify that initialization completes for all the groups. 

15. Review the Management Console event log. 

16. Ensure that no storage error or other component error appears. 

5–10 6872 5688–002

Journal Not Accessible 


Symptoms 

The journal is not accessible to either RA. 


A journal for one of the consistency groups is corrupted. The corruption results from 

another initiator because of storage changes or as a result of storage failure. Because 

the snapshot history is corrupted, replication for the relevant consistency group cannot 

proceed. 


• The Volumes tab on the management console shows an error for the journal volume. 

(See Figure 5–8.) 

Figure 5–8. Volumes Tab Shows Journal Volume Error 

• The RAs tab on the management console shows errors for connections between 

this volume and the RAs. (See Figure 5–9.) 

Figure 5–9. RAs Tab Shows Connection Errors 

6872 5688–002 5–11



transfer for the relevant consistency group, as shown in Figure 5–10. 

Figure 5–10. Groups Tab Shows Group Paused by System 

• The Logs tab on the management console lists a message for event ID 3012. This 

message indicates that the RA is unable to access the volume. (See Figure 5–11.) 

Figure 5–11. Management Console Messages for the Journal Not Accessible 

Problem 






• Verify that this LUN still exists on the storage device and that it is only masked to 

the RAs. 



WWN and LUN. 

• For a corrupted journal, check that the system recovers automatically by re-creating 

the data structures for the corrupted journal and that the system then initiates a fullsweep 

resynchronization. No manual intervention is needed. 

5–12 6872 5688–002

Journal Volume Lost Scenarios 


Scenarios 


The journal volume is lost and will not be available in some scenarios as described 

below. 

• Attempt to write data to the Journal volume with the speed higher than the journal 

data is distributed to the replication volume will result in Journal data loss. In this 

case the Journal volume may be full and attempt to perform write operation on it 

creates a problem. 

• The user performs the following operations: 

− Failover 

− Recover production 


You can minimize the occurrence of this problem in scenario 1 by carefully configuring 

the Journal Lag. It is unavoidable in scenario 2. 

Total Storage Loss in a Geographic Replicated 

Environment 


Symptoms 

All volumes belonging to a certain storage target and WWN (or controller, device) have 

been lost. 


• The symptoms can be the same as those from any of the volume failure problems 

listed previously (or a subset of those symptoms), if the symptoms are relevant to 

the volumes that were used on this target. All volumes common to a particular 

storage array have failed. 

The Volumes tab on the management console shows errors for all volumes. (See 

Figure 5–12.) 

6872 5688–002 5–13


Figure 5–12. Management Console Volumes Tab Shows Errors for All Volumes 

• No volumes from the relevant target and WWN are accessible to any initiator on the 

SAN, as shown on the RAs tab on the management console. (See Figure 5–13.) 

Figure 5–13. RAs Tab Shows Volumes That Are Not Accessible 

• Multipathing software (such as EMC PowerPath Administrator) reports failed paths 

to the storage device, as shown in Figure 5–14. 

5–14 6872 5688–002

Figure 5–14. Multipatthing 

Software Reports Failed Paths to Storage 

Device 


6872 5688–002 

Perform the followiing 

actions to isolate and resolve the problem: 

Solving Sto orage Problems 

• Verify that the sstorage 

device has not experienced a power outage. 

Instead, the 

device is functioning 

normally according to all external indicators. 

• Verify that the FFibre 

Channel switch and the storage device indicate e an operating 

Fibre Channel cconnection 

(that is, the relevant LEDs show OK). If the 

indicators are 

not OK, the prooblem 

might be a faulty Fibre Channel port (storage, switch, or patch 

panel) or a faultty 

Fibre Channel cable. 

• Verify that the iinitiator 

can be seen from the switch name server. If f not, the problem 

could be a Fibree 

Channel port or cable problem (as in the preceding g item). Otherwise, 

the problem coould 

be a misconfiguration of the port on the switch (for ( example, type 

or speed could be wrong). 

• Verify that the ttarget 

WWN is included in the relevant zones (that is s, hosts and RA). 

Verify also that the current zoning configuration is the active config guration. If you use 

the default zonee, 

verify that it is set to permit by default. 

• Verify that the rrelevant 

LUNs still exist on the storage device and are 

masked to the 

proper splitters and RAs. 

• Verify that volumes 

have read and write permissions on the storage 

system. 

• Verify that thesse 

volumes are exposed and managed by the proper r hosts and that 

there are no othher 

hosts on the SAN that use this volume. 

5–15


Storage Failure on One Site in a Geographic 

Clustered Environmment 

5–16 

In a geographic clusteredd 

environment where MSCS is running, if the storage 

subsystem 

on one site fails, the symmptoms 

and resulting actions depend on whether the e quorum 

owner resided on the failed 

storage subsystem. 

To understand the two scenarios 

and to follow the actions for both possibilit ties, review 


Fiigure 

5–15. Storage on Site 1 Fails 

68 872 5688–002

Storage Failure on OOne 

Site with Quorum Owner on Failed Site 


Symptoms 

6872 5688–002 

In this case, the cluuster 

quorum owner as well as the quorum resource e resides on the 

failed storage subsyystem. 

The quorum and resource 

automatically fail over to the node that gains control through 

MSCS arbitration. TThis 

node resides on the site without the storage failure. 

The RAs use the lasst 

available image. This action results in a loss of dat ta that has yet to 

be replicated. The rresources 

cannot fail back to the failed site until the storage 

subsystem is restored. 


might help you identify this failure. 

• A node on whicch 

the cluster was running might report a delayed write w failure or 

similar error. 

• The quorum resservation 

is lost, and MSCS stops on the cluster nod de that owned the 

quorum resourcce. 

This action triggers a cluster “regroup” process, which allows 

other cluster noodes 

to arbitrate for the quorum device. Figure 5–16 6 shows typical 

listings for the ccluster 

regroup process. 

Figuure 

5–16. Cluster “Regroup” Process 


5–17


• Cluster nodes located on the failed storage subsystem fail quorum arbitration 

because the service cannot provide a reservation on the quorum volume. The 

resources fail over to the site without a storage failure. The first cluster node on the 

site without the storage failure that successfully completes arbitration of the quorum 

device assumes ownership of the cluster. 

The following messages illustrate this process. 

Cluster Log Entries 

INFO Physical Disk : [DiskArb]------- DisksArbitrate -------. 

INFO Physical Disk : [DiskArb] DisksOpenResourceFileHandle: Attaching to disk with 

signature f6fb216 

INFO Physical Disk : [DiskArb] DisksOpenResourceFileHandle: Disk unique id present 

trying new attach 

INFO Physical Disk : [DiskArb] DisksOpenResourceFileHandle: Retrieving disk number 

from ClusDisk registry key 

INFO Physical Disk : [DiskArb] DisksOpenResourceFileHandle: Retrieving handle to 

PhysicalDrive9 

INFO Physical Disk : [DiskArb] DisksOpenResourceFileHandle: Returns success. 

INFO Physical Disk : [DiskArb] Arbitration Parameters: ArbAttempts 5, 

SleepBeforeRetry 500 ms. 

INFO Physical Disk : [DiskArb] Read the partition info to insure the disk is 

accessible. 

INFO Physical Disk : [DiskArb] Issuing GetPartInfo on signature f6fb216. 

INFO Physical Disk : [DiskArb] GetPartInfo completed, status 0. 

INFO Physical Disk : [DiskArb] Arbitrate for ownership of the disk by 

reading/writing various disk sectors. 

INFO Physical Disk : [DiskArb] Successful read (sector 12) [:0] 

(0,00000000:00000000). 

INFO Physical Disk : [DiskArb] Successful write (sector 11) [USMV-DL580:0] 

(0,6ddd5cac:01c6d778). 

INFO Physical Disk : [DiskArb] Successful read (sector 12) [:0] 

(0,00000000:00000000). 

INFO Physical Disk : [DiskArb] Successful write (sector 12) [USMV-DL580:0] 

(0,6ddd5cac:01c6d778). 

INFO Physical Disk : [DiskArb] Successful read (sector 11) [USMV-DL580:0] 

(0,6ddd5cac:01c6d778). 

INFO Physical Disk : [DiskArb] Issuing Reserve on signature f6fb216. 

INFO Physical Disk : [DiskArb] Reserve completed, status 0. 

INFO Physical Disk : [DiskArb] CompletionRoutine starts. 

INFO Physical Disk : [DiskArb] Posting request to check reserve progress. 

INFO Physical Disk : [DiskArb] ********* IO_PENDING ********** - Request to insure 

reserves working is now posted. 

WARN Physical Disk : [DiskArb] Assume ownership of the device. 

INFO Physical Disk : [DiskArb] Arbitrate returned status 0. 

5–18 6872 5688–002

6872 5688–002 

• In Cluster Administrator, 

the groups that were online on one node change to the 

node that wins arbitration, as shown in Figure 5–17. 

Figuree 

5–17. Cluster Administrator Displays 


• Multipathing sooftware, 

if present, reports errors on the host server rs of the site for 

which the storaage 

subsystem failed. Figure 5–18 shows errors for failed f storage 

devices. 

Figure 5–18. Multipatthing 

Software Shows Server Errors for Fai iled Storage 

Subsystem 

5–19




• Verify that all cluster resources failed over to a node on the site for which the 

storage subsystem did not fail and that these resources are online. If the cluster is 

running and no additional errors are reported, the problem has probably been isolated 

to a total site storage failure. 

• Log in to the storage subsystem, and verify that all LUNs are present and configured 

properly. 

• If the storage subsystem appears to be operating, the problem is most likely 

because of a failed SAN switch. See “Total SAN Switch Failure on One Site in a 

Geographic Clustered Environment” in Section 6. 

• Resolve the failure of the storage subsystem before attempting failback. Once the 

storage subsystem is working and the RAs and host can access it, a full initialization 

is initiated. 

Storage Failure on One Site with Quorum Owner on Surviving 

Site 


Symptoms 

In this case, the cluster quorum owner does not reside on the failed storage subsystem, 

but other resources do reside on the failed storage subsystem. 

The cluster resources fail over to a site without a failed storage subsystem. The RAs use 

the last available image. This action results in a loss of data that has yet to be replicated 

(if not synchronous). The resources cannot fail back to the failed site until the storage 

subsystem is restored. 


• The cluster marks the data groups containing the physical disk resources as failed. 

• Applications dependent on the physical disk resource go offline. Failed resources 

attempt to come online on the failed site, but fail. Then the resources fail over to the 

site with a valid storage subsystem. 



• Verify that multipathing software, if present, reports errors on the host servers at the 

site with the suspected failed storage subsystem. (See Figure 5–19.) 

• Verify that all cluster resources failed over to site 2 in Cluster Administrator. Entries 

similar to the following occur in the cluster log for a host at the site with a failed 

storage subsystem (thread ID and timestamp removed). 

5–20 6872 5688–002

Cluster Log 


Disk reservation lost .. 

ERR Physical Disk : [DiskArb] CompletionRoutine: reservation lost! Status 2 

Arbitrate for disk .... 

INFO Physical Disk : [DiskArb] Arbitration Parameters: ArbAttempts 5, 

SleepBeforeRetry 500 ms. 

INFO Physical Disk : [DiskArb] Read the partition info to insure the disk is 

accessible. 


ERR Physical Disk : [DiskArb] GetPartInfo completed, status 2. 

INFO Physical Disk : [DiskArb] Arbitrate for ownership of the disk by 

reading/writing various disk sectors. 

ERR Physical Disk : [DiskArb] Failed to read (sector 12), error 2. 

INFO Physical Disk : [DiskArb] We are about to break reserve. 

INFO Physical Disk : [DiskArb] Issuing BusReset on signature f6fb211. 

Give up after 5 re-tries ... 

INFO Physical Disk : [DiskArb] We are about to break reserve. 

INFO Physical Disk : [DiskArb] Issuing BusReset on signature f6fb211. 

INFO Physical Disk : [DiskArb] BusReset completed, status 0. 

INFO Physical Disk : [DiskArb] Read the partition info from the disk to insure 

disk is accessible. 


ERR Physical Disk : [DiskArb] GetPartInfo completed, status 2. 

ERR Physical Disk : [DiskArb] Failed to write (sector 12), error 2. 

ERR Physical Disk : Online, arbitration failed. Error: 2. 

INFO Physical Disk : Online, setting ResourceState 4 . 

Control goes offline at failed site... 

INFO [FM] FmpDoMoveGroup: Entry 

INFO [FM] FmpMoveGroup: Entry 

INFO [FM] FmpMoveGroup: Moving group 97ac3c3b-6985-44dd-bacd-a26e14966572 to node 4 (4) 

INFO [FM] FmpOfflineResource: Disk R: depends on Data1. Shut down first. 

INFO Unisys SafeGuard 30m Control : KfResourceOffline: Resource 'Data1' going 

offline. 

After trying other nodes at site move to remote site ... 

INFO [FM] FmpMoveGroup: Take group 97ac3c3b-6985-44dd-bacd-a26e14966572 request to remote 

node 4 

Move succeeds ... 

INFO [FM] FmpMoveGroup: Exit group , status = 0 

INFO [FM] FmpDoMoveGroup: Exit, status = 0 

INFO [FM] FmpDoMoveGroupOnFailure: FmpDoMoveGroup returns 0 

INFO [FM] FmpDoMoveGroupOnFailure Exit. 

INFO [GUM] s_GumUpdateNode: dispatching seq 5720 type 0 context 9 

INFO [FM] GUM update group 97ac3c3b-6985-44dd-bacd-a26e14966572, state 0 

INFO [FM] New owner of Group 97ac3c3b-6985-44dd-bacd-a26e14966572 is 2, state 0, curstate 

0. 

• Log in to the failed storage subsystem and determine whether the storage reports 

failed or missing disks. If the storage subsystem appears to be fine, the problem is 

most likely because of a SAN switch failure. See “Total SAN Switch Failure on One 

Site in a Geographic Clustered Environment” in Section 6. 

• Once the storage for the site that failed is back online, a full sweep is initiated. 

Check that the messages “Starting volume sweep“ and “Starting full sweep “ are 

displayed as an Events Notice. 

6872 5688–002 5–21


5–22 6872 5688–002

Section 6 

Solving SAN Connectivity Problems 

This section lists symptoms that usually indicate problems with connections to the 

storage subsystem. Table 6–1 lists symptoms and possible problems indicated by the 

symptom. The problems and their solutions are described in this section. The graphics, 

behaviors, and examples in this section are similar to what you observe with your 

system but might differ in some details. 






Table 6–1. Possible SAN Connectivity Problems 

Symptoms Possible Problem 

The system pauses the transfer. If the 

volume is accessible to another RA, a 

switchover occurs, and the relevant groups 

start running on the new RA. 

The relevant message appears in the event 

log. 

The link to the volume from the 

disconnected RA or RAs shows an error. 

The volume is accessible to the splitters 

that are attached to it. 


relevant groups. 

If the volume is not accessible, the 

management console shows an error for 

the splitter. If a replication volume is not 

accessible, the splitter connection to that 

volume shows an error. 

Volume not accessible to RAs 

Volume not accessible to SafeGuard 30m 

splitter 

6872 5688–002 6–1


Table 6–1. Possible SAN Connectivity Problems 



relevant group or groups. If the connection 

with only one of the RAs is lost, the group 

or groups can restart the transfer by 

means of another RA, beginning with a 

short initialization. 

The splitter connection to the relevant RAs 

shows an error. 

The relevant message describes the lost 

connection in the event log. 

The management console shows a server 

down. 

Messages on the management console 

show that the splitter is down and that the 

node fails over. 

Multipathing software (such as EMC 

PowerPath Administrator) messages report 

an error. 

Cluster nodes fail and the cluster regroup 

process begins. 

Applications fail and attempt to restart. 

Messages regarding failed physical disks 

are displayed on the management console. 

The cluster resources fail over to the 

remote site. 

RAs not accessible to SafeGuard 30m 

splitter 

Server unable to connect with SAN 

(See “Server Unable to Connect with 

SAN” in Section 9. This problem is not 

described in this section.) 

Total SAN switch failure on one site in a 

geographic clustered environment 

6–2 6872 5688–002

Volume Not Accessible to RAs 


Symptoms 


A volume (repository volume, replication volume, or journal) is not accessible to one or 

more RAs, but it is accessible to all other relevant initiators—that is, the splitter. 


• The system pauses the transfer. If the volume is accessible to another RA, a 

switchover occurs, and the relevant group or groups start running on the new RA. 

• The management console displays failures similar to those in Figure 6–1. 

Figure 6–1. Management Console Showing “Inaccessible Volume” Errors 




Figure 6–2. Management Console Messages for Inaccessible Volumes 

6872 5688–002 6–3


Referenc 

e No. 

The following table explains the numbered messages shown in Figure 6–2. 

Event 

ID 

Description 

1 3012 The RA is unable to access the 

volume (RA 2, quorum). 

2 5049 Splitter writer to RA failed. X 

3 4003 For each consistency group, the 

surviving site reports a group 

consistency problem. The details 

show a WAN problem. 

4 4044 The group is deactivated indefinitely 

by the system. 

5 4003 For each consistency group, a minor 

problem is reported. The details 

show that sides are not linked and 

also cannot transfer data. 

6 4001 For each consistency group, a minor 

problem is reported. The details 

show that sides are not linked and 

also cannot transfer data. 

7 5032 The splitter is splitting to replication 

volumes. 

E-mail 

Immediate 

E-mail 

Daily 

Summary 

To see the details of the messages listed on the management console display, you 

must collect the logs and then review the messages for the time of the failure. 

Appendix A explains how to collect the management console logs, and Appendix E 

lists the event IDs with explanations. 

• If you review the Windows system event log, you can find messages similar to the 

following examples that are based on the testing cases used to generate the 

previous management console images: 

System Event Log for USMV-SYDNEY Host (Host on Failure Site) 

5/28/2008 9:31:53 PM ClusSvc Error Physical Disk Resource 1038 N/A USMV-SYDNEY 

Reservation of cluster disk 'Disk Q:' has been lost. Please check your system and disk configuration 

5/28/2008 9:31:53 PM Srv Warning None 2012 N/A USMV-SYDNEY While transmitting or receiving 

data, the server encountered a network error. Occasional errors are expected, but large amounts of these 

indicate a possible error in your network configuration. The error status code is contained within the 

returned data (formatted as Words) and may point you towards the problem. 

5/28/2008 9:31:54 PM Ftdisk Warning Disk 57 N/A USMV CAS100P2 the system failed to 

flush data to the transaction log. Corruption may occur. 

5/28/2008 9:32:54 PM Service Control Manager Information None 7035 CLUSTERNET\clusadminUSMV- 

SYDNEY The Windows Internet Name Service (WINS) service was successfully sent a stop control. 

6–4 6872 5688–002 

X 

X 

X 

X 

X 

X


System Event Log for Usmv-x455 Host (Host on Surviving Site) 

5/28/2008 9:32:56 PM ClusSvc Warning Node Mgr 1123 N/A USMV-X455 

The node lost communication with cluster node 'USMV-SYDNEY' on network 'Public'. 

5/28/2008 9:32:56 PM ClusSvc Warning Node Mgr 1123 N/A USMV-X455 

The node lost communication with cluster node 'USMV-SYDNEY' on network 'Private LAN'. 

5/28/2008 9:33:10 PM ClusDisk Error None 1209 N/A USMV-X455 

Cluster service is requesting a bus reset for device \Device\ClusDisk0. 

5/28/2008 9:33:30 PM ClusSvc Warning Node Mgr 1135 N/A USMV-X455 Cluster node USMV- 

SYDNEY was removed from the active server cluster membership. Cluster service may have been 

stopped on the node, the node may have failed, or the node may have lost communication with the other 

active server cluster nodes. 

5/28/2008 9:33:30 PM ClusSvc Information Failover Mgr 1200 N/A USMV-X455 

"The Cluster Service is attempting to bring online the Resource Group ""Cluster Group""." 

5/28/2008 9:33:34 PM ClusSvc Information Failover Mgr 1201 N/A USMV-X455 

"The Cluster Service brought the Resource Group ""Cluster Group"" online." 

5/28/2008 9:34:08 PM Service Control Manager Information None 7036 N/A USMV-X455 

The Windows Internet Name Service (WINS) service entered the running state. 

• If you review the cluster log, you can find messages similar to the following 

examples that are based on the testing cases used to generate the previous 

management console images: 

Cluster Log for USMV-SYDNEY Host (Host on Failure Site) 

00000e44.00000380::2008/05/28-21:31:53.841 ERR Physical Disk : [DiskArb] 

CompletionRoutine: reservation lost! Status 170 (Error 170: the requested resource is in use) 

00000e44.00000380::2008/05/28-21:31:53.841 ERR [RM] LostQuorumResource, cluster service 

terminated... 

00000e44.00000f0c::2008/05/28-21:31:55.011 ERR Network Name : Unable to 

open handle to cluster, status 1753. (Error 1753: there are no more endpoints available from the endpoint 

mapper) 

00000e44.00000f08::2008/05/28-21:31:55.341 ERR IP Address : WorkerThread: 

GetClusterNotify failed with status 6. (Error 6: the handle is invalid) 

00000e44.00000f0c::2008/05/28-1:35:56.125 ERR Physical Disk : [DiskArb] Failed to read 


00000e44.00000f0c::2008/05/28-1:35:56.125 ERR Physical Disk : [DiskArb] Error cleaning 

arbitration sector, error 170. 

Cluster Log for Usmv-x455 Host (Host on Surviving Site) 

0000015c.00000234::2008/05/28-21:31:56.299 INFO [ClMsg] Received interface unreachable event for 

node 1 network 2 



00000688.00000e10::2008/05/28-1:35:10.712 ERR Physical Disk : [DiskArb] Signature of disk 

has changed or failed to find disk with id, old signature 0x98f3f0b new signature 0x98f3f0b, status 2. 

(Error 2: The system cannot find the file specified) 

0000015c.000007c8::2008/05/28-1:35:31.136 WARN [NM] Interface f409cf69-9c30-48f0-8519ad5dd14c3300 

is unavailable (node: USMV-SYDNEY, network: Private LAN). 

0000015c.000004fc::2008/05/28-1:35:31.136 WARN [NM] Interface 5019923b-d7a1-4886-825f- 

207b5938d11e is unavailable (node: USMV-SYDNEY, network: Public). 

6872 5688–002 6–5




• Verify that the physical connection between the inaccessible RAs and the Fibre 

Channel switch is healthy. 

• Verify that any disconnected RA appears in the name server of the Fibre Channel 

switch. If not, the problem could be because of a bad port on the switch, a bad host 

bus adaptor (HBA), or a bad cable. 

• Verify that any disconnected RA is present in the proper zone and that the current 

zoning configuration is enabled. 

• Verify that the correct volume is configured (WWN and LUN). To double-check, enter 

the Create Volume command in the management console, and verify that the same 

volume does not appear on the list of volumes that are available to be “created.” 

• If the volume is not accessible to the RAs but is accessible to a splitter, and the 

server on which that splitter is installed is clustered using MSCS, Oracle RAC, or any 

other software that uses a reservation method, the problem probably occurs 

because the server has reserved the volume. 

For more information about the clustered environment installation process, see the 

Unisys SafeGuard Solutions Planning and Installation Guide and the Unisys 

SafeGuard Solutions Replication Appliance Administrator's Guide. 

6–6 6872 5688–002


Volume Not Accessible to SafeGuard 30m Splitter 


Symptoms 

A volume (repository volume, replication volume, or journal) is not accessible to one or 

more splitters but is accessible to all other relevant initiators (for example, the RAs). 


• The system pauses the transfer for the relevant groups. 

• If the repository volume is not accessible, the management console shows an error 

for the splitter. If a replication volume is not accessible, the splitter connection to 

that volume shows an error. 

• The management console System Status screen and the Splitter Settings screen 

show error indications similar to those in Figure 6–3. 

Figure 6–3. Management Console Error Display Screen 




6872 5688–002 6–7


Figure 6–4. Management Console Messages for Volumes Inaccessible to Splitter 

6–8 6872 5688–002



Reference 

No. Event ID Description 

1 4008 For each consistency group at the failed site, the 

transfer is paused to allow a failover to the 

surviving site. 

E-mail 

Immediate 

2 5030 The splitter write operation failed. X 

3 4001 For each consistency group, a minor problem is 

reported. The details show sides are not linked 

and cannot transfer data. 

E-mail Daily 

Summary 

4 4005 Negotiating Transfer Protocol X 

5 4016 Transferring the latest snapshot before pausing 

the transfer (no data is lost). 

6 4007 Pausing Data Transfer X 

7 4087 For each consistency group at the failed site, 

initialization completes. 

8 5032 The splitter is splitting to replication volumes at 

the surviving site. 

9 5049 Splitter write to RA failed X 

10 

4086 

For each consistency group at the failed site, the 

data transfer starts and then the initialization 

starts. 

11 4104 Group Started Accepting Writes X 

12 5015 Splitter is Up X 





6872 5688–002 6–9 

X 

X 

X 

X 

X

Solving SAN Connectivity Prooblems 

6–10 

• The multipathing sofftware 

(such as EMC PowerPath) on the server at the 

failed site 

reports disk error as shown in Figure 6–5. 

Figure 6–5. 

EMC PowerPath Shows Disk Error 

• If you review the Windows 

system event log, you can find messages sim milar to the 

following examples tthat 

are based on the testing cases used to generate e the 

previous management 

console images: 

System Event Log foor 

USMV-SYDNEY Host (Host on Failure Site e) 

5/29/2008 1:35:20 AM EmccpBase 

Error None 108 N/A USMV-SYDNEY Volume 

6006016011321100158233EDE0B23DB11 

is unbound. 

5/29/2008 1:35:20 AM EmccpBase 

to APM00042302162 is dead. 

Error None 100 N/A USMV-SYDNEY Path Bu us 5 Tgt 3 Lun 2 

5/29/2008 1:35:20 AM EmccpBase 



5/29/2008 1:35:20 AM EmccpBase 



5/29/2008 1:35:20 AM EmccpBase 



5/29/2008 1:35:20 AM EmccpBase 

Error None 104 N/A USMV-SYDNEY All path hs to 

6006016011321100158233EDE0B23DB11 

are dead. 

5/29/2008 1:35:20 AM Srv Error None 2000 N/A USMV-SYDNEY The server's call to t a system 

service failed unexpectedly. 

5/29/2008 1:36:18 AM Ftdiisk 

Warning Disk 57 N/A USMV-SYDNEY The system failed to flush 

data to the transaction logg. 

Corruption may occur. 

5/29/2008 1:36:18 AM Srv Error None 2000 N/A USMV-SYDNEY The server's call to t a system 

service failed unexpectedly. 

5/29/2008 1:36:18 AM Ntfss 

Warning None 50 N/A USMV-SYDNEY {Delayed Write Failed} F Windows 

was unable to save all thee 

data for the file. The data has been lost. This error may be cause ed by a failure of 

your computer hardware oor 

network connection. Please try to save this file elsewhere. 

68 872 5688–002


5/29/2008 1:36:18 AM Application Popup Information None 26 N/A USMV-SYDNEY 

Application popup: Windows - Delayed Write Failed : Windows was unable to save all the data for the file 

S:\$BitMap. The data has been lost. This error may be caused by a failure of your computer hardware or 


5/29/2008 1:36:19 AM Service Control Manager Information None 7035 CLUSTERNET\clusadmin 

USMV-SYDNEY The Windows Internet Name Service (WINS) service was successfully sent a stop 

control. 


5/29/2008 1:35:26 AM ClusSvc Warning Node Mgr 1123 N/A USMV-X455 The node lost 

communication with cluster node 'USMV-SYDNEY' on network 'Public'. 

5/29/2008 1:35:26 AM ClusSvc Warning Node Mgr 1123 N/A USMV-X455 The node lost 

communication with cluster node 'USMV-SYDNEY' on network 'Private LAN'. 

5/29/2008 1:35:40 AM ClusDisk Error None 1209 N/A USMV-X455 Cluster service is requesting 


5/29/2008 1:36:06 AM ClusSvc Warning Node Mgr 1135 N/A USMV-X455 Cluster node USMV- 




5/29/2008 1:36:06 AM ClusSvc Information Failover Mgr 1200 N/A USMV-X455 "The Cluster 

Service is attempting to bring online the Resource Group ""Cluster Group""." 

5/29/2008 1:36:10 AM ClusSvc Information Failover Mgr 1201 N/A USMV-X455 "The Cluster 

Service brought the Resource Group ""Cluster Group"" online." 

5/29/2008 1:36:36 AM Service Control Manager Information None 7035 

CLUSTERNET\clusadmin USMV-X455 The Windows Internet Name Service (WINS) service was 

successfully sent a start control. 





00000d68.00000284::2008/05/29-1:35:21.703 ERR Physical Disk : [DiskArb] 

CompletionRoutine: reservation lost! Status 21 (Error 21: the device is not ready) 

00000d68.00000284::2008/05/29-1:35:22.713 ERR Physical Disk : [DiskArb] 

CompletionRoutine: reservation lost! Status 2 (Error 2: the system cannot find the file specified) 

00000d68.00000284::2008/05/29-1:35:22.713 ERR [RM] LostQuorumResource, cluster service 

terminated... 

00000d68.00000e68::2008/05/29-1:35:23.133 ERR Physical Disk : LooksAlive, error checking 

device, error 2. 

00000d68.00000e68::2008/05/29-1:35:23.133 ERR Physical Disk : IsAlive, error checking 


00000d68.00000e68::2008/05/29-1:35:23.143 ERR Network Name : Name query request 

failed, status 3221225860. 

00000d68.00000e68::2008/05/29-1:35:23.143 INFO Network Name : Name SYDNEY- 

AUCKLAND failed IsAlive/LooksAlive check, error 22. (Error 22: the device does not recognize the 

command) 

00000d68.00000cd0::2008/05/29-1:35:23.303 ERR Network Name : Unable to 

open handle to cluster, status 1753. 

00000d68.00000cd0::2008/05/29-21:33:19.245 ERR Physical Disk : [DiskArb] Failed to read 

(sector 12), error 1117. (Error 1117: the request could not be performed because of an I/O device error) 

00000d68.00000cd0::2008/05/29-21:33:19.245 ERR Physical Disk : [DiskArb] Error cleaning 







00000688.00000d08::2008/05/29-1:35:40.523 ERR Physical Disk : [DiskArb] GetPartInfo 


00000688.00000d08::2008/05/29-1:35:40.653 ERR Physical Disk : [DiskArb] Failed to read 


6872 5688–002 6–11




• Verify that the physical connection between the disconnected splitter or splitters and 

the Fibre Channel switch is healthy. 

• Verify that any host on which a disconnected splitter resides appears in the name 

server of the Fibre Channel switch. If not, the problem could be because of a bad 

port on the switch, a bad HBA, or a bad cable. 

• Verify that any host on which a disconnected splitter resides is present in the proper 

zone and that the current zoning configuration is enabled. 

• If a replication volume is not accessible to the splitter at the source site, but appears 

as OK in the management console for that splitter, verify that the splitter is not 

functioning at the target site (TSP not enabled). During normal replication, the 

system prevents target-site splitters from accessing the replication volumes. 

RAs Not Accessible to SafeGuard 30m Splitter 


Symptoms 

One or more RAs on a site are not accessible to the splitter through the Fibre Channel. 


• The system pauses the transfer for the relevant groups. If the connection with only 

one of the RAs is lost, the groups can restart the transfer by means of another RA, 

beginning with a short initialization. 

• The splitter connection to the relevant RAs shows an error. 

• The management console displays error indicators similar to those in Figure 6–6. 

Figure 6–6. Management Console Display Shows a Splitter Down 




6–12 6872 5688–002


Figure 6–7. Management Console Messages for Splitter Inaccessible to RA 

6872 5688–002 6–13


Reference 

No. 


Event 

ID 


Immediate 

1 4005 The surviving site Negotiating transfer 

protocol 

2 4008 For each consistency group at the 

failed site, the transfer is paused to 

allow a failover to the surviving site. 

3 5002 The splitter for server USMV-SYDNEY 

is unable to access the RA. 

4 4105 The failed site stop accepting writes to 

the consistency group 


failed site, the transfer is paused to 

allow a failover to the surviving site. 

6 5013 Splitter down problem X 

7 4087 The synchronization completed 

message after the splitter is restored 

and replication completes 

8 5032 The splitter starts splitting the 

replication volumes 

9 4001 Group capabilities reporting problem. X 

10 5032 The splitter is splitting to replication 

volumes 

13 5049 The splitter unable to write to the RAs X 

14 4086 The original site starts the 

synchronization 

15 4104 Consistency Group start replicating X 

E-mail 

Daily 

Summary 








6–14 6872 5688–002 

X 

X 

X 

X 

X 

X 

X 

X 

X



5/29/2008 2:25:20 AM ClusSvc Error Physical Disk Resource 1038 N/A USMV-SYDNEYReservation 

of cluster disk 'Disk Q:' has been lost. Please check your system and disk configuration. 

5/29/2008 2:25:20 AM Service Control Manager Error None 7034 N/A USMV-SYDNEYThe Cluster 

service terminated unexpectedly. 

5/29/2008 2:25:50 AM Srv Warning None 2012 N/A USMV-SYDNEY While transmitting or 

receiving data, the server encountered a network error. Occasional errors are expected, but large amounts 

of these indicate a possible error in your network configuration. The error status code is contained within 

the returned data (formatted as Words) and may point you towards the problem. 

5/29/2008 2:25:20 AM Ftdisk Warning Disk 57 N/A USMV-SYDNEY 

The system failed to flush data to the transaction log. Corruption may occur. 

5/29/2008 2:25:21 AM Ntfs Warning None 50 N/A USMV-SYDNEY {Delayed Write Failed} 

Windows was unable to save all the data for the file. The data has been lost. This error may be caused by 

a failure of your computer hardware or network connection. Please try to save this file elsewhere. 

5/29/2008 2:25:32 AM Ftdisk Warning Disk 57 N/A USMV-SYDNEY 

The system failed to flush data to the transaction log. Corruption may occur. 

5/29/2008 2:25:32 AM Srv Error None 2000 N/A USMV-SYDNEY 

The server's call to a system service failed unexpectedly. 

5/29/2008 2:25:32 AM ClusSvc Error IP Address Resource 1077 N/A USMV-SYDNEY 

The TCP/IP interface for Cluster IP Address '' has failed. 

5/29/2008 2:25:32 AM ClusSvc Error Physical Disk Resource 1036 N/A USMV-SYDNEY 

Cluster disk resource '' did not respond to a SCSI maintenance command. 

5/29/2008 2:25:32 AM ClusSvc Error Network Name Resource 1215 N/A USMV-SYDNEYCluster 

Network Name SYDNEY-AUCKLAND is no longer registered with its hosting system. The associated 

resource name is ''. 


5/29/2008 2:25:23 AM ClusSvc Warning Node Mgr 1123 N/A USMV-X455 

The node lost communication with cluster node 'USMV-SYDNEY' on network 'Public'. 

5/29/2008 2:25:23 AM ClusSvc Warning Node Mgr 1123 N/A USMV-X455 

The node lost communication with cluster node 'USMV-SYDNEY' on network 'Private LAN'. 

5/29/2008 2:25:37 AM ClusDisk Error None 1209 N/A USMV-X455 

Cluster service is requesting a bus reset for device \Device\ClusDisk0. 

5/29/2008 2:25:53 AM ClusSvc Warning Node Mgr 1135 N/A USMV-X455 Cluster node USMV- 




5/29/2008 2:25:53 AM ClusSvc Information Failover Mgr 1200 N/A USMV-X455 

"The Cluster Service is attempting to bring online the Resource Group ""Cluster Group""." 

5/29/2008 2:25:58 AM ClusSvc Information Failover Mgr 1201 N/A USMV-X455 

"The Cluster Service brought the Resource Group ""Cluster Group"" online." 

5/28/2008 2:25:35 AM Service Control Manager Information None 7035 

CLUSTERNET\clusadmin USMV-X455 

The Windows Internet Name Service (WINS) service was successfully sent a start control. 

5/29/2008 2:25:37 AM Service Control Manager Information None 7035 NT 

AUTHORITY\SYSTEM USMV-X455 

The Windows Internet Name Service (WINS) service was successfully sent a continue control. 




6872 5688–002 6–15



00000f70.00000d10::2008/05/29-2:25:20.426 ERR Physical Disk : [DiskArb] 

CompletionRoutine: reservation lost! Status 31. (Error 31: a device attached to the system is not 

functioning) 

00000f70.00000d10::2008/05/29-2:25:20.426 ERR [RM] LostQuorumResource, cluster service 

terminated... 

00000f70.00000e78::2008/05/29-2:25:32.778 ERR Physical Disk : IsAlive, error checking device, 

error 995. (Error 995: The I/O operation has been aborted because of either a thread exit or an application 

request) 

00000f70.00000e78::2008/05/29-2:25:32.778 ERR Physical Disk : LooksAlive, error checking 


00000f70.00000e78::2008/05/29-2:25:32.778 ERR Physical Disk : IsAlive, error checking 


00000f70.00000e78::2008/05/29-2:25:32.778 ERR Network Name : Name query request 

failed, status 3221225860. 

00000f70.00000b54::2008/05/29-2:25:32.868 ERR Network Name : Unable to open 

handle to cluster, status 1753. (Error 1753: there are no more endpoints available from the endpoint 

mapper) 

00000f70.00000b54::2008/05/29-2:25:33.258 ERR Physical Disk : Terminate, error opening 

\Device\Harddisk10\Partition1, error C0000022. 

00000f70.00000b54::2008/05/29-2:25:33.528 ERR Physical Disk : [DiskArb] Failed to read 

(sector 12), error 170. (Error 170: the requested resource is in use) 

00000f70.00000b54::2008/05/29-2:25:33.528 ERR Physical Disk : [DiskArb] Error cleaning 







00000688.00000d08::2008/05/29-2:25:37.898 ERR Physical Disk : [DiskArb] GetPartInfo 


00000688.00000d08::2008/05/29-2:25:37.898 ERR Physical Disk : [DiskArb] Failed to read 




• Identify which of the components is the problematic one. A problematic component 

is likely to have additional errors or problems: 

− A problematic RA might not be accessible to other splitters or might not 

recognize certain volumes. 

− A problematic splitter might not recognize any RAs or the storage subsystem. 

• Connect to the storage switch to verify the status of each connection. Ensure that 

each connection is configured correctly. 

• If you cannot find any additional problems, there is a good chance that the problem is 

with the zoning; that is, somehow, the splitters are not exposed to the RAs. 

• Verify the physical connectivity of the RAs and the servers (those on which the 

potentially problematic splitters reside) to the Fibre Channel switch. For each 

connection, verify that it is healthy and appears correctly in the name server, zoning, 

and so forth. 

• Verify that this is not a temporary situation---for instance, if the RAs were rebooting 

or recovering from another failure, the splitter might not yet identify them. 

6–16 6872 5688–002

Total SAN Switcch 

Failure on One Site in a 

Geographic Clusstered 

Environment 

6872 5688–002 

Solving SAN Connec ctivity Problems 

A total SAN switch failure implies that cluster nodes and RAs have lost t access to the 

storage device thatt 

was connected to the SAN on one site. This failure causes the 

cluster nodes to losse 

their reservation of the physical disks and triggers s an MSCS failover 

to the remote site. In a geographic clustered environment where MSCS S is running, if the 

connection to a storage 

device on one site fails, the symptoms and res sulting actions 

depend on whetherr 

or not the quorum owner resided on the failed stor rage device. 

To understand the ttwo 

scenarios and to follow the actions for both pos ssibilities, review 

Figure 6–8. 

FFigure 

6–8. SAN Switch Failure on One Site e 

6–17


Cluster Quorum Owner Located on Site with Failed SAN Switch 


Symptoms 

The following point explains the expected behavior of the MSCS Reservation Manager 

when an event of this nature occurs: 

• If the cluster quorum owner is located on the site with the failed SAN, the quorum 

reservation is lost. This loss causes the cluster nodes to fail and triggers a cluster 

“regroup” process. This regroup process allows other cluster nodes participating in 

the cluster to arbitrate for the quorum device. 

Cluster nodes located on the failed SAN fail quorum arbitration because the failed 

SAN is not able to provide a reservation on the quorum volume. The cluster nodes in 

the remote location attempt to reserve the quorum device and succeed arbitration of 

the quorum. The node that owns the quorum device assumes ownership of the 

cluster. The cluster owner brings online the data groups that were owned by the 

failed site. 


• All resources fail over to the surviving site (site 2 in this case) and come online 

successfully. Cluster nodes fail at the source site. If the consistency groups are 

configured asynchronously, this failover results in loss of data. The failover is fully 

automated and does not require additional downtime. The RAs cannot replication 

data until the SAN is operational. 

• Failures are reported on the server and the management console. Replication 

stopped on all consistency groups. 

• The management console displays error indications similar to those in Figure 6–9. 

Figure 6–9. Management Console Display with Errors for Failed SAN Switch 




6–18 6872 5688–002


Figure 6–10. Management Console Messages for Failed SAN Switch 

6872 5688–002 6–19


Reference 

No. 


Event 

ID 

Description 


volume. 

E-mail 

Immediate 

2 5002 RA unable to access splitter X 

3 4001 The surviving site reports of the 

Group Capabilities problem 

4 4008 The Surviving site pauses the data 

transfer 

5 5013 The original site reporting the 

splitter down status 


surviving site reports a group 

consistency problem. The details 

show a WAN problem. 


repository volume. 

8 4044 The group is deactivated indefinitely 

by the system. 

9 4007 The system is pausing data transfer 

on the surviving site (Quorum --- 

South). 

E-mail 

Daily 

Summary 

10 4086 Synchronization started message X 

11 4000 Group capabilities OK message X 

12 5032 The splitter starts splitting X 








6–20 6872 5688–002 

X 

X 

X 

X 

X 

X 

X 

X



5/29/2008 05:15:31 PM Application Popup Information None 26 N/A USMV-SYDNEY 


Q:\. The data has been lost. This error may be caused by a failure of your computer hardware or network 

connection. Please try to save this file elsewhere. 

5/29/2008 05:15:31 PM Ftdisk Warning Disk 57 N/A USMV-SYDNEY The system failed to 




System Event Log for USMV-AUCKLAND Host (Host on Surviving Site) 

5/29/2008 05:13:33 PM ClusSvc Error Physical Disk Resource 1038 N/A USMV-SYDNEY 

Reservation of cluser disk 'Disk Q:' has been lost. Please check your system and disk configuration. 

5/29/2008 05:13:33 PM Service Control Manager Error None 7031 N/A USMV-SYDNEY 

The Cluster Service terminated unexpectedly. It has done this 2 time(s). The following corrective action 

will be taken in 120000 milliseconds: Restart the service. 

5/29/2008 05:15:31 PM Application Popup Information None 26 N/A USMV-SYDNEY 


Q:\$Mft. The data has been lost. This error may be caused by a failure of your computer hardware or 








00001130.00001354::2008/5/29-17:14:33.712 ERR Physical Disk : [DiskArb] 

CompletionRoutine: reservation lost! Status 170 (Error 170: the requested resource is in use) 

00001130.00001354::2008/5/29-17:14:33.712 ERR [RM] LostQuorumResource, cluster service 

terminated... 

00001130.00001744::2008/5/29-17:15:31.733 ERR Physical Disk : [DiskArb] Failed to read 


00001130.00001744::2008/5/29-17:15:31.733 ERR Physical Disk : [DiskArb] Error cleaning 


00001130.00001744::2008/5/29-17:15:31.733 ERR Network Name : Unable to open 

handle to cluster, status 1753. (Error 1753: there are no more endpoints available from the endpoint 

mapper) 

00001130.00000d3c::2008/5/29-17:15:31.733 ERR IP Address : WorkerThread: 

GetClusterNotify failed with status 6. (Error 6: the handle is invalid) 

6872 5688–002 6–21


Cluster Log for USMV-AUCKLAND Host (Host on Surviving Site) 

00000668.00000d90::2008/5/29-17:14:35.120 INFO [ClMsg] Received interface unreachable event for 


00000668.00000d90::2008/5/29-17:14:35.120 INFO [ClMsg] Received interface unreachable event for 


00000bb8.00000d0c::2008/5/29-17:14:49.706 ERR Physical Disk : [DiskArb] GetPartInfo 


00000bb8.00000d0c::2008/5/29-17:14:49.706 ERR Physical Disk : [DiskArb] Failed to read 



To resolve this situation, diagnose the SAN switch failure. 

Cluster Quorum Owner Not on Site with Failed SAN Switch 


Symptoms 

The following points explain the expected behavior of the MSCS Reservation Manager 

when an event of this nature occurs: 

• If a SAN failure occurs and the cluster nodes do not own the quorum resource, the 

state of the cluster services on these nodes is not affected. 

• The cluster nodes remain as active cluster members; however, the data groups 

containing the SafeGuard 30m Control instance and the physical disk resources on 

these nodes are marked as failed, and any applications dependent on them are taken 

offline. These resources first try to restart, and then eventually fail over to the 

surviving site. 


• Applications fail and attempt to restart. 

• The data groups containing the SafeGuard 30m Control instance and the physical 

disk resources on these nodes are marked as failed, and any applications dependent 

on them are taken offline. These resources first try to restart, and then eventually fail 

over to the surviving site. The cluster nodes remain as active cluster members. 

• The management console displays error indications similar to those in Figure 6–9. 




6–22 6872 5688–002


Figure 6–11. Management Console Messages for Failed SAN Switch with Quorum 

Owner on Surviving Site 

6872 5688–002 6–23


Reference 

No. 


Event ID 

Description 

1 5002 The RA is unable to access 

the splitter. 


the volume (RA 2, Quorum). 

3 4003 For each consistency group, 

the surviving site reports a 

group consistency problem. 

The details show a WAN 

problem. 


the repository volume 

(RA2). 

5 4009 The system is pausing data 

transfer on the failure site 

6 4044 The group is deactivated 

indefinitely by the system. 

E-mail 

Immediate 

E-mail 

Daily 

Summary 









5/29/2008 5:14:24 PM ClusDisk Error None 


1209 N/A USMV-AUCKLAND Cluster service is requesting 

5/29/2008 5:14:38 PM ClusSvc Warning Node Mgr 1123 N/A USMV- AUCKLAND The node lost 


5/29/2008 5:14:39 PM ClusSvc Information Node Mgr 1122 N/A USMV- AUCKLAND The node 

(re)established communication with cluster node 'USMV-SYDNEY' on network 'Private LAN'. 

6–24 6872 5688–002 

X 

X 

X 

X 

X 

X


System Event Log for Usmv-Auckland Host (Host on Surviving Site) 

5/29/2008 5:14:38 PM ClusSvc Warning Node Mgr 1123 N/A USMV- AUCKLAND The node lost 


5/29/2008 5:14:39 PM ClusSvc Information Node Mgr 1122 N/A USMV- AUCKLAND The node 

(re)established communication with cluster node 'USMV-SYDNEY' on network 'Private LAN'. 




Cluster Log for Usmv USMV-SYDNEY Host (Host on Failure Site) 

00001524.00000360::2008/5/29-17:14:56.750 ERR Physical Disk : [DiskArb] GetPartInfo 


00001524.00000360::2008/5/29-17:14:56.750 ERR Physical Disk : [DiskArb] Failed to read 


00001524.000017e4::2008/5/29-17-15:22.899 ERR IP Address : WorkerThread: 


Cluster Log for USMV-AUCKLAND Host (Host on Surviving Site) 

00000bb8.00000c5c::2008/5/29-17:14:14.596 ERR IP Address : WorkerThread: 


00000bb8.0000026c::2008/5/29-17:14:24.188 ERR Physical Disk : [DiskArb] GetPartInfo 


00000bb8.0000026c::2008/5/29-17:14:24.188 ERR Physical Disk : [DiskArb] Failed to read 



To resolve this situation, diagnose the SAN switch failure. 

6872 5688–002 6–25


6–26 6872 5688–002

Section 7 

Solving Network Problems 

This section lists symptoms that usually indicate networking problems. Table 7–1 lists 




details. 


for possible problems. Also, messages are displayed on the management console similar 

to the e-mail messages. If you do not see the messages, they might have already 



Table 7–1. Possible Networking Problems with Symptoms 


The cluster groups with the failed network 

connection fail over to the next preferred 

node. If only one node is configured at the 

site with the failure, replication direction 

changes and applications run on the 

backup site. 

If the NIC is teamed, no failover occurs and 

no symptoms are obvious. 

The networks on the Cluster Administrator 

screen show an error. 

Host system and application event log 

messages contain error or warning 

messages. 

Clients on site 2 are not able to access 

resources associated with the IP resource 

located on site 1. 

Public communication between the two 

sites fails, only allowing local cluster public 

communication between cluster nodes and 

local clients. 



Public NIC failure on a cluster node in a 


Public or client WAN failure in a geographic 


6872 5688–002 7–1


Table 7–1. Possible Networking Problems with Symptoms 


You cannot access the management 

console or initiate an SSH session through 

PuTTY using the management IP address 

of the remote site. 

Management console log indicates that the 

WAN data links to the RAs are down. 

All consistency groups show the transfer 

status as “Paused by system.” 

On the management console, all 

consistency groups show the transfer 

status switching between “Paused by 

system” and “initializing/active.” All 

groups appear unstable over the WAN 

connection. 



You cannot access the management 

console using the management IP address 

of the remote site. 

The cluster is no longer accessible from 

nodes except from one surviving node. 

Unable to reach DNS server. 

Unable to communicate to NTP server. 

Unable to reach mail server. 

The management console shows errors for 

the WAN or for RA data links. 

The management console logs show RA 

communication errors. 

Management network failure in a 


Replication network failure in a geographic 


Temporary WAN failures 

Private cluster network failure in a 


Total communication failure in a 


Port information 

7–2 6872 5688–002

Public NIC Failuure 

on a Cluster Node in a 

Geographic Clusstered 

Environment 


6872 5688–002 

If a public network interface card (NIC) of a cluster node failed, the clus ster node of the 

failed public NIC cannot 

access clients. The cluster node of the failed NIC N can participate 

in the cluster as a mmember 

because it can communicate over the privat te cluster 

network. Other clusster 

nodes are not affected by this error. 

The MSCS software 

detects a failed network and the cluster resources s fail over to the 

next preferred nodee. 

All cluster groups used for replication that contain a virtual IP 

address for the faileed 

network connection succeed to fail over to the ne ext preferred 

node. However, thee 

Unisys SafeGuard 30m Control resources cannot fail f back to the 

node with a failed ppublic 

network because they cannot communicate with w the site 

management IP adddress 

of the RAs. 

Note: A teamed ppublic 

network interface does not experience this pr roblem and 

therefore is the reccommended 

configuration. 


this failure. 

Solving Net twork Problems 

Figgure 

7–1. Public NIC Failure of a Cluster Node 

7–3


Symptoms 


• All cluster groups used for replication that contain a virtual IP address for the failed 

network connection fail over to the next preferred node. 

• If no other node exists at the same site, replication direction changes and the 

application run at the backup site. 

• If you review the host system event log, you can find messages similar to the 

following examples: 

Windows System Event Log Messages on Host Server 

Type: error 

Source: ClusSvc 

EventID: 1077, 1069 

Description: The TCP/IP interface for Cluster IP Address “xxx” has failed. 

Type: error 


EventID: 1069 

Description: Cluster resource ‘xxx’ in Resource Group ‘xxx’ failed. 

Type: error 


EventID: 1127 

Description: The interface for cluster node ‘xxx’ on network ‘xxx’ failed. If the condition persists, check 

the cabling connecting the node to the network. Next, check for hardware or software errors in nodes’s 

network Adapter. 

• If you attempt to move a cluster group to the node with the failing public NIC, the 

event 2002 message is displayed in the host application event log. 

Application Event Log Message on Host Server 

Type: warning 

Source: 30mControl 

Event Category: None 

EventID: 2002 

Date : 05/30/2008 

Time: 11:12:02 AM 

User : N/A\ 

Computer: USMV-DL580 

Description: Online resource failed. RA CLI command failed because of a network communication error or 

invalid IP address. 

Action: Verify the network connection between the system and the site management IP Address 

specified for the resource. Ping each site management IP Address specified for the specified resource. 

Note: The preceding information can also be viewed in the cluster log. 

7–4 6872 5688–002

6872 5688–002 

• The managemeent 

console display and management console logs do d not show any 

errors. 

• When the publiic 

NIC fails on a node that does not use teaming, the e Cluster 

Administrator ddisplays 

an error indicator similar to Figure 7–2. If the e public NIC 

interface is teammed, 

you do not see error messages in the Cluster Administrator. 

Figure 7–2. Pubblic 

NIC Error Shown in the Cluster Adminis strator 

Actions to Resolve thhe 

Problem 

Perform the followiing 

actions to isolate and resolve the problem: 


1. In the Cluster AAdministrator, 

verify that the public interface for all nodes 

is in an 

“Up” state. If mmultiple 

nodes at a site show public connections failed 

in the Cluster 

Administrator, pphysically 

check the network switch for connection errors. 

If the private neetwork 

also shows errors, physically check the netw work switch for 

connection erroors. 

2. Inspect the NICC 

link indicators on the host and, from a client, use th he Ping command 

to verify the physical 

IP address of the adapter (not the virtual IP ad ddress). 

3. Isolate a NIC orr 

cabling issue by moving cables at the network swit tch and at the NIC. 

4. Replace the NICC 

in the host if necessary. No configuration of the re eplaced NIC is 

necessary. 

5. Move the cluster 

resources back to the original node after the reso olution of the 

failure. 

7–5


Public or Client WAN Failure in a Geographic 

Clustered Environment 


When the public or client WAN fails, some clients cannot access virtual IP networks that 

are associated with the cluster. The WAN components that comprise this failure might 

be two switches that are possibly on different subnets using gateways. This failure 

results from connectivity issues. The MSCS cluster would detect and fail the associated 

node if the failure resulted from an adapter failure or media failure to the adapter. 

Instead, cluster groups do not fail and the public LAN shows an “unreachable for this 

failure” mode. 

Public communication between the two sites failed, only allowing local cluster public 

communication between cluster nodes and local clients. The cluster node state does not 

change on either site because all cluster nodes are able to communicate with the private 

cluster network. 

All resources remain online and no cluster group errors are reported in the Cluster 

Administrator. Clients on the remote site cannot access resources associated with the IP 

resource located on the local site until the public or client network is again operational. 

Depending on the cause of the failure and the network configuration, the SafeGuard 30m 

Control might fail to move a cluster group because the management network might be 

the same physical network as the public network. Whether this failure to move the 

group occurs or not depends on how the RAs are physically wired to the network. 

7–6 6872 5688–002

Symptoms 

6872 5688–002 


this scenario. 

Figure 7–3. Public or Client WAN Failure 

The following symmptoms 



• Clients on site 2 are not able to access resources associated with the t IP resource 

located on site 1. 

• Public communnication 

between the two sites displays as “unreach hable” allowing 

local cluster public 

communication between cluster nodes and loca al clients. 

• When the publiic 

cluster network fails, the Cluster Administrator dis splays an error 

indicator similar 

to Figure 7–4. 

All private netwwork 

connections show as “unreachable” when the problem is a WAN 

issue. 

If only two of thhe 

connections show as failed (and the nodes are ph hysically located at 

the same site), the issue is probably local to the site. 

If only one connnection 

failed, the issue is probably a host network adapter. 

a 

7–7


7–8 

Figure 7–4. Cluster Administrator 

Showing Public LAN Network Error E 

• If you review the sysstem 

event log, messages similar to the following ex xamples are 

displayed: 

Event Type : Warning 

Event Source : ClusSvc 

Event Category: Node Mggr 

Event ID: 1123 

Date : 05/30/2008 

Time: 9:49:34 AM 

User : N/A 

Computer: USMV-WEST22 

Description: 

The node lost communicaation 

with cluster node 'USMV-EAST2' on network 'Public LAN'. 

Event Type: Warning 

Event Source: ClusSvc 

Event Category: Node Mggr 


Date : 05/30/2008 

Time: 9:49:36 AM 

User : N/A 


Description: 

The interface for cluster nnode 

'USMV-WEST2' on network 'Public LAN' is unreachable by at a least one 

other cluster node attacheed 

to the network. the server cluster was not able to determine the t location of 

the failure. Look for additional 

entries in the system event log indicating which other nodes s have lost 

communication with nodee 

USMV-WEST2. If the condition persists, check the cable connec cting the node 

to the network. Next, cheeck 

for hardware or software errors in the node's network adapter. 

Finally, check 

for failures in any other neetwork 

components to which the node is connected such as hubs, 

switches, or 

bridges. 

68 872 5688–002


Event Type: Warning 

Event Source: ClusSvc 

Event Category: Node Mgr 


Date : 05/30/2008 

Time: 9:49:36 AM 

User : N/A 


Description: 

Cluster network 'Public network is down. None of the available nodes can communicate using this 

network. If the condition persists, check for failures in any network components to which the nodes are 

connected such as hubs, switches, or bridges. Next, check the cables connecting the nodes to the 

network. Finally, check for hardware or software errors in the adapters that attach the nodes to the 

network. 

• A cluster group containing a SafeGuard 30m Control resource might fail to move to 

another node when the management network has network components common to 

the public network. (Refer to “Management Network Failure in a Geographic 

Clustered Environment.”) 

• Symptoms might include those in “Management Network Failure in a Geographic 

Clustered Environment” when these networks are physically the same network. 

Refer to this topic if the clients at one site are not able to access the IP resources at 

another site. 

• The management console logs might display the messages in the following table 

when this connection fails and is then restored. 

Event 

ID 

Description 

3023 For each RA at the site, this console log 

message is displayed: 

Error in LAN link to RA. (RA ) 

3022 

When the LAN link is restored, a 

management console log displays: 

LAN link to RA restored. (RA) 

E-mail 

Immediate 

E-mail 

Daily 

Summary 

6872 5688–002 7–9 

X 

X



Note: Typically, a network administrator for the site is required to diagnose which 

network switch, gateway, or connection is the cause of this failure. 


1. In the Cluster Administrator, view the network properties of the public and private 

network. 

The private network should be operational with no failure indications. 

The public network should display errors. Refer to the previous symptoms to identify 

that this is a WAN issue. If the error is limited to one host, the problem might be a 

host network adapter. See “Cluster Node Public NIC Failure in a Geographic 

Clustered Environment.” 

2. Check for network problems using a method such as isolating the failure to the 

network switch or gateway by pinging from the cluster node to the gateway at each 

site. 

3. Use the Installation Manager site connectivity IP diagnostic from the RAs to the 

gateway at each site by performing the following steps. (For more information, see 

Appendix C.) 

a. Log on to an RA with user ID as boxmgmt and password as boxmgmt. 

b. On the Main Menu, type 3 (Diagnostics) and press Enter. 

c. On the Diagnostics menu, type 1 (IP diagnostics) and press Enter. 

d. On the IP Diagnostics menu, type 1 (Site connectivity tests) and press Enter. 

e. When asked to select a target for the tests, type 5 (Other host) and press 

Enter. 

f. Enter the IP address for the gateway that you want to test. 

g. Repeat steps a through f for each RA. 

4. Isolate the site by determining which gateway or network switch failed. Use 

standard network methods such as pinging to make the determination. 

7–10 6872 5688–002

Management Neetwork 

Failure in a Geograp phic 

Clustered Enviroonment 


Symptoms 

6872 5688–002 

When the managemment 

network fails in a geographic clustered environ nment, you cannot 

access the manageement 

console for the affected site. The replication environment e 

is not 

affected. If you try tto 

move a cluster group to the site with the failed management 

m 

network, the move fails. 


this scenario. 

Figure 7–5. Management Network Failure 




• The indicators ffor 

the onboard management network adapter of the e RA are not 

illuminated. 

• Network switchh 

port lights show that no link exists with the host adapter. 

7–11


• You cannot access the management console or initiate a SSH session through 

PuTTY using the management IP address of the failed site from remote site. You can 

access the management console from a client local to the site. If you cannot access 

the management IP address from either site, see Section 8, “Solving Replication 

Appliance (RA) Problems.” 

• A cluster move operation to the site with the failed management network might fail. 

The event ID 2002 message is displayed in the host application event log. 

Application Event Log Message on Host Server 

Type : warning 

Source : 30mControl 


EventID : 2002 

Date : 05/30/2008 

Time : 2:46:29 PM 

User : N/A 

Computer : USMV-SYDNEY 

Description : Online resource failed. RA CLI command failed because of a network communication 

error or invalid IP address. 

Action : Verify the network connection between the system and the site management IP Address 

specified for the resource. Ping each site management IP Address mentioned for the specified resource. 

Note: The preceding information can also be viewed in the cluster log. 

• If the management console was open with the IP address of the failed site, the 

message “Connection with RA was lost, please check RA and network settings” is 

displayed. The management console display shows “not connected,” and the 

components have a question mark “Unknown” status as illustrated in Figure 7–6. 

7–12 6872 5688–002


Figure 7–6. Management Console Display: “Not Connected” 

• The management console log displays a message for event 3023 as shown in 

Figure 7–7. 

Figure 7–7. Management Console Message for Event 3023 

6872 5688–002 7–13


• The management console log messages might appear as in the following table. 

Event 

ID 

Description 

3023 For each RA at the site, this console log 

message is displayed: 

Error in LAN link to RA. (RA ) 

3022 

When the LAN link is restored, a 

management console log displays: 

LAN link to RA restored. (RA ) 


E-mail 

Immediate 

E-mail 

Daily 

Summary 




1. Ping from the cluster node to the RA box management IP address at the same site. 

Repeat this action for the other site. If the local connections are working at both 

sites, the problem is with the WAN connection such as a network switch or gateway 

connection. 

2. If one site from step 1 fails, ping from the cluster node to the gateway of that site. If 

the ping completes, then proceed to step 3. 



Appendix C.) 

a. Log in to an RA as user boxmgmt with the password boxmgmt. 





Enter. 



4. Isolate the site by determining which gateway failed. Use standard network methods 

such as pinging to make the determination. 

7–14 6872 5688–002 

X 

X

Replication Netwwork 

Failure in a Geograph hic 

Clustered Enviroonment 


6872 5688–002 

This type of event ooccurs 

when the RA cannot replicate data to the rem mote site because 

of a replication netwwork 

(WAN) failure. Because this error is transparen nt to MSCS and 

the cluster nodes, ccluster 

resources and nodes are not affected. Each cluster c node 

continues to run, annd 

data transactions sent to their local cluster disk are 

completed. 


this failure. 

Figure 7–8. Replication Network Failure 


The RA cannot replicate 

data while the WAN is down. During this failur re, the RA keeps a 

record of data writtten 

to local storage. Once the WAN is restored, the RA updates the 

replication volumess 

on the remote site. 

During the replicatioon 

network failure, the RAs prevent the quorum and d data resources 

from failing over to the remote site. This behavior differs from a total co ommunication 

failure or a total sitee 

failure in which the data groups are allowed to fail over. The quorum 

group is never allowwed 

to fail over automatically when the RAs cannot communicate c 

over 

the WAN. 

7–15


Symptoms 

Notes: 

• If the management network has also failed, see “Total Communication Failure in a 

Geographic Clustered Environment” later in this section. 

• If all RAs at a site have failed, see “Failure of All RAs at One Site” in Section 8. 

If the administrator issues a move-group operation from the Cluster Administrator for a 

data or quorum group, the cluster accepts failover only to another node within the same 

site. Group failover to the remote site is not allowed, and the resource group fails back 

to a node on the source site. 

Although automatic failover is not allowed, the administrator can perform a manual 

failover to the remote site. Performing a manual failover results in a loss of data. The 

administrator chooses an available image for the failover. 

Important considerations for this type of failure are as follow: 

• This type of failure does not have an immediate effect on the cluster service or the 

cluster nodes. The quorum group cannot fail over to the remote site and goes back 

online at the source site. 

• Only local failovers are permitted. Remote failovers require that the administrator 

perform the manual failover process. 

• The SafeGuard 30m Control resource and the data consistency groups cannot fail 

over to the remote site while the WAN is down; they go back online at the source 

site. 

• Only one site has up-to-date data. Replication does not occur until the WAN is 

restored. 

• If the administrator manually chooses to use remote data instead of the source data, 

data loss occurs. 

• Once the WAN is restored, normal operation continues; however, the groups might 

initiate a long resynchronization. 


• The management console display shows errors similar to the image in Figure 7–9. 

This image shows the dialog box displayed after clicking the red Errors in the right 

column. The More Info message box is displayed with messages similar to those in 

the figure but appropriate for your site. If only one RA is down, see Section 8 for 

resolution actions. Notice in the figure that all RA data links at the site are down. 

7–16 6872 5688–002

Figure 7–9. Management Console Display: WAN Down 


This figure also shows the Groups tab and the messages that the data consistency 

groups and the quorum group are “Paused by system.” If the groups are not paused 

by the system, a switchover might have occurred. See Section 8 for more 

information. If all groups are not paused, see Section 5, “Solving Storage Problems.” 


on the management console when the WAN is down. See the table after the figure 

for an explanation of the numbered console messages. 

Figure 7–10. Management Console Log Messages: WAN Down 

The following table explains the numbers in Figure 7–10. You might also see the 

events in the table denoted by an asterisk (*) in the management console log. 

6872 5688–002 7–17


Reference 

No./Legend 

Event 

ID 

Description 

* 3001 The RA is currently experiencing a problem 

communicating with its cluster. The details 

explain that an event 3000 means that the RA 

functionality will be restored. 

* 3000 The RA is successfully communicating with its 

cluster. In this case, the RA communicates by 

means of the management link. 

1 4001 For each consistency group on the Auckland 

and the Sydney sites, the transfer is paused. 

2 4008 For each quorum group on the Auckland and 

the Sydney sites, the transfer is paused. 

* 4043 For each group on the Auckland and Sydney 

sites, the “group site is deactivated” message 

might appear with the detail showing the 

reason for the switchover. The RA attempts to 

switch over to resolve the problem. 

3 4001 The event is repeated after the switchover 

attempt. 

E-mail 

Immediate 

E-mail 

Daily 

Summary 

• If you review the management console RAs tab, the data link column lists errors for 

all RAs, as shown in Figure 7–11. The data link is the replication link between peer 

RAs. Notice that the WAN link shows OK because the RAs can still communicate 

over the management link. There is no column for the management link. 

Figure 7–11. Management Console RAs Tab: All RAs Data Link Down 

• If you review the host application event log, no messages appear for this failure 

unless a data resource move-group operation is attempted. If this move-group 

operation is attempted, then messages similar to the following are listed: 

Application event log 


Event Source : 30mControl 


Event ID : 1119 

Date : 5/30/2008 

Time : 3:27:49 PM 

User : N/A 


7–18 6872 5688–002 

X 

X 

X 

X 

X 

X

Description : Online resource failed. 

Cannot complete transfer for auto failover (7). 

The following could cause this error: 

1. Wan is down. 

2. Long resynchronization might be in progress. 

The resource might have to be brought online manually. 


RA Version: 3.0(g.60) 

Resource name: Data1 

RA CLI command: UCLI -ssh -l plugin -pw **** -superbatch 172.16.25.50 initiate_failover group=Data1 

active_site=Sydney cluster_owner=USMV-SYDNEY 

• If you review the system event log, a message similar to the following example is 

displayed: 

System Event Log 

Event Type : Error 


Event Category: Failover Mgr 


Date : 5/30/2008 

Time : 3:27:50 PM 

User : N/A 


Description : Cluster resource 'Data1' in Resource Group 'Group 0' failed. 

Note: Data1 would change to the Quorum drive if the quorum was moved. 

• If you review the cluster log, you can see an error if a data or a quorum move-group 

operation is attempted. Messages similar to the following are listed: 

Cluster Log for the Node to which the Move Was Attempted 

Key messages 

00000d4c.00000910::2008/05/30-15:27:22.077 INFO Physical Disk : [DiskArb]------- 

DisksArbitrate -------. 

……………….. 

00000d4c.00000910::2008/05/30-15:27:35.608 ERR Physical Disk : [DiskArb] Failed to write 


00000d4c.00000910::2008/05/30-15:27:35.608 INFO Physical Disk : [DiskArb] Arbitrate returned 

status 170. 

Cluster Log for the Node to which the Data Group Move Was Attempted 

00000e60.00000940::2008/05/30-15:53:38.470 INFO Unisys SafeGuard 30m Control : 

KfResourceTerminate: Resource 'Data1' terminated. AbortOnline=1 CancelConnect=0 

terminateProcess=0. 

0000099c.00000dd4::2008/05/30-15:53:38.470 INFO [CP] CppResourceNotify for resource Data1 

0000099c.00000dd4::2008/05/30-15:53:38.470 INFO [FM] RmTerminateResource: a16fc059-e4d3-4bc8a15a-6440e9b2f976 

is now offline 

0000099c.00000dd4::2008/05/30-15:53:38.470 WARN [FM] Group failure for group . Create thread to take offline and move 

6872 5688–002 7–19






1. On the management console, observe that a WAN error occurred for all RAs and that 

the data link is in error for all RAs. If that is not the case, see Section 8 for resolution 

actions. 



Appendix C.) 

a. Log in to an RA as user boxmgmt with the password boxmgmt. 





Enter. 



3. Isolate the site by determining which network switch or gateway failed. Use 

standard network methods such as pinging to make the determination. 

4. In some cases, the WAN connection might appear to be down because a firewall is 

blocking ports. See “Port Information” later in this section. 

5. If all RAs at both sites can connect to the gateway, the problem is related to the link. 

In this case, check the connectivity between subnets by pinging between machines 

on the same subnet (not RAs) and between a non-RA machine at one site and an RA 

at the other site. 

6. Verify that no routing problems exist between the sites. 

7. Optionally, follow the recovery actions to manually move cluster and data resource 

groups to the other site if necessary. This action results in a loss of data. Do not 

attempt this manual recovery unless the WAN failure has affected applications. 

If you choose to manually move groups, refer to Section 4 for the procedures. 

Once you observe on the management console that the WAN error is gone, verify 

that the consistency groups are resynchronizing. 

If a move-group operation is issued to the other site while the group is 

resynchronizing, the command fails with a return code 7 (long resync in progress) 

and move back to the original node. 

7–20 6872 5688–002

Temporary WAN Failures 


Symptoms 

All applications are unaffected. The target image is not up-to-date. 


On the management console, messages showing the transfer between sites switch 

between the “paused by system” and “initializing/active.” All groups appear 

unstable over the WAN connection. 


Perform the following actions to isolate and resolve this problem: 

1. If the connection problem is temporary but reoccurs, check for a problematic 

network such as a high percentage of packet loss because of bad network 

connections, insufficient bandwidth that is causing an overloaded network, and so 

on. 

2. Verify that the bandwidth allocated to this link is reasonable and that no 

unreasonable external or internal (consistency group bandwidth policy) limits are 

causing an overloaded network. 

6872 5688–002 7–21


Private Cluster Nettwork 

Failure in a Geograph hic 



7–22 

When the private clusterr 

network fails, the cluster nodes are able to commu unicate with 

the public cluster networrk 

if the cluster public address is set for all communication. 

No 

cluster resources fail oveer, 

and current processing on the cluster nodes cont tinues. 

Clients do not experiencee 

any impact by this failure. 

Figure 7–12 illustrates thhis 

scenario. 

Figuree 

7–12. Private Cluster Network Failure 

Unisys recommends thatt 

the public cluster network be set for “All communications” 

and 

the private cluster LAN bbe 

set for “internal cluster communications only…” You Y can 

verify these settings in thhe 

“Networks” properties section within Cluster Administrator. 

See “Checking the Clustter 

Setup” in Section 4. 

If the public cluster netwwork 

was not set for “All communications” but instead 

was set 

for “Client access only,” the following symptoms occur: 

• All nodes except the node that owned the quorum stop MSCS. This action 

is 

completed to prevennt 

a “split brain” situation. 

• All resources move tto 

the surviving node. 

68 872 5688–002

Symptoms 



• When the private cluster network fails, the Cluster Administrator displays an error 

indicator similar to Figure 7–13. 

All private network connections show a status of “Unknown” when the problem is a 

WAN issue. 

If only two of the connections failed (and the nodes are physically located at the 

same site), the issue is probably local to the site. 

If only one connection failed, the issue is probably a host network adapter. 

Figure 7–13. Cluster Administrator Display with Failures 

• On the cluster nodes at both sides, the system event log contains entries from the 

cluster service similar to the following: 





Date : 5/30/2008 

Time : 4:03:10 PM 

User : N/A 

6872 5688–002 7–23


Computer 

Description: 

: USMV-SYDNEY 

The node lost communication with cluster node 'USMV-AUCKLAND' on network 'Private'. 





Date : 5/30/2008 

Time : 4:03:12 AMP 

User : N/A 

Computer 

Description: 

: USMV-SYDNEY 

The interface for cluster node 'USMV-AUCKLAND' on network 'Private' is unreachable by at least one 

other cluster node attached to the network. The server cluster was not able to determine the location of 

the failure. Look for additional entries in the system event log indicating which other nodes have lost 

communication with node USMV-AUCKLAND. If the condition persists, check the cable connecting the 

node to the network. Then, check for hardware or software errors in the node's network adapter. Finally, 

check for failures in any other network components to which the node is connected such as hubs, 

switches, or bridges. 





Date : 5/30/2008 

Time : 4:03:12 PM 

User : N/A 

Computer 

Description: 

: USMV-SYDNEY 

Cluster network 'Private’ is down. None of the available nodes can communicate using this network. If 

the condition persists, check for failures in any network components to which the nodes are connected 

such as hubs, switches, or bridges. Next, check the cables connecting the nodes to the network. Finally, 

check for hardware or software errors in the adapters that attach the nodes to the network. 

7–24 6872 5688–002






1. In the Cluster Administrator, view the network properties of the public and private 

network. 

The public network should be operational with no failure indications. 

The private network should display errors. Refer to the previous symptoms to 

identify that this is a WAN issue. If the error is limited to one host, the problem 

might be a host network adapter. See “Public NIC Failure on a Cluster Node in a 

Geographic Clustered Environment” for action to resolve a host network problem. 

2. Check for network problems using methods such as isolating the failure to the 

network switch or gateway with the problem. 

6872 5688–002 7–25


Total Communicattion 

Failure in a Geographic c 



7–26 

A total communication faailure 

implies that the cluster nodes and RAs are no longer able 

to communicate with eacch 

other over the public and private network interfac ces. 

Figure 7–14 illustrates this 

failure. 

Figurre 

7–14. Total Communication Failure 

When this failure occurs, , the cluster nodes on both sites detect that the clus ster 

heartbeat has been brokeen. 

After six missed heartbeats, the cluster nodes go 

into a 

“regroup” process to determine 

which node takes ownership of all cluster re esources. 

This process consists of checking network interface states and then arbitrati ing for the 

quorum device. 

During the network interrface 

detection phase, all nodes perform a network interface 

check to determine that the node is communicating through at least one net twork 

interface dedicated for cllient 

access, assuming the network interface is set for f “All 

communications” or “Cliient 

access only.” If this process determines that the 

node is not 

communicating through aany 

viable network, the cluster node voluntarily stop ps cluster 

service and drops out of the quorum arbitration process. The remaining node es then 

attempt to arbitrate for thhe 

quorum device. 

68 872 5688–002

Symptoms 


Quorum arbitration succeeds on the site that originally owned the quorum consistency 

group and fails on the nodes that did not own the quorum consistency group. Cluster 

service then shuts itself down on the nodes where quorum arbitration fails. 

In Microsoft Windows 2000 environments, MSCS does not check for network interface 

availability during the regroup process and starts the quorum arbitration process 

immediately after a regroup process is initiated—that is, after six missed heartbeats. 

Once the cluster has determined which nodes are allowed to remain active in the 

cluster, the cluster node attempts to bring online all data groups previously owned by the 

other cluster nodes. The SafeGuard 30m Control resource and its associated dependent 

resources will come online. 

During this total communication failure, replication is “Paused by system.” An extended 

outage requires a full volume sweep. Refer to Section 4 for more information. 


• The management console shows a WAN error; all groups are paused. The other site 

shows a status of “Unknown.” Figure 7–15 illustrates one site. 

Figure 7–15. Management Console Display Showing WAN Error 

6872 5688–002 7–27


• The RAs tab on the management console lists errors as shown in Figure 7–16. 

Figure 7–16. RAs Tab for Total Communication Failure 




Figure 7–17. Management Console Messages for Total Communication Failure 

7–28 6872 5688–002

Reference 

No. 


Event ID 

Description 

1 4001 For each consistency group, a group 

capabilities minor problem is reported. The 

details indicate that a WAN problem is 

suspected on both RAs. 

2 4008 For each consistency group on the West and 

the East sites, the transfer is paused. The 

details indicate a WAN problem is 

suspected. 

3 3021 For each RA at each site, the following error 

message is reported: 

Error in WAN link to RA at other site 

(RA x) 

4 1008 The following message is displayed: 

User action succeeded. The details indicate 

that a failover was initiated. This message 

appears when the groups are moved by the 

SafeGuard Control resource to the surviving 

cluster node. 


E-mail 

Immediate 

E-mail 

Daily 

Summary 

• All cluster resources appear online after successfully failing over to the surviving 

node. 

• The cluster service stops on all nodes except the surviving node. 

• From the surviving node, the host system event log has entries similar to the 

following: 





Date : 6/1/2008 

Time : 12:58:55 PM 

User : N/A 

Computer : USMV-WEST2 

Description: 

The node lost communication with cluster node 'USMV-EAST2' on Public network. 





Date : 6/1/2008 

6872 5688–002 7–29 

X 

X 

X 

X


Time : 12:58:55 PM 

User : N/A 


Description: 

The node lost communication with cluster node 'USMV-EAST2' on Private network. 





Date : 6/1/2008 

Time : 12:58:16 PM 

User : N/A 


Description: 

Cluster node USMV-EAST2 was removed from the active server cluster membership. Cluster service may 

have been stopped on the node, the node may have failed, or the node may have lost communication 

with the other active server cluster nodes. 

Event Type : Information 




Date : 6/1/2008 

Time : 12:58:21 PM 

User : N/A 

Computer : 

Description: 

USMV-WEST2 

The Cluster Service is attempting to bring online the Resource Group "Group 1". 

Event Type : Information 




Date : 6/1/2008 

Time : 1:02:54 PM 

User : N/A 

Computer : 

Description: 

USMV-WEST2 

The Cluster Service brought the Resource Group "Group 1" online. 

7–30 6872 5688–002


• From the surviving node, the private and public network connections show an 

exclamation mark “Unknown” status as shown in Figures 7–18 and 7–19. 

Figure 7–18. Cluster Administrator Showing Private Network Down 

Figure 7–19. Cluster Administrator Showing Public Network Down 

6872 5688–002 7–31






1. When you observe on the management console that a WAN error occurred on site 1 

and on site 2, call the other site to verify that each management console is available 

and shows a WAN down because of the failure. If only one site can access the 

management console, the problem is probably not a total WAN failure but rather a 

management network failure. In that case, see “Management Network Failure in a 

Geographic Clustered Environment.” 

2. In the Cluster Administrator, verify that only one node is active in the cluster. 

3. View the network properties of the public and private network. 

The display should show an “Unknown” status for the private and public network. 

4. Check for network problems using methods such as isolating the failure to the 

network switch or gateway by pinging from the cluster node to the gateway at each 

site. 

Port Information 


Symptoms 

Communications problems might occur because of firewall settings that prevent all 

necessary communication. 

The following symptoms might help you identify this problem: 

• Unable to reach the DNS server. 

• Unable to communicate to the NTP server. 

• Unable to reach the mail server. 

• The RAs tab shows RA data link errors. 

• The management console shows errors for the WAN. 

• The management console logs show RA communications errors. 

7–32 6872 5688–002



Perform the port diagnostics from each of the RAs by following the steps given in 

Appendix C. 

The following tables provide port information that you can use in troubleshooting the 

status of connections. 

Port Numbers 

Table 7–2. Ports for Internet Communication 

Protocol or Protocols 

21 FTP 192.61.61.78 

443 Used for remote maintenance 

(TCP) 

Unisys Product Support 

IP Address 

129.225.216.130 

The following tables list ports used for communication other than Internet 

communication. 

Table 7–3. Ports for Management LAN 

Communication and Notification 

Port Numbers Protocol or Protocols 

21 Default FTP port (needed for collecting system 

information) 

22 Default SSH and communications between RAs 

25 Default outgoing mail (SMTP) e-mail alerts from 

the RA are configured. 

80 Web server for management (TCP) 

123 Default NTP port 

161 Default SNMP port 

443 Secure Web server for management (TCP) 

514 Syslog (UDP) 

1097 RMI (TCP) 



4405 Host-to-RA kutils communications (SQL 

commands) and KVSS (TCP) 

7777 Automatic host information collection 

6872 5688–002 7–33


The ports listed in Table 7–4 are used for both the management LAN and WAN. 

Table 7–4. Ports for RA-to-RA Internal 

Communication 

Port Numbers Protocol or Protocols 

23 telnet 

123 NTP (UDP) 



4444 TCP 

5001 TCP (default iperf port for performance 

measuring between RAs) 

5010 Management server (UDP, TCP) 

5020 Control (UDP, TCP) 


5040 Replication (UDP, TCP) 

5060 Mpi_perf (TCP) 

5080 Connectivity diagnostics tool 

7–34 6872 5688–002

Section 8 

Solving Replication Appliance (RA) 

Problems 

This section lists symptoms that usually indicate problems with one or more Unisys 

SafeGuard 30m replication appliances (RAs). The problems include hardware failures. 

The graphics, behaviors, and examples in this section are similar to what you observe 

with your system but might differ in some details. 

For problems relating to RAs, gather the RA logs and ask the following questions: 

• Are any errors displayed on the management console? 

• Is the issue constant? Is the issue a one-time occurrence? Does the issue occur at 

intervals? 

• What are the states of the consistency groups? 

• What is the timeframe in which the problem occurred? 

• When was the first occurrence of the problem? 

• What actions were taken as a result of the problem or issue? 

• Were any recent changes made in the replication environment? If so, what? 

Table 8–1 lists symptoms and possible causes for the failure of a single RA on one site 

with a switchover as a symptom. Table 8–2 lists symptoms and possible causes for the 

failure of a single RA on one site without switchover symptoms. Table 8–3 lists 

symptoms and other possible problems regarding multiple RA failures. Each problem and 

the actions to resolve it are described in this section. 






6872 5688–002 8–1

Solving Replication Appliance (RA) Problems 

Table 8–1. Possible Problems for Single RA Failure with a 

Switchover 


The management console shows RA 

failure. 

Single RA failure 

Possible Contributing Causes to Single RA Failure with a Switchover 

The system frequently pauses transfer 

for all consistency groups. 

If you log in to the failed RA as the 

boxmgmt user, a message is displayed 

explaining that the reboot regulation 

limit has been exceeded. 

The management console shows 

repeated events that report an RA is 

up followed by an RA is down. 

The link indicator lights on all host bus 

adapters (HBAs) are not illuminated. 

The port indicator lights on the Fibre 

Channel switch no longer show a link 

to the RA. 

Port errors occur or there is no target 

when running the SAN diagnostics. 

The management console shows RA 

failure with details pointing to a 

problem with the repository volume. 

The link indicator lights on the HBA or 

HBAs are not illuminated. 

The port indicator lights on the 

network switch or hub no longer show 

a link to the RA. 

Reboot regulation failover 

Failure of all SAN Fibre Channel HBAs on one RA 

Onboard WAN network adapter failure 

(Or failure of the optional gigabit Fibre Channel 

WAN network adapter) 

8–2 6872 5688–002


Table 8–2. Possible Problems for Single RA Failure Wthout a 

Switchover 


The link indicators lights on the onboard 

management network adapter are not 

illuminated. 

The failure light for the hard disk 

indicates a failure. 

An error message that appears during a 

boot operation indicates failure of one of 

the internal disks. 

The link indicator lights on the HBA are 

not illuminated. 

The port indicator lights on the Fibre 

Channel switch no longer show a link to 

the RA. 

For one of the ports on the relevant RA, 

errors appear when running the SAN 

diagnostics. 

Onboard management network adapter 

failure 

Single hard-disk failure 

Port failure of a single SAN Fibre Channel 

HBA on one RA 

Table 8–3. Possible Problems for Multiple RA Failures with 

Symptoms 


Replication has stopped on all groups. 

MSCS fails over groups to the other 

site, or MSCS fails on all nodes. 

The management console displays a 

WAN error to the other site. 

Replication has stopped on all groups. 

MSCS fails over groups to the other 

site, or MCSC fails on all nodes. 

The management console displays a 

WAN error to the other site. 

Failure of all RAs on one site 

All RAs on one site are not attached 

6872 5688–002 8–3


Single RA Failures 


When an RA fails, a switchover might occur. In some cases, a switchover does not 

occur. See “Single RA Failures With Switchover” and “Single RA Failures Without 

Switchover.” 

Understanding Management Console Access 

If the RA that failed had been running site control—that is, the RA owned the virtual 

management IP network—and a switchover occurs, the virtual IP address moves to the 

new RA. 

If you attempt to connect to the management console using one of the static 

management IP addresses of the RAs, a connection error occurs if the RA does not have 

site control. Thus, you should use the site management IP address to connect to the 

management console. 

At least one RA (either RA 1 or RA 2) must be attached to the RA cluster for the 

management console to function. 

If the RA that failed was running site control and a switchover does not occur (such as 

with an onboard management network connection failure), the management console 

might not be accessible. Also, attempts to log in using PuTTY fail if you use the 

boxmgmt log-in account. When an RA does not have site control, you can always log in 

using PuTTY and the boxmgmt log-in account. 

You cannot determine which RA owns site control unless the management console is 

accessible. The site control RA is designated at the bottom of the display as follows: 

Another situation in which you cannot log in to the management console is when the 

user account has been locked. In this case, follow these steps: 

1. Log in interactively using PuTTY with another unlocked user account. 

2. Enter unlock_user. 

3. Determine whether any users are listed, and follow the messages to unlock the 

locked user accounts. 

8–4 6872 5688–002

6872 5688–002 


a single RA failure. 

Single RA Failure wwith 

Switchover 

Solving Replication Appliance e (RA) Problems 

Figure 8–1. Single RA Failure 

In this case, a single 

RA fails, and there is an automatic switchover to a surviving RA on 

the same site. Any groups that had been running on the failed RA run on o a surviving RA 

at the same site. 

Each RA handles thhe 

replicating activities of the consistency groups for r which it is 

designated as the ppreferred 

RA. The consistency groups that are affect ted are those that 

were configured wiith 

the failed RA as the preferred RA. Thus, whenever 

an RA becomes 

inoperable, the handling 

of the consistency groups for that RA switches s over 

automatically to thee 

functioning RAs in the same RA cluster. 

During the RA switchover 

process, the server applications do not experience 

any I/O 

failures. In a geograaphic 

clustered environment, MSCS is not aware of the RA failure, 

and all application aand 

replication operations continue to function norma ally. However, 

performance mightt 

be affected because the I/O load on the surviving RAs R is now 

increased. 

8–5


Symptoms 

Failures of an RA that cause a switchover are as follows: 

• RA hardware issues (such as memory, motherboard, and so forth) 

• Reboot regulation failover 

• Failure of all SAN Fibre Channel HBAs on one RA 

• Onboard WAN network adapter failure (or failure of the optional gigabit Fibre Channel 

WAN network adapter) 


• The RA does not boot. 

From a power-on reset, the BIOS display shows the BIOS information, RAID adapter 

utility prompt, logical drives found, and so forth. The display is similar to the 

information shown in Figure 8–2. 

Figure 8–2. Sample BIOS Display 

Once the RA initializes, the log-in screen is displayed. 

Note: Because status messages normally scroll on the screen, you might need to 

press Enter to see the log-in screen. 

• The management console system status shows an RA failure. (See Figure 8–3.) 

To display more information about the error, click the red error in the right column. 

The More Info dialog box is displayed with a message similar to the following: 

RA 1 in West is down 

8–6 6872 5688–002


Figure 8–3. Management Console Display Showing RA Error and RAs Tab 

• The RAs tab on the management console shows information similar to that in 

Figure 8–3, specifically 

− The RA status for RA 1 on the West site shows an error. 

− The peer RA on the East site (RA 1) shows a data link error. 

− Each RA on the East site shows a WAN connection failure. 

− The surviving RA at the failed site (West) does not show any errors. 


on the management console when an RA fails and a switchover occurs. See the 

table after the figure for an explanation of the numbered console messages. In your 

environment, the messages pertain only to the groups configured to use the failed 

RA as the preferred RA. 

6872 5688–002 8–7


Figure 8–4. Management Console Messages for Single RA Failure with Switchover 

Reference 

No. 


Event 

ID 


Immediate 

1 3023 At the same site, the other RA reports a 

problem getting to the LAN of the failed RA. 

2 3008 The site with the failed RA reports that the RA is 

probably down. 

3 2000 The management console is now running on RA 

2. 

4 4001 For each consistency group, a minor problem is 

reported. The details show that the RA is down 

or not a cluster member. 

5 4008 For each consistency group, the transfer is 

paused at the surviving site to allow a 

switchover. The details show the reason for the 

pause as switchover. 

E-mail Daily 

Summary 

8–8 6872 5688–002 

X 

X 

X 

X 

X

Reference 

No. 

Event 

ID 



Immediate 

6 4041 For each consistency group at the same site, 

the groups are activated at the surviving RA. 

This probably means that a switchover to RA 2 

at the failed site was successful. 


splitter is again splitting. 

8 3021 A WAN link error is reported from each RA at 

the surviving site regarding the failed RA at the 

other site. 


transfer is started. 

10 4086 For each consistency group at the failed site, an 

initialization is performed. 


the initialization completes. 

E-mail Daily 

Summary 

12 3007 The failed RA (RA 1) is now restored. X 

To see the details of the messages listed on the management console display, you must 

collect the logs and then review the messages for the time of the failure. Appendix A 

explains how to collect the management console logs, and Appendix E lists the event 

IDs with explanations. 


The following list summarizes the actions you need to perform to isolate and resolve the 

problem: 

• Check the LCD display on the front panel of the RA. See “LCD Status Messages” in 

Appendix B for more information. 

If the LCD display shows an error, run the RA diagnostics. See Appendix B for more 

information. 

• Check all indicator lights on the rear panel of the RA. 

• Review the symptoms and actions in the following topics: 

− Reboot Regulation 

− Onboard WAN Network Adapter Failure 

• If you determine that the failed RA must be replaced, contact the Unisys service 

representative for a replacement RA. 

After you receive the replacement RA, follow the steps in Appendix D to install and 

configure it. 

6872 5688–002 8–9 

X 

X 

X 

X 

X 

X


The following procedure provides a detailed description of the actions to perform: 

1. Remove the front bezel of the RA and look at the LCD display. During normal 

operation, the illuminated message should identify the system. 

If the LCD display flashes amber, the system needs attention because of a problem 

with power supplies, fans, system temperature, or hard drives. 

Figure 8–5 shows the location of the LCD display. 

Figure 8–5. LCD Display on Front Panel of RA 

If an error message is displayed, check Table B–1. For example, the message E0D76 

indicates a drive failure. (Refer to “Single Hard Disk Failure” in this section.) 

If the message code is not listed in the Table B–1, run the RA diagnostics, (see 

Appendix B). 

2. Check the indicators at the rear of the RA as described in the following steps and 

visually verify that all are working correctly. 

Figure 8–6 illustrates the rear panel of the RA. 

Note: The network connections on the rear panel labeled 1 and 2 in the following 

illustration might appear different on your RA. The connection labeled 1 is always the RA 

replication network, and the connection labeled 2 is always the RA management 

network. Pay special attention to the labeling when checking the network connections. 

8–10 6872 5688–002

6872 5688–002 

Solving Replication Appliance e (RA) Problems 


Rear Panel of RA Showing Indicators 

• Ping each netwwork 

connection (management network and replicatio on network), and 

visually verify thhat 

the LEDs on either side of the cable on the back k panel are 

illuminated. Figure 

8–7 shows the location of these LEDs. 

If the LEDs are off, the network is not connected. The green LED is 

lit if the network 

is connected too 

a valid link partner on the network. The amber LED D blinks when 

network data iss 

being sent or received. 

If the managemment 

network LEDs indicate a problem, refer to “Onboard 

Management NNetwork 

Adapter Failure” in this section. 

If the replication 

network LEDs indicate a problem, refer to “Onboa ard WAN Network 

Adapter Failure” 

in this section. 

Figure 8–7. Location of Network LEDs 

• Check that the green LEDs for the SAN Fibre Channel HBAs are illu uminated as 

shown in Figuree 

8–8. 

8–11


Figure 8–8. Location of SAN Fibre Channel HBA LEDs 

The following table explains the LED patterns and their meanings. If the LEDs 

indicate a problem, refer to the two topics for SAN Fibre Chanel HBA failures in this 

section. 

Green LED Amber LED Activity 

On On Power 

On Off Online 

Off On Signal acquired 

Off Flashing Loss of synchronization 

Flashing Flashing Firmware error 

Reboot Regulation 


After frequent, unexplained reboots or restarts of the replication process, the RA 

automatically detaches from the RA cluster. 

When installing the RAs, you can enable or disable this reboot regulation feature. The 

factory default is for the feature to be enabled so that reboot regulation is triggered 

whenever a specified number of reboots or failures occur within the specified time 

interval. 

The two parameters available for the reboot regulation feature are the number of reboots 

(including internal failures) and the time interval. The default value for the number of 

reboots is 10, and the default value for the time interval is 2 hours. 

Only Unisys personnel should change these values. Use the Installation Manager to 

change the parameter values or disable the feature. See the Unisys SafeGuard Solutions 

Replication Appliance Installation Guide for information about using the Installation 

Manager tools to make these changes. 

8–12 6872 5688–002

Symptoms 



• Frequent transfer pauses for all consistency groups that have the same preferred 

RA. 

• If you log in to the RA as the boxmgmt user, the following message is displayed: 

Reboot regulation limit has been exceeded 

• Several messages might be displayed on the Logs tab of the management console 

as an RA reboots to try to correct a problem. These messages are listed in 

Table 8–4. 

Table 8–4. Management Console Messages Pertaining to Reboots 

Reference 

No./Legend 

Event 

ID 

* 3008 The RA appears to be down. 

The RA might attempt to 

perform a reboot to correct 

the problem. 

* 3023 Error in LAN link (as RA 

reboots). 

* 3021 Error in WAN link (as RA 

reboots). 

* 3007 The RA is up (the reboot 

completes). 

* 3022 The LAN link is restored (the 

reboot has completed). 

* 3020 The WAN link at other site is 

restored (the reboot has 

completed). 


Immediate 

E-mail Daily 

Summary 

When any of these messages appear multiple times in a short time period, they 

might indicate an RA that has continuously rebooted and might have reached the 

reboot regulation limit. 



1. Collect the RA logs before you attempt to resolve the problem. See Appendix A for 

information about collecting logs. 

2. To determine whether the hardware is faulty, run the RA diagnostics described in 

Appendix B. 

3. If the problem remains, submit the RA logs to Unisys for analysis. 

6872 5688–002 8–13 

X 

X 

X 

X 

X 

X


4. Once the problem is corrected, the RA automatically attaches to the RA cluster after 

a power-on reset. If necessary, reattach the RA to the RA cluster manually by 

following these steps: 

a. Log in as boxmgmt to the RA through an SSH session using PuTTY. 

b. At the prompt, type 4 (Cluster operations) and press Enter. 

c. At the prompt, type 1 (Attach to cluster) to attach the RA to the cluster. 

d. At the prompt, type Q (Quit). 

Failure of All SAN Fibre Channel Host Bus Adapters (HBAs 


Symptoms 

All SAN Fibre Channel HBAs or adapter ports on the RA fail. This scenario is unlikely 

because the RA has redundant ports that are located on different physical adapters. A 

SAN connectivity problem is more likely. 

Note: A single redundant path does not show errors on the management console 

display. See “Port Failure on a Single SAN Fibre Channel HBA on One RA.” 


• The link indicator lights on all SAN Fibre Channel HBAs are not illuminated. (Refer to 

Figure 8–8 for the location of these LEDs.) 

• The port indicator lights on the Fibre Channel switch no longer show a link to the RA. 

• Port errors occur or no target appears when running the Installation Manager SAN 

diagnostics. 

• Information on the Volumes tab of the management console is inconsistent or 

periodically changing. 

• The management console shows failures for RAs, storage, and hosts. (See 


8–14 6872 5688–002


Figure 8–9. Management Console Display: Host Connection with RA Is 

Down 

If you click the red error indication for RAs in the right column, the message is 

RA 2 in East can’t access repository volume 

If you click the red error indication for storage in the right column, the following 

messages are displayed: 

If you click the red error indication in the right column for splitters, the message is 

ERROR: USMV-EAST2's connection with RA2 is down 


on the management console when an RA fails with this type of problem. See the 

table after the figure for an explanation of the numbered console messages. 

Also, refer to Figure 8–4 and the table that explains the messages for information 

about an RA failure with a generic switchover. 

Refer to Table 8–4 for other messages that might occur whenever an RA reboots to 

try to correct the problem. 

6872 5688–002 8–15


Figure 8–10. Management Console Messages for Failed RA (All SAN HBAs Fail) 

8–16 6872 5688–002

Reference 

No. 


The following table explains the numbered messages shown in Figure 8–10. You 

might also see the messages denoted with an asterisk (*). 

Event 

ID 

Description 


repository volume (RA 2). 

2 4003 For each consistency group that had 

the failed RA as the preferred RA, a 

group consistency problem is 

reported. The details show a 

repository volume problem. 

3 3012 The RA is unable to access volumes 

(all volumes for repository, journal, and 

data are listed). 

4 4086 Initialization started (RA 1, Quorum --- 

West). 

5 4087 Initialization complete (RA 1, Quorum - 

West). The group has completed the 

switchover. 

E-mail 

Immediate 

E-mail 

Daily 

Summary 







1. Refer to Section 6, “Solving SAN Connectivity Problems,” to determine whether the 

problem is described there. 

2. If you determine that the SAN Fibre Channel HBA failed and must be replaced, 

contact a Unisys service representative for a replacement adapter. 

3. Once the replacement adpter is received, perform the following steps to replace the 

failed HBA: 

a. Open a PuTTY session using the IP address of the RA and log in as 

boxmgmt/boxmgmt. 

Appendix C provides additional information about the Installation Manager 

diagnostics. 

b. On the Main menu, type 3 (Diagnostics) and press Enter. 

c. On the Diagnostics menu, type 2 (Fibre Channel diagnostics) and press 

Enter. 

d. On the Fibre Channel Diagnostics menu, type 4 (View Fibre Channel 

details) and press Enter. 

6872 5688–002 8–17 

X 

X 

X 

X 

X


Information similar to the following is displayed: 

>>Site1 Box 1>>3 

Port 0 

wwn = 50012482001c6fb0 

node_wwn = 50012482001c6fb1 

Port id = 0x20100 

operating mode = point to point 

speed 

Port 1 

= 2 GB 

---------------------------------wwn 

= 50012482001ce3c4 

node_wwn = 50012482001ce3c5 

Port id = 0x10100 


speed = 2 GB 

e. Write down the port information. 

f. On the Fibre Channel Diagnostics menu, type B (Back) and press Enter. 

g. On the Diagnostics menu, type B (Back) and press Enter. 

h. On the Main Menu, type 4 (Cluster operations) and press Enter. 

i. On the Cluster Operations menu, type 2 (Detach from cluster) and press 

Enter. 

j. Shut down the RA. 

k. Replaced the failed adapter with the replacement and then boot the RA. 

Note: The replacement adapter does not require any settings to be changed. 

l. Repeat steps a through d, and again view the Fibre Channel details to see the 

new WWN for the replaced HBA. 

m. Using the management of the SAN switch, make the modifications to the zoning 

as needed to replace the failed WWN with the new WWN. 

n. Use the new WWN to configure the storage. 

o. On the Fibre Channel Diagnostics menu, type 1 (SAN diagnostics) and 

press Enter. (Refer to steps a through c to access the Fibre Channel 

Diagnostics menu.) 

When you select the SAN diagnostics option, the system conducts automatic 

tests that are designed to identify the most common problems encountered in 

the configuration of SAN environments. 

Once the tests complete, a message is displayed confirming the successful 

completion of SAN diagnostics, or a report is displayed that details any critical 

configuration problems. 

p. Once no problems are reported from the SAN diagnostics, type B (Back) and 

press Enter. 

q. On the Fibre Channel Diagnostics menu, type B (Back) and press Enter. 

8–18 6872 5688–002


r. On the Diagnostics menu, type B (Back) and press Enter. 

s. On the Main Menu, type 4 (Cluster operations) and press Enter. 

t. On the Cluster Operations menu, type 1 (Attach to cluster) and press Enter. 

This action reattaches the RA, which automatically reboots and restarts 

replication. 

Note: The replacement Fibre Channel HBA does not need any configuration changes. 

Failure of Onboard WAN Adapter or Failure of Optional Gigabit 

Fibre Channel WAN Adapter 


Symptoms 

The onboard WAN adapter failed. This capability serves the replication network. 

Notes: 

• The gigabit Fibre Channel WAN adapter is an optional component found in some 

environments. When this board fails, the symptoms are the same as those observed 

when the onboard WAN adapter fails. In that case, the indicator lights pertain to the 

gigabit Fibre Channel WAN board instead of the onboard capability. 

• The actions to resolve the problem are similar once you isolate the board as the 

problem. That is, contact a Unisys service representative for a replacement part. 


• Transfer between sites pauses temporarily for all consistency groups for which this 

is the preferred RA while an RA switchover occurs. 

• Applications continue to run. High loads might occur because of reduced total 

throughput capacity. 

• The link indicators on the onboard WAN adapter might not be illuminated. (See 

Figure 8–6 for the location of the connector for the replication network WAN. 

Figure 8–7 illustrates the LEDs.) 

• The port lights on the network switch might indicate that there is no link to the 

onboard WAN adapter. 

• The management console shows a WAN data link failure for RA 1. The More 

information for this error provides the message: “RA-x WAN data link is down.” (See 


6872 5688–002 8–19


Figure 8–11. Management Console Showing WAN Data Link Failure 

• The RAs tab on the management console (Figure 8–11) shows an error for the same 

RA at each site, indicating that the connectivity between them has been lost. 

• Warnings and informational messages similar to those shown in Figure 8–4 for an 

RA failure are displayed for this failure. Refer to the table after Figure 8–4 for 

descriptions of the messages. For this failure, the details of event ID 4001 show a 

WAN data path problem. 



• Isolate the problem to the onboard WAN adapter by performing the actions in 

“Replication Network Failure in a Geographic Clustered Environment” in 

Section 7. 

• If you determine that the motherboard must be replaced, contact a Unisys service 

representative for a replacement part. 

• Contact the Unisys Support Center for the appropriate BIOS for the replacement 

part. 

Note: The replacement motherboard might not have the disk controller set for 

RAID1 (mirroring). Check the setting and change it if necessary. 

• In rare cases, you might need to obtain a replacement RA from a Unisys service 

representative. After you receive the replacement RA, follow the steps in Appendix 

D to install and configure it. 

8–20 6872 5688–002


Single RA Failures Without a Switchover 


Some failures that might occur on an RA do not cause a switchover. These failures are 

• Port failure on a single SAN Fibre Channel HBA on one RA 

• Onboard management network adapter failure 

• Single hard disk failure 

Port Failure on a Single SAN Fibre Channel HBA on One RA 


Symptoms 

One SAN Fibre Channel HBA port on the RA failed. 


• The Logs tab on the management console displays a message for event ID 3030— 

Warning RA switched path to storage. (RA , Volumes )—only if the 

connection failed during an I/O operation. 

• The link indicator lights on the SAN Fibre Channel HBA are not illuminated. (Refer to 

Figure 8–8 for the location of these LEDs.) 

• The port indicator lights on the Fibre Channel switch no longer show a link to the RA. 

• For one port on the relevant RA, errors occur when running the Installation Manager 

SAN diagnostics. See Appendix C for information about these diagnostics. 



1. If you determine that the SAN Fibre Channel HBA failed and must be replaced, 

contact a Unisys service representative for a replacement part. 

2. Once the replacement adapter is received, perform the following steps to replace 

the failed HBA: 

a. Open a PuTTY session using the IP address of the RA, and log in as 


Appendix C provides additional information about the Installation Manager 

diagnostics. 

b. On the Main menu, type 3 (Diagnostics) and press Enter. 

c. On the Diagnostics menu, type 2 (Fibre Channel diagnostics) and press 

Enter. 

d. On the Fibre Channel Diagnostics menu, type 4 (View Fibre Channel 

details) and press Enter. 

6872 5688–002 8–21


Information similar to the following is displayed: 

>>Site1 Box 1>>3 

Port 0 

wwn = 50012482001c6fb0 

node_wwn = 50012482001c6fb1 

Port id = 0x20100 


speed 

Port 1 

= 2 GB 

---------------------------------wwn 

= 50012482001ce3c4 

node_wwn = 50012482001ce3c5 

Port id = 0x10100 


speed = 2 GB 

e. Write down the port information. 

f. On the Fibre Channel Diagnostics menu, type B (Back) and press Enter. 

g. On the Diagnostics menu, type B (Back) and press Enter. 

h. On the Main Menu, type 4 (Cluster operations) and press Enter. 

i. On the Cluster Operations menu, type 2 (Detach from cluster) and press 

Enter. 

j. Shut down the RA. 

k. Replaced the failed adapter with the replacement and then boot the RA. 

Note: The replacement adapter does not require any settings to be changed. 

l. Repeat steps a through d and again view the Fibre Channel details to see the 

new WWN for the replaced HBA. 

m. Using the management of the SAN switch, make the modifications to the zoning 

as needed to replace the failed WWN with the new WWN. 

n. Use the new WWN to configure the storage. 

o. On the Fibre Channel Diagnostics menu, type 1 (SAN diagnostics) and 

press Enter. (Refer to steps a through c to access the Fibre Channel 

Diagnostics menu.) 

When you select the SAN diagnostics option, the system conducts automatic 

tests that are designed to identify the most common problems encountered in 

the configuration of SAN environments. 

Once the tests complete, a message is displayed confirming the successful 

completion of SAN diagnostics, or a report is displayed that details any critical 

configuration problems. 

p. Once no problems are reported from the SAN diagnostics, type B (Back) and 

press Enter. 

q. On the Fibre Channel Diagnostics menu, type B (Back) and press Enter. 

8–22 6872 5688–002


r. On the Diagnostics menu, type B (Back) and press Enter. 

s. On the Main Menu, type 4 (Cluster operations) and press Enter. 

t. On the Cluster Operations menu, type 1 (Attach to cluster) and press Enter. 

This action reattaches the RA, which automatically reboots and restarts 

replication. 

Note: The replacement Fibre Channel HBA does not need any configuration changes. 

Onboard Management Network Adapter Failure 


Symptoms 

The onboard management network adapter failed. 


• On the management console, the system status and RA status do not display any 

error indications. 

• The link indicators on the onboard management network adapter are not illuminated. 

(See Figure 8–6 for the location of the connector for the onboard management 

network adapter. Figure 8–7 illustrates the LEDs.) 

• If RA site control was running on the failed RA, you cannot access the management 

console or if the management console was open, a banner is displayed showing 

“not connected.” 

• If RA site control was not running on the failed RA, you can access the management 

console. 

• You cannot determine which RA owns site control unless the management console 

is accessible. The RA site control is designated at the bottom of the display as 

follows: 

• See “Management Network Failure in a Geographic Clustered Environment” in 

Section 7 for additional symptoms. 

• The Logs tab on the management console might display a message for event ID 

3023—Error in LAN link to RA (RA1)—for this failure. 

6872 5688–002 8–23




• Isolate the problem to the onboard management network adapter by performing the 

actions in “Management Network Failure in a Geographic Clustered Environment” in 

Section 7. 

• If you determine the motherboard must be replaced, contact a Unisys service 


• Contact the Unisys Support Center for the appropriate BIOS for the replacement 

part. 

Note: The replacement motherboard might not have the disk controller set for 

RAID1 (mirroring). Check the setting and change it if necessary. 

• In rare cases, you might need to obtain a replacement RA from a Unisys service 

representative. After you receive the replacement RA, follow the steps in Appendix 

D to install and configure it. 

Single Hard Disk Failure 


Symptoms 

One of the mirrored internal hard disks for the RA failed. 


• The failure light for a hard disk indicates a failure. Figure 8–12 illustrates the location 

of the LEDs for hard disks in the RA. 

8–24 6872 5688–002


Figure 8–12. Location of Hard Drive LEDs 

• An error message that appears during boot indicates failure of one of the internal 

disks. 

• The LCD display on the front panel of the RA indicates a drive failure. This error code 

is E0D76 as shown in Figure 8–5. 



• If the drive failed, you must replace the hard drive. Contact a Unisys service 


• Install the new drive; resynchronization occurs automatically. 

Do not power off or reboot the RA while resynchronization is taking place. 

Failure of All RAs at One Site 


If all RAs fail on one site, replication stops and the data that are currently changing on the 

remote site are marked for synchronization. Once the RAs are restored, synchronization 

occurs through a full \-sweep operation. 

This type of failure is unlikely unless the power source fails. 

6872 5688–002 8–25


Symptoms 


• Transfer is paused for all consistency groups. 

• Depending on the environment and group settings, applications that were running on 

the failed site might stop. 

• If the quorum resource belonged to a node at the failed site, MCSC might fail. 

• The symptoms for this failure are similar to a total site failure and a network failure 

on both the management network and WAN. Because the WAN link is functioning, 

the difference is that the following are true: 

− Neither site can access the management console using the site management IP 

address of the site with the failed RAs. 

− Both sites can access the management console using the site management IP 

address of the site with the functioning RAs. 

Communicate with the administrator at the other site to determine whether that site 

can access the management console. Both sites should see a display similar to 


Figure 8–13. Management Console Showing All RAs Down 



1. Restore power to the failed RAs. 

2. If recovery of applications is needed prior to restoring the RAs, see the recovery 

topics in Section 3 for geographic replication environments and in Section 4 for 

geographic clustered environments. 

8–26 6872 5688–002

All RAs Are Not Attached 


Symptoms 


If all RAs at a site are not attached, connection to the management console is not 

available. Also, you cannot access the RA using a PuTTY session and the site 

management IP address. You cannot log into the RA using the RA management IP 

address and the admin user account. The RA that runs site control is assigned a virtual IP 

address that is the site management IP address. Either RA 1 or RA 2 must be attached 

to the cluster to have an RA cluster with site control running. 


• You cannot log in to the management console using the site management IP 

addresses of the failed sites. 

• You cannot initiate an SSH session through PuTTY using the admin account to either 

RA management IP address or the site management IP address. 

• From the management console of the other site, the WAN appears to be down. (See 




1. Ping the RA using the management IP address. If the ping is not successful, refer to 

“Management Network Failure in a Geographic Clustered Environment” in 

Section 7. If the ping completes successfully, continue with steps 2 through 5. 

2. Log in as boxmgmt to each RA management IP address through an SSH session 

using PuTTY. (See “Using the SSH Client” in Appendix C for more information.) If 

this is not successful, the RA is probably not attached. 

3. To verify that the RA is not attached, follow these steps: 

a. Log in as boxmgmt to the RA. 

b. At the prompt, type 4 (Cluster operations) and press Enter. 

Note: The “reboot regulation limit has been exceeded” message is displayed 

when you log in as boxmgmt. In that case, see “Reboot Regulation” in this 

section. 

c. At the prompt, type 2 (Detach from cluster) and press Enter. 

Do not type y to detach. If the RA was not attached, a message is displayed 

stating that it is not detached. 

6872 5688–002 8–27


Note: Either RA 1 or RA 2 must be attached to have a cluster. RAs 3 through 8 

cannot become cluster masters. 

4. If the RA is not attached, then type B (Back) and press Enter. 

5. At the prompt, type 1 (Attach to cluster) to attach the RA to the cluster. 

6. At the prompt, type Q (Quit). 

7. Once the RA is attached, log in as admin to the management console and also 

initiate a SSH session to the management IP address to ensure that both are 

operational. 

8. At the management console, click the RAs tab and check that all connections are 

working. 

8–28 6872 5688–002

Section 9 

Solving Server Problems 

This section lists symptoms that usually indicate problems with one or more servers. 

The problems listed in this section include hardware failure problems. Table 9–1 lists 




details. 


for any of the possible problems or causes. Also, messages similar to e-mail notifications 

are displayed on the management console. If you do not see the messages, they might 

have already dropped off the display. Review the management console logs for 

messages that have dropped off the display. 

Table 9–1. Possible Server Problems with Symptoms 


The management console shows a server 

down. 

Messages on the management console 

show the splitter is down and that the 

node fails over. 

Multipathing software (such as EMC 

PowerPath Administrator) messages report 

errors. (This symptom might occur if the 

server is unable to connect with the SAN 

or if the server HBA fails.) 

Host logs and RA log timestamps are not 

synchronized. 

Cluster node failure (hardware or software) 

in geographic clustered environment 

possibly resulting from 

• Windows server reboot 

• Unexpected server shutdown because 

of a bug check 

• Server crash or restart 

• Server unable to connect with SAN 

• Server HBA failure 

Infrastructure (NTP) server failure 

Applications are down. Server failure (hardware or software) in 

geographic replication environment 

possibly resulting from 

• Windows server reboot 

• Unexpected server shutdown because 

of a bug check 

• Server crash or restart 

• Server unable to connect with SAN 

• Server HBA failure 

6872 5688–002 9–1


Cluster Node Failure 

(Hardware or Software) in i a 

Geographic Clusteered 

Environment 


9–2 

MSCS uses several hearrtbeat 

mechanisms to detect whether a node is still actively 

responding to cluster acttivities. 

MSCS assumes a cluster node has failed wh hen the 

cluster node no longer reesponds 

to heartbeats that are broadcast over the pu ublic\private 

cluster networks and whhen 

a SCSI reservation is lost on the quorum volume e. 

Figure 9–1 illustrates thiss 

failure. 

Figure 9–1. Cluster Node Failure 

If the server that crashedd 

was the MSCS leader (quorum owner), another clu uster node 

(the challenger) tries to bbecome 

leader and arbitrate for the quorum device. Because the 

failed server is no longerr 

the quorum device owner in the reservation manag ger, the 

arbitration by the challenger 

instantly succeeds. 

If the challenger node is from the same site as the failed server, arbitration in nstantly 

succeeds, and no failoveer 

of the quorum device to the remote site is required. 

If the challenger node is from the remote site, the RA reverses the replicatio on direction 

of the quorum consistency 

group. Once failover completes, the challenger arbitration 

is 

completed. 

68 872 5688–002


When a nonleader MSCS node fails, the data groups move to the remaining MSCS local 

or remote nodes, depending on preferred ownership settings. From the perspective of 

the RA, this situation is equivalent to a user-initiated move of the data groups. That is, 

the SafeGuard 30m Control resource on the node that tries to bring the group online 

sends a command to fail over the group to its site. If the group fails over to a cluster 

node on the same site, failover occurs instantly. Otherwise, a consistency group failover 

is initiated to the remote site. The SafeGuard 30m Control resource does not come 

online until the consistency group has completed failover. 

Possible Subset Scenarios 

The symptoms of a server failure vary based on the reasons that the server went down. 

Five different scenarios are described as subsets of this type of failure: 

• Windows Server Reboot 

• Unexpected Server Shutdown Because of a Bug Check 

• Server Crash or Restart 

• Server Unable to Connect with SAN 

• Server HBA Failure 

One of the first things to determine in troubleshooting a server failure is whether the 

failure was an unexpected event (a “crash”) or an orderly event such as an operator 

reboot. When the server crashes, you usually see a “blue screen” and do not have 

access to messages. Once the server comes up again, then you can view messages 

regarding the reason it crashed. These messages help diagnose the reason for the initial 

shutdown or failure. 

In an orderly event, the Windows event log is stopped, and you can view events that 

point to the reason for the reboot or restart. 

Windows Server Reboot 


The consistency groups fail over to another local node or to the other site because a 

server fails or goes down. In this scenario, the shutdown is an orderly event and thus 

causes the Windows event log service to stop. 

6872 5688–002 9–3


Symptoms 


• The management console display shows a server failure similar to that shown in 

Figure 9–2. 

Figure 9–2. Management Console Display with Server Error 

• Warning and informational messages similar to those shown in Figure 9–3 appear on 

the management console when a server fails. See the table after the figure for an 

explanation of the numbered console messages. 

9–4 6872 5688–002


Figure 9–3. Management Console Messages for Server Down 

6872 5688–002 9–5


Reference 

No. 

The following table explains the numbered messages shown in Figures 9–3. 

Event 

ID 

Description 

1 5008 The source site reports that server 

USMV-CAS100P2 performed an 

orderly shutdown. 

2 4062 The surviving site accesses the 

latest image of the consistency 

group during the failover. 

3 5032 For each consistency group that 

moves to a surviving node, the 

splitter is again splitting. 


moves to a surviving node, the 

transfer is paused. In the details of 

this message, the reason for the 

pause is given. 

5 1008 The Unisys SafeGuard 30m Control 

resource successfully issued an 

initiate_failover command. 


moves to asurviving node, data 

transfer starts and then a quick 

initialization starts. 


moves to a surviving node, 

initialization completes. 

E-mail 

Immediate 

E-mail 

Daily 

Summary 





• If you review the system event logs, you can find messages similar to the following 


management console images. 

System Event Log for Usmv-Cas100p2 Host (Failure Host on Site 1) 

6/01/2008 16:19:13 PM EventLog Information None 6006 N/A USMV-WEST2 The Event log 

service was stopped. 

6/01/2008 16:19:48 PM EventLog Information None 6009 N/A USMV-WEST2 Microsoft (R) 

Windows (R) 5.02. 3790 Service Pack 1 Multiprocessor Free. 

6/01/2008 16:19:48 PM EventLog Information None 6005 N/A USMV-USMV-WEST2. The Event 

log service was started. 

9–6 6872 5688–002 

X 

X 

X 

X 

X 

X 

X


System Event Log for Usmv-x455 Host (Surviving Host on Site 2) 

6/01/2008 16:19:15 PM ClusSvc Warning Node Mgr 1123 N/A USMV-EAST2 The node lost 

communication with cluster node 'USMV-WEST2' on network 'Public'. 


communication with cluster node 'USMV-WEST2' on network 'Private'. 

6/01/2008 16:19:56 PM ClusSvc Warning Node Mgr 1135 N/A USMV-EAST2 Cluster node 

USMV-WEST2 was removed from the active server cluster membership. Cluster service may have been 




examples that are based on the failed node owning the quorum used to generate the 


Cluster Log for Usmv-West2 Host (Failure Host on Site 1) 

0000089c.00000a54:: 2008/05/25-10:31:42.107 ERR [GUM]GumUpdateRemoteNode: Failed to get 

completion status for async RPC call,status 1115.(Error 1115: A system shutdown is in progress) 

0000089c.00000a54::2008/05/25-10:31:42.107 ERR [GUM] GumSendUpdate: Update on node 2 failed 

with 1115 when it must succeed 

0000089c.00000a54:: 2008/05/25-10:31:42.107 ERR [GUM] GumpCommFailure 1115 communicating 

with node 20000089c.00000a54:: 2008/05/25-10:31:42.107 ERR [NM] Banishing node 1 from active 

cluster membership. 

0000089c.00000a54:: 2008/05/25-10:31:42.107 ERR [RGP] Node 1: REGROUP WARNING: reload failed. 

0000089c.00000a54:: 2008/05/25-10:31:42.107 ERR [NM] Halting this node due to membership or 

communications error. Halt code = 1. 

0000089c.00000a54:: 2008/05/25-10:31:42.107 ERR [CS] Halting this node to prevent an inconsistency 

within the cluster. Error status = 5890. (Error 5890: An operation was attempted that is incompatible with 

the current membership state of the node) 

0000091c.00000fe4:: 2008/05/25-10:31:42.107 ERR [RM] LostQuorumResource, cluster service 

terminated... 

Cluster Log for Usmv-East2 Host (Surviving Host on Site 2) 

00000268.00000c38::2008/05/25-10:31:42.107 INFO [ClMsg] Received interface unreachable event for 


00000268.00000c38::2008/05/25-10:31:42.107 INFO [ClMsg] Received interface unreachable event for 


00000268.00000b70::2008/05/25-10:31:42.107 WARN [NM] Communication was lost with interface 

374359a2-5782-4b1d-a863-07f84f8c97d9 (node: USMV-WEST2, network: private) 

00000268.00000b70::2008/05/25-10:31:42.107 INFO [NM] Updating local connectivity info for network 

afe1f350-f66a-460a-a526-6f58987b911d. 

00000268.00000b70::2008/05/25-10:31:42.107 INFO [NM] Started state recalculation timer (2400ms) for 

network afe1f350-f66a-460a-a526-6f58987b911d (private) 

00000268.00000b70::2008/05/25-10:31:42.107 WARN [NM] Communication was lost with interface 

15b9fbe1-c05f-4e90-b937-17fdc27c133e (node: USMV-WEST2, network: public) 

00000268.00000b70::2008/05/25-10:31:42.107 INFO [NM] Updating local connectivity info for network 

9d905035-8105-4c87-a5bc-ce82e49e764a. 

00000268.00000b70::2008/05/25-10:31:42.107 INFO [NM] Started state recalculation timer (2400ms) for 

network 9d905035-8105-4c87-a5bc-ce82e49e764a (public) 

00000268.000005d0::2008/05/25-10:31:39.733 INFO [NM] We own the quorum resource.. 

6872 5688–002 9–7




• Check for event 5008 in the management console logs. If this event is replaced by 

event 5013, the host probably crashed. See “Unexpected Server Shutdown Because 

of a Bug Check” and “Server Crash or Restart.” 

• Review the cluster log and check for the system shutdown message as shown in 

the preceding examples. Determine whether the quorum resource moved by 

checking the surviving nodes for the message “We own the quorum resource.” 

• Review the Windows system event log messages and determine whether or not the 

server failure was a crash or an orderly event. 

In this case, based on the example messages, the Windows system event log 

shows that the system started the reboot or shutdown in an orderly manner at 

6:19:13 p.m. (message 6006). Because the event log service was shut down, the 

events that follow show that the event log service restarted. 

For an orderly event, often an operator shuts down the system for some planned 

reason. 

• If the event log messages do not point to an orderly event, then review 

“Unexpected Server Shutdown Because of a Bug Check” and “Server Crash or 

Restart” as possible scenarios that fit the circumstances. 

Unexpected Server Shutdown Because of a Bug Check 


Symptoms 

The consistency groups fail over to another local node or to the other site because a 

server fails or shuts down unexpectedly and then reboots after the “blue screen” event. 



Figure 9–2. 

• Warning and informational messages similar to those shown in Figure 9–4 appear on 

the management console when a server fails. See the table after the figure for an 

explanation of the numbered console messages. 

9–8 6872 5688–002


Figure 9–4. Management Console Messages for Server Down for Bug Check 

6872 5688–002 9–9


Reference 

No. 


Event 

ID 

Description 

1 5013 The splitter for the server USMV- 

WEST2 is down unexpectedly. 


transfer is paused at the source (down) 

site. In the details of this message, the 

reason for the pause is given. 

3 5002 The splitter for server USMV-WEST2 is 

unable to access the RA unexpectedly. 


transfer is paused at the surviving site 

to allow a switchover. In the details of 

this message, the reason for the pause 

is given. 

5 4062 The surviving site accesses the latest 

image of the consistency group during 

the failover. 


surviving site, the splitter is splitting to 

the replication volumes. 

7 5002 The RA at the source (down) site 

cannot access the splitter for server 

USMV-WEST2. 


source site, the transfer is started. 


source site, data transfer starts and 

then initialization starts. 


source site, initialization completes. 

E-mail 

Immediate 

E-mail 

Daily 

Summary 





• If you review the Windows system event logs after the system reboots, you can find 

messages similar to the following examples that are based on the testing cases 

used to generate the previous management console images. 

9–10 6872 5688–002 

X 

X 

X 

X 

X 

X 

X 

X 

X

System Log for Usmv-West2 Host (Failure Host on Site 1) 


6/01/2008 18:12:42 PM EventLog Error None 6008 N/A USMV-WEST2 The previous system 

shutdown at 18:02:42 PM on 6/01/2008 was unexpected. 




service was started. 

6/01/2008 18:12:42 PM Save Dump Information None 1001 N/A USMV-WEST2 The 

computer has rebooted from a bugcheck. The bugcheck was: 0x0000007e (0xffffffffc0000005, 

0xe000015f97c8a664, 0xe000015f9e52be68, 0xe000015f9e52afb0). A dump was saved in: 

C:\WINDOWS\MEMORY.DMP. 

System Log for Usmv-East2 Host (Surviving Host on Site 2) 





6/01/2008 18:02:42 PM ClusDisk Error None 1209 N/A USMV-EAST2 Cluster service is requesting 

a bus reset for device\Device\ClusDisk0. 









For this error situation, no entries appear in the cluster log. 


000007e0.00000138::2008/06/01-18:02:42.104 INFO [ClMsg] Received interface unreachable event for 


000007e0.00000138:: 2008/06/01-18:02:42.104 INFO [ClMsg] Received interface unreachable event for 


000007e0.00000124:: 2008/06/01-18:02:42.104 WARN [NM] Communication was lost with interface 

5019923b-d7a1-4886-825f-207b5938d11e (node: USMV-WEST2, network: Public) 


f409cf69-9c30-48f0-8519-ad5dd14c3300 (node: B) 

000001c0.00000664:: 2008/06/01-18:02:42.507 ERR Physical Disk : [DiskArb] GetPartInfo 

completed, status 170. (Error 170: the request resource is in use) 

000001c0.00000664:: 2008/06/01-18:02:42.507 ERR Physical Disk : [DiskArb] Failed to read 


000001c0.00000664:: 2008/06/01-18:02:42.507 INFO Physical Disk : [DiskArb] We are about to 

break reserve. 

000007e0.00000a0c:: 2008/06/01-18:02:42.881 INFO [NM] We own the quorum resource. 

6872 5688–002 9–11




1. Review the Windows application event log messages to determine the cause of the 

unexpected event. 

In this case, based on the four example messages, the first Windows system event 

log shows event 6008 in which the system unexpectedly shut down; it was not a 

reboot. 

Then event 6009 is typically displayed as a reboot message. This event occurs 

regardless of the reason for the reboot. The same is true for event 6005. 

The Save Dump event 1001 shows that a memory dump was saved. Based on this 

message, consult the Microsoft Knowledge Base regarding bug checks. 

(http://support.microsoft.com/). Search for bug check 0x0000007e, or stop 

error 0x0000007e and replace the stop number with the one displayed. 

2. Once you have the appropriate Knowledge Base article from the Microsoft site, 

follow the recommendations in the article to resolve the issue. 

3. If the information from the Knowledge Base article does not solve resolve the 

problem, collect and save the memory dump file and then submit it to the Unisys 

Support Center. 

Server Crash or Restart 


Symptoms 

When the server goes down for whatever reason and then restarts in a geographic 

clustered environment, the consistency groups fail over to the other site and then fail 

over to the original site once the server is restarted. 

The following symptoms might help you identify his failure: 


Figure 9–2. 


on the management console when the server fails. See the table after that figure for 

an explanation of the numbered console messages. 



management console images for Figures 9–2 and 9–4: 

9–12 6872 5688–002

System Log for Usmv-West2 Host (Failure Host on Site 1) 


6/01/2008 18:42:39 PM EventLog Error None 6008 N/A USMV-WEST2 The previous system 

shutdown at 18:05:55 PM on 6/01/2008 was unexpected. 




service was started. 

System Log for Usmv-East2 Host (Surviving Host on Site 2) 





6/01/2008 18:05:55 PM ClusDisk Error None 1209 N/A USMV-EAST2 Cluster service is requesting 







examples that are based on the testing cases used to generate the management 

console images for Figures 9–2 and 9–4: 


For this error situation, no entries appear in the cluster log. 


000007e0.00000138::2008/06/01-18:05:55.102 INFO [ClMsg] Received interface unreachable event for 


000007e0.00000138:: 2008/06/01-18:05:55.102 INFO [ClMsg] Received interface unreachable event for 



5019923b-d7a1-4886-825f-207b5938d11e (node: USMV-WEST2, network: Public) 


f409cf69-9c30-48f0-8519-ad5dd14c3300 (node: USMV-WEST2, network: Private LAN) 

000001c0.00000168:: 2008/06/01-18:05:55.504 ERR Physical Disk : [DiskArb] GetPartInfo 

completed, status 170. (Error 170: the requested resource is in use) 

000001c0.00000168:: 2008/06/01-18:05:55.504 ERR Physical Disk : [DiskArb] Failed to read 


000001c0.00000168:: 2008/06/01-18:05:55.504 INFO Physical Disk : [DiskArb] We are about to 

break reserve. 

000007e0.00000764:: 2008/06/01-18:05:55.079 INFO [NM] We own the quorum resource. 

6872 5688–002 9–13




1. Run the Microsoft Product Support MPS Report Utility to gather system information. 

(See “Using the MPS Report Utility” in Appendix A.) 

2. Submit the MPS report to the Unisys Support Center. 

Server Unable to Connect with SAN 


Symptoms 

The server is unable to connect to the SAN. 



Figure 9–5. 

Figure 9–5. Management Console Display Showing LA Site Server Down 

To display more information about the error, click on More in the right column. A 

message similar to the following is displayed: 

ERROR: Splitter USMV-WEST2 is down 


on the management console when the server fails. See the table after the figure for 

an explanation of the numbered console messages. 

9–14 6872 5688–002


Figure 9–6. Management Console Images Showing Messages for Server Unable to 

Connect to SAN 

Reference 

No. 


Event 

ID 

Description 

1 5013 The splitter for the server USMV-WEST2 is 

down. 


the transfer is paused to allow a failover to 

the surviving site. 

3 4008 For each consistency group, the transfer is 

paused at the surviving site to allow a failover. 

In the details of this message, the reason for 

the pause is given. 

4 5002 The splitter for the server USMV-WEST2 is 

unable to access the RA. 

5 4010 The consistency groups on the original failed 

site start data transfer. 


data transfer starts and then initialization 

starts. 


data transfer completes. 

E-mail 

Immediate 

E-mail 

Daily 

Summary 

• The multipathing software (EMC PowerPath Administrator) flashes a red X on the 

right side of the toolbar. 

6872 5688–002 9–15 

X 

X 

X 

X 

X 

X 

X


• The PowerPath Administrator Console reports failures similar to those shown in 

Figure 9–7. 

Figure 9–7. PowerPath Administrator Console Showing Failures 

• If you review the server system event log, you can find error messages similar to the 


previous management console images. 

Type : warning 

Source : Ftdisk 

EventID : 57 

Description : The system failed to flush data to the transaction log. Corruption may occur. 

Type : error 

Source : Emcpbase 

EventID : 100 

Description : Path Bus x Tgt y LUN z to APMxxxx is dead 

The event 100 will appear numerous times for each bus, target and LUN. 



1. At the server, run a tool such as the PowerPath Administrator that might aid in 

diagnosing the problem. 

2. Log in to the storage software and determine whether problems are reported. If so, 

use the information for that software to correct the problems. 

Something might have happened to the volume, or the zoning configuration on the 

switch might have been changed. Also, a connection issue could exist such as a 

fabric switch or storage cable failure. 

9–16 6872 5688–002


3. If the problem is not limited to one server, run the Installation Manager Fibre 

Channel diagnostics. Appendix C explains how to run the Installation Manager 

diagnostics and provides information about the various diagnostic capabilities. 

4. If the problem still appears at the host, an adapter with multiple ports might have 

failed. Replace the Fibre Channel adapter in the host if the storage, zoning, and 

cabling appear correct. Ensure that the storage and zoning are corrected to use the 

new WWN as necessary. (See “Server HBA Failure” for resolution actions.) 

Server HBA Failure 


Symptoms 

One HBA in the server failed on a host that has multiple paths to storage. 


• The multipathing software (such as EMC PowerPath Administrator) flashes a red X 

on the right side of the toolbar. 

• The PowerPath Administrator console reports failures similar to those shown in 

Figure 9–8. 

Figure 9–8. PowerPath Administrator Console Showing Adapter Failure 

6872 5688–002 9–17


• If you review the server system event log, you can find error messages similar to the 

following example: 


Type : error 

Source : Emcpbase 

EventID : 100 

Description: 

Path Bus x Tgt y LUN z to APMxxxx is dead 

The event 100 will appear numerous times for each target and LUN. 

To replace an HBA in the server, perform the following steps: 

1. Run Emulex HBAnywhere and record the WWNs in use by the server. 

2. Shut down the server. 

3. Replace the failed HBA and then boot the server. 

4. Run Emulex HBAnywhere and record the new WWN. 

5. Using the SAN switch management modify the zoning as needed to replace the 

failed WWN with new WWN. 

6. If manual discovery was used for the storage, update the configuration to use the 

new WWN. 

Infrastructure (NTP) Server Failure 


Symptoms 

The replication environment is not affected by an NTP server failure. Timestamps of log 

entries are affected. 

The following symptoms might help you identify the failure: 

• When comparing log entries of a failover, the host application log and the 

management console entries are not synchronized. 

• You are unable to run the synchronization diagnostics as described in the Unisys 

SafeGuard Solutions Replication Appliance Installation Guide. 


To resolve an NTP server failure, perform the following steps: 

1. Temporarily change the cluster mode for a data consistency group to MSCS 

manual (for a group replicating from the source site to the target site). 

2. Perform a move-group operation on a cluster group that contains a Unisys SafeGuard 

Control resource to a node at the target site. 

3. View the management console log for event 1009 as shown in Figure 9–9. 

9–18 6872 5688–002

6872 5688–002 

4. View the host aapplication 

event log for event 1115, as follows: 

Event Type : 

Event Source : 

Event Category : 

Event ID : 

Date : 

Time : 

User : 

Computer 

Description: 

: 

Online resource fai 

Group is not a MSC 

Action: Verify throu 

Or if doing manual 

Figure 9–9. Event 1009 Display 

Warning 

30mControl 

None 

1115 

9/10/2006 

12:09:04 PM 

N/A 

USMV-EAST2 

Resource name: Daata1 

RA CLI command: UCLI -ssh -l plugin -pw **** -superbatch 172.16.7.60 initiate_ _failover group=Data1 

active_site=East cluster_owner=USMV-EAST2 

5. Compare the timestamps. 

If the time betwween 

the timestamps is not within a couple of minut tes, the host and 

RAs are not synnchronized. 

6. Use the Installaation 

Manager site connectivity IP diagnostic by perf forming the 

following stepss. 

(For more information, see Appendix C.) 

a. Log in to ann 

RA as user boxmgmt with the password boxmg gmt. 

b. On the Maain 

Menu, type 3 (Diagnostics) and press Enter. 

Solving Server S Problems 

led. 

CS auto-data group (5). 

ugh the Management Console that the Global cluster mode is set t to MSCS auto-data. 

recovery, ensure an image has been selected. 

c. On the Diaagnostics 

menu, type 1 (IP diagnostics) and press Enter. 

9–19




Enter. 

f. Enter the IP address for the NTP server that you want to test. 

Note: In step e, you must specify 5 (Other host) and 4 (NTP Server). This 

choice is because site 2 does not specify an NTP server in the configuration, and 

the test will fail if you use 4 (NTP Server). 

7. If the NTP server fails, check that the NTP service on the NTP server is functioning 

correctly. 

8. Use the Installation Manager port diagnostics IP diagnostic to ensure that no ports 

are blocked. (For more information about running port diagnostics, see Appendix C.) 

9. Check that the NTP server specified for the host is the same NTP server specified 

for the RAs at site 1. (If you want to view the RA configuration settings, use the 

Installation Manager Setup View capability. For information about that capability, 

refer to the Unisys SafeGuard Solutions Replication Appliance Installation Guide.) 

10. Repeat steps 1 through 5 choosing a group that will move a group from the target 

site to the source site. 

Server Failure (Hardware or Software) in a 

Geographic Replication Environment 


When a server goes down in a geographic replication environment, the circumstances 

and Windows event log messages are similar to those for the server failure in a 

geographic clustered environment. That is, the five subset scenarios previously 

presented apply as far as the event log messages and actions to resolve are concerned. 

The primary difference is that the main symptom of the server failure in this environment 

is that the user applications fail. 

Refer to the previous five subset scenarios for more details. 

9–20 6872 5688–002

Section 10 

Solving Performance Problems 

This section lists symptoms that usually indicate performance problems. Table 10–1 lists 


solutions are described in this section. This section also includes a general discussion of 

high-load event. The graphics, behaviors, and examples in this section are similar to what 

you observe with your system but might differ in some details. 

The management console provides graphs that you can use to evaluate performance. 

For more information, see the Unisys SafeGuard Solutions Replication Appliance 

Administrator’s Guide. 


for the possible problems. Also, messages similar to e-mail notifications are displayed on 

the management console. If you do not see the messages, they might have already 



Table 10–1. Possible Performance Problems with Symptoms 


The initialization progression indicator (%) 

in the management interface progresses 

significantly slower than expected. 

Initialization completes after a significantly 

longer period of time than expected. 

The event log indicates that the disk 

manager has reported high load conditions 

for a specific consistency group or groups. 

A consistency group or groups start to 

initialize. This initialization can occur once 

or multiple times, depending on the 

circumstances. 

Slow initialization 

High-load (disk manager) 

6872 5688–002 10–1


Table 10–1. Possible Performance Problems with Symptoms 


The event log indicates that the distributor 

has reported high load conditions for a 

specific consistency group or groups. 

A consistency group or groups start to 

initialize. This initialization can occur once 

or multiple times, depending on the 

circumstances. 

Applications are offline for a lengthy period 

during changes in the replication direction. 

Slow Initialization 


Symptoms 

High load (distributor) 

Failover time lengthens 

Initialization of a consistency group or groups takes longer than expected. 

Progression of initialization is reported through the management console in percentages. 

You might notice that the percentage for a group has not progressed in a long time or 

progresses at a slow rate. This progression might or might not be normal depending on 

several factors. 

For some groups, it might be natural to take a long time to advance to the next 

percentage. One percent of 10 TB is much larger than one percent of 100 GB; therefore, 

larger groups would take longer to advance in initialization. 


• The initialization progression indicator (%) in the management interface progresses 

significantly slower than expected. 

• Initialization completes after a significantly longer period of time than expected. 

10–2 6872 5688–002




• Verify the bandwidth of the connection between sites using the Installation Manager 

network diagnostic tools to test the WAN speed while there is no traffic over the 

WAN. Appendix C explains how to run these diagnostics. 

• Use the Installation Manager Fibre Channel diagnostic tools or customer 

storage/SAN diagnostic tools to test the performance of the source and target 

storage LUNs to ensure that all storage LUNs are capable of handling the observed 

load. Appendix C explains how to run the Installation Manager diagnostics. 

If storage performance on either site is poor, the replication system could be limited 

in its ability to read from the replication volumes on the source site or to write to the 

journal volume on the remote site. Poor storage performance reduces the maximum 

speed at which the RAs can initialize. 

• Verify that no bandwidth limitation exists on the relevant group or groups properties. 

• Use the event log to verify that no other events occurred during initialization—for 

example, high load conditions, WAN disconnections, or storage disconnections—that 

could have caused the initialization to restart. 

• Diagnosis of these types of problems is usually specific to the environment. Collect 

RA logs and submit a service request to Unisys support if the cause of slow 

initialization cannot be determined through the actions given above. See Appendix A 

for information about collecting logs. 

General Description of High-Load Event 

A high-load event reports that, at the time of the event, a bottleneck existed in the 

replication process. To keep track of the changes being made during the bottleneck, the 

replication goes into “marking mode” and records the location of all changed data on the 

source replication volume until the activity causing the bottleneck has subsided. 

The three possible points at which a bottleneck might occur are 

• Between the host and RA—Disk Manager 

Of the three points for a bottleneck to occur, this point is the rarest to cause the 

bottleneck. This type of bottleneck occurs when the host is writing to the storage 

device faster than the RA can handle. 

• The WAN 

This type of bottleneck occurs when the host is writing to the storage device faster 

than the RAs can replicate over the available bandwidth. For example, a host is 

writing to the storage device during peak hours at a rate of 60 Mbps. The RAs 

compress this data down to 15 Mbps. The available bandwidth is 10 Mbps. Clearly, 

during peak hours, the bandwidth is not sufficient to support the write rate; 

therefore, during peak hours, a number of high load events occur. 

6872 5688–002 10–3


• The remote storage—Distributor 

This type of bottleneck occurs when the storage device containing the journal 

volume on the remote site cannot keep up with the speed that the data is being 

replicated to the remote site. To avoid this situation, configure the journal volume on 

the fastest possible LUNs using the fastest RAID and the most disk spindles. Also, 

use multiple journal volumes located on different physical disks in the storage array 

or use separate disk subsystems in the same consistency group so that the 

replication can perform an additional layer of striping. The replication stripes the 

images across these multiple journal volumes. 

High-Load (Disk Manager) Condition 


Symptoms 

The disk manager reports high-load conditions. 


• The event log indicates that the disk manager reported high load conditions for a 

specific consistency group or groups (event ID 4019). 

• A consistency group or groups start to initialize. This initialization can occur once or 

multiple times, depending on the circumstances. 



• Use the Installation Manager network diagnostic tools to test the WAN speed while 

there is no traffic over the WAN. Appendix C explains how to run these diagnostics. 

• Analyze the performance data for the consistency groups on the RA to ensure that 

the incoming write rate is not outside the limits of the available bandwidth or the 

capabilities of the RA. 

• High loads can occur naturally during traffic peaks or during periods of high external 

activity on the WAN. If the high load events occur infrequently or can be associated 

with a temporal peak, consider this behavior as normal. 


RA logs and submit a service request to the Unisys Support Center if the high load 

events occur frequently and you cannot resolve the problem through the actions 

previously listed. See Appendix A for information about collecting logs. 

10–4 6872 5688–002

High-Load (Distributor) Condition 


Symptoms 

The distributor reports high-load conditions. 



• The event log indicates that the distributor reported high load conditions for a 

specific consistency group or groups. 

• A consistency group or groups start to initialize. This initialization can occur once or 

multiple times, depending on the circumstances. 



• Use the Installation Manager Fibre Channel diagnostic tools or customer storage or 

SAN diagnostic tools to test the performance of the target-site storage LUNs. 

Appendix C explains how to run the Installation Manager diagnostics. 

• Analyze the WAN performance of the consistency group or groups, and ensure that 

loads are not too high for handling by the target-site storage devices. 

• High loads can occur naturally during traffic peaks. If the high-load events occur 

infrequently or can be associated with a temporal peak, consider this behavior as 

normal. 


RA logs and submit a service request to the Unisys Support Center if the high-load 

events occur frequently and you cannot resolve the problem through the actions 

previously listed. See Appendix A for information about collecting logs. 

Failover Time Lengthens 


Symptoms 

Prior to changing the replication direction, the images must be distributed to the targetsite 

volumes. The applications are not available during this process. 

Applications are offline for a lengthy period during changes to the replication direction. 


Refer to the Unisys SafeGuard Solutions Planning and Installation Guide for more 

information on pending timeouts. 

6872 5688–002 10–5


10–6 6872 5688–002

Appendix A 

Collecting and Using Logs 

Whenever a failure occurs, you might need to collect and analyze log information to 

assist in diagnosing the problem. This appendix presents information on the following 

tasks: 

• Collecting RA logs 

• Collecting server (host) logs 

• Analyzing RA log collection files 

• Analyzing server (host) logs 

• Analyzing intelligent fabric switch logs 

Collecting RA Logs 

When you collect logs from one RA, you automatically collect logs from all other RAs and 

from the servers. Occasionally, you might need to collect logs from the servers (hosts) 

manually. Refer to “Collecting Server (Host) Logs” later in this appendix for more 

information. 

Each time you complete a log collection, the files are saved for a maximum of 7 days. 

The length of time the files remain available depends on the size and number of log 

collections performed. To ensure that you have the log files that you need, download and 

store the files locally. Log files with dates older than 7 days from the current date are 

automatically removed. 

To collect the RA logs, perform the following procedures: 

1. Set the Automatic Host Info Collection option 

2. Test FTP connectivity 

3. Determine when the failure occurred 

4. Convert local time to GMT or UTC 

5. Collect logs from the RA 

6872 5688–002 A–1


Setting the Automatic Host Info Collection Option 

Perform the following steps to set the Automatic Host Info Collection Option: 

1. On the System menu select System Settings in the Management Console. 

The System Settings page appears. 

2. Choose the Automatic Host Info Collection option from Miscellaneous 

Settings. 

For more information, refer to the Unisys SafeGuard Solutions Planning and Installation 

Guide. 

Testing FTP Connectivity 

To test FTP connectivity, perform the following steps on the management PC. The 

information you provide depends on whether logs are being collected locally on an FTP 

server or sent to an FTP server at the Unisys Product Support site. 

1. To initiate an FTP session, type FTP at a command prompt. Press Enter. 

2. Type Open. Press Enter. 

3. At the To prompt, enter one of the following and then press Enter: 

• ftp.ess.unisys.com (the Unisys FTP address) 

• Your local FTP server IP address 

4. At the User prompt, enter one of the following and then press Enter: 

• FTP, if you specified the Unisys FTP address 

• Your local FTP user account 

5. At the Password prompt, enter one of the following and then press Enter: 

• Your Internet e-mail address if you specified the Unisys FTP address 

• Your local FTP account password 

6. Type bye and press Enter to log out. 

Determining When the Failure Occurred 

Perform the following steps to determine when the failure occurred: 

Note: If you cannot determine the failure time from the RA logs, use the Windows 

event logs on each server (host) to determine the failure time. 

1. Select the Logs tab from the navigation pane in the Management Console. 

A list of events is displayed. Each event entry includes a Level column that indicates 

the severity of the event. 

If necessary, click View and select Detailed. 

2. Scan the Description column to find the event for which you want to gather logs. 

A–2 6872 5688–002


3. Select the event and click the Filter Log option. 

The Filter Log dialog box appears. 

4. Select any option from scope list (normal, detailed, advanced) and level list (info, 

warning, error). 

5. Write down the timestamp that is displayed for the event. You must convert the 

time displayed to GMT—also called Coordinated Universal Time (UTC). 

This timestamp is used to calculate the start date and end time for log collection. 

6. Click OK. 

Converting Local Time to GMT or UTC 

Perform the following steps to convert the time in which the failure occurred to GMT or 

UTC. You need the time zone you wrote down in the preceding procedure. 

1. In Windows Control Panel, click Date and Time. 

2. Select the Time Zone tab. 

3. Look in the list for the GMT or UTC offset value corresponding to the time zone you 

wrote down in the procedure “Determining When the Failure Occurred.” The offset 

value represents the number of hours that the time zone is ahead or behind GMT or 

UTC. 

4. Add or subtract the GMT or UTC offset value from the local time. 

Example 

If the time zone is Pacific Standard Time, the GMT or UTC offset value is –8:00. If the 

time in which the failure occurred is 13:30, then GMT or UTC is 21:30. 

Collecting RA Logs 

Use the Installation Manager, which is a centralized collection tool, to collect logs from 

all accessible RAs, servers (hosts), and intelligent fabric switches. 

Before you begin log collection, determine the failure date and time. If you have SANTap 

switches and want to collect information from the switches, know the user name and 

password to access the switches. 

To collect RA logs, perform the following steps: 

1. Start the SSH client by performing the steps in “Using the SSH Client” in 

Appendix C. Use the site management IP address; log in with boxmgmt as the login 

user name and boxmgmt as the password. 

2. On the Main Menu, type 3 (Diagnostics) and press Enter. 

3. On the Diagnostics menu, type 4 (Collect system info) and press Enter. 

6872 5688–002 A–3


4. When prompted, provide the following information. Press Enter after each item. 

(The program displays date and time in GMT/UTC format.) 

a. Start date: This date specifies how far back the log collection is to start. Use 

the MM/DD/YYYY format. Do not accept the default date; the date should be at 

least 2 days earlier than the current date. This date must include the date and 

time in which the failure occurred. 

b. Start time: This time specifies the GMT/UTC in which log collection is to start. 

Use the HH:MM:SS format. 

c. End date: This date specifies when log collection is to end. Accept the default 

date, which is the current date. 

d. End time: This time specifies when log collection is to end. Accept the default 

time, which is the current time. 

5. Type y to collect information from the other site. 

6. Type y or n, and press Enter when asked about sending the results to an FTP 

server. 

If you choose not to send the results to an FTP server, skip to step 8. The results are 

stored at the URL http:///info/. You can access the 

collected results by logging in with webdownload as the log-in name and 

webdownload as the password. (If your system is set for secure Web 

transactions, then the URL begins with https://.) 

If you choose to send the results to an FTP server and the procedure has been 

performed previously, all of the information is filled in. If not, provide the following 

information for the management PC: 

a. When prompted for the FTP server, type one of the following and then press 

Enter. 

• The IP address of the Unisys Product Support FTP server, 192.61.61.78, or 

ftp.ess.unisys.com 

• The IP address of your local FTP server 

b. Press Enter to accept the default FTP port number, or type a different port 

number if you are using a management PC with a nonstandard port number. 

c. Type the local user account when prompted for the FTP user name. Press 

Enter. 

d. If you are using the Unisys FTP server, type incoming as the folder name of 

the FTP location in which to store the collected information. Press Enter. 

If you are using a local FTP server, press Enter for none. 

A–4 6872 5688–002


e. Type a name for the file on the FTP server in the following format: 

.tar 

Example: 19557111_Company1.tar 

Note: If no name is specified, the name will be similar to the following: 

sysInfo--hosts-from-)-.tar 

Example: sysInfo-l1-l2-r1-r2-hosts-from-l1-r1-2006.08.17.16.28.31.tar 

f. Type the appropriate password. Press Enter. 

7. On the Collection mode menu, type 3 (RAs and hosts) and press Enter. 

Note: The “hosts” part of this menu selection (RAs and hosts) collects intelligent 

fabric switch information. 

8. Type y or n, and press Enter when asked if you have SANTap switches from which 

you want to collect information. 

If you do not have SANTap switches, go to step 10. 

If you want to collect information from SANTap switches, enter the user name and 

password to access the switch when prompted. 

9. Type n if prompted on whether to perform a full collection, unless otherwise 

instructed by a Unisys service representative. 

10. Type n when prompted to limit collection time. 

The collection program checks connectivity to all RAs and then displays a list of the 

available hosts and SANTap switches from which to collect information. 

11. Type All and press Enter. 

The Installation Manager shows the collection progress and reports that it 

successfully collected data. This collection might take several minutes. Once the 

data collection completes, a message indicates that the collected information is 

available at the FTP server you specified or at the URL (http:///info/ or https:///info/). 

12. Press Enter. 

13. On the Diagnostics dialog box, type Q and press Enter to exit the program. 

14. Type Y when prompted to quit and press Enter. 

Verifying the Results 

• Ensure that “Failed for hosts” has no entries. The success or failure entries might be 

listed multiple times. 

For the collection to be successful for hosts and intelligent fabric switches, all entries 

must indicate “Succeeded for hosts.” 

For the collection to be successful for RAs, all entries must indicate “Collected data 

from .” 

6872 5688–002 A–5


• There is a 20-minute timeout on the collection process for RAs. There is a 15-minute 

timeout on the collection process for each host. 

• If the collection from the remote site failed because of a WAN failure, run the 

process locally at the remote site. 

• If the connection with an RA is lost while the collection is in process, no 

information is collected. Run the process again. 

• If you transferred the data by FTP to a management PC, you can transfer the 

collected data to the Unisys Product Support Web site at your convenience. 

Otherwise, if you are connected to the Unisys Product Support Web site, the 

collected data is transferred automatically to this Web site. 

• If you use the Web interface, you must download the collected data to the 

management PC and then transfer the collected data to the Unisys Product Support 

Web site at your convenience. 

Collecting Server (Host) Logs 

Use the following utilities to collect log information: 

• MPS Report Utility 

• Host information collector (HIC) utility 

Using the MPS Report Utility 

Use the Microsoft MPS Report Utility to collect detailed information about the current 

host configuration. You must have administrative rights to run this utility. 

Unisys uses the cluster (MSCS) version of this utility if that version is available from 

Microsoft. This version of the utility enables you to gather cluster information as well as 

the standard Microsoft information. If the server is not clustered, the utility still runs, but 

the cluster files in the output are blank. 

The average time for the utility to complete is between 5 and 20 minutes. It might take 

longer if you run the utility during peak production time. 

You can download the MPS Report Utility from the Unisys FTP server at the following 

location: (You are not prompted for a username or password.) 

ftp://ftp.ntsupport.unisys.com/outbound/MPS-REPORTS/ 

Select one of the following directories, depending on your operating system 

environment: 

• 32-BIT 

• 64-BIT-IA64 

• 64-BIT-X64 (not a clustered version) 

A–6 6872 5688–002


Output Files 

Individual output files are created by using the following directory structure. Depending 

on the MPS Report version, the file name and directory name might vary. 

Directory: %systemroot%\MPSReports , typically C:\windows\MPSReports 

File name: %COMPUTERNAME%_MPSReports_xxx.CAB 

Using the Host Information Collector (HIC) Utility 

Note: You can skip this procedure unless directed to complete it by the Unisys support 

personnel. Host log collection occurs automatically if the Automatic Host Info Collection 

option on the System menu of the management console is selected. 

Perform the following steps to collect log information from the hosts: 

1. At the command prompt on the host, change to the appropriate directory depending 

on your system: 

• For 32-bit and Intel Itanium 2-based systems, enter 

cd C:\Program Files\KDriver\hic 

• For x64 systems, enter 

cd C:\Program Files (x86)\KDriver\hic 

2. Type one of the following commands: 

• host_info_collector –n (noninteractive mode) 

• host_info_collector (interactive mode) 

If you choose the interactive mode command, provide the following site information: 

• Account ID: Click System Settings on the System menu of the 

Management Console, and click on Account Settings in the System 

Settings dialog box to access this information. 

• Account name: The name of the customer who purchased the Unisys SafeGuard 

30m solution. 

• Contact name: The name of the person responsible for collecting logs. 

• Contact mail: The mail account of the person responsible for collecting logs. 

Note: Ignore messages about utilities that are not installed. 

6872 5688–002 A–7


Verifying the Results 

• The process generates a single tar file of the host logs in the gzip format. 

• On 32-bit and Intel Itanium 2-based systems, the host logs are located in the 

following directory: 

C:\Program Files\KDriver\hic 

• On 64-bit systems, the host logs are located on the following directory: 

C:\Program Files (x86)\KDriver\hic 

Analyzing RA Log Collection Files 

If you use the Installation Manager RA log collection process, logs are collected from all 

accessible RAs and servers (hosts). When the tar file is extracted using this process, the 

information is gathered in a file on the FTP server that is, by default, named with the 

following format: 

sysInfo--hosts-from-)-.tar 

The is in the format yyyy.mm.dd.hh.mm.ss. 

An example of such a file name is 

sysInfo-lr-l2-r1-r2-hosts-from-l1-r1-2007.09.07.17.37.39.tar 

For each RA on which logs were collected, directories are created with the following 

formats: 

extracted.. 

HLR-- 

The is in the format yyyy.mm.dd.hh.mm.ss. 

An example of the name of an extracted directory for the RA is 

extracted.l1.2007.06.05.19.25.03 (from left RA 1 on June 5, 2007 at 19:25:03) 

In the RA identifier information, the l1 to 8 and r1 to 8 designations refer to RAs at the 

left and right sites. That is, site 1 RAs 1 through 8 are designated with l, and site 2 RAs 1 

through 8 are designated with r. 

If the RA collected a host log, the host information is collected in a directory beginning 

with HLR. For example, HLR-r1-2007.06.05.19.25.03 is the directory from right (site 2) 

RA1 on June 5, 2007 at 19:25:03. 

This directory is described in “Host Log Extraction Directory” later in this appendix. 

A–8 6872 5688–002

RA Log Extraction Directory 


Several files and directories are placed inside the extracted directory for the RA: 

• parameters: file containing the time frame for the collection 

• CLI: file that containing the output collected by running CLI commands 

• aiw: file containing the internal log of the system, which is used by third-level 

support 

• aiq: file containing the internal log of the system, which is used by third-level support 

• cm_cli: internal file used by third-level support 

• init_hl: internal file used by third-level support 

• kbox_status: file used by third-level support 

• unfinished_init_hl: file used by third-level support 

• log: file containing the log of the collection process itself (used only by third-level 

support) 

• summary: file containing a summary of the main events from the internal logs of the 

system, which is used by third-level support 

• files: directory containing the original directories from the appliance 

• processes: directory containing some internal information from the system such as 

network configuration, processes state, and so forth 

• tmp: temporary directory 

Of the preceding items, you should understand the time frame of the collection from the 

parameters file and focus on the CLI file information. To determine whether the logs 

were correctly collected, check that the time frame of the collection correlates with the 

time of the issue, and verify that logs were collected from all nodes. 

Root-Level Files 

Several files are saved at the root level of the extracted directory: parameters file, CLI 

file, aiw file, aiq file, cm_cli file, init_hl file, kbox_status file, unfinished_init_hl file, log file, 

and summary file. 

Parameters File 

The parameters file contains the parameters given to the log gathering tool. Those 

parameters set the time frame for the log collection and are reflected in the parameters 

file. The format for the date is yyyy/mm/dd. 

The following example illustrates the contents of a parameters file: 

only_connectivity=”0” 

min=”2007/08/03 16:25:02” 

max=”2007/08/04 19:25:02” 

withCores=”1” 

6872 5688–002 A–9


The value ”0” for only_connectivity in the parameters file is a standard value for logs. 

The value “1” for withCores means that core logs (long) were collected for the time 

displayed. 

CLI File 

The CLI file contains the output from executing various CLI commands. The commands 

issued to produce the information are saved to the CLI file in the tmp directory. Usually 

executing CLI commands in the process of collecting logs produces volumes of output. 

The types of information that are contained in the CLI file are as follows: 

• Account settings and license 

• Alert settings 

• Box states 

• Consistency groups, settings, and state 

• Consistency group statistics 

• Site name 

• Splitters 

• Management console logs for the period collected 

• Global accumulators (used by third-level support) 

• Various settings and system statistics 

• Save_settings command output 

• Splitters settings and state 

• Volumes settings and state 

• Available images 

The commands used to collect the output are listed in the runCLIFile, described later in 

this appendix. 

Log File 

This file contains a report of the log collection that executed. It shows the start and stop 

time for the log. 

If there is a problem running CLI commands, information appears at the end of the file 

similar to the following: 

2007/06/05 19:25:40: info: running CLI commands 

2007/06/05 19:25:40: info: retrieving site name 

2007/06/05 19:25:40: info: site name is "Tunguska" 

2007/06/05 19:25:40: info: retrieving groups 

2007/06/05 19:25:40: error: while running CLI commands: when running CLI 

get_groups, RC=2 

2007/06/05 19:25:40: error: while running CLI commands: errors retrieving 

groups. skipping CLI commands. 

A–10 6872 5688–002


Summary File 

The summary file is at the root of the extracted directory and contains a summary of the 

main events from the internal logs of the system. The format of this file is used by thirdlevel 

support. However, you might find a summary of the errors helpful in some cases. 

Files Directory 

The files directory contains several subdirectories and files in those directories. The 

directories are etc, home, collector, rreasons, proc, and var. 

etc Directory 

This directory contains the rc.local file, which is used by third-level support. 

home Directory 

The home directory contains the kos directory containing several files and these 

subdirectories: cli, connectivity_tool, control, customer_monitor, hlr, install_logs, kbox, 

management, monitor, mpi__perf, old_config, replication, rmi, snmp, and utils. 

The home directory also contains the collector and rreasons directories. 

collector Directory 

This directory contains the connectivity_tool subdirectory, which lists results from 

connectivity tests to configured IP addresses on the local host loopback and the specific 

ports on the IP addresses that require testing for various protocols. 

rreasons Directory 

This directory contains the rreasons.log file, which lists the reasons for any reboots in 

the specified time frame. 

This file is used by third-level support but can be helpful in reviewing the reboot reasons, 

as shown in the following sample file: 

*************************************************************************** 

*************************************************************************** 

*************************************************************************** 

*************************************************************************** 

*************************************************************************** 

=== LogLT STARTED HERE - 2007/07/05 22:40:40 === 

*************************************************************************** 

*************************************************************************** 

*************************************************************************** 

*************************************************************************** 

*************************************************************************** 

Couldn't open 'logger.ini' file, so assuming default 'all' with level 

DEBUG2007/07/05 22:40:40.834 - #2 - 1421 - RebootReasons: 

getRebootReasons2007/07/05 22:40:40.834 - #2 - 1421 - rreasons: Reboot Log: 

[Mon Apr 16 20:33:00 2007] : kernel watchdog 0 expired (time=66714 

lease=1390 last_tick=65233) 0=(1390,65233) 1=(30000,63214) 2=(1400,65233) 

6872 5688–002 A–11


Note: In the example, the “kernel watchdog 0 expired” message indicates a typical 

reboot that was not a result of an error. 

Other Directories 

The proc, and var directories are also contained within the files directory and are used by 

third-level support. 

processes Directory 

The processes directory contains the InfoCollect, sbin, usr, home, and bin directories and 

several subdirectories. 

InfoCollect Directory 

Under the InfoCollect directory, the SanDiag.sh file contains the SAN diagnostic logs. 

The ConnectivityTest.sh file contains connection information. Connection errors in this 

log do not indicate an error in the configuration or function. 

sbin Directory 

This directory contains files with information pertaining to networking. 

• Ifconfig file: Lists configuration information as shown in the following example: 

eth0 Link encap:Ethernet HWaddr 00:14:22:11.DD:1B 

inet addr:10.10.21.51 Bcast:10.255.255.255 Mask:255.255.255.0 

UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 

RX packets:286265797 errors:0 dropped:0 overruns:0 frame:0 

TX packets:228318046 errors:0 dropped:0 overruns:0 carrier:0 

collisions:0 txqueuelen:100 

RX bytes:1377792659 (1.2 GiB) TX bytes:2189256742 (2.0 GiB) 

Base address:0xecc0 Memory:fe6e0000-fe700000 

eth1 Link encap:Ethernet HWaddr 00:14:22:11.DD:1C 







Base address:0xdcc0 Memory:fe4e0000-fe500000 

eth1 Link encap:Ethernet HWaddr 00:14:22:11.DD:1C 




lo Link encap:Local Loopback 

inet addr:127.0.0.1 Mask:255.0.0.0 

UP LOOPBACK RUNNING MTU:16436 Metric:1 





A–12 6872 5688–002


• route file: Lists other pieces of routing information, as shown in the following 

example: 

Kernel IP routing table 

Destination Gateway Genmask Flags Metric Ref Use Iface 

10.10.21.0 * 255.255.255.0 U 0 0 0 eth0 

172.16.0.0 * 255.255.0.0 U 0 0 0 eth1 

usr Directory 

The usr directory contains two subdirectories: bin and sbin. 

The bin subdirectory contains the kps.pl file. 

The following is an example of the kps.pl file for an attached RA: 

Uptime: 7:25pm up 18:24, 1 user, load average: 5.99, 4.38, 2.05 

Processes: 

control_process - UP 

control_loop.tcsh - UP 

replication - UP 

mgmt_loop.tcsh - UP 

management_server - UP 

cli - down 

rmi_loop.tcsh - UP 

rmi - UP 

monitor_loop.tcsh - UP 

load_monitor.pl - UP 

runall - down 

hlr_kbox - UP 

rcm_run_loop.tcsh - UP 

customer_monitor.pl - UP 

Modules: 

st - UP 

sll - UP 

var_link - UP 

kaio_mod-2.4.32-k22 - UP 

The following is an example of the kps.pl file for a detached RA: 

Uptime: 7:25pm up 18:24, 1 user, load average: 5.99, 4.38, 2.05 

Processes: 

control_process - down 

control_loop.tcsh - down 

replication - down 

mgmt_loop.tcsh - down 

management_server - down 

cli - down 

rmi_loop.tcsh - down 

rmi - down 

monitor_loop.tcsh - down 

load_monitor.pl - down 

runall - down 

hlr_kbox - UP 

rcm_run_loop.tcsh - down 

customer_monitor.pl - down 

Modules: 

st - UP 

6872 5688–002 A–13


sll - UP 

var_link - UP 

kaio_mod-2.4.32-k22 - UP 

The sbin subdirectory contains the biosdecode and dmidecode files. The biosdecode file 

provides hardware-specific RA BIOS information and the pointers to locations where this 

information is stored. The dmidecode file provides handle and other information for 

components capable of passing this information to a Desktop Management Interface 

(DMI) agent. 

home Directory 

The home directory contains the kos subdirectory, which contains other subdirectories 

that yield the get_users_lock_state.tcsh file. This file contains all the users on the RA. 

bin Directory 

The bin directory contains the df-h and lspci files. The df-h file contains directory size and 

disk size usage statistics for the RA hard disk drive. The lspci file contains PCI bridge bus 

numbers, revisions, and OEM identification strings for inbuilt devices in the RA. 

tmp Directory 

The tmp directory contains the runCLI file listing the commands that generated the CLI 

file. It also contains the getGroups file, which is a temporary file to gather the list of 

consistency groups. 

runCLI File 

The following is an example of the runCLI file saved in the tmp directory that shows the 

CLI commands executed: 

• get_logs from= to=–n 

The time and date are specified as day, month, year as follows: 

get_logs from="22:03 03/08/2007" to="17:03 04/08/2007” –n 

• config_io_throttling –n 

• config_multipath_monitoring –n 

• get_account_settings –n 

• get_alert_settings –n 

• get_box_states –n 

• get_global_policy –n 

• get_groups –n 

• get_groups_sets –n 

• get_group_settings –n 

• get_group_state –n 

• get_group_statistics –n 

A–14 6872 5688–002


• get_id_names –n 

• get_initiator_bindings –n 

• get_pairs –n 

• get_raw_stats –n 

• get_snmp_settings –n 

• get_syslog_settings –n 

• get_system_status –n 

• get_system_settings –n 

• get_system_statistics –n 

• get_tweak_params –n 

• get_version –n 

• get_virtual_targets –n 

• save_settings –n 

• get_splitter_settings site="" 

• get_splitter_states site="" 

• get_san_splitter_view site="" 

• get_san_volumes site="" 

• get_santap_view site="" 

• get_volume_settings site="" 

• get_volume_state site="" 

• get_images group="" (This command is repeated for each group.) 

getGroups File 

This internal file is used to generate the runCLI file. 

Host Log Extraction Directory 

When the RA collects a host log, the host information is collected in a directory named 

with the HLR-- format. 

Such a directory contains a tar.gz file for servers with a name similar in format to the 

following: 

HLR-r1_USMVEAST2_1157647546524147.tar.gz 

When you extract a tar.gz file, you can choose to decompress the ZIP file 

(to_transfer.tar) to a temp folder and open it, or you can choose to extract the files to a 

directory. 

When the file is for intelligent fabric switches, the file name does not have the .gz 

extension. 

6872 5688–002 A–15


Analyzing Server (Host) Logs 

The output file from host collection is named 

Unisys_host_info___.tar.gz 

This file contains a folder named “collected_items,” which contains the following files 

and directories: 

• Cluster_log: a folder containing the cluster.log file generated by MSCS 

• Hic_logs: a folder containing logs used by third-level support 

• Host_logs: a folder containing logs used by third-level support 

• Msinfo32: information from the Msinfo32.exe file 

• Registry.dump: the registry dump for this server 

• Tweak: the internal RA parameters on this server 

• Watchdog log: log created by the KDriverWatchDog service 

• Commands: a file containing output from commands executed on this server, 

including 

− A view of the LUNs recognized by this server 

− Some internal RA structures 

− Output from the dumpcfg.exe file 

− Windows event logs for system, security, and applications 

Analyzing Intelligent Fabric Switch Logs 

The output file from collecting information from intelligent fabric switches is named with 

the following format: 

HLR-__identifier.tar 

The following name is an example of this format: 

HLR-l1_CISCO_232c000dec1a7a02.tar 

Once you extract the .tar file, some files are listed with formats similar to the following: 

CVT_.tar_AT__M3_tech 

CVT_.tar_AT__M3_isapi_tech 

CVT_.tar_AT__M3_santap_tech 

A–16 6872 5688–002

Appendix B 

Running Replication Appliance (RA) 

Diagnostics 

This appendix 

• Explains how to clear the system event log (SEL.) 

• Describes how to run hardware diagnostics for the RA. 

• Lists the LCD status messages shown on the RA. 

Clearing the System Event Log (SEL) 

Before you run the RA diagnostics, you need to clear the SEL to prevent errors from 

being generated during the diagnostics run. 

1. Insert the bootable Replication Appliance (RA) Diagnostic CD-ROM in the CD/DVD 

drive. 

2. Press Ctrl+Alt+Delete to reboot the RA. 

The RA displays the following event log menu. 

3. Select Show all system event log records using the arrow keys, then press 

Enter. 

This action results in an SEL summary and indicates whether the SEL contains 

errors. If there are errors, an error description is given. 

Note: You cannot scroll up or down in this screen. 

A clear SEL without errors has “IPMI SEL contains 1 records” displayed in the 

summary. Anything greater than one record indicates that errors are present. 

6872 5688–002 B–1

Running Replication Appliance (RA) Diagnostics 

Note: The preceding step did not clear the SEL; ignore the statement “Log area 

Reset/Cleared.” 

4. Press any key to return to the main boot menu. 

5. Select Clear System Event Log using the arrow keys, and press Enter to ensure 

that the SEL is cleared of all error entries. 

Note: Depending on whether there are error entries, this clearing action could take 

up to 1 minute to complete. 

6. Press any key again to return to the main boot menu. 

7. Select Show all system event log records using the arrow keys and press 

Enter. Confirm that “IPMI SEL contains 1 records” is shown. 

8. Press any key to return to the main boot menu. 

Note: If you accidentally press Escape and leave the main boot menu, a Diag 

prompt is displayed. Type menu to return to the main boot menu. 

Running Hardware Diagnostics 

Running the hardware diagnostics for the RA includes completing the Custom Test and 

Express Test diagnostics. 

Follow these steps to run the hardware diagnostics for the RA: 

1. At the main boot menu, use the arrow keys to select Run Diags …; then press 

Enter. 

2. On the Customer Diagnostic Menu, press 2 to select Run ddgui graphicsbased 

diagnostic. 

The system diagnostic files begin loading and a message is displayed giving 

information about the software and showing “initializing…” 

Once the diagnostics are loaded and ready to be executed, the Main Menu is 

displayed. 

B–2 6872 5688–002

Custom Test 


1. On the Main Menu, select Custom Test using the arrow keys; then press Enter. 

The Custom Test dialog box is displayed as follows: 

2. Expand the PCI Devices folder to view the PCI devices installed in the system 

including those devices that are “on-board.” 

3. Select the PCI Devices folder; then press Enter. 

This action causes each PCI device to be interrogated in turn and a message is 

displayed for each one. Verify that the correct number of QLogic adapters is shown. 

4. Press OK after each message is displayed until all PCI devices have been recognized 

and passed. The message “All tests passed.” is displayed. 

Note: If any devices fail this test, investigate and rectify the problem; then clear the 

SEL as explained in “Clearing the System Event Log (SEL).” 

5. Close the Custom Test dialog box and return to the Main Menu. 

6872 5688–002 B–3


Express Test 

1. On the Main Menu, select Express Test using the arrow keys; then press Enter. 

A warning is displayed advising that media must be installed on all drives or else 

some tests might fail. 

2. If a diskette drive is installed in the system, insert a blank, formatted diskette and 

then click OK to start the test. If no diskette drive is installed, just click OK. 

During testing, a status screen is displayed. 

If the diagnostic test run is successful, the message “All tests passed.” appears. 

Notes: 

• During the video portion of the testing, the screen typically flickers and goes 

blank. 

• If any errors occur, investigate and resolve the problem, and then rerun the 

diagnostic tests. Before you rerun the tests, be sure to clear the SEL as 

explained in “Clearing the System Event Log (SEL).” 

3. Click OK to exit the diagnostic tests. 

The Main Menu is then displayed. 

4. Select Exit using the arrow keys; then press Enter. 

The following message is displayed: 

Displaying the end of test result.log ddgui.txt. Strike a Key when ready. 

5. Press any key to display the diagnostic test summary screen. 

6. Verify that no errors are listed. Scroll up and down to see the different portions of 

the output. 

Note: If any errors are listed, investigate and resolve the problem; then rerun the 

diagnostic tests. Before you rerun the tests, be sure to clear the SEL as explained in 

“Clearing the System Event Log (SEL).” 

7. Press Escape to return to the original Customer Diagnostic Menu. 

8. Press 4 to quit and return to the main boot menu. 

9. Select Exit; then press Enter. 

10. Remove all media from the diskette and CD/DVD drives. 

LCD Status Messages 

The LCDs on the RA signify status messages. Table B–1 lists the LCD status messages 

that can occur and the probable cause for each message. The LCD messages refer to 

events recorded in the SEL. 

Note: For information about corrective actions for the messages listed in Table B–1, 

refer to the documentation supplied with the system. 

B–4 6872 5688–002


Table B–1. LCD Status Messages 

Line 1 Message Line 2 Message Cause 

SYSTEM ID SYSTEM NAME The system ID is a unique name, 5 characters or less, 

defined by the user. 

The system name is a unique name, 16 characters or 

less, defined by the user. 

The system ID and name display under the following 

conditions: 

• The system is powered on. 

• The power is off and active POST errors are 

displayed. 

E000 OVRFLW CHECK LOG LCD overflow message. A maximum of three error 

messages can display sequentially on the LCD. The 

fourth message is displayed as the standard overflow 

message. 

E0119 TEMP AMBIENT Ambient system temperature is out of the acceptable 

range. 

E0119 TEMP BP The backplane board is out of the acceptable temperature 

range. 

E0119 TEMP CPU n The specified microprocessor is out of the acceptable 

temperature range. 

E0119 TEMP SYSTEM The system board is out of the acceptable temperature 

range. 

E0212 VOLT 3.3 The system power supply is out of the acceptable voltage 

range; the power supply is faulty or improperly installed. 

E0212 VOLT 5 The system power supply is out of the acceptable voltage 


E0212 VOLT 12 The system power supply is out of the acceptable voltage 


E0212 VOLT BATT Faulty battery; faulty system board. 

E0212 VOLT BP 12 The backplane board is out of the acceptable voltage 

range. 

E0212 VOLT BP 3.3 The backplane board is out of the acceptable voltage 

range. 

E0212 VOLT BP 5 The backplane board is out of the acceptable voltage 

range. 

E0212 VOLT CPU VRM The microprocessor voltage regulator module (VRM) 

voltage is out of the acceptable range. The 

microprocessor VRM is faulty or improperly installed. The 

system board is faulty. 

E0212 VOLT NIC 1.8V Integrated NIC voltage is out of the acceptable range; the 

power supply is faulty or improperly installed. The system 

board is faulty. 

6872 5688–002 B–5




E0212 VOLT NIC 2.5V Integrated NIC voltage is out of the acceptable range. The 

power supply is faulty or improperly installed. The system 

board is faulty. 

E0212 VOLT PLANAR REG The system board is out of the acceptable voltage range. 

The system board is faulty. 

E0276 CPU VRM n The specified microprocessor VRM is faulty, 

unsupported, improperly installed, or missing. 

E0276 MISMATCH VRM n The specified microprocessor VRM is faulty, 


E0280 MISSING VRM n The specified microprocessor VRM is faulty, 


E0319 PCI OVER CURRENT The expansion cord is faulty or improperly installed. 

E0412 RPM FAN n The specified cooling fan is faulty, improperly installed, or 

missing. 

E0780 MISSING CPU 1 Microprocessor is not installed in socket PROC_1. 

E07F0 CPU IERR The microprocessor is faulty or improperly installed. 

E07F1 TEMP CPU n HOT The specified microprocessor is out of the acceptable 

temperature range and has halted operation. 

E07F4 POST CACHE The microprocessor is faulty or improperly installed. 

E07F4 POST CPU REG The microprocessor is faulty or improperly installed. 

E07FA TEMP CPU n THERM The specified microprocessor is out of the acceptable 

temperature range and is operating at a reduced speed or 

frequency. 

E0876 POWER PS n No power is available from the specified power supply. 

The specified power supply is improperly installed or 

faulty. 

E0880 INSUFFICIENT PS Insufficient power is being supplied to the system. The 

power supplies are improperly installed, faulty, or 

missing. 

E0CB2 MEM SPARE ROW The correctable errors threshold was met in a memory 

bank; the errors were remapped to the spare row. 

E0CF1 MBE DIMM Bank n The memory modules installed in the specified bank are 

not the same type and size. The memory module or 

modules are faulty. 

E0CF1 POST MEM 64K A parity failure occurred in the first 64 KB of main 

memory. 

E0CF1 POST NO MEMORY The main-memory refresh verification failed. 

E0CF5 LGO DISABLE SBE Multiple single-bit errors occurred on a single memory 

module. 

B–6 6872 5688–002




E0D76 DRIVE FAIL A hard drive or RAID controller is faulty or improperly 

installed. 

E0F04 POST DMA INIT Direct memory access (DMA) initialization failed. DMA 

page register write/read operation failed. 

E0F04 POST MEM RFSH The main-memory refresh verification failed. 

E0F04 POST SHADOW BIOS-shadowing failed. 

E0F04 POST SHD TEST The shutdown test failed. 

E0F0B POST ROM CHKSUM The expansion card is faulty or improperly installed. 

E0F0C VID MATCH CPU n The specified microprocessor is faulty, unsupported, 

improperly installed, or missing. 

E10F3 LOG DISABLE BIOS The BIOS disabled logging errors. 

E13F2 IO CHANNEL CHECK The expansion card is faulty or improperly installed. The 

system board is faulty. 

E13F4 PCI PARITY 

E13F5 PCI SYSTEM 

E13F8 CPU BUS INIT The microprocessor or system board is faulty or 

improperly installed. 

E13F8 CPU MCKERR Machine check error. The microprocessor or system 

board is faulty or improperly installed. 

E13F8 HOST TO PCI BUS 

E13F8 MEM CONTROLLER A memory module or the system board is faulty or 

improperly installed. 

E20F1 OS HANG The operating system watchdog timer has timed out. 

EFFF1 POST ERROR A BIOS error occurred. 

EFFF2 BP ERROR The backplane board is faulty or improperly installed. 

6872 5688–002 B–7


B–8 6872 5688–002

Appendix C 

Running Installation Manager 

Diagnostics 

To determine the causes of various problems as well as perform numerous procedures, 

you must access the Installation Manager functions and diagnostics capabilities. 

Using the SSH Client 

Throughout the procedures in this guide you might need to use the secure shell (SSH) 

client. Perform the following steps whenever you are asked to use the SSH client or to 

open a PuTTY session: 

1. From Windows Explorer, double-click the PuTTY.exe file. 

2. When prompted, enter the applicable IP address. 

3. Select SSH for the protocol and keep the default port settings (port 22). 

4. Click Open. 

5. If prompted by a PuTTY security dialog box, click Yes. 

6. When prompted to log in, type the identified user name and then press Enter. 

7. When prompted for a password, type the identified password and then press Enter. 

Running Diagnostics 

When you open the PuTTY session and log in as boxmgmt/boxmgmt, the Main Menu of 

Installation Manager is displayed. This menu offers the following six choices: Installation, 

Setup, Diagnostics, Cluster Operations, Reboot/Shutdown, and Quit. 

For more information about these capabilities, see the Unisys SafeGuard Solutions 

Replication Appliance Installation Guide. 

6872 5688–002 C–1

Running Installation Manager Diagnostics 

To access the various diagnostic capabilities of Installation Manager, perform the 

following steps: 

1. Open a PuTTY session using the IP address of the RA, and log in as 


The Main Menu is displayed, as follows: 

** Main Menu ** 

[1] Install 

[2] Setup 

[3] Diagnostics 

[4] Cluster Operations 

[5] Reboot / Shutdown 

[Q] Quit 

2. Type 3 (Diagnostics) and press Enter. 

The Diagnostics menu is displayed as follows: 

** Diagnostics ** 

IP Diagnostics 

[1] IP diagnostics 

[2] Fibre Channel diagnostics 

[3] Synchronization diagnostics 

[4] Collect system info 

[B] Back 

[Q] Quit 

The four diagnostics capabilities are explained in the following topics. 

Use the IP diagnostics when you need to check port connectivity, view IP addresses, 

test throughput, and review other related information. 

On the Diagnostics menu, type 1 (IP diagnostics) and press Enter to access the IP 

Diagnostics menu as shown: 

** IP Diagnostics ** 

[1] Site connectivity tests 

[2] View IP details 

[3] View routing table 

[4] Test throughput 

[5] Port diagnostics 

[6] System connectivity 

[B] Back 

[Q] Quit 

C–2 6872 5688–002

Site Connectivity Tests 


On the IP Diagnostics menu, type 1 (Site connectivity tests) and press Enter to 

access the Site connectivity tests menu. 

Note: You must apply settings to the RA before you can test options 1 through 4 in the 

following list. 

The options to test are as follows: 

** Select the target to which to test connectivity: ** 

[1] Gateway 

[2] Primary DNS server 

[3] Secondary DNS server 

[4] NTP Server 

[5] Other host 

[B] Back 

[Q] Quit 

Tests for options 1 through 4 return a result of success or failure. 

For option 5, you must specify the target IP address that you want to test. The test 

returns the relative success of 0 through 100 percent over both the management and 

WAN interfaces. 

View IP Details 

From the IP Diagnostics menu, type 2 (View IP details) and press Enter to run an 

ipconfig process. The displayed results of the process are similar to the following: 

eth0 Link encap:Ethernet HWaddr 00:0F:1F:6A:03:E7 






RX bytes:1084700432 (1034.4 Mb) TX bytes:2661155798 (2537.8 Mb) 

Base address:0xecc0 Memory:fe6e0000-fe700000 

eth1 Link encap:Ethernet HWaddr 00:0F:1F:6A:03:E8 








6872 5688–002 C–3


eth1:1 Link encap:Ethernet HWaddr 00:0F:1F:6A:03:E8 




lo Link encap:Local Loopback 

inet addr:127.0.0.1 Mask:255.0.0.0 

UP LOOPBACK RUNNING MTU:16436 Metric:1 





View Routing Table 

On the IP Diagnostics menu, type 3 (View routing table) and press Enter to display 

the routing table. 

Test Throughput 

On the IP Diagnostics menu, type 4 (Test throughput) and press Enter to use iperf to 

test throughput to another RA. 

Once you select this option, Installation Manager guides you through the following 

dialog. The bold text shows sample entries. 

Note: The Fibre Channel interface only appears if the Installation Manager Diagnostic 

capability was preconfigured to run on Fibre Channel. Then the option appears a [2} in 

the menu list. 

Enter the IP address to which to test throughput: 

>>192.168.1.86 

Select the interface from which to test throughput: 

** Interface ** 

[1] Management interface 

[2] Fibre Channel Interface 

[3] WAN interface 

>>3 

Enter the desired number of concurrent streams: 

>>2 

Enter the test duration (seconds): 

>>10 

C–4 6872 5688–002


If the test is successful, the system responds with a standard iperf output that 

resembles the following: 

Checking connectivity to 10.10.17.51 

Connection to 10.10.17.51 established. 

Client Connecting to 10.10.17.51, TCP port 5001 

Binding to local address 10.10.17.61 

TCP window size: 4.00 MByte (WARNING: requested 2.00 MByte) 

[ 6] local 10.10.17.61 port 35222 connected with 10.10.17.51 port 5001 

[ 5] local 10.10.17.61 port 35222 connected with 10.10.17.51 port 5001 

[ ID] Interval Transfer Bandwidth 

[ 5] 0.0-10.6 sec 59.1 Mbytes 46.9 Mbits/sec 

[ 6] 0.0-10.6 sec 59.1 Mbytes 46.9 Mbits/sec 

[SUM] 0.0-10.6 sec 118 Mbytes 93.9 Mbits/sec 

Port Diagnostics 

On the IP Diagnostics menu, type 5 (Port diagnostics) and press Enter to check that 

none of the ports used by the RAs are blocked (for example, by a firewall). You must test 

each RA individually—that is, designate each RA, in turn, to be the server. 

Once you select the option, Installation Manager guides you through one of the following 

dialogs, depending on whether you designate the RA to be the server or the client. In the 

dialogs, sample entries are bold. 

For the server, the dialog is as follows: 

In which mode do you want to run ports diagnostics? 

** ** 

[1] Server 

[2] Client 

>>1 

Note: Before you select the server designation for the RA, detach the RA that you 

intend to specify as the server. 

6872 5688–002 C–5


After you specify the RA that you want to test as the server, move to the RA from which 

you wish to run the port diagnostics tests. Designate that RA as a client, as noted in the 

following dialog: 

** ** 

[1] Server 

[2] Client 

>>2 

Did you already designate another RA to be the server (y/n) 

>>y 

Enter the IP address to test: 

>>10.10.17.51 

If the test is successful, the system responds with output that resembles the following: 

Port No. TCP Connection 

5030 OK 

5040 OK 

4401 OK 

1099 OK 

5060 Blocked 

4405 OK 

5001 OK 

5010 OK 

5020 OK 

Correct the problem on any port that returns a Blocked response. 

System Connectivity 

Use the system connectivity options to test connections and generate reports on 

connections between RAs anywhere in the system. You can perform the tests during 

installation and during normal operation. The tests performed to verify connections are 

as follows: 

• Ping 

• TCP (to ports and IP addresses, to the specific processes of the RA, and using SSH) 

• UDP (general and to RA processes) 

• RA internal protocols 

C–6 6872 5688–002


On the IP Diagnostics menu, type 6 (System connectivity) and press Enter to access 

the System Connectivity menu as follows: 

** System Connectivity ** 

[1] System connectivity test 

[2] Advanced connectivity test 

[3] Show all results from last connectivity check 

[B] Back 

[Q] Quit 

When you select System connectivity test and Full mesh network check, the 

test reports errors in communications from any RA to any other RA in the system. 

When you select System connectivity test and Check from local RA to all 

other boxes, the test reports errors from the local RA to any other RA in the system. 

When you select Advanced connectivity test, the test reports on the connection 

from an IP address that you specified on the local appliance to an IP address and port 

that you specified on an RA anywhere in the system. Use this option to diagnose a 

problem specific to a local IP address or port. 

When you select Show all results from last connectivity check, the test reports 

all results from the previous tests—not only the errors, but also the tests that completed 

successfully. 

6872 5688–002 C–7


You might receive one of the messages shown in Table C–1 from the connectivity test 

tool. 

Table C–1. Messages from the Connectivity Testing Tool 

Message Meaning 

Machine is down. There is no communication with the RA. 

Perform the following steps to determine 

the problem: 

• Verify that the firewall permits pinging 

the RA, that is, using a CMP echo. 

• Check that the RA is connected and 

operating. 

• Check that the required ports are 

open. (Refer to Section 7, “Solving 

Networking Problems,” for tables with 

the port information.) 

is down. The host connection exists but the RA is 

not responding. 

Perform the following steps to determine 

the problem: 

• Check that the required ports are 

open. (Refer to Section 7, “Solving 

Networking Problems” for tables with 

the port information.) 

• Verify that the RA is attached to an RA 

cluster. 

Connection to link: protocol: 

FAILED. 

Link () 

FAILED. 

No connection is available to the host 

through the protocol. 

The connection that was checked has 

failed. 

All OK. The connection is working. 

To discover which port is involved in the error or failure, run the test again and select 

Show all results from last connectivity check. The port on which each failure 

occurred is shown. 

C–8 6872 5688–002

Fibre Channel Diagnostics 


Use the Fibre Channel diagnostics when you need to check SAN connections, review 

port settings, see details of the Fibre Channel, determine Fibre Channel targets and 

LUNs, and perform I/O operations to a LUN. 

On the Diagnostics menu, type 2 (Fibre Channel diagnostics) and press Enter to 

access the Fibre Channel Diagnostics menu as follows: 

** Fibre Channel Diagnostics ** 

[1] Run SAN diagnostics 

[2] View Fibre Channel details 

[3] Detect Fibre Channel targets 

[4] Detect Fibre Channel LUNs 

[5] Detect Fibre Channel SCSI-3 reserved LUNs 

[6] Perform I/O to a LUN 

[B] Back 

[Q] Quit 

Run SAN Diagnostics 

On the Fibre Channel Diagnostics menu, type 1 (Run SAN diagnostics) and press 

Enter to run the SAN diagnostics. 

When you select this option, the system conducts a series of automatic tests to identify 

the most common problems encountered in the configuration of SAN environments, 

such as the following: 

• Storage inaccessible within a site 

• Delays with writes or reads to disk 

• Disk not accessible in the network 

• Configuration issues 

Once the tests complete, a message is displayed confirming the successful completion 

of SAN diagnostics, or a report is displayed that provides additional details. 

Results similar to the following are displayed for a successful diagnostics run of port 0: 

0 errors: 

0 warnings: 

Total=0 

6872 5688–002 C–9


Sample results follow for a diagnostics run that returns errors: 

ConfigB_Site2 Box2>>1 

>>Running SAN diagnostics. This may take a few moments... 

results of SAN diagnostics are 

3 errors: 

1. Found device with no guid : wwn=5006016b1060090d lun=0 port=0 vendor=DGC 

product=LUNZ 

2. Found device with no guid : wwn=500601631060090d lun=0 port=0 vendor=DGC 

product=LUNZ 

3. Found device with no guid : wwn=5006016b1060090d lun=0 port=1 vendor=DGC 

product=LUNZ 

9 warnings: 

1. device wwn=500601631060090d lun=8 

guid=(storage=CLARION,buffer=Vector(96,6,1,96,155,195,14,0,125,87,93,152,230,2 

29,218,17)) found in port 1 and not in port 0 



















C–10 6872 5688–002








Total=12 

View the Fibre Channel Details 

On the Fibre Channel Diagnostics menu, type 2 (View Fibre Channel details) and 

press Enter to show the current Fibre Channel details. 

The operation mode is identified automatically according to the SAN switch 

configuration. Usually the RA is configured for the point-to-point mode unless the SAN 

switch is hard-wired to port L. 

Note: You can use the View Fibre Channel details capability to obtain information about 

WWNs that is needed for zoning. 

You can check the status for the following on the Fibre Channel Diagnostics menu: 

• Speed 

• Operating node 

• Node WWN 

• Changes made 

• Connection issues 

• Additions of new HBAs 

Sample results showing Fibre Channel details for port 0 and port 1 follow: 


>> Port 0 

-----------------------------------wwn 

= 5001248200875c81 

node_wwn = 5001248200875c80 

port id = 0x20100 


speed = 2 GB 

Port 1 

-----------------------------------wwn 

= 5001248201a75c81 

node_wwn = 5001248201a75c80 

port id = 0x20500 


speed = 2 GB 

6872 5688–002 C–11


If all cables are disconnected, the operating mode results for all ports are disconnected. 

If only one cable is disconnected, then the operating mode for the affected port is 

disconnected, as shown in the following sample results: 


>> Port 0 

------------------------------------ 

wwn = 5001248200875c81 

node_wwn = 5001248200875c80 

port id = 0x20100 


speed = 2 GB 

Port 1 

------------------------------------ 

wwn = 5001248201a75c81 

node_wwn = 5001248201a75c80 

port id = 0x0 

operating mode = disconnected 

speed = 2 GB 

Detect Fibre Channel Targets 

On the Fibre Channel Diagnostics menu, type 3 (Detect Fibre Channel targets) and 

press Enter to see a list of the targets that are accessible to the RA through ports A 

and B. 

Some of the reasons to use this capability are as follows: 

• Zoning issues 

• Failure to detect a host 

• SAN connection issues 

• Need for WWN or storage details of each RA 

The following sample results provide port WWN, node WWN, and port information: 


>> 

Port 0 

Port WWN Node WWN Port ID 

---------------------------------------------------- 

1) 0x500601631060090d 0x500601609060090d 0x20000 

2) 0x5006016b1060090d 0x500601609060090d 0x20400 

C–12 6872 5688–002

Port 1 

Port WWN Node WWN Port ID 

---------------------------------------------------- 

1) 0x500601631060090d 0x500601609060090d 0x20000 

2) 0x5006016b1060090d 0x500601609060090d 0x20400 

Detect Fibre Channel LUNs 


On the Fibre Channel Diagnostics menu, type 64(Detect Fibre Channel LUNs) and 

press Enter to see a list of all volumes on the SAN that are visible to the RA. 

Using this capability can detect 

• Issues with volume access 

• LUN repository details 

• Additions of volumes 

In the following sample results that show the types of information returned, the 

information wraps around: 


>>This operation may take a few minutes... 

Size Vendor Product Serial Number Vendor Specific UID 

Port WWN LUN CGs Site ID 

================================================================================ 

1. 4.00GB DGC RAID 5 APM00031800182 LUN ID: 127 

CLARION: 60,06,01,60,9b,c3,0e,00,8d,57,5d,98,e6,e5,da,11:0 

1 500601631060090d 0 2 


CLARION: 60,06,01,60,9b,c3,0e,00,8b,57,5d,98,e6,e5,da,11:0 

1 500601631060090d 1 2 


CLARION: 60,06,01,60,9b,c3,0e,00,89,57,5d,98,e6,e5,da,11:0 

1 500601631060090d 2 2 



1 500601631060090d 3 2 

6872 5688–002 C–13




1 500601631060090d 4 2 



1 500601631060090d 5 2 



1 500601631060090d 6 0 


CLARION: 60,06,01,60,9b,c3,0e,00,7f,57,5d,98,e6,e5,da,11:0 

1 500601631060090d 7 2 


CLARION: 60,06,01,60,9b,c3,0e,00,7d,57,5d,98,e6,e5,da,11:0 

1 500601631060090d 8 40 

10. N/A DGC LUNZ APM00031800182 - 

N/A 

0 500601631060090d 0 N/A 

11. N/A DGC LUNZ APM00031800182 - 

N/A 

0 5006016b1060090d 0 N/A 

12. N/A DGC LUNZ APM00031800182 - 

N/A 

1 5006016b1060090d 0 N/A 

C–14 6872 5688–002

Detect Fibre Channel Scsi3 Reserved LUNs 


On the Fibre Channel Diagnostics menu, type 5 (Detect Fibre Channel Scsi3 

reserved LUNs) and press Enter to list all LUNs that have SCSI-3 reservations. The 

information returned includes the WWN, LUN number, port number, and reservation 

type. 

Perform I/O to a LUN 

On the Fibre Channel Diagnostics menu, type 6 (Perform I/O to a LUN) and press 

Enter to initiate a dialog that guides you through performing an I/O operation to a LUN. 

Note: The write operation removes any data that you might have. Use the write 

operation only when you are installing at the site. 

The following example for a read operation shows sample responses in bold type. 

SYDNEY Box1>>6 



Port WWN Ctrl LUN 

============================================================================ 

1. 9.00GB DGC RAID 5 APM00024400378 LUN 29 Sydney 

JouCLARION: 60,06,01,60,db,e3,0e,00,d1,3a,e0,54,cd,b6,db,11:0 

0 500601601060009a SP-A 0 

0 500601681060009a SP-B 0 

1 500601601060009a SP-A 0 

1 500601681060009a SP-B 0 

. 

. 

. 


JouCLARION: 60,06,01,60,db,e3,0e,00,da,3a,e0,54,cd,b6,db,11:0 

0 500601601060009a SP-A 10 

0 500601681060009a SP-B 10 

1 500601601060009a SP-A 10 

1 500601681060009a SP-B 10 

Select: 6 

Select operation to perform: 

** Operation To Perform ** 

[1] Read 

[2] Write 

6872 5688–002 C–15



>> 

Enter the desired transaction size: 


Do you want to read the whole LUN? (y/n) 

>>y 

1 buffers in 

1 buffers out 

total time : 0.395567 seconds 

2.65082e+07 bytes/sec 

25.2802 MB/sec 

2.52802 IO/sec 

CRC = 4126172682534249172 

I/O succeeded. 

The following example for a write operation shows sample responses in bold type. 




Port WWN Ctrl LUN 

============================================================================ 


JouCLARION: 60,06,01,60,db,e3,0e,00,d1,3a,e0,54,cd,b6,db,11:0 

0 500601601060009a SP-A 0 

0 500601681060009a SP-B 0 

1 500601601060009a SP-A 0 

1 500601681060009a SP-B 0 

. 

. 

. 


JouCLARION: 60,06,01,60,db,e3,0e,00,da,3a,e0,54,cd,b6,db,11:0 

0 500601601060009a SP-A 10 

0 500601681060009a SP-B 10 

1 500601601060009a SP-A 10 

1 500601681060009a SP-B 10 

============================================================================ 

Select: 10 

Select operation to perform: 

** Operation To Perform ** 

[1] Read 

[2] Write 


>> 

Enter the desired transaction size: 


C–16 6872 5688–002

Enter the number of transactions to perform: 


Enter the number of blocks to skip: 


100 buffers in 

100 buffers out 

total time : 40.7502 seconds 

2.57318e+07 bytes/sec 

24.5398 MB/sec 

2.45398 IO/sec 

CRC = 3829111553924479115 

I/O succeeded. 

Synchronization Diagnostics 


On the Diagnostics menu, type 3 (Synchronization diagnostics) and press Enter to 

verify that a RA is synchronized. 

Note: The RA must be attached to run the synchronization diagnostics. Reattaching the 

RA causes the RA to reboot. 

The results displayed are similar to the following example: 

remote refid st t when poll reach delay offset jitter 

============================================================================= 

*10.10.0.1 192.116.202.203 3 u 438 1024 377 0.337 12.971 6.241 

+11 10.10.0.1 2 u 484 1024 376 0.090 -4.530 0.023 

LOCAL(0) LOCAL(0) 13 1 2 64 377 0.000 0.000 0.004 

The columns in the previous output are defined as follows: 

• remote—host names or addresses of the servers and peers used for synchronization 

• refid—current source of synchronization 

• st—stratum 

• t—type (u=unicast, m=multicast, l=local, – =do not know) 

• when—time since the peer was last heard, in seconds 

• poll—poll interval, in seconds 

• reach—status of the reachability register in octal format 

• delay—latest delay in milliseconds 

• offset—latest offset in milliseconds 

• jitter—latest jitter in milliseconds 

6872 5688–002 C–17


The symbol at the left margin indicates the synchronization status of each peer. The 

currently selected peer is marked with an asterisk (*); additional peers designated as 

acceptable for synchronization are marked with a plus sign (+). Peers marked with * and 

+ are included in the weighted average computation to set the local clock. Data 

produced by peers marked with other symbols is discarded. The LOCAL(0) entry 

represents the values obtained from the internal clock on the local machine. 

Collect System Info 

On the Diagnostics menu, type 4 (Collect system info) and press Enter to collect 

system information for later processing and analysis. You specify where to place the 

information collected. In some cases, you might need to transfer it to a vendor for 

technical support. You are prompted to provide the following information: 

• The time frame for log collection 

• Whether to collect information from the remote site 

• FTP details if you choose to send the results to an FTP server 

• Which logs to collect 

• Whether you have SANTap switches from which you want to collect information 

Note: The dialog asks whether you want full collection. If you choose full collection, 

additional technical information is supplied, but the time required for the collection 

process is lengthened. Unless specifically instructed by a Unisys service representative, 

do not choose full collection. 

The following dialog provides sample responses in bold type for collecting system 

information: 

>>GMT right now is 11/24/2005 14:45:43 

Enter the start date: 

>>11/22/2005 

Enter the start time: 

>>12:00:00 

Enter the end date: 

>>11/24/2005 

Enter the end time: 

>>14:45:43 

Note: The start and end times are used only for collection of the system 

logs. Logs from hosts are collected in their entirety. 

Do you want to collect system information from the other site also? (y/n) 

>>y 

Do you want to send results to an ftp server? (y/n) 

>>y 

C–18 6872 5688–002


Enter the name of the ftp server to which you want to transfer the 

collected system information: 

>>ftp.ess.unisys.com 

Enter the port number to which to connect on the FTP server: 

>>21 

Enter the FTP user name: 

>>MY_USERNAME 

Enter the location on the FTP server in which you want to put the collected 

system information: 

>>incoming 

Enter the file on the FTP server in which you want to put the collected 

system information: 

>>19557111_company.tar 

Enter the FTP password: 

>>******* 

Select the logs you want to collect: 

** Collection mode ** 

[1] Collect logs from RAs only 

[2] Collect logs from hosts only 

[3] Collect logs from RAs and hosts 

>>3 

Do you have SANTap switches from which you want to collect information? 

>>n 

Do you want to perform full collection? (y/n) 

>>n 

Do you want to limit collection time? (y/n) 

>>n 

Once you complete the information-entry dialog, Installation Manager checks 

connectivity and displays a list of accessible hosts for which the feature is enabled. (See 

the Unisys SafeGuard Solutions Replication Appliance Administrator’s Guide for more 

information.). You must indicate the hosts for which you want to collect logs. You can 

select one or more individual hosts or enter NONE or ALL. 

Once you specify the hosts, Installation Manager returns system information and logs for 

all accessible RAs, including the remote RAs, if so instructed. This software also returns 

a success or failure status report for each RA from which it has been instructed to collect 

information. 

6872 5688–002 C–19


Installation Manager also collects logs for the selected hosts and reports on the success 

or failure of each collection. The timeout on the collection process is 20 minutes. 

Once the information is collected and you requested that it be stored on an ftp server, 

the system reports that it is transferring the collected information to the specified FTP 

location. Once the transfer completes, you are prompted to press ENTER to continue. 

You can also open or download the stored files using your browser. Log in as 

webdownload/webdownload, and access the files at one of these URLs: 

• For nonsecured servers: http:///info/ 

• For secured servers: https:///info/ 

The following error conditions apply: 

• If the connection with an RA is lost while information collection is in progress, no 

information is collected. 

You can run the process again. If the collection from the remote site failed because 

of a WAN failure, run the process locally at the remote site. 

• If simultaneous information collection is occurring from the same RA, only the 

collector that established the first connection can succeed. 

• FTP failure results in failure of the entire process. 

If this process fails to collect the desired host information, you can alternatively generate 

host information collection directly for individual hosts. Use the Host Information 

Collector (HIC) utility as described in Appendix A. Also, the Unisys SafeGuard Solutions 

Administrator’s Guide provides additional information about the HIC utility. 

C–20 6872 5688–002

Appendix D 

Replacing a Replication Appliance 

(RA) 

To replace an RA at a site, you must perform the following tasks as described in this 

appendix: 

• Save configuration settings. 

• Record the group properties and save the Global cluster mode settings. 

• Modify the Preferred RA setting. 

• Detach the failed RA. 

• Remove the Fibre Channel adapter cards. 

• Install and configure the replacement RA. 

• Verify the RA installation. 

• Restore group properties. 

• Ensure the existing RA can switch over to the new RA. 

Note: During this process, be sure that the direction of all consistency groups is from 

the site without the failed RA to the site with the RA during this process. You might 

need to move groups. 

6872 5688–002 D–1

Replacing a Replication Appliance (RA) 

Saving the Configuration Settings 

Before you replace an RA, Unisys recommends that you save the current environment 

settings to a file. The saved file is a script that contains CLI commands for all groups, 

volumes, and replication pairs needed to re-create the environment. The file is used for 

backup purposes only. 

1. From a command prompt on the management PC, enter the following command to 

change to the directory where the plink.exe file is located: 

cd putty 

2. Update the following command with your site management IP address and 

administrator (admin) password, and then enter the command: 

plink -ssh site management IP address -l admin -pw admin password 

save_settings > sitexandsitey.txt 

Note: If a message is displayed asking whether you want to add a cached registry 

key, type y and press Enter. The file is automatically saved to the management PC 

in the same directory from which the command was issued. 

If you need to restore the settings saved in the previous procedure, update the following 

command with your site management IP address and administrator (admin) password, 

and then enter the command: 

plink -ssh site management IP address -l admin -pw admin password -m 

version30.txt 

Recording Policy Properties and Saving Settings 

Before you begin the RA replacement procedure, ensure to record the policy properties 

and save the Global cluster mode settings. 

Perform the following steps for each consistency group to record policy properties and 

save settings: 

1. Select the Policy tab. 

2. Write down and save the current preferred RA settings and Global cluster mode 

parameter for each consistency group. Use this record to restore these values after 

you replace the RA. 

3. Click OK. 

4. Repeat steps 1 through 3 for all the other groups. 

. 

D–2 6872 5688–002

Modifying the Preferred RA Setting 


For each consistency group, record the Preferred RA and Global cluster mode settings 

so that they can be stored at the end of this procedure. 

Perform the following steps to change all consistency groups that were running on the 

failed RA to a surviving RA: 

1. Select the Policy tab. 

2. Change the Preferred RA setting to a surviving RA number for all consistency 

groups that had the Preferred RA value set to the failed RA. Perform steps 2a 

through 2e for each group. 

a. If the Global cluster mode parameter is set to one of the following options, 

skip this step, and continue with step 4d: 

• None 

• Manual (shared quorum) 

• Manual 

b. Change the Global cluster mode parameter to 

• Manual (if using MSCS with shared quorum) 

• Manual (if using MSCS with majority node set) 

c. Click Apply. 

d. Change the Preferred RA setting, and then click Apply. 

e. Change the Global cluster mode parameter to the original setting. 

f. Click Apply. 

2. Select the Consistency Group and click the Status tab to verify that all groups 

are running on the new RA number. 

Review the current status of the preferred RA under the components pane. 

3. Detach the failed RA. If you can log on to the RA, detach the RA by performing the 

following steps. Else continue with “Removing Fibre Channel Adapter Cards.” 

a. Use the Putty utility to connect to the box management IP address for the RA that 

is being replaced. 

b. Type boxmgmt when prompted to log in, and then type the appropriate 

password if it has changed from the default password boxmgmt. 

The Main Menu is displayed. 

c. Type 4 (Cluster operations) and press Enter. 

d. Type 2 (Detach from cluster) to detach the RA from the cluster, and then press 

Enter. 

e. Type y when prompted to detach and press Enter. 

f. Type B (Back) and press Enter to return to the Main Menu. 

g. Type quit and close the PuTTY window. 

6872 5688–002 D–3


Removing Fibre Channel Adapter Cards 

Perform the following to remove the RA and Fibre Channel host bus adapters (HBAs): 

1. Power off the failed RA. 

2. Physically disconnect and remove the failed RA from the rack. 

3. Physically remove the Fibre Channel HBAs from the failed RA and insert them into 

the replacement RA. 

Note: If you cannot use the cards from the existing RA, refer to “Failure of All SAN 

Fibre Channel Host Bus Adapters (HBAs)” in Section 8 for information about 

replacing a failed HBA. 

Installing and Configuring the Replacement RA 

To install and configure the replacement RA, you must complete several tasks, as follow: 

• Complete the procedure in “Cable and Apply Power to New RA.” 

• Complete the procedure in “Connecting and Accessing the RA.” 

• Complete the procedure in “Configuring the RA.” 

• Complete the procedures in “Verifying the RA Installation.” 

Cable and Apply Power to the New RA 

1. Insert the new RA into the rack and apply power. 

2. Insert the Unisys SafeGuard Solutions RA Setup Disk CD-ROM into the CD/DVD 

drive of the RA. Ensure that this disk is the same version that is running in the other 

RAs. 

3. Power off and then power on the RA. 

4. As the RA boots, check the BIOS level as displayed in the Unisys banner and note 

the level displayed. At the end of the replacement procedure, you can compare the 

existing RA BIOS level with the new RA BIOS level. The RA BIOS might need to be 

updated. 

Connecting and Accessing the RA 

1. Power on the appropriate RA. 

2. Connect an Ethernet cable between the management PC used for installation and 

the WAN Ethernet segment to which the RA is connected. 

If you connect the management PC directly to the RA, use a crossover cable. 

3. Assign the following IP address and subnet mask to the management PC: 

10.77.77.50 (IP address) 

255.255.255.0 (subnet mask) 

4. Access the RA by using the SSH client. (See Appendix C.) Use the 10.77.77.77 IP 

address, which has a subnet mask of 255.255.255.0. 

D–4 6872 5688–002


5. Log in with the boxmgmt user name and the boxmgmt password. 

6. Provide the following information for the layout of the RA installation: 

a. When prompted about the number of sites in the environment 

• Type 2 to install in a geographic replication environment or a geographic 

clustered environment. 

• Type 1 to install in a continuous data protection environment. 

b. Type the number of RAs at the site, and press Enter. 

The Main Menu appears. 

Checking Storage-to-RA Access 

If the LUNs are not accessible, check your switch configuration and zoning. Verify that all 

LUNs are accessible by using the Main Menu of the Installation Manager and 

performing the following steps: 

1. Type 3 (Diagnostics). 

2. Type 2 (Fibre Channel diagnostics). 

3. Type 4 (Detect Fibre Channel LUNs). 

After a few minutes, a list of detected LUNs appears. 

4. Press the spacebar until all expected LUNs appear. 

5. Type B (Back). 

6. Type B again. 

The Main Menu appears. 

7. If you do not see all Fibre Channel LUNs in step 4, correct the environment and 

repeat steps 1 through 6. 

Enabling PCI-X Slot Functionality 

If your system is configured with a gigabit (Gb) WAN, which is used for the optical WAN 

connection, perform the following steps on the Main Menu of the replacement RA: 

1. Type 2 (Setup). 

2. Type 8 (Advanced option). 

3. Type 12 (Enable/disable additional remote interface). 

4. Type yes when prompted on whether to enable the additional remote interface. 

5. Type B twice to return to the Main Menu. 

6872 5688–002 D–5


Configuring the RA 

1. On the Main Menu, type 1 (Installation). 

2. Type 2 (Get Setup information from an installed RA). Press Enter. 

The Get Settings Wizard menu appears with Get Settings from Installed 

RA selected. 

3. Press Enter. 

4. Type 1 (Management interface) to view the settings from the installed RA. 

5. Type y when prompted to configure a temporary IP address. 

6. Type the IP address. 

7. Type the IP subnet mask and then press Enter. 

8. Type y or n, depending on your environment, when prompted to configure a 

gateway. 

9. Type the box management IP address of Site 1 RA 1 to import the settings from that 

RA. 

10. Type y to import the settings. 

11. Press Enter to continue when a message states that the configuration was 

successfully imported. 

The Get Settings Wizard menu appears with Apply selected. 

12. Perform the following steps to apply the configuration to the RA: 

a. Press Enter to continue. 

The complete list of settings is displayed. These settings are the same as the 

ones for Site 1 RA 1. 

b. Type y to apply these settings. 

c. Type 1 or 2 when prompted for a site number, depending on the site on which 

the RA is located. 

d. Type the RA number when prompted. 

A confirmation message appears when the settings are applied successfully. 

e. Press Enter. 

The Get Settings Wizard menu appears with Proceed to the Complete 

Installation Wizard selected. 

f. Press Enter to continue. 

The Complete Installation Wizard menu appears with Configure 

repository volume selected. 

13. Configure the repository volume by completing the following steps: 

a. Press Enter. 

b. Type 2 (Select a previously formatted repository volume). 

D–6 6872 5688–002


c. Select the number of the repository volume corresponding to the group of 

displayed volumes, and press Enter. 

d. Press Enter again. 

The Complete Installation Wizard menu appears with Attach to cluster 

selected. 

14. Attach the RA to the RA cluster by completing the following steps: 

a. Press Enter. 

b. Type y at the prompt to attach to the cluster. 

The RA reboots. 

c. Close the PuTTY session if necessary. 

Verifying the RA Installation 

To verify that the RA is correctly installed, you must 

• Verify the WAN bandwidth 

• Verify the clock synchronization 

Verifying WAN Bandwidth 

Use the following procedure to verify the actual versus the expected WAN bandwidth. 

Note: Correct any problems and rerun the verification. 

1. Open an SSH session to the box management IP address for the replacement RA. 

2. Type boxmgmt when prompted to log in, and then type the appropriate password 

if it has changed from the default password boxmgmt. 

The Main Menu is displayed. 

3. Type 3 (Diagnostics) and press Enter. 

The Diagnostics menu appears. 

4. Type 1 (IP diagnostics) and press Enter. 

The IP Diagnostics menu appears. 

5. Type 4 (Test throughput) and press Enter. 

6. Type the WAN IP address of the peer RA; for example, site 2 RA 1 is the peer for 

site 1 RA 1. 

7. Type 2 (WAN interface). 

8. At the prompt, type 20 to change the default value for the desired number of 

concurrent streams. 

9. At the prompt for the test duration, type 60 to change the default value. 

A message is displayed that the connection was established. 

6872 5688–002 D–7


10. After 60 seconds, make sure that the following information is displayed on the 

screen. Ignore any TCP Windows Size warnings. 

• IP connection for every stream 

• Interval, Transfer, and Bandwidth for every stream 

• Expected bandwidth in the [SUM] display at the bottom of the screen 

11. On the IP Diagnostics menu, type Q (Quit), and then type y. 

Verifying Clock Synchronization 

The timing of all Unisys SafeGuard 30m activities across all RAs in an installation must 

be synchronized against a single clock (for example, on the network time protocol [NTP] 

server). Consequently, you need to synchronize the replacement RA. 

For the procedure to verify RA synchronization, see the Unisys SafeGuard Solutions 

Replication Appliance Installation Guide. 

Restoring Group Properties 

Perform the following steps on the Management Console for each group that needs 

to have the Preferred RA setting restored to an RA other than RA 1.. All Preferred RA 

settings are set to RA 1. 

1. Select the Policy tab for the consistency group. 

2. On the General Settings section, change the Preferred RA setting to the 

original setting, and then click Apply. 

3. Change the Global cluster mode under Advanced to the original setting if it 

was changed earlier. 


Ensuring the Existing RA Can Switch Over to the 

New RA 

Once the new RA is part of the configuration, the management console does not display 

any errors. Shut down any other RA at the site to ensure that the newly replaced RA can 

successfully complete the switchover. As the existing RA reboots, check the BIOS level 

as displayed in the Unisys banner and note it. 

Compare the BIOS level noted for the exiting (rebooting) RA with the BIOS level you 

noted for the replacement RA. If the BIOS levels do not match, contact the Unisys 

Support Center to obtain the correct BIOS. 

D–8 6872 5688–002

Appendix E 

Understanding Events 

Event Log 

Event Topics 

Various events generate entries to the Unisys SafeGuard 30m solution system log. 

These events are predefined in the system according to topic, level of severity, and 

scope. The Unisys SafeGuard 30m solution supports proactive notification of an event— 

either by sending e-mail messages or by generating system log events that are logged 

by a management application. 

The system records log entries in response to a wide range of predefined events. Each 

event carries an event ID. For manageability, the system divides the events into general 

and advanced types. In most cases, you can monitor system behavior effectively by 

viewing the general events only. For troubleshooting a problem, technical support 

personnel might want to review the advanced log events. 

Event topics correspond to the components where the events occur, including 

• Management (management console and CLI) 

• Site 

• RA 

• Consistency group 

• Splitter 

A single event can generate multiple log entries. 

6872 5688–002 E–1


Event Levels 

Event Scope 

The levels of severity for events are defined as follows (in ascending order): 

• Info 

These messages are informative in nature, usually referring to changes in the 

configuration or normal system state. 

• Warning 

These messages indicate a warning, usually referring to a transient state or to an 

abnormal condition that does not degrade system performance. 

• Error 

These messages indicate an important event that is likely to disrupt normal system 

behavior, performance, or both. 

A single change in the system—for example, an error over a communications line—can 

affect a wide range of system components and cause the system to generate a large 

number of log events. Many of these events contain highly technical information that is 

intended for use by Unisys service representatives. When all of the events are displayed, 

you might find it difficult to identify the particular events in which you are interested. 

You can use the scope to manage the type and quantity of events that are displayed in 

the log. An event belongs to one of the following scopes: 

• Normal 

Events with a Normal scope result when the system analyzes a wide range of 

system data to generate a single event that explains the root cause for an entire set 

of Detailed and Advanced events. Usually, these events are sufficient for effective 

monitoring of system behavior. 

• Detailed 

Events with a Detailed scope include all events for all components that are 

generated for users and that are not included among the events that have a Normal 

scope. The display of Detailed events includes Normal events also. 

• Advanced 

Events with an Advanced scope contain technical information. In some cases, such 

as troubleshooting a problem, a Unisys service representative might need to retrieve 

information from the Advanced log events. 

. 

E–2 6872 5688–002

Displaying the Event Log 


The event log is displayed either from the Management Console or using the CLI. 

To display event logs, select Logs in the navigation pane; the most recent events in the 

event log are displayed. For more information about a particular event log, double-click 

the event log. The Log Event Properties dialog box displays details of the individual 

event. 

You can sort the log events according to any of the columns (that is, level, scope, time, 

site, ID and topic) in ascending or descending order. 

Perform the following steps to display advanced logs: 

1. Click the Filter log tool bar option in the event pane. 

The Filter Log dialog box appears. 

2. Change the scope to Advanced. 

3. Click OK. 

For more information about using the management console, see the Unisys SafeGuard 

Solutions Replication Appliance Administrator’s Guide. 

To display the event log from the CLI, run the get_logs command and specify values for 

each of the parameters. Specify the parameters carefully to avoid displaying unnecessary 

log information. You can use the terse display parameter to show more or less 

information for the displayed events as desired. 

For information about the CLI, see the Unisys SafeGuard Solutions Replication Appliance 

Command Line Interface (CLI) Reference Guide. 

Using the Event Log for Troubleshooting 

The event log provides information that can be useful in determining the cause or nature 

of problems that might arise during operation. 

The “group capabilities” events provide an important tool for understanding the behavior 

of a consistency group. Each group capabilities event—such as group capabilities OK, 

group capabilities minor problem, or group capabilities 

problem—provides a high-level description of a current group situation with regard to 

each of the RAs and identifies the RA that is currently handling the group. 

6872 5688–002 E–3


The information reported for each RA includes the following: 

• RA status: Indicates whether an RA is currently a member of the RA cluster (that is, 

alive) or not a member (that is, dead). 

• Marking status: yes or no. 

• Transfer status: yes, no, no data loss (that is, flushing), or yes unstable (that is, the 

RA cannot be initialized if closed or detached). 

• Journal capability: yes (that is, distributing, logged access, and so forth), no, or static 

(that is, access to an image is enabled but access to a different image is not enabled, 

cannot distribute, and cannot support image access) 

• Preferred: yes or no. 

In addition, the event log reports the RA on which the group is actually running and the 

status of the link between the sites. 

A group capabilities event is generated whenever there is a change in the capabilities of 

a group on any RA. The message reports on any limitations to the capabilities of the 

group and provides reasons for these limitations. 

Tracking logged events can explain changes in a group state (for example, the reason 

replication was paused, the reason the group switched to another RA, and so forth). 

The group capabilities events might offer reasons that particular actions are not 

performed. For example, if you want to know the reason the group transfer was paused, 

you can check the event log for the “pause replication” action. If, however, you want to 

know the reason a group transfer did not start, you might check the most recent group 

capabilities event. 

The level of a group capabilities event can be INFO, WARNING, or ERROR, depending 

on the severity of the reported situation. These levels correspond to the OK, minor 

problem, and problem bookmarks that follow group capabilities in the message 

descriptions. 

List of Events 

The list of events is presented in tabular format with the following given for each event: 

• Event ID 

• Topic (for example, Management, Site, RA, Splitter, Group) 

• Level (for example, Info, Warning, Error ) 

• Description 

• Scope 

• Time 

• Site 

E–4 6872 5688–002

List of Normal Events 

Event 

ID 


Normal events include both root-cause events (a single description for an event that can 

generate multiple events) and other selected basic events. Some Normal events do not 

have a topic or trigger. Table E–1 lists Normal events with their descriptions. 

Topic 

Table E–1. Normal Events 

Level 

Description 

1000 Management Info User logged in. (User 

) 

1001 Management Warning Log in failed. (User 

) 

1003 Management Warning Failed to generate SNMP 

trap. (Trap contents) 

1004 Management Warning Failed to send e-mail alert 

to specified address. 

(Address , Event summary 

) 

1005 Management Warning Failed to update file. (File 

 

1006 Management Info Settings changed. (User 

, Settings 

) 

1007 Management Warning Settings change failed. 

(User , Settings 

, Reason 

) 

1008 Management Info User action succeeded. 

(User , Action 

) 

Trigger 

User log-in action 

User failed to log in 

The system failed to 

send SNMP trap. 


send an e-mail alert. 


update the local 

configuration file 

(passwords, SSH 

keys, system log 

configuration, and 

SNMP configuration). 

The user changed 

settings. 


change settings. 

The user performed 

one of these actions: 

bookmark_image, 

clear_markers, 

set_markers, 

undo_logged_ 

writes, set_num_ 

of_streams. 

6872 5688–002 E–5


Event 

ID 

Topic 


Level 

Description 

1009 Management Warning User action failed. (User 

, Action , 

Reason ) 

1011 Management Error Grace period expired. You 

must install an activation 

code to activate your 

license. 

1014 Management Info User bookmarked an 

image. (Group , 

Snapshot ) 

1015 Management Warning RA-to-storage multipathing 

problem (RA , 

Volume ) 

1016 Management Warning 

Off 

RA- multipathing fixed. 

problem (RA , 

Volume ) 

1017 Management Warning RA- multipathing problem. 

(RA , 

Splitter) 

1018 Management Warning 

Off 

RA- multipathing problem 

fixed. (RA , Splitter 

) 

1019 Management Warning User action succeeded. 

(Markers cleared. Group 

,) 

(Replication set attached 

as clean. Group) 

3001 RA Warning RA is no longer a cluster 

member. (RA ) 

3005 RA Error Settings conflict between 

sites. (Reason ) 

Trigger 

One of these actions 

failed: 

bookmark_image, 

clear_markers, 

set_markers, 

undo_logged_ 

writes, set_num_ 

of_streams. 

Grace period expired 

The user bookmarked 

an image. 

Single path only or 

more paths between 

RA and volume are 

not available. 

All paths between the 

RA and volume are 

available. 

One or more paths 

between the RA and 

the splitter are not 

available. 

All paths between the 

RA and the splitter 

are available. 

User cleared markers 

or attached replication 

set as clean. 

An RA is 

disconnected from 

site control. 

A settings conflict 

between the sites 

was discovered. 

E–6 6872 5688–002

Event 

ID 

Topic 


Level 

Description 

3006 RA Error Off Settings conflict between 

sites resolved by user. 

(Using Site 

settings) 

3030 RA Warning RA switched path to 

storage. (RA , 

Volume ) 

4056 Group Warning No image was found in 

the journal to match the 

query. (Group ) 

4090 Group Warning Target-side log is 90 

percent full. When log is 

full, writing by hosts at 

target side is disabled. 

(Group ) 

4106 Group Warning Capacity reached; cannot 

write additional markers 

for this group to 

. 

Starting full sweep. (Group 

) 

4117 Group Warning Virtual access buffer is 90 

percent full. When the 

buffer is full, writing by 

hosts at the target side is 

disabled. (Group ) 

5008 Splitter Warning Host shut down. (Host 

Splitter ) 

5010 Splitter Warning Splitter stopped; 

depending on policy, 

writing by host might be 

disabled for some groups, 

and a full sweep might be 

required for other groups. 

(Splitter ) 

5011 Splitter Warning Splitter stopped; full 

sweep is required. (Splitter 

) 

5012 Splitter Warning The splitter stopped; write 

operations to replication 

volumes are disabled. 

(Splitter ) 


Trigger 

A settings conflict 

between the sites 

was resolved by the 

user. 

A storage path 

change was initiated 

by the RA. 

No image was found 

in the journal to 

match the query. 

The target-side log is 

90 percent full. 

The disk space for the 

markers was filled for 

the group. 

The usage of the 

virtual access buffer 

has reached 90 

percent. 

The host was shut 

down or restarted. 

The user stopped the 

splitter after removing 

volumes; volumes are 

disconnected. 

The user stopped the 

splitter after removing 

volumes; volumes are 

disconnected. 

The splitter stopped; 

host access to all 

volumes is disabled. 

6872 5688–002 E–7


Event 

ID 

Topic 


Level 

Description 

10000 — Info Changes are occurring in 

the system. Analysis in 

progress. 

10001 — Info System changes have 

occurred. The system is 

now stable. 

10002 — Info The system activity has 

not stabilized—issuing an 

intermediate report. 

10101 — Error The cause of the system 

activity is unclear. To 

obtain more information, 

filter the events log using 

the Detailed scope. 

10102 — Info Site control recorded 

internal changes that do 

not affect system 

operation. 

10202 — Info Settings have changed. — 

10203 — Info The RA cluster is down. — 

10204 — Error One or more RAs are 

disconnected from the RA 

cluster. 

10205 — Error A communications 

problem occurred in an 

internal process. 

10206 — Info An internal process was 

restarted. 

10207 — Error An internal process was 

restarted. 

10210 — Error Initialization is 

experiencing high-load 

conditions. 

10211 — Error A temporary problem 

occurred in the Fibre 

Channel link between the 

splitters and the RAs. 

10212 — Error Off The temporary problem 

that occurred in the Fibre 

Channel link between the 

splitters and the RAs is 

resolved. 

Trigger 

E–8 6872 5688–002 

— 

— 

— 

— 

— 

— 

— 

— 

— 

— 

— 

—

Event 

ID 

Topic 


Level 

Description 

10501 — Info Synchronization 

completed. 

10502 — Info Access to the target-side 

image is enabled. 

10503 — Error The system is transferring 

the latest snapshot before 

pausing transfer (no data 

loss). 

10504 — Info The journal was cleared. — 

10505 — Info The system completed 

undoing writes to the 

target-side log. 

10506 — Info The roll to the physical 

images is complete. 

Logged access to the 

physical image is now 

available. 

10507 — Info Because of system 

changes, the journal was 

temporarily out of service. 

The journal is now 

available. 

10508 — Info All data were flushed from 

the local-side RA; 

automatic failover 

proceeds. 

10509 — Info The initial long 

resynchronization has 

completed. 

10510 — Info Following a paused 

transfer, the system is 

now cleared to restart 

transfer. 

10511 — Info The system finished 

recovering the replication 

backlog. 

12001 — Error The splitter is down. — 

12002 — Error An error occurred in all 

WAN links to the other 

site. The other site is 

possibly down. 


Trigger 

6872 5688–002 E–9 

— 

— 

— 

— 

— 

— 

— 

— 

— 

— 

—


Event 

ID 

Topic 


Level 

Description 

12003 — Error An error occurred in the 

WAN link to the RA at the 

other site. 


data link over the WAN. All 

RAs are unable to transfer 

replicated data to the 

other site. 


data link over the WAN. 

The RA is unable to 

transfer replicated data to 

the other site. 

12006 — Error The RA is disconnected 

from the RA cluster. 

12007 — Error All RAs are disconnected 

from the RA cluster. 

12008 — Error The RA is down. — 

12009 — Error The group entered high 

load. 

12010 — Error A journal error occurred. 

Full sweep is to be 

performed after the error 

is corrected. 

12011 — Error The target-side log or 

virtual buffer is full. Writing 

by hosts at the target side 

is disabled. 

12012 — Error The system cannot enable 

virtual access to the 

image. 

12013 — Error The system cannot enable 

access to a specified 

image. 

12014 — Error The Fibre Channel link 

between all RAs and all 

splitters and storage is 

down. 



storage is down. 

Trigger 

E–10 6872 5688–002 

— 

— 

— 

— 

— 

— 

— 

— 

— 

— 

— 

—

Event 

ID 

Topic 


Level 

Description 



splitters or storage 

volumes (or both) is down. 


between the RA and all 


down. 



splitters is down. 



storage is down. 



other site. 

12027 — Error All replication volumes 

attached to the 

consistency group (or 

groups) are not accessible. 


between all RAs and one 

or more volumes is down. 

12033 — Error The repository volume is 

not accessible; data might 

be lost. 

12034 — Error Writes to storage occurred 

without corresponding 

writes to the RA. 


WAN link to the RA cluster 

at the other site. 

12036 — Error A renegotiation of the 

transfer protocol is 

requested. 

12037 — Error All volumes attached to 

the consistency group (or 



Trigger 

6872 5688–002 E–11 

— 

— 

— 

— 

— 

— 

— 

— 

— 

— 

— 

—


Event 

ID 

Topic 


Level 

Description 

12038 — Error All journal volumes 




12039 — Error A long resynchronization 

started. 

12040 — Error The system detected bad 

sectors in a volume. 

12041 — Error The splitter is up. — 

12042 — Error All WAN links to the other 

site are restored. 

12043 — Error The WAN link to the RA at 

the other site is restored. 

12044 — Error Problem with IP link 

between RA (in at least in 

one direction). 

12045 — Error Problem with all IP links 

between RA 

12046 — Error Problem with IP links 

between RA 

12047 — Error RA network interface card 

(NIC) problem. 

14001 — Error Off The splitter is up. — 

14002 — Error Off All WAN links to the other 

site are restored. 

14003 — Error Off The WAN link to the RA at 


14004 — Error Off The data link over the 

WAN is restored. All RAs 

can transfer replicated 

data to the other site. 


WAN is restored. The RA 

can transfer replicated 

data to the other site. 

14006 — Error Off The connection of the RA 

to the RA cluster is 

restored. 

Trigger 

E–12 6872 5688–002 

— 

— 

— 

— 

— 

— 

— 

— 

— 

— 

— 

— 

— 

—

Event 

ID 

Topic 


Level 

Description 

14007 — Error Off The connection of all RAs 

to the RA cluster is 

restored. 

14008 — Error Off The RA is up. — 

14009 — Error Off The group exited high 

load. The initialization 

completed. 

14010 — Error Off The journal error was 

corrected. A full sweep 

operation is required. 

14011 — Error Off The target-side log or 

virtual buffer is no longer 

full. 

14012 — Error Off Virtual access to an image 

is enabled. 

14013 — Error Off The system is no longer 

trying to access a diluted 

image. 

14014 — Error Off The Fibre Channel link 



restored. 



storage is restored. 

14022 — Error Off The Fibre Channel link that 

was down between the 

RA and splitters or storage 

volumes (or both) is 

restored. 




restored. 



splitters is restored. 



storage is restored. 




Trigger 

6872 5688–002 E–13 

— 

— 

— 

— 

— 

— 

— 

— 

— 

— 

— 

— 

—


Event 

ID 

Topic 


Level 

Description 

14027 — Error Off Access to all volumes 



groups) is restored. 



or more volumes is 

restored. 

14033 — Error Off Access to the repository 

volume is restored. 

14034 — Error Off Replication consistency in 

writes to storage is 

restored. 



14036 — Error Off The renegotiation of the 

transfer protocol is 

complete. 

14037 — Error Off Access to all replication 

volumes attached to the 



14038 — Error Off Access to all journal 




14039 — Info The long resynchronization 

has completed. 

14040 — Error Off The system detected a 

correction of bad sectors 

in the volume. 

14041 — Error Off The system detected that 

the volume is no longer 

read-only. 

14042 — Error Off A synchronization is in 

progress to restore any 

failed writes in the group. 

14043 — Error Off A synchronization is in 

progress to restore any 

failed writes. 

Trigger 

E–14 6872 5688–002 

— 

— 

— 

— 

— 

— 

— 

— 

— 

— 

— 

— 

—

Event 

ID 

Topic 


Level 

Description 

14044 — Error Off Problem with IP link 

between RAs (in at least in 

one direction) corrected. 

14045 — Error Off All IP links between RAs 

restored. 

14046 — Error Off IP link between RAs 

restored. 

14047 — Error Off RA network interface card 

(NIC) problem corrected. 

16000 — Error Transient root cause. — 

16001 — Error The splitter was down. 

The problem is corrected. 

16002 — Error An error occurred in all 

WAN links to the other 

site. The problem is 

corrected. 



other site. The problem is 

corrected. 


data link over the WAN. All 

RAs were unable to 


the other site. The 

problem is corrected. 


data link over the WAN. 

The RA was unable to 


the other site. The 


16006 — Error The RA was disconnected 

from the RA cluster. The 

connection is restored. 

16007 — Error All RAs were 

disconnected from the RA 

cluster. The problem is 

corrected. 

16008 — Error The RA was down. The 



Trigger 

6872 5688–002 E–15 

— 

— 

— 

— 

— 

— 

— 

— 

— 

— 

— 

—


Event 

ID 

Topic 


Level 

Description 

16009 — Error The group entered high 

load. The problem is 

corrected. 

16010 — Error A journal error occurred. 


A full sweep is required. 

16011 — Error The target-side log or 

virtual buffer was full. 

Writing by the hosts at the 

target side was disabled. 


16012 — Error The system could not 

enable virtual access to 

the image. The problem is 

corrected. 

16013 — Error The system could not 

enable access to the 

specified image. The 




splitters and storage was 

down. The problem is 

corrected. 



storage was down. The 




splitters or storage 

volumes (or both) was 


corrected. 





corrected. 



splitters was down. The 


Trigger 

E–16 6872 5688–002 

— 

— 

— 

— 

— 

— 

— 

— 

— 

—

Event 

ID 

Topic 


Level 

Description 



storage was down. The 





corrected. 

16027 — Error All volumes attached to 

the consistency group (or 

groups) were not 

accessible. The problem is 

corrected. 



or more volumes was 


corrected. 

16033 — Error The repository volume 

was not accessible. The 


16034 — Error Off Writes to storage occurred 

without corresponding 

writes to the RA. The 





corrected. 

16036 — Error The renegotiation of the 

transfer protocol was 

requested and has been 

completed. 

16037 — Error All replication volumes 





corrected. 


Trigger 

6872 5688–002 E–17 

— 

— 

— 

— 

— 

— 

— 

— 

—


Event 

ID 

Topic 


Level 

Description 

16038 — Error All journal volumes 





corrected. 

16039 — Info The system ran a long 

resynchronization. 

16040 — Error The system detected bad 

sectors in the volume. The 


16041 — Error The system detected that 

the volume was read-only. 


16042 — Error The splitter write 

operation might have 

failed while the group was 

transferring data. 

16043 — Error The splitter write 

operations might have 

failed. 

16044 — Error There was a problem with 

an IP link between RAs (in 

at least in one direction) 


all IP links between RAs. 

Problem has been 

corrected 


an IP link between RAs. 

Problem has been 

corrected. 

16047 — Error There was an RA network 

interface card (NIC) 

problem. Problem has 

been corrected. 

18001 — Error Off The splitter was 

temporarily up but is down 

again. 

18002 — Error Off All WAN links to the other 

site were temporarily 

restored, but the problem 

has returned. 

Trigger 

E–18 6872 5688–002 

— 

— 

— 

— 

— 

— 

— 

— 

— 

— 

— 

—

Event 

ID 

Topic 


Level 

Description 


the other site was 

temporarily restored, but 

the problem has returned. 


WAN was temporarily 


has returned. All RAs are 

unable to transfer 

replicated data to the 

other site. 


WAN was temporarily 


has returned. The RA is 

currently unable to 


the other site. 

18006 — Error Off The connection of the RA 

to the RA cluster was 



18007 — Error Off All RAs were temporarily 

restored to the RA cluster, 

but the problem has 

returned. 

18008 — Error Off The RA was temporarily 

up, but is down again. 

18009 — Error Off The group temporarily 

exited high load, but the 

problem has returned. 

18010 — Error Off The journal error was 

temporarily corrected, but 


18011 — Error Off The target-side log or 

virtual buffer was 

temporarily no longer full, 

and write operations by 

the hosts at the target 

side were re-enabled. 

However, the problem has 

returned. 


Trigger 

6872 5688–002 E–19 

— 

— 

— 

— 

— 

— 

— 

— 

—


Event 

ID 

Topic 


Level 

Description 

18012 — Error Off Virtual access to the 

image was temporarily 

enabled, but the problem 

has returned. 

18013 — Error Off Access to an image was 

temporarily enabled, but 








between all splitters and 

all storage was temporarily 


has returned. 

18022 — Error Off The Fibre Channel link that 

was down between the 

RA and splitters or storage 

volumes (or both) was 





storage was temporarily 


has returned. 



splitters was temporarily 


has returned. 



storage was temporarily 


has returned. 

18026 — Error The WAN link to the RA at 




Trigger 

E–20 6872 5688–002 

— 

— 

— 

— 

— 

— 

— 

— 

—

Event 

ID 

Topic 


Level 

Description 

18027 — Error Off Access to all journal 



groups) was temporarily 


has returned. 



or more volumes was 



18033 — Error Off Access to the repository 

volume was temporarily 


has returned. 

18034 — Error Off Replication consistency in 

write operations to 

storage and to RAs was 







18036 — Error Off The negotiation of the 

transfer protocol was 

completed but is again 

requested. 

18037 — Error Off Access to all volumes 





has returned. 

18038 — Error Off Access to all replication 





has returned. 

18039 — Info The long resynchronization 

completed but has now 

restarted. 


Trigger 

6872 5688–002 E–21 

— 

— 

— 

— 

— 

— 

— 

— 

—


Event 

ID 

Topic 


Level 

Description 

18040 — Error Off The user marked the 

volume as OK, but the 

bad-sectors problem 

persists. 

18041 — Error Off The user marked the 

volume as OK, but the 

read-only problem 

persists. 

18042 — Error Off The synchronization 

restored any failed write 

operations in the group, 

but the problem has 

returned. 

18043 — Error Off An internal problem has 

occurred. 


between RAs (in at least 

one direction) was 

corrected, but problem 

has returned. 

18045 — Error Off Problem with all IP links 

between RAs (in at least in 

one direction) was 


has returned. 


between RAs was 


has returned. 

18047 — Error Off RA network interface card 

(NIC) problem was 


has returned. 

List of Detailed Events 

Trigger 

Detailed events are all events with respect to components generated for use by users 

and do not have a normal scope. Table E–2 lists these events and their descriptions. 

E–22 6872 5688–002 

— 

— 

— 

— 

— 

— 

— 

—

Event 

ID 

Topic 

Table E–2. Detailed Events 

Level 

Description 


Trigger 

1002 Management Info User logged out. (User ) The user logged 

out of the system. 

1010 Management Warning Grace period expires in 1 day. 

You must install an activation 

code to activate your Unisys 

SafeGuard solution license. 

1012 Management Warning License expires in 1 day. You 

must obtain a new Unisys 

SafeGuard 30m solution 

license. 

1013 Management Error License expired. You must 

obtain a new Unisys SafeGuard 

30m solution license. 

2000 Site Info Site management running on 

. 

3000 RA Info RA as become a cluster 

member. (RA ) 

3002 RA Warning Site management switched 

over to this RA. (RA , 

Reason ) 

3007 RA Warning 

Off 

The grace period 

expires in 1 day. 

The Unisys 

SafeGuard 30m 

solution license 

expires in 1 day. 

The Unisys 

SafeGuard 30m 

solution license 

expired. 

Site control is 

open; the RA has 

become the 

cluster leader. 

The RA is 

connected to site 

control. 

Leadership is 

transferred from 

an RA to another 

RA. 

RA is up. (RA ) The RA that was 

previously down 

came up. 

3008 RA Warning RA appears to be down. (RA 

) 

3011 RA Info RA access to a volume or 

volumes restored. (RA , 

Volume , Volume 

Type ) 

3012 RA Warning RA unable to access a volume 

or volumes. (RA , Volume 

, Volume Type 

) 

An RA suspects 

that the other RA 

is down. 

Volumes that were 

inaccessible 

became 

accessible. 

Volumes ceased to 

be accessible to 

the RA. 

6872 5688–002 E–23


Event 

ID 

Topic 


Level 


Off 

Description 

RA access to restored. (RA , 

Volume ) 

3014 RA Warning RA unable to access 

. (RA 

, Volume ) 


Off 

WAN connection to an RA at 

other site is restored. (RA at 

other site: ) 

3021 RA Warning Error in WAN connection to an 

RA at other site. (RA at other 

site: ) 


Off 

LAN connection to RA 

restored. (RA ) 

3023 RA Warning Error in LAN connection to an 

RA. RA ) 

4000 Group Info Group capabilities OK. (Group 

) 

4001 Group Warning Group capabilities minor 

problem. (Group ) 

Trigger 

The repository 

volume that was 

inaccessible 

became 

accessible. 

The repository 

volume became 

inaccessible to a 

single RA. 

The RA regained 

the WAN 

connection to an 

RA at the other 

site. 

The RA lost the 

WAN connection 

to an RA at the 

other site. 

The RA regained 

the LAN 

connection to an 

RA at the local 

site. 

The RA lost the 

LAN connection to 

an RA at the local 

site, without losing 

the connection 

through the 

repository volume. 

Capabilities are full 

and previous 

capabilities are 

unknown. 

Capabilities are 

either temporarily 

not full on the RA 

on which the 

group is currently 

running, or 

indefinitely not full 

on the RA on 

which the group is 

not running. 

E–24 6872 5688–002

Event 

ID 

Topic 


Level 

Description 

4003 Group Error Group capabilities problem. 

(Group ) 

4007 Group Info Pausing data transfer. (Group 

, Reason: ) 

4008 Group Warning Pausing data transfer. (Group 

, Reason: ) 

4009 Group Error Pausing data transfer. (Group 

, Reason: ) 

4010 Group Info Starting data transfer. (Group 

) 

4015 Group Info Transferring latest snapshot 

before pausing transfer (no 

data loss). (Group ) 

4016 Group Warning Transferring latest snapshot 



4017 Group Error Transferring latest snapshot 



4018 Group Warning Transfer of latest snapshot 

from source is complete (no 



Trigger 

Capabilities are not 

full indefinitely on 

the RA on which 

the group is 

running. 

The user stopped 

the transfer. 

The system 

temporarily 

stopped the 

transfer. 

The system 

stopped the 

transfer 

indefinitely. 

The user 

requested a start 

transfer. 

In a total storage 

disaster, the 

system flushed 

the buffer before 

stopping 

replication. 


disaster, the 



stopping 

replication. 


disaster, the 



stopping 

replication. 


disaster, the last 

snapshot from the 

source site is 

available at the 

target site. 

6872 5688–002 E–25


Event 

ID 

Topic 


Level 

Description 

4019 Group Warning Group in high load; transfer is 

to be paused temporarily. 

(Group ) 

4020 Group Warning 

Off 

Group is no longer in high load. 

(Group ) 

4021 Group Error Journal full—initialization 

paused. To complete 

initialization, enlarge the journal 

or allow long 

resynchronization. (Group 

) 

4022 Group Error Off Initialization resumed. (Group 

) 

4023 Group Error Journal full—transfer paused. 

To restart the transfer, first 

disable access to image. 

(Group ) 

4024 Group Error Off Transfer restarted. (Group 

) 

4025 Group Warning Group in high load— 

initialization to be restarted. 

(Group ) 


Off 

Group no longer in high load. 

(Group ) 

4027 Group Error Group in high load—the journal 

is full. The roll to physical 

image is paused, and transfer 

is paused. (Group ) 

4028 Group Error Off Group no longer in high load. 

(Group ) 

Trigger 

The disk manager 

has a high load. 

The disk manager 

no longer has a 

high load. 

In initialization, the 

journal is full and 

a long 


is not allowed. 

End of an 

initialization 

situation in which 

the journal is full 

and a long 


was not allowed. 

Access to the 

image is enabled 

and the journal is 

full. 

End of a situation 

in which access to 

the image is 

enabled and the 

journal is full. 

The group has a 

high load; 

initialization is to 

be restarted. 

The group no 

longer has a high 

load. 

No space remains 

to which to write 

during roll. 

Journal capacity 

was added, or 

image access was 

disabled. 

E–26 6872 5688–002

Event 

ID 

Topic 


Level 

Description 

4040 Group Error Journal error—full sweep to be 

performed. (Group ) 

4041 Group Info Group activated. (Group 

, RA ) 

4042 Group Info Group deactivated. (Group 

, RA ) 

4043 Group Warning Group deactivated. (Group 

, RA ) 

4044 Group Error Group deactivated. (Group 

, RA ) 

4051 Group Info Disabling access to image— 

resuming distribution. (Group 

) 

4054 Group Error Enabling access to image. 

(Group ) 

4057 Group Warning Specified image was removed 

from the journal. Try a later 

image. (Group ) 

4062 Group Info Access enabled to latest 


Failover site ) 

4063 Group Warning Access enabled to latest 




Trigger 

A journal volume 

error occurred. 

The group is 

replication-ready; 

that is, replication 

could take place if 

other factors are 

acceptable, such 

as RAs, network, 

and storage 

access. 

A user action 

deactivated the 

group. 

The system 

temporarily 


group. 

The system 


group indefinitely. 

The user disabled 

access to an 

image (that is, 

distribution is 

resumed). 

The system 

enabled access to 

an image 

indefinitely. 

The specified 

image was 

removed from the 

journal (that is, 

FIFO). 

Access was 

enabled to the 

latest image during 


Access was 


latest image during 


6872 5688–002 E–27


Event 

ID 

Topic 


Level 

Description 

4064 Group Error Access enabled to latest 



4080 Group Warning Current lag exceeds maximum 

lag. (Group , Lag 

, Maximum lag 

) 


off 

Current lag within policy. 

(Group , Lag , 

Maximum lag ) 

4082 Group Warning Starting full sweep. (Group 

) 

4083 Group Warning Starting volume sweep. (Group 

, Pair ) 

4084 Group Info Markers cleared. (Group 

) 

4085 Group Warning Unable to clear markers. 

(Group ) 

4086 Group Info Initialization started. (Group 

) 

4087 Group Info Initialization completed. (Group 

) 

4091 Group Error Target-side log is full; write 

operations by the hosts at the 

target side is disabled. (Group 

, Site ) 

4095 Group Info Writing target-side log to 

storage; writes to log cannot 

be undone. (Group ) 

Trigger 

Access was 


latest during 


The group lag 

exceeds the 

maximum lag 

(when not 

regulating an 

application). 

The group lag 

drops from above 

the maximum lag 

to below 90 

percent of the 

maximum. 

Group markers 

were set. 

Volume markers 

were set. 

Group markers 

were cleared. 

An attempt to 

clear the group 

markers failed. 

Initialization 

started. 

Initialization 

completed. 

The target-side log 

is full. 

Started marking to 

retain write 

operations in the 

target-side log. 

E–28 6872 5688–002

Event 

ID 

Topic 


Level 

Description 

4097 Group Warning Maximum journal lag 

exceeded. Distribution in fastforward—older 

images 

removed from journal. (Group 

) 


Off 

Maximum journal lag within 

limit. Distribution normal— 

rollback information retained. 

(Group ) 

4099 Group Info Initializing in long 

resynchronization mode. 

(Group ) 

4110 Group Info Enabling virtual access to 

image. (Group ) 

4111 Group Info Virtual access to image 

enabled. (Group ) 

4112 Group Info Rolling to physical image. 

(Group ) 

4113 Group Info Roll to physical image stopped. 

(Group ) 

4114 Group Info Roll to physical image 

complete—logged access to 

physical image is now enabled. 

(Group ) 


Trigger 

Fast-forward 

action started 

(causing a loss of 

snapshots taken 

before as 

maximum journal 

lag was 

exceeded). 

Five minutes have 

passed since the 

fast-forward action 

stopped. 

The system 

started a long 

resynchronization. 

The user initiated 

enabling virtual 

access to an 

image. 

The user enabled 

virtual access to an 

image. 

Rolling to the 

image (in 

background) while 

virtual access to 

the image is 

enabled. 

Rolling to the 

image (that is, the 

background, while 

virtual access to 

the image is 

enabled) is 

stopped. 

The system 

completed the roll 

to the physical 

image. 

6872 5688–002 E–29


Event 

ID 

Topic 


Level 

Description 

4115 Group Error Unable to enable access to 

virtual image because of 

partition table error. (The 

partition table on at least one 

of the volumes in group 

has been modified 

since logged access was last 

enabled to a physical image. To 

enable access to a virtual 

image, first enable logged 

access to a physical image.) 

4116 Group Error Virtual access buffer is full— 

writing by hosts at the target 

side is disabled. (Group 

) 

4118 Group Error Cannot enable virtual access to 

an image. (Group ) 

4119 Group Error Initiator issued an out-ofbounds 

I/O operation. Contact 

technical support. (Initiator 

, Group 

, Volume ) 

4120 Group Warning Journal usage (with logged 

access enabled) now exceeds 

this threshold. (Group 

, ) 

4121 Group Error Unable to gain permissions to 

write to replica. 

Trigger 

An attempt to 

pause on a virtual 

image is 

unsuccessful 

because of a 

change in the 

partition table of a 

volume or volumes 

in the group. 

An attempt to 

write to the virtual 

image is 

unsuccessful 

because the virtual 

access buffer 

usage is 100 

percent. 

An attempt to 

enable virtual 

access to the 

image is 

unsuccessful 

because of 

insufficient 

memory. 

A configuration 

problem exists. 

Journal usage 

(with logged 

access enabled) 

has passed a 

specified 

threshold. 

RAs unable to 

write to replication 

or journal volumes 

because they do 

not have proper 

permissions. 

E–30 6872 5688–002

Event 

ID 

Topic 


Level 

Description 

4122 Group Trying to regain permissions to 

write to replica. 

4123 Group Error Unable to access volumes – 

bad sectors encountered. 

4124 Group Error Off Trying to access volumes that 

previously had bad sectors. 

5000 Splitter Info Splitter or splitters are attached 

to a volume. (Splitter 

, Volume ) 

5001 Splitter Info Splitter or splitters are 

detached from a volume. 

(Splitter , Volume 

) 

5002 Splitter Error RA is unable to access splitter. 

(Splitter , RA ) 

5003 Splitter Error Off RA access to splitter is 

restored. (Splitter , 

RA ) 

5004 Splitter Error Splitter is unable to access a 

replication volume or volumes. 


) 

5005 Splitter Error Off Splitter access to replication 

volume or volumes is restored. 


) 

5006 OBSOLETE 

5007 OBSOLETE 


Trigger 

User has indicated 

that the 

permissions 

problem has been 

corrected. 

RAs unable to 

write to replication 

or journal volumes 

due to bad sectors 

on the storage. 

User has indicated 

that the bad 

sectors problem 

has been 

corrected. 

The user attached 

a splitter to a 

volume. 

The user detached 

a splitter from a 

volume. 

The RA is unable 

to access a 

splitter. 

The RA can access 

a splitter that was 

previously 

inaccessible. 

The splitter cannot 

access a volume. 

The splitter can 

access a volume 

that was 

previously 

inaccessible. 

6872 5688–002 E–31


Event 

ID 

Topic 


Level 

Description 

5013 Splitter Error Splitter is down. (Splitter 

) 

5015 Splitter Error Off Splitter is up. (Splitter 

) 

5016 Splitter Warning Splitter has restarted. (Splitter 

) 

5030 Splitter Error Splitter write failed. (Splitter 

, Group ) 

5031 Splitter Warning Splitter is not splitting to 

replication volumes; volume 

sweeps are required. (Host 

, Volumes , Groups ) 

5032 Splitter Info Splitter is splitting to replication 

volumes. (Host , 

Volumes , 

Groups (Groups) 

5035 Splitter Info Writes to replication volumes 

are disabled. (Splitter 


5036 Splitter Warning Writes to replication volumes 

are disabled. (Host< host>, 

Volumes , 

Groups ) 

5037 Splitter Error Writes to replication volumes 

are disabled. (Splitter 


Trigger 

Connection to the 

splitter was lost 

with no warning; 

splitter crashed or 

the connection is 

down. 

Connection to the 

splitter was 

regained after a 

splitter crash. 

The boot 

timestamp of the 

splitter has 

changed. 

The splitter write 

operation to the 

RA was 

successful; the 

write operation to 

the storage device 

was not 

successful. 

The splitter is not 

splitting to the 

replication 

volumes. 

The splitter started 


replication 

volumes. 

Write operations 

to the replication 

volumes are 

disabled. 



volumes are 

disabled. 



volumes are 

disabled. 

E–32 6872 5688–002

Event 

ID 

Topic 


Level 

Description 

5038 Splitter Info Splitter delaying writes. 

(Splitter , Volumes 

, Groups 

) 

5039 Splitter Warning Splitter delaying writes. 


, Groups 

) 

5040 Splitter Error Splitter delaying writes. 


, Groups 

) 

5041 Splitter Info Splitter is not splitting to 

replication volumes. (Splitter 


5042 Splitter Warning Splitter is not splitting to 



5043 Splitter Error Splitter not splitting to 



5045 Splitter Warning Simultaneous problems 

reported in splitter and RA. 

Full-sweep resynchronization is 

required after restarting data 

transfer. 

5046 Splitter Warning Transient error—reissuing 

splitter write. 


Trigger 

— 

6872 5688–002 E–33 

— 

— 



replication 

volumes because 

of a user decision. 



replication 

volumes. 



replication 

volumes because 

of a system action. 

The marking 

backlog on the 

splitter was lost as 

a result of 

concurrent 

disasters to the 

splitter and the 

RA. 

—


E–34 6872 5688–002

Appendix F 

Configuring and Using SNMP Traps 

The RA in the Unisys SafeGuard 30m solution is SNMP capable—that is, the solution 

supports monitoring and problem notification using the standard Simple Network 

Management Protocol (SNMP), including support for SNMPv3. The solution supports 

various SNMP queries to the agent and can be configured so that events generate 

SNMP traps, which are sent to designated servers. 

Software Monitoring 

To configure SNMP traps for monitoring, see the Unisys SafeGuard 30m Solution 

Planning and Installation Guide. 

You cannot query the RA software management information base (MIB). You can query 

the MIB-II. The RA SNMP agent includes MIB-II support. Also see “Hardware 

Monitoring.” For more information on MIB-II, see the document at 

http://www.faqs.org/rfcs/rfc1213.html 

All of the management console log events listed in Appendix E generate SNMP traps 

depending on the severity of the trap configuration. 

The Unisys MIB OID is 1.3.6.1.4.1.21658. 

The trap identifiers for Unisys traps are as follows: 

1: Info 

2: Warning 

3: Error 

6872 5688–002 F–1


The Unisys trap variables and their possible values are defined in Table F–1. 

Table F–1. Trap Variables and Values 

Variable OID Description Value 

dateAndTime 3.1.1.1 Date and time that the trap was 

sent 

eventID 3.1.1.2 Unique event identifier (See 

values in “List of Events” in 

Appendix E.) 

siteName 3.1.1.3 Name of site where event 

occurred 

eventLevel 3.1.1.4 See values 1: info 

2: warning 

3: warning off 

4: error 

5: error off 

eventTopic 3.1.1.5 See values 1: site 

2: K-Box 

3: group 

4: splitter 

5: management 

hostName 3.1.1.6 Name of host — 

kboxName 3.1.1.7 Name of RA — 

volumeName 3.1.1.8 Name of volume — 

groupName 3.1.1.9 Name of group — 

eventSummary 3.1.1.10 Short description of event — 

eventDescription 3.1.1.11 More detailed description of 

event 

F–2 6872 5688–002 

— 

— 

— 

—


SNMP Monitoring and Trap Configuration 

To configure SNMP traps, see the Unisys SafeGuard Solutions Planning and Installation 

Guide. 

On the management console, use the SNMP Settings menu (in the System menu) to 

manage the SNMP capabilities. Through that menu, you can enable and disable the 

agent or the SNMP traps feature, modify the configuration for SNMP traps, and add or 

remove SNMP users. 

In addition, the RA provides several CLI commands for SNMP, as follows: 

• The enable_snmp command to enable the SNMP agent 

• The disable_snmp command to disable the SNMP agent 

• The set_snmp_community command to define a community of users (for SNMPv1) 

• The add_snmp_user command to add SNMP users (for SNMPv3) 

• The remove_snmp_user command to remove SNMP users (for SNMPv3) 

• The get_snmp_settings command to display whether the agent is currently set to be 

enabled, the current configuration for SNMP traps, and the list of registered SNMP 

users 

• The config_snmp_traps command to configure the SNMP traps feature so that 

events generate traps. Before you enable the feature, you must designate the IP 

address or DNS name for a host at one or more sites to receive the SNMP traps. 

Note: You can designate a DNS name for a host only in installations for which a 

DNS has been configured. 

• The test_snmp_trap command to send a test SNMP trap 

When the SNMP agent is enabled, SNMP users can submit queries to retrieve various 

types of information about the RA. 

You can also designate the minimum severity for which an event should generate an 

SNMP trap (that is, info, warning, or error in order from less severe to more severe with 

error as the initial default). Once the SNMP traps feature is enabled, the system sends 

an SNMP trap to the designated host whenever an event of sufficient severity occurs. 

Installing MIB Files on an SNMP Browser 

Install the RA MIB file (\MIBS\mib.txt on the Unisys SafeGuard Solutions Splitter Install 

Disk CD-ROM) on an SNMP browser. Follow the instructions for your browser to load 

the MIB file. 

6872 5688–002 F–3


Resolving SNMP Issues 

For SNMP issues, first determine whether the issue is an SNMP trap or an SNMP 

monitoring issue by performing the procedure for verifying SNMP traps in the Unisys 

SafeGuard Solutions Planning and Installation Guide. 

If you do not receive traps, perform the steps in “Monitoring Issues” and then in “Trap 

Issues.” 

Monitoring Issues 

Trap Issues 

1. Ping the RA management IP address from the management server that has the 

SNMP browser. 

2. Ensure that the community name used on the RA configuration matches the 

management server running the SNMP browser (version 1 and 2). Use public as a 

community name. 

3. Ensure that the user and password used on the RA configuration matches the 

management server running the SNMP browser (version 3). 

1. Ensure that the trap destination is on the same network as the management 

network and that a firewall has not blocked SNMP traffic. 

2. Ensure that the same version of SNMP is configured in the management software 

that receives traps. 

F–4 6872 5688–002

Appendix G 

Using the Unisys SafeGuard 30m 

Collector 

The Unisys SafeGuard 30m Collector utility enables you to easily collect information 

about the environment so that you can solve problems. An enterprise solution requires 

many logs, and gathering the log information can be time intensive. Often the person 

who collects the information is not familiar with all the interfaces to the hardware. The 

Collector solves these problems. An experienced installer configures log collection one 

time, and then other personnel can use a “one-button” approach to log collection. 

You can use this utility to create custom scripts to complete tasks tailored to your 

environment. You choose which CLI commands to include in the custom scripts so that 

you build the capabilities you need. Refer to the Unisys SafeGuard Solutions Introduction 

to Replication Appliance Command Line Interface (CLI) for more information about CLI 

commands. 

The Collector gathers configuration information from RAs, storage subsystems, and 

switches. No information is collected from the servers in the environment. 

Installing the SafeGuard 30m Collector 

This utility offers two modes: Collector and View. You determine the available modes 

when you install the program. If you install the Collector and specify Collector mode, 

both modes are enabled. If you install the Collector and specify View mode, the Collector 

mode functions are disabled. The View mode is primarily used by support personnel at 

the Unisys Support Center. 

If you are installing the Collector at a customer installation, be sure to install the utility on 

PCs at both sites. 

The utility requires .NET Framework 2.0 and J# redistributable, which are on the Unisys 

SafeGuard 30m Solution Control Install Disk CD-ROM in the Redistributable folder. 

The directories under this folder are dotNet Framework 2.0 and JSharp. 

6872 5688–002 G–1

Using the Unisys SafeGuard 30m Collector 

Notes: 

• The readme file on that CD-ROM contains the same information as this appendix. 

• If you installed a previous version of the Collector, uninstall this utility and remove 

the folder and all of the files in the folder before you begin this installation. 

Perform the following steps to install the Collector: 

1. Insert the CD-ROM in the CD/DVD drive, and start the file Unisys SafeGuard 30m 

Collector.msi. 

2. On the Installation Wizard welcome screen, click Next. 

3. On the Customer Information screen, type the user name and organization, and 

click Next. 

4. On the Destination Folder screen, select a destination folder and click Next. 

Note: If you are using the Windows Vista operating system, install the Collector 

into a separate directory named C:\Unisys\30m\Collector. 

5. On the Select Options: screen, select Collector mode –install at site or 

select View mode –install at support center, and then click Next. 

6. On the Ready to Install the Program screen, click Install. 

The Installation wizard begins installing the files, and the Installing Unisys 

SafeGuard 30m Collector screen is displayed to indicate the status of the 

installation. 

After the files are installed, the Installation Wizard Completed screen is 

displayed. 

7. Click Finish. 

Before You Begin the Configuration 

Before you begin configuring the Collector, be sure you have the following information: 

• IP addresses 

− SAN switches 

− Network switches 

− RA site management 

• Log-in names 



− RA (for custom scripts only) 

G–2 6872 5688–002


• Passwords 



− RA (for custom scripts only) 

• EMC Navisphere CLI 

− Storage 

• Autologon configuration 

− SAN switches (Consult your SAN switch documentation for the autologon 

configuration.) 

If you are using a Cisco SAN switch, enable the SSH server before you begin the 

configuration. See “Configuring RA, Storage, and SAN Switch Component Types Using 

Built-Ins” in this appendix. 

Handling the Security Breach Warning 

If you previously installed the Collector and have uninstalled the utility and all the files, 

when you begin configuring RAs or adding RAs, you might get this message: 

WARNING – POTENTIAL SECURITY BREACH! 

If you receive this message, complete these steps: 

1. Delete the IP address for the RA. 

2. Use the following plink command: 

C:\>plink -l admin -pw admin get_version 

Messages about the host key and a new key are displayed. 

3. Type Y in response to the message “Update cached key?” 

Once you have updated the cached key, complete the steps in “Configuring RAs” to 

discover the IP addresses for the RAs. 

6872 5688–002 G–3


Using Collector Mode 

Installing the utility in Collector mode enables all the capabilities to gather log information 

using scripts and also enables View mode. 

Getting Started 

To access the Collector, follow these steps: 

1. On the Start menu, point to Programs, then click Unisys, then click SafeGuard 

30m Collector; and click SafeGuard 30m Collector. 

2. Select the Components.ssc file on the Open Unisys SafeGuard 30m Collector 

File dialog box. 

The Unisys SafeGuard 30m Collector program window is displayed with two panes 

open. 

Configuring RAs 

To collect data, specify the site management IP address of either of the RA clusters for a 

site. The “built-in” scripts are a preconfigured set of CLI commands that facilitate easy 

data collection. 

The other site management IP address is automatically discovered when you specify 

either of the RA site management addresses. 

To configure the RA, perform these steps: 

1. Start the Collector. 

2. If needed, expand the Components tree in the left pane. 

3. Select BI Built-In (under RA), right-click, and click Copy Built-In (Discover RA). 

4. On the Script dialog box, type the RA site management IP address in the IP 

Address field and click Save. 

If you have multiple SafeGuard solutions, repeat steps 3 and 4 for each set of RA 

clusters. 

After you enter the IP address, the Collector window is updated with the folder of each 

site management IP address appearing below the RA folder. Each IP folder contains the 

built-in scripts that are enabled. 

The following sample window shows the IP address folders listed in the left pane. In this 

figure, two SafeGuard solutions are configured—the set of IP addresses (172.16.17.50 

and 172.16.17.60) for the two RA clusters in solution 1 and the IP address 172.16.7.50 

for the continuous data protection (CDP) solution, which always has only one RA cluster. 

G–4 6872 5688–002

Adding Customer Information 


Add information about the Unisys service representative, customer, and architect so that 

the Unisys Support Center can contact the site easily. To add the information, perform 

the following steps on the Unisys SafeGuard 30m Collector program window. 

1. On the File menu, click Properties. 

2. On the Properties dialog box, select the appropriate tab: Customer, Architect, 

or CIR. 

3. Type in the information for each field on each tab. (For instance, type text in the 

Name, Office, Mobile, E-mail, and Additional Info fields for the CIR tab.) 

The Architect tab provides an Installed Date field. Use the Additional Info field for any 

other information that the Unisys Support Center might need, such as a support 

request number. 

4. Click OK. 

6872 5688–002 G–5


Running All Scripts 

To collect data from all enabled scripts in a SafeGuard Solutions Components (SSC) file, 

perform these steps on the Unisys SafeGuard 30m Collector program window. 

1. Select Components. 

2. Right-click, and click Run, or click the Run button. 

Note: The status bar shows the progress of script executions and the amount of data 

collected. 

Compressing an SSC File to Send to the Support Center 

Once you run the utility to collect information, you can compress the SSC file to send to 

the Unisys Support Center. 

Note: A Collector components file has the .ssc suffix. Once an SSC file is compressed, 

the corresponding SafeGuard Solutions Data (SSD) file has the .ssd suffix. 

On the Unisys SafeGuard 30m Collector program window, follow these steps to 

compress an SSC file: 

1. Click Compress SSC on the File menu. 

Once the file is compressed, the file name and path are displayed at the top in the 

right pane of the window. The data is exported to the file named Components.ssd in 

the directory C:\Program Files\Unisys\30m\Collector\Data. 

Note: For the Microsoft Vista operating system, the SSD file resides in the 

directory where the Collector is installed. A typical location for this file is 

C:\Unisys\30m\Collector\Components.ssd. 

2. Send the SSD file to the Unisys Support Center at 

Safeguard30msupport@unisys.com. 

Duplicating the Installation on Another PC 

To duplicate the installation of the Collector at a different PC (for example, on the second 

site), perform these steps: 

1. Copy the SSD file from the PC with the installed Collector to the second PC, placing 

it in the C:\Program Files\Unisys\30m\Collector\Data directory. 

2. Start the Collector. 

3. Click Cancel on the Open Unisys SafeGuard 30m Collector File dialog box. 

The Unisys SafeGuard 30m Collector program window is displayed. 

Note: Once an SSD file is extracted, you can select the .ssc file. 

4. On the File menu, select Uncompress SSD. 

G–6 6872 5688–002


5. On the Open SafeGuard 30m Data File dialog box, select from the list of 

available files the SSD file that you wish to uncompress. 

If a message appears asking about overwriting the SSC file, click Yes. 

6. Ensure that all scripts run from this PC by selecting each component type and 

running the scripts for each component. 

Understanding Operations in Collector Mode 

The Components.ssc file contains the configuration information. If you make changes to 

the Components.ssc file—such as adding, deleting, editing, enabling, and disabling 

scripts—these changes are automatically saved. You can also make these changes to a 

saved SSC file except that you cannot delete scripts from a saved SSC file. You must 

open the Components.ssc file to delete scripts. 

Understanding and Saving SSC Files 

Because you can enable and disable scripts in any SSC file, you can create saved SSC 

files for specific uses. If you want to run a subset of the available scripts, save the 

Components.ssc file as a new SSC file with a unique name. You can then enable or 

disable scripts in the saved SSC file. The saved SSC file is always updated from the 

Components.ssc file for information such as the available scripts and the details within 

each script. In addition, all changes that are made to any SSC file are updated in the 

Components.ssc file. Only scripts that were enabled in the saved SSC file are enabled 

when updated from a Components.ssc file. 

For example, you could save an SSC file with all RAs except one disabled. You might 

name it “radisabled.ssc”. If you have the radisabled.ssc file open and add a new script to 

it, the script is automatically added to the Components.ssc file. 

Whenever the Components.ssc file is updated with a new script, that script is 

automatically added to any saved SSC files. 

If you add a new RA to the configuration, the Components.ssc file and any existing 

saved SSC files are updated with the component and its scripts are disabled. 

If you make deletions to the Components.ssc file, the deletions are automatically 

removed from any saved SSC files. 

6872 5688–002 G–7


Sample Scenario 

If you want to collect data at one site only or if you want to view the data from one site, 

you can create a new saved SSC file for each site. Follow these steps to create the 

saved SSC files. 

1. Add any desired scripts to the Components.ssc file. 

2. Open an SSC file. 

3. Click Save As on the File menu, and enter a unique name for the file. 

4. Enable and disable scripts as desired. 

For example, you might disable one site. To do so, follow these steps: 

a. Select the IP address of a component (perhaps Site 1 RA cluster management 

IP.) 

b. Right-click and click Disable. 

Repeat steps 2 through 4 to create additional customized files. 

Opening an SSC File 

On the Unisys SafeGuard 30m Collector program window, perform the following steps 

to open an SSC file: 

1. Click Open on the File menu. 

2. Select an SSC file and click Open. 

Configuring RA, Storage, and SAN Switch Component Types Using 

Built-In Scripts 

The built-in scripts are preconfigured; they contain CLI commands for RAs, navicli 

commands for Clariion storage, and CLI commands for switches that facilitate easy data 

collection. It takes about 4 minutes for the built-in scripts for one RA to run and about 2 

minutes for the built-in scripts for a SAN switch to run. 

After you configure built-in scripts, the left pane is updated with the IP addresses below 

the component type. Each IP folder contains the built-in scripts that are enabled. 

See the previous sample window with the IP address folders listed in the left pane. In 

that figure, two SafeGuard solutions are configured—the set of IP addresses 

(172.16.17.50 and 172.16.17.60) for the two RA clusters and the IP address 172.16.7.50 

for the continuous data protection (CDP) setup, which always has only one RA cluster. 

G–8 6872 5688–002


On the Unisys SafeGuard 30m Collector program window, follow these steps to use 

built-in scripts to configure RA, Storage, and SAN Switch component types: 

1. Expand a component type—RA, Storage, or SAN Switch—and select BI Built- 

In. 

2. Right-click and click Copy Built-In. 

3. On the Script dialog box, complete the available fields and click Save. 

Note: You can select one script instead of all scripts by selecting a script name instead 

of selecting BI-Built-In. 

For the RA Component Type 

To collect data, specify the site management IP address of either of the RA clusters for a 

site. The other site management IP address is automatically discovered when you 

specify either of the RA site management addresses. 

If you have multiple SafeGuard solutions, repeat the three previous steps for each set of 

RA clusters. 

For the Storage Component Type 

Clariion is the only storage component with built-in scripts available. 

For the SAN Switch Component Type 

Before configuring a Cisco SAN switch, enter config mode on the switch and type #ssh 

server enable. To determine the state of the SSH server, type show ssh server 

when not in config mode. Refer to the Cisco MDS 9020 Fabric Switch Configuration 

Guide and Command Reference for more information about switch commands. 

If you run the tech-support command under SAN Switch from the Collector, the data 

capture might take a long time. You can follow the progress in the status bar of the 

window. 

If you run commands for a Brocade switch and receive the following message, the 

Brocade switch is downlevel and does not support the SSH protocol: 

rbash: switchShow: command not found 

Upgrade the switch software to a later version that supports the SSH protocol. 

6872 5688–002 G–9


Enabling Scripts 

You can interactively enable all the scripts in any SSC file, the scripts for one component 

in the SSC file, or a single script. To enable a disabled script, you must open the SSC file 

containing the script. Perform the following steps on the Unisys SafeGuard 30m 

Collector program window. 

Enable All Scripts 


2. Right-click and click Enable. 

Enabled scripts are shown in green. 

Enable Scripts for One Component 

1. Select the IP address of the component. 


Enabled scripts are shown in green. 

Enable a Single Script 

1. Select the script name. 


The enabled script is shown in green. 

Disabling Scripts 

You can interactively disable all the scripts in any SSC file, the scripts for one component 

in the SSC file, or a single script. Perform the following steps on the Open Unisys 

SafeGuard 30m Collector program window. 

Disable All Scripts 


2. Right-click and click Disable. 

Disabled scripts are shown in red. 

Disable Scripts for One Component 

1. Select the IP address of the component. 


Disabled scripts are shown in red. 

Disable a Single Script 

1. Select the script name. 


The disabled script is shown in red. 

G–10 6872 5688–002

Running Scripts 


You can interactively run all the scripts in any SSC file; the scripts for one component 

type such as RA, Storage, SAN Switch, or Other; the scripts for one component in the 

SSC file; or a single script. 

Note: You can use the Run button on the Collector toolbar or the Run command in the 

following procedures. 

Perform the following steps on the Unisys SafeGuard 30m Collector program window. 

Run All Scripts 


2. Right-click and click Run. 

Run Scripts for One Component Type 

1. Select a component type—RA, Storage, SAN Switch, or Other. 


The status of the executing scripts is displayed in the right pane. The status bar 

shows the component type that is running, the IP address, the script name, and 

instructions for halting script execution. A progress bar indicates that the Collector is 

running the script and shows the amount of data being captured by the script. Once 

script execution completes, the status bar shows the last script run. 

Run Scripts for One Component 

1. Select either the IP address or custom-named component. 







Run a Single Script 

1. Select a script name. 







Stopping Script Execution 

To stop a script while it is executing, click Stop on the Collector toolbar. All scripts that 

have been stopped are marked with a green X. The status of the stopped script is 

displayed in the right pane. 

6872 5688–002 G–11


Deleting Scripts 

You can interactively delete scripts only in the Components.ssc file. Perform the 

following steps on the Unisys SafeGuard 30m Collector program window. 

Delete Scripts for One Component 

1. Select the IP address or custom-named component. 

2. Right-click and click Delete. 

Delete a Single Script 

1. Expand an IP address or a custom-named component; then select a script name. 


Adding Scripts for RA, Storage, and SAN Switch Component Types 

You can interactively add custom scripts to any SSC file by copying an existing script or 

by specifying a new script. Perform the following steps on the Unisys SafeGuard 30m 

Collector program window. 

Add New Script for a Component Type 

1. Select a component type—RA, Storage, or SAN Switch. 

2. Right-click and click New. 

3. Complete the script form. 

4. Click Save. 

Add a New Script Based on an Existing Custom Script 

1. Select a script name. 


3. Complete the form. Change the script name and the command. 

4. Click Save. 

Adding Scripts for the Other Component Type 


1. Select the component type Other. 


3. On the Select Program dialog box, navigate to the appropriate directory and 

choose the file to run. Then click Open. 

4. On the Script dialog box, type a component name in the Component field. 

5. Type a unique name for the script in the Script Name field. 

6. Review the selected file name that is displayed in the Command field. Modify the 

file name as necessary. 

G–12 6872 5688–002


The following example illustrates using a custom component (adding a new script as 

shown in the previous procedure) to mount and unmount drives. 

Note: In this example, the Collector must be installed on the server with the kutils 

utility installed or with the stand-alone kutils utility installed. 

C:\batch_File\mount_r.bat 

%This command, when run, mounts the specified drive 

Echo ON 

cd c:\program files\kdriver\kutils 

kutils.exe umount r: 

kutils.exe mount r: 

echo "Finished" 

C:\batch_File\unmount_r.bat 

%This command, when run, unmounts the specified drive 

cd c:\program files\kdriver\kutils 

kutils.exe flushedFS r: 

kutils.exe unmount r: 

Scheduling an SSC File 


1. Click Schedule on the menu bar. 

2. On the Schedule Unisys SafeGuard 30m Collector File dialog box, enter the 

information required for each field as follows: 

a. Type the password. 

b. Type the date and start time. 

c. Select a Perform task option, which determines how often the schedule runs. 

d. Enter the end date if shown. (You do not need an end date for a Perform task of 

Once.) 

3. Click Select. 

4. On the Select Unisys SafeGuard 30m Collector dialog box, select the 

appropriate SSC file for which you wish to run the schedule, and then click Open. 

The Schedule Unisys SafeGuard 30m Collector File dialog box is again 

displayed. The Collector opens the selected SSC file as the current SSC file. 

5. Click Add. 

6. Click Exit. 

Note: You can create one schedule for an SSC file. To create additional schedules, 

create additional SSC files with the desired scripts enabled. The resultant scheduled data 

is appended to any current data (if available). For example, if you run the Collector using 

Windows Scheduler three times, three outputs are displayed in the right pane one after 

another with the timestamps for each. 

6872 5688–002 G–13


Querying a Scheduled SSC File 


1. Click Schedule from the menu bar. 

2. On the Schedule Unisys SafeGuard 30m Collector File dialog box, click 

Query. 

3. On the Tasks window, select the task name that is the same as the scheduled SSC 

file. 

4. Right-click and click Properties. 

5. View the details of the scheduled task in the window; then click OK to close the 

task Properties window. 

6. Close the Tasks window and then select the Schedule Unisys SafeGuard 30m 

Collector window. 


Note: For the Microsoft Vista operating system, if you want to see the scheduled task 

after scheduling a task, click Query on the Schedule Unisys SafeGuard 30m 

Collector File dialog box. The Vista Microsoft Management Control (mmc) window is 

displayed. Press F5 to see the scheduled task. 

Deleting a Scheduled SSC File 


1. Select Schedule from the menu bar. 

2. On the Schedule Unisys SafeGuard 30m Collector File dialog box, click 

Query. 

3. On the Tasks window, select the task name that is the same as the scheduled SSC 

file. 


5. Close the Tasks window and then select the Schedule Unisys SafeGuard 30m 

Collector window. 


G–14 6872 5688–002

Using View Modde 

6872 5688–002 

If you installed the Collector in View mode, the support personnel at the 

Unisys Support 

Center can use Vieww 

Mode to view the information. To access the Collector, 

follow 

these steps: 

1. Start the Collecctor. 

2. On the Open UUnisys 

SafeGuard 30m Collector File dialog box, b click Cancel. 

The Unisys SafeGuard 

30m Collector program window is displayed d. 

Note: Once aan 

SSD file is extracted, you can select the . ssc file. 

3. On the File meenu, 

click Uncompress SSD. 

4. On the Open SSafeGuard 

30m Data File dialog box, select from m the list of 

available files thhe 

SSD file that you wish to uncompress. 

5. In View mode, expand the components tree and then expand a com mponent type: 

RA, Storage, SAN Switch, or Other. 

6. Click a script naame 

from those displayed to view the data collected d from that script. 

The data is dispplayed 

in the right pane. 

The following figure 

displays a sample of View mode with data disp played in the right 

pane. 

7. On the File meenu, 

click Exit. 

Using the Unisys SafeGuard d 30m Collector 

G–15


G–16 6872 5688–002

Appendix H 

Using kutils 

Usage 

The server-based kutils utility enables you to manage host splitters across all platforms. 

This utility is installed automatically when you install the Unisys SafeGuard 30m splitter 

on a host machine. When the splitting function is performed by an intelligent fabric 

switch, you can install a stand-alone version of the kutils utility separately on host 

machines. 

For details on the syntax and use of the ktuils commands, see the Unisys SafeGuard 

Solutions Replication Appliance Administrator’s Guide. 

A kutils command is always introduced with the kutils string. If you enter the string 

independently—that is, without any parameters—the ktuils utility returns usages notes, 

as follows: 

C:\program files\kdriver\kutils>kutils 

Usage: kutils 

Path Designations 

You can designate the path to a device in the following ways: 

• Device path example 

“SCSI\DISK&VEN_KASHYA&PROD_MASTER_RIGHT&REV_0001\5&133EF78A&0&000” 

• Storage path example 

“SCSI#DISK&VEN_KASHYA&PROD_MASTER_RIGHT&REV_0001#5&133EF78A&0&000#{53 

f56307-b6bf-11d0-94f2-00a0c91efb8b}” 

• Volume path example 

“\\?\Volume{33b4a391-26af-11d9-b57b-505054503030}” 

Each command notes the particular designation to use. In addition, some commands, 

such as showDevices and showFS, return the symbolic link for a device. The symbolic 

link generally provides additional information about the characteristics of the specific 

devices. 

6872 5688–002 H–1

Using kutils 

The following are examples of symbolic links: 

“\Device\0000005c” 

“\Device\EmcPower\Power2” 

“\Device\Scsi\q123001Port2Path0Target0Lun2” 

Command Summary 

The kutils utility offers the following commands: 

• disable: Removes host access to the specified device or volume (Windows only). 

• enable: Restores host access to a specified device or volume (Windows only). 

• flushFS: Initiates an operating system flush of the file system (Windows only). 

• manage_auto_host_info_collection: Indicates whether the automatic host 

information collection is enabled or disabled, or enables or disables automatic host 

information collection. 

• mount: Mounts a file system (Windows only). 

• rescan: Scans storage for all existing disks (Windows only). 

• showDevices: Presents a list of physical devices to which the host has access, 

providing (as available) the device path, storage path, and symbolic link for each 

device (Windows only). 

• showFS: Presents the drive designation and, as available, the device path, storage 

path, and symbolic link for each mounted physical device (Windows only). 

• show_vol_info: Presents information on the specified volume, including: the Unisys 

SafeGuard 30m solution name (if “created” in Unisys SafeGuard Solutions), size, and 

storage path. 

• show_vols: Presents information on all volumes to which the host has access 

including: the Unisys SafeGuard 30m solution name (if “created” in Unisys 

SafeGuard Solutions), size, and storage path 

• sqlRestore: Restores an image previously created by the sqlSnap command 

(Windows only) 

• sqlSnap: Performs a VDI-based SQL Server image (Windows only). 

• start: Resumes the splitting of write operations. 

• stop: Discontinues the splitting of write operations to an RA (that is, places the host 

splitter in pass-through mode in which data is written to storage only). 

• umount: Unmounts the file system (Windows only). 

H–2 6872 5688–002

Appendix I 

Analyzing Cluster Logs 

Samples of cluster log messages for problems and situations are listed throughout this 

guide. You can search on text strings from cluster log messages to find specific 

references. 

The information gathered in cluster logs is critical in determining the cause of a given 

cluster problem. Without the diagnostic information from the cluster logs, you might find 

it difficult to determine the root cause of a cluster problem. 

This appendix provides information to help you use the cluster log as a diagnostic tool. 

Introduction to Cluster Logs 

The cluster log is a text log file updated by the Microsoft Cluster Service (MSCS) and its 

associated cluster resource. The cluster log contains diagnostic messages about cluster 

events that occur on an individual cluster member or node. This file provides more 

detailed information than the cluster events written in the system event log. 

A cluster log reports activity for one node. All member nodes in a cluster perform as a 

single unit. Therefore, when a problem occurs, it is important to gather log information 

from all member nodes in the cluster. This information gathering is typically done using 

the Microsoft MPS Report Utility. Gather the information immediately after a problem 

occurs to ensure cluster log data is not overwritten. 

By default, the cluster log name and location are as follows: 

• C:\Winnt\Cluster\cluster.log 

Note: For windows 2003 cluster.log file is located in the following path: 

C:\WINDOWS\Cluster 

• Captured with MPS Report Utility: _Cluster.log 

6872 5688–002 I–1


Creating the Cluster Log 

In Windows 2000 Advanced Server and Windows 2000 Datacenter Server, by default, 

cluster logging is enabled on all nodes. You can define the characteristics and behavior of 

the cluster log with system environment variables. 

To access the system environment variables, perform the following actions: 

1. In Control Panel, double-click System. 

2. Select the Advanced tab. 

3. Click Environment Variables. 

You can get additional information regarding the system environment variables in 

Microsoft Knowledge Base article 16880, “How to Turn On Cluster Logging in Microsoft 

Cluster Server” at this URL: 

http://support.microsoft.com/default.aspx?scid=kb;en-us;168801 

The default cluster settings are listed in Table I–1. Some parameters might not be listed 

when viewing the system environment variables. If a variable is not listed, its default 

value is still in effect. 

Table I–1. System Environment Variables Related to Clustering 

Variable Name Default Setting Comment 

ClusterLog %SystemRoot% 

\Cluster\Cluster.log 

Determines the location and name 

of cluster log file. 

ClusterLogSize 8 MB Determines the size of the cluster 

log. The default size is usually not 

large enough to retain history on 

enterprise systems. The 

recommended setting is 64 MB. 

ClusterLogLevel 2 Sets the level of detail for log 

entries, as follows: 

0 = No logging 

1 = Errors only 

2 = Errors and Warnings 

3 = Everything that occurs 

Used only with the /debug 

parameter on MSCS startup. 

Review Microsoft Knowledge Base 

article 258078 for more information 

about using the /debug parameter. 

I–2 6872 5688–002


Table I–1. System Environment Variables Related to Clustering 

Variable Name Default Setting Comment 

ClusterLogOverwrite 

Note: By default, the 

ClusterLogOverwrite setting is 

disabled. Unisys recommends that 

this setting remain disabled. When 

this setting is enabled, all cluster 

log history is lost if MSCS is 

restarted twice in succession. 

Understanding the Cluster Log Layout 

Process ID 

Thread ID 

Date 

GMT 

0 Determines whether a new cluster 

log is to be created when MSCS 

starts. 

0 = Disabled 

1 = Enabled 

Figure I–1 illustrates the layout of the cluster log. The paragraphs following the figure 

explain the various parts of the layout. 

Figure I–1. Layout of the Cluster Log 

The process ID is the process number assigned by the operating system to a service or 

application. 

The thread ID is a thread of a particular process. A process typically has multiple threads 

listed. Within a large cluster log, it is particularly useful to search by thread ID to find the 

messages related to the same thread. 

The date listed is the date of the entry. You can use this date to match the date of the 

problem in the system event log. 

The time entered in the Windows 2000 cluster log is always in Greenwich Mean Time 

(GMT). The format of the entry is HH:MM:SS.SSS. The SS.SSS entry represents 

seconds carried out to the thousandths of a second. There can be multiple .SSS entries 

for the same thousandth of a second. Therefore, there can be more than 999 cluster log 

entries vsn exist for any given second. 

6872 5688–002 I–3


Cluster Module 

Table I–2 lists the various modules of MSCS. These module names are logged within 

square brackets in the cluster log. 

Table I–2. Modules of MSCS 

API API Support 

ClMsg Cluster messaging 

ClNet Cluster network engine 

CP Checkpoint Manager 

CS Cluster service 

DM Database Manager 

EP Event Processor 

FM Failover Manager 

GUM Global Update Manager 

INIT Initialization 

JOIN Join 

LM Log Manager 

MM Membership Manager 

NM Node Manager 

OM Object Manager 

RGP Regroup 

RM Resource Monitor 

For additional descriptions of the cluster components, refer to the Windows 2000 Server 

Resource Kit at this URL: 

http://www.microsoft.com/technet/prodtechnol/windows2000serv/reskit/default.msp 

x?mfr=true 

Click the following link for Windows 2003 to refer to the Windows 2003 Server Resource 

Kit: 

http://www.microsoft.com/windowsserver2003/techinfo/reskit/tools/default.mspx 

Click the following link to interpret the cluster logs: 

http://technet2.microsoft.com/windowsserver/en/library/16eb134d-584e-46d9-9bf4- 

6836698cd26a1033.mspx?mfr=true 

I–4 6872 5688–002

Sample Cluster Log 


The sample cluster log that follows illustrates the component names in brackets. 

Cluster Operation 

00000848.00000ba0::2008/05/05-16:11:31.000 [RGP] Node 1: REGROUP INFO: 

regroup engine requested immediate shutdown. 

00000848.00000ba0::2008/05/05-16:11:31.000 [NM] Prompt shutdown is requested 

by a membership engine 

00000adc.00000acc::2008/05/05-16:11:31.234 [RM] Going away, Status = 1, 

Shutdown = 0. 

The cluster operation is the task currently being performed by the cluster. Each cluster 

module (listed in Table I–2) can perform hundreds of operations, such as forming a 

cluster, joining a cluster, checkpointing, moving a group manually, and moving a group 

because of a failure. 

Posting Information to the Cluster Log 

The cluster log file is organized by date and time. Process threads of MSCS and 

resources post entries in an intermixed fashion. As the threads are performing various 

cluster functions, they constantly post entries to the cluster log in an interspersed 

manner. 

The following sample cluster log shows various disks in the process of coming online. 

The entries are not logically grouped by disk; rather, the entries are logged as each 

thread posts its unique information. 

In the left navigation pane, click on Windows 2000 Server Resource Kit and click 

on Distributed Systems Guide, then Enterprise Technologies, and then 

Interpreting the Cluster Log. 

Sample Cluster Log 

Thread ID 

↓ 

00000444.00000600::2008/11/18-18:23:48.307 Physical Disk : 

[DiskArb] Issuing GetSectorSize on signature 9a042144. 

00000444.000005e0::2008/11/18-18:23:48.307 Physical Disk : 

[DiskArb]Successful read (sector 12) [:0] (0,00000000:00000000). 

00000444.00000608::2008/11/18-18:23:48.307 Physical Disk : 

[DiskArb]DisksOpenResourceFileHandle: CreateFile successful. 

6872 5688–002 I–5


00000444.00000600::2008/11/18-18:23:48.307 Physical Disk : 

[DiskArb] GetSectorSize completed, status 0. 

00000444.00000608::2008/11/18-18:23:48.307 Physical Disk : 

DiskArbitration must be called before DisksOnline. 

00000444.00000600::2008/11/18-18:23:48.307 Physical Disk : 

[DiskArb] ArbitrationInfo.SectorSize is 512 

00000444.00000608::2008/11/18-18:23:48.307 Physical Disk : 

[DiskArb] Arbitration Parameters (1 9999). 

00000444.00000600::2008/11/18-18:23:48.307 Physical Disk : 

[DiskArb] Issuing GetPartInfo on signature 9a042144. 

Because the cluster performs many operations simultaneously, the log entries pertaining 

to a particular thread are interwoven along with the threads of the other cluster 

operations. Depending on the number of cluster groups and resources, reading a cluster 

log can become difficult. 

Tip: To follow a particular operation, search by the thread ID. For instance, to follow 

online events for Physical Disk V, perform these steps using the preceding sample 

cluster log: 

1. Anchor the cursor in the desired area. 

2. Search up or down for thread 00000600. 

Diagnosing a Problem Using Cluster Logs 

The following topics provide you with useful information for diagnosing problems using 

cluster logs: 

• Gathering Materials 

• Opening the Cluster Log 

• Converting GMT to Local Time 

• Converting Cluster Log GUIDs to Text Resource Names 

• Understanding State Codes 

• Understanding Persistent State 

• Understanding Error and Status Codes 

I–6 6872 5688–002

Gathering Materials 


You need to gather the following pieces of information, tools, and files to use with the 

cluster logs to diagnose problems: 

• Information 

− Date and time of problem occurrence 

− Server time zone 

• Tools 

− Notepad or Wordpad text viewer 

− This command-line tool is embedded in Windows. The command syntax is Net 

Helpmsg ). 

• Output from the MPS Report Utility from all cluster nodes 

• Files from the MPS Report Utility run 

− Cluster log (Mandatory) 

The file name is _Cluster.log. 

− System event log (Mandatory) 

The file name is _Event_Log_System.txt. 

− .nfo system information file for installed adapters and driver versions (Reference) 

The file name is _Msinfo.nfo. 

− Cluster registry hive for cross-referencing information used in the cluster log 

(Reference) 

The file name is _Cluster_Registry.hiv. 

− Cluster configuration file for a basic listing of cluster nodes, groups, resources, 

and dependencies (available in MPS Report Utility version 7.2 or later) 

The file name is _Cluster_mps_Information.txt. 

Opening the Cluster Log 

Use a text editor to view the cluster log file in the MPS Report Utility. Notepad or 

Wordpad works well. Notepad allows text searches up or down the document. Wordpad 

allows text searches only down the document. 

Note: Do not open the cluster.log file on a production cluster. Logging stops while the 

file is open. Instead, copy the cluster.log file first and then open the copy to read the file. 

The cluster log is on the local system in the directory Winnt/Cluster/Cluster.log. 

6872 5688–002 I–7


Converting GMT/UCT to Local Time 

The time posted in the cluster log is given as GMT/UCT. You must convert GMT/UCT to 

the local time to cross-reference cluster log entries with system and application event 

log entries. 

You can find the local time zone in the .nfo file in MPS Reports under system summary. 

You can also use the Web site www.worltimeserver.com to find accurate local time for a 

given city, GMT/UCT, and the difference between the two in hours. 

Converting Cluster Log GUIDs to Text Resource Names 

A globally unique identifier (GUID) is a 32-character hexadecimal string used to identify a 

unique entity in the cluster. A unique entry can be a node name, group name, resource 

name, or cluster name. 

The GUID format is nnnnnnnn-nnnn-nnnn-nnnn-nnnnnnnnnnnn. 

The following are examples of GUIDs in the cluster log: 

000007d0.00000808::2008/04/23-21:48:23.105 [FM] FmpHandleResourceTransition: resource 

Name = ae775058-af20-4ba2-a911-af138b1f65bd old state=130 new state=3 

000007d0.00000808::2008/04/23-21:48:23.448 [FM] FmpRmOfflineResource: RMOffline() for 

6060dc33-5737-4277-b2f2-9cc45629ef0 returned error 997 

000007d0.00001970::2008/05/02-21:41:58.846 [FM] OnlineResource: e65bc275-66d1-41ff- 

8a4e-89ad6643838b depends on 758bb9bb-7d1f-4148-a994-684dd4f8c969. Bring online 

first. 

000007d0.0000081::2008/05/04-17:21:06.888 [FM] New owner of Group b072608c-b7f3-48b0- 

83f8-7c922c14e709 is 2, state 0, curstate 1. 

Mapping a Text Name to a GUID 

The two methods for mapping a text name to a GUID are 

• Automatic mapping 

• Reviewing the cluster registry hive 

I–8 6872 5688–002

Automatic Mapping 


The simplest method of mapping a text name to a GUID is the automatic mapping 

performed by some versions of the MPS Report tool. However, most versions of the 

MPS Report tool do not perform this automatic function. 

For those versions with the automatic mapping feature, you can find the information in 

the cluster configuration file (_Cluster_Mps_Information.txt). The 

following listing shows this mapping: 

f9f0b528-b674-40fb-9770-c65e17a2a387 = SQL Network Name 

f0dd1852-acc8-4921-b33a-a77dd5cdcfee = SQL Server Fulltext (SQL1) 

f0aca2c4-049f-4255-9332-92a69cc07326 = MSDTC 

eff360f3-d987-4a020-8f3c-4118056a50b2 = MSDTC IP Address 

e74769f8-67e1-43b2-9bec-93171c31d182 = SQL IP Address 1 

e09f61cf-8ebf-4cd1-9ae3-58ed4d2b0fbc = Disk K: 

Reviewing the Cluster Registry Hive 

The second method of mapping a text name to a GUID is more complex and involves 

opening the cluster registry hive from the MPS Report tool and then reviewing the 

contents. 

Follow these steps to open and review the cluster registry hive: 

1. Start the Registry Editor (Regedt32.exe). 

2. Click the HKEY_LOCL_MACHINE hive. 

3. Click the HKEY_LOCAL_MACHINE root folder. 

4. Click Load Hive on the Registry menu. 

5. Select the _Cluster_Registry.hiv file; then press Ctrl-C. 

6. Select Open. 

7. Press Ctrl-V to obtain the key name. 

8. Expand the cluster hive and review the GUIDS, which are located in the subkeys 

Groups, Resources, Networks, and NetworkInterfaces, as shown in Figure I–2. 

6872 5688–002 I–9


I–10 

Figure I–2. Expandded 

Cluster Hive (in Windows 2000 Server) 

Scroll through the GUUIDs 

until you find the one that matches the GUID from 

the 

cluster log. You can aalso 

open each key until you find the matching GUID D. 

Tip: Under each GUID iss 

a TYPE field. This field identifies a resource type such 

as 

physical disk, IP addresss, 

network name, generic application, generic service e, and so 

forth. You can use this fiield 

to find a specific resource type and then map it to the GUID. 

Understanding State CCodes 

MSCS uses state codes to determine the status of a cluster component. The e state varies 

depending on the type off 

cluster components, which are nodes, groups, resources, 

networks, and network innterfaces. 

Some state codes are posted in the cluster 

log using 

the numeric code and others 

using the actual value for the code. 

68 872 5688–002

Examples of State Codes in the Cluster Log 


The following example entries show state codes for the resource, group, network 

interface, node, and network types of cluster component: 

• Resource 

In this example, the resource is changing states from online pending (129) to online 

(2). 

00000850.00000888::2008/05/05-17:37:29.125 [FM] FmpHandleResource 

Transition: Resource Name = 87e55402-87cb-4354-95e7-6dd864b79039 old state = 

129 new state=2 

• Group 

In this example, the group state is set to offline (1). 

00000898.000008a0::2008/05/05-06:25:55:062 [FM] Setting group 1951e272-6271- 

4ea3-b0f9-cd767537f245 owner to node 2, state 1 

• Network interface 

This example provides the actual value of the state code, not the numeric code. 

00000898.00000598:2008/05/05-06:28:40;921 [ClMsg] Received interface 

unreachable event for node 2 network 1 

• Node 


00000898.0000060c::2008/05/05-06:28:45:953 [EP] Node down event received 

00000898.000008a8:2008/05/05-06:28:45:953 [Gum] Nodes down: 0002. Locker=1, 

Locking=1 

• Network 


00000898.000008a4::2008/05/05-06:25:53:703 [NM] Processing local interface 

up event for network 0433c4e2-a577-4325-9ebd-a9d3d2b9b81f. 

6872 5688–002 I–11


State Codes 

Table I–3 lists the state codes from the Windows 2000 Resource Kit for nodes. 

Table I–3. Node State Codes 

State Code State 

–1 ClusterNodeStateUnknown 

0 ClusterNodeUp 

1 ClusterNodeDown 

2 ClusterNodePaused 

3 ClusterNodeJoining 

Table I–4 lists the state codes from the Windows 2000 Resource Kit for groups. 

Table I–4. Group State Codes 


–1 ClusterGroupStateUnknown 

0 ClusterGroupOnline 

1 ClusterGroupOffline 

2 ClusterGroupFailed 

3 ClusterGroupPartialOnline 

Table I–5 lists the state codes from the Windows 2000 Resource Kit for resources. 

Table I–5. Resource State Codes 


–1 ClusterResourceStateUnknown 

0 ClusterResourceInherited 

1 ClusterResourceInitializing 

2 ClusterResourceOnline 

3 ClusterResourceOffline 

4 ClusterResourceFailed 

128 ClusterResourcePending 

I–12 6872 5688–002

Table I–5. Resource State Codes 


129 ClusterResourceOnlinePending 

130 ClusterResourceOfflinePending 


Table I–6 lists the state codes from the Windows 2000 Resource Kit for network 

interfaces. 

Table I–6. Network Interface State Codes 


–1 ClusterNetInterfaceStateUnknown 

0 ClusterNetInterfaceUnavailable 

1 ClusterNetInterfaceFailed 

2 ClusterNetInterfaceUnreachable 

3 ClusterNetInterfaceUp 

Table I–7 lists the state codes from the Windows 2000 Resource Kit for networks 

Table I–7. Network State Codes 


–1 ClusterNetworkStateUnknown 

0 ClusterNetworkUnavailable 

1 ClusterNetworkDown 

2 ClusterNetworkPartitioned 

3 ClusterNetworkUp 

6872 5688–002 I–13


Understanding Persistent State 

Persistent state is not a state code, but rather a key in the cluster registry hive for groups 

and resources. The persistent state key reflects the current state of a resource or group. 

This key is not a permanent value; it changes value when a group or resource changes 

states. 

You can change the value of the persistent state key, which can be useful for 

troubleshooting or managing the cluster. For example, you can change the value before a 

manual failover or shutdown to prevent a particular group or resource from starting 

automatically. 

The value for the persistent state can be 0 (disabled or offline) or 1 (enabled or online). 

The default value is 1. 

If the value for persistent state is 0, the group or resource remains in an offline state 

until it is manually brought online. 

The following is an example cluster log reference to persistent state: 

000008bc.00000908::2008/05/12-23:45:36/687 [FM] FmpPropagateGroupState: 

Group 1951e272-6271-4ea3-b0f9-cd767537f245 state = 3, persistent state = 1 

For more information about persistent state, view Microsoft Knowledge Base article 

259243, “How to Set the Startup Value for a Resource on a Clustered Server” at this 

URL: 

http://support.microsoft.com/default.aspx?scid=kb;en-us;259243 

I–14 6872 5688–002

Understanding Error and Status Codes 


You can easily interpret error and status codes that occur in cluster log entries by issuing 

the following command from the command line: 

Net Helpmsg 

This command returns a line of explanatory text that corresponds to the number. 

Examples 

• For the error code value of 5 as shown in the following example, the Net Helpmsg 

command returns “Access is denied.” 

00000898.000008f0:2008/30-16:03:31.979 [DM] DmpCheckpointTimerCb -Failed to 

reset log, error=5 

• For the status code value of 997 as shown in the following example, the Net 

Helpmsg command returns “Overlapped I/O operation is in progress.” This status 

code is also known as “I/O pending.” 

00000898.00000a8c::2008/05/05-06:38:14.187 [FM] FmpOnlineResource: Returning 

Resource 87e55402-87cb-4354-95e7-6dd864b79039, state 129, statue 997 

• For the status code value of 170 as shown in the following example, the Net 

Helpmsg command returns “The requested resource is in use.” 

000009a4.000009c4::2008/05/15-07:28:42.303 Physicsl Disk :[DiskArb] 

CompletionRoutine, status 170 

6872 5688–002 I–15


I–16 6872 5688–002

Index 

A 

accessing an image, 3-2 

analyzing 

intelligent fabric switch logs, A-16 

RA log collection files, A-8 

server (host) logs, A-16 

B 

bandwidth, verifying, D-7 

bin directory, A-14 

C 

changes for this release, 1-2 

clearing the system event log (SEL), B-1 

ClearPath MCP 

bringing data consistency group online, 3-5 

manual failover, 3-5 

recovery tasks, 3-5 

CLI file, A-10 

clock synchronization, verifying, D-8 

cluster failure, recovering, 4-19 

cluster log 

cluster registry hive, I-9 

definition, I-1 

error and status codes, I-15 

GUID format, I-8 

GUIDs, I-8 

layout, I-3 

mapping GUID to text name, I-8 

name and location, I-1 

opening, I-7 

overview, 2-9 

persistent state, I-14 

state codes, I-10, I-12 

cluster registry hive, I-9 

cluster service modules, I-4 

cluster settings 

system environment variables, I-2 

cluster setup, checking, 4-1 

collecting host logs 

using host information collector (HIC) 

utility, A-7 

using MPS utility, A-6 

collecting RA logs, A-1, A-3 

Collector (See Unisys SafeGuard 30m 

Collector) 

collector directory, A-11 

configuration settings, saving, D-2 

configuring additional RAs, D-4 

configuring the replacement RA, D-6 

connecting, accessing the replacement 

RA, D-4 

connectivity testing tool messages, C-8 

converting local time to GMT or UTC, A-3 

6872 5688–002 Index–1 

D 

data consistency group 

bringing online, 3-3, 4-9 

bringing online for ClearPath MCP, 3-5 

manual failover, 3-2, 4-8 

manual failover for ClearPath MCP, 3-5 

recovery tasks, 3-2, 3-5, 4-7 

recovery tasks for ClearPath MCP, 3-5 

taking offline, 4-7, 5-9 

data flow, overview, 2-3 

detaching the failed RA, D-3 

determining when the failure occurred, A-2 

diagnostics 

Installation Manager, C-1 

RA hardware, B-2 

directory 

bin, A-14 

collector, A-11 

etc, A-11 

files, A-11 

home, A-11, A-14 

host log extraction, A-15 

InfoCollect, A-12 

processes, A-12 

rreasons, A-11

Index 

E 

sbin, A-12 

tmp, A-14 

usr, A-13 

e-mail notifications 

configuring a diagnostic e-mail 

notification, 2-8 

overview, 2-8 

enabling PCI-X slot functionality, D-5 

environment settings, restoring, D-2 

etc directory, A-11 

event log, E-1 

displaying, E-3 

event levels, E-2 

event scope, E-2 

event topics, E-1 

list of Detailed events, E-22 

list of Normal events, E-5 

overview, 2-7 

using for troubleshooting, E-3 

events 

event log, E-1 

understanding, E-1 

events that cause journal distribution, 2-10 

F 

Fabric Splitter, 2-4 

Fibre Channel diagnostics 

detecting Fibre Channel LUNs, C-13 

detecting Fibre Channel Scsi3 Reserved 

LUNs, C-15 

detecting Fibre Channel targets, C-12 

performing I/O to LUN, C-15 

running SAN diagnostics, C-9 

viewing Fibre Channel details, C-11 

Fibre Channel HBA LEDs 

location, 8-12 

files directory, A-11 

full-sweep initialization, 4-4 

G 


basic configuration diagram, 2-2 

definition, 2-1 

overview, 2-2 

recovery from total failure of one site, 4-19 

geographic replication environment, 2-1 


server failure, 9-20 

total storage loss, 5-13 

GMT 

converting local time to, A-3 

example of local time conversion, A-3 

group initialization effects on move-group 

operation, 4-3 

Index–2 6872 5688–002 

H 

HIC (See host information collector (HIC) 

utility) 

high load 

disk manager reports, 10-4 

general description, 10-3 

home directory, A-11, A-14 

host information collector (HIC) utility 

overview, 2-9 

using, A-7 

host logs collection 

using host information collector (HIC) 

utility, A-7 

using MPS utility, A-6 

I 

InfoCollect directory, A-12 

initialization 

from marking mode, 4-5 

full sweep, 4-4 

long resynchronization, 4-4 

initiate_failover command, 4-6 

Installation Manager 

diagnostics, 2-9 

Diagnostics menu, 8-17, 8-21, C-2 

steps to run, C-2 

Installation Manager diagnostics 

collect system info, C-18 

Fibre Channel diagnostics, C-9 

IP diagnostics, C-2 

synchronization diagnostics, C-17 

installing and configuring the replacement 

RA, D-4 

IP diagnostics 

port diagnostics, C-5 

site connectivity tests, C-3 

system connectivity, C-6, C-7

K 

test throughput, C-4 

view IP details, C-3 

view routing table, C-4 

kutils 

command summary, H-2 

overview, 2-10 

path designations, H-1 

string, H-1 

using, H-1 

L 

Local Replication by CDP, 2-5 

log extraction directory 

host, A-15 

RA, A-9 

log file, A-10 

long resynchronization, 4-4 

M 

management console 

locked user, 8-4 

RA attached to cluster, 8-4 

understanding access, 8-4 

manual failover 

data consistency group, 3-2, 4-8 

performing, 4-7 

performing with data consistency group 

(older image), 4-8 

quorum consistency groups, 4-14, 4-23 

manual failover for ClearPath MCP 

data consistency group, 3-5 

manual failover of volumes and data 

consistency groups 

accessing an image, 3-2 

marking mode, initializing from, 4-5 

MIB 

OID Unisys, F-1 

RA file, F-3 

MIB II, F-1 

Microsoft Cluster Service, 2-1 

modifying the Preferred RA setting, D-3 

move group operation, initialization 

effects, 4-3 

MPS utility, A-6 

MSCS (See Microsoft Cluster Service) 

MSCS properties, checking, 4-1 

Index 

6872 5688–002 Index–3 

N 

network bindings 

checking, 4-2 

cluster specific, 4-3 

host network specific, 4-2 

network LEDs 

location, 8-11 

networking problem 

cluster node public NIC failure (geographic 

clustered environment), 7-3 

management network failure (geographic 


port information, 7-32 

private cluster network failure (geographic 


public or client WAN failure (geographic 


replication network failure (geographic 


temporary WAN failures, 7-21 

total communication failure (geographic 


new for this release, 1-2 

P 

parameters file, A-9 

performance problem 

failover time lengthens, 10-5 

high load 

disk manager, 10-4 

distributer, 10-5 

slow initialization, 10-2 

persistent state key, I-14 

port information, 7-32 

processes directory, A-12 

Q 

quorum consistency group 

manual failover, 4-14, 4-23

Index 

R 

RA problem 

all RAs at one site fail, 8-25 

all RAs not attached, 8-27 

all SAN Fibre Channel HBAs fail, 8-14 

onboard management network adapter 

fails, 8-23 

onboard WAN network adapter fails, 8-19 

optional Gigabit Fibre Channel WAN 

network adapter fails, 8-19 

reboot regulation failover, 8-12 

single hard disk fails, 8-24 

single RA failure, 8-4 

single RA failures with switchover, 8-5 

single RA failures without switchover, 8-21 

single SAN Fibre Channel HBA on one RA 

fails, 8-21 

rear panel indicators, 8-11 

recording group properties and saving 

settings, D-2 

recovery 

all RAs fail on site, 4-11 

from site failure, 4-19 

from total failure of one site in geographic 

clustered environment, 4-19 

site 1 failure with quorum owner located 

on site 2, 4-25 

site 1 failure with quorum resource owned 

by site 1, 4-19 

using older image, 4-7 

recovery tasks 

data consistency group, 3-2, 4-7 

data consistency group for ClearPath 

MCP, 3-5 

reformatting the repository volume, 5-8 

removing Fibre Channel host bus 

adapters, D-4 

replacing an RA, D-1 

replication appliance (RA) 

connecting, accessing, D-4 

diagnostics, B-2 

LCD status messages, B-4 

replacing, D-1 

replication appliance (RA) 

analyzing logs from, A-8 

collecting logs from, A-1 

replication, reversing direction, 4-10, 4-15 

repository volume 

not accessible, 5-6 

reformatting, 5-8 

restoring environment settings, D-2 

restoring failover settings, 4-24 

restoring group properties, D-8 

resynchronization, long, 4-4 

rreasons directory, A-11 

runCLI file, A-14 

Index–4 6872 5688–002 

S 

SafeGuard 30m Control 

behavior during move group, 4-5 

SAN connectivity problem 

RAs not accessible to splitter, 6-12 

total SAN switch failure (geographic 


volume not accessible to RAs, 6-3 

volume not accessible to splitter, 6-7 

saving configuration settings, D-2 

sbin directory, A-12 

server problem 

cluster node failure (georgraphic clustered 

environment), 9-2 

infrastructure (NTP) server fails, 9-18 

server crash or restart, 9-12 

server failure (georgraphic replication 


server HBA fails, 9-17 

server unable to connect with SAN, 9-14 

unexpected server shutdown because of a 

bug check, 9-8 

Windows server reboot, 9-3 

SNMP traps 

configuring and using, F-1 

MIB, F-1 

resolving issues, F-4 

variables and values, F-2 

SSH client, using, C-1 

state codes, I-10, I-12 

storage problem 

journal volume not accessible, 5-11 

repository volume not accessible, 5-6 

storage failure on one site (geographic 


total storage loss (geographic replicated 


user or replication volume not 

accessible, 5-4 

storage-to-RA access, checking, D-5 

summary file, A-11 

system event log (SEL), clearing, B-1 

system status 

using CLI commands, 2-8

T 

using the management console, 2-7 

tar file, A-15 

testing FTP connectivity, A-2 

tmp directory, A-14 

troubleshooting 

general procedures, 2-11 

recovering from site failure, 4-19 

U 

Unisys SafeGuard 30m Collector, G-1 

Collector mode, G-4 

adding customer information, G-5 

adding scripts, G-12 

automatic discovery of RAs, G-4 

compressing an SSC file, G-6 

configuring component types using 

built-ins scripts, G-8 

configuring RAs, G-4 

configuring SAN switches, G-9 

deleting a scheduled SSC file, G-14 

deleting scripts, G-12 

disabling scripts, G-10 

duplicating installation on another 

PC, G-6 

enabling scripts, G-10 

opening an SSC file, G-8 

querying a scheduled SSC file, G-14 

running all scripts, G-6 

running scripts, G-11 

scheduling an SSC file, G-13 

stopping script execution, G-11 

installing, G-1 

prior to configuring, G-2 

security breach warning, G-3 

View mode, G-15 

Unisys SafeGuard 30m solution 


unmounting volumes 

at production site, 3-4 

at remote site, 3-3 

unmounting volumes at source site, 3-4 

user types, preconfigured for RAs, 2-8 

using the SSH client, C-1 

using this guide, 1-3 

usr directory, A-13 

UTC 

converting local time to, A-3 

example of local time conversion, A-3 

Index 

6872 5688–002 Index–5 

V 

verify_failover command, 4-6 

verifying clock synchronization, D-8 

verifying the replacement RA installation, D-7 

volumes 

unmounting at source site, 3-4 

W 

WAN bandwidth, verifying, D-7 

webdownload/webdownload, 2-8, C-20

Index 

Index–6 6872 5688–002

© 2008 Unisys Corporation. 

All rights reserved. 

*68725688-002* 

6872 5688–002

SafeGuard Solutions Troubleshooting Guide - Public Support Login ...

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?