OpenVMS Cluster Systems - OpenVMS Systems - HP
OpenVMS Cluster Systems - OpenVMS Systems - HP
OpenVMS Cluster Systems - OpenVMS Systems - HP
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
<strong>OpenVMS</strong> <strong>Cluster</strong> <strong>Systems</strong><br />
Order Number: AA–PV5WF–TK<br />
June 2002<br />
This manual describes procedures and guidelines for configuring<br />
and managing <strong>OpenVMS</strong> <strong>Cluster</strong> systems. Except where noted, the<br />
procedures and guidelines apply equally to VAX and Alpha computers.<br />
This manual also includes information for providing high availability,<br />
building-block growth, and unified system management across coupled<br />
systems.<br />
Revision/Update Information: This manual supersedes <strong>OpenVMS</strong><br />
<strong>Cluster</strong> <strong>Systems</strong>, <strong>OpenVMS</strong> Alpha<br />
Version 7.3 and <strong>OpenVMS</strong> VAX<br />
Version 7.3.<br />
Software Version: <strong>OpenVMS</strong> Alpha Version 7.3–1<br />
<strong>OpenVMS</strong> VAX Version 7.3<br />
Compaq Computer Corporation<br />
Houston, Texas
© 2002 Compaq Information Technologies Group, L.P.<br />
Compaq, the Compaq logo, Alpha, <strong>OpenVMS</strong>, Tru64, VAX, VMS, and the DIGITAL logo are<br />
trademarks of Compaq Information Technologies Group, L.P. in the U.S. and/or other countries.<br />
Motif, OSF/1, and UNIX are trademarks of The Open Group in the U.S. and/or other countries.<br />
All other product names mentioned herein may be trademarks of their respective companies.<br />
Confidential computer software. Valid license from Compaq required for possession, use, or copying.<br />
Consistent with FAR 12.211 and 12.212, Commercial Computer Software, Computer Software<br />
Documentation, and Technical Data for Commercial Items are licensed to the U.S. Government<br />
under vendor’s standard commercial license.<br />
Compaq shall not be liable for technical or editorial errors or omissions contained herein. The<br />
information in this document is provided "as is" without warranty of any kind and is subject<br />
to change without notice. The warranties for Compaq products are set forth in the express<br />
limited warranty statements accompanying such products. Nothing herein should be construed as<br />
constituting an additional warranty.<br />
The Compaq <strong>OpenVMS</strong> documentation set is available on CD-ROM.<br />
This document was prepared using DECdocument, Version 3.3-1b.<br />
ZK4477
Contents<br />
Preface ............................................................ xix<br />
1 Introduction to <strong>OpenVMS</strong> <strong>Cluster</strong> System Management<br />
1.1 Overview . . ................................................ 1–1<br />
1.1.1 Uses . . . ................................................ 1–2<br />
1.1.2 Benefits ................................................ 1–2<br />
1.2 Hardware Components ........................................ 1–2<br />
1.2.1 Computers ............................................. 1–3<br />
1.2.2 Physical Interconnects ..................................... 1–3<br />
1.2.3 <strong>OpenVMS</strong> Galaxy SMCI . . . ................................ 1–3<br />
1.2.4 Storage Devices . . ........................................ 1–3<br />
1.3 Software Components ........................................ 1–4<br />
1.3.1 <strong>OpenVMS</strong> <strong>Cluster</strong> Software Functions . . ....................... 1–4<br />
1.4 Communications ............................................ 1–5<br />
1.4.1 System Communications . . . ................................ 1–5<br />
1.4.2 Application Communications ................................ 1–7<br />
1.4.3 <strong>Cluster</strong> Alias ............................................ 1–7<br />
1.5 System Management . ........................................ 1–7<br />
1.5.1 Ease of Management ...................................... 1–7<br />
1.5.2 Tools and Utilities from Compaq ............................. 1–7<br />
1.5.3 System Management Tools from <strong>OpenVMS</strong> Partners .............. 1–11<br />
1.5.4 Other Configuration Aids . . . ................................ 1–12<br />
2 <strong>OpenVMS</strong> <strong>Cluster</strong> Concepts<br />
2.1 <strong>OpenVMS</strong> <strong>Cluster</strong> System Architecture ........................... 2–1<br />
2.1.1 Port Layer .............................................. 2–2<br />
2.1.2 SCS Layer .............................................. 2–3<br />
2.1.3 System Applications (SYSAPs) Layer . . . ....................... 2–4<br />
2.1.4 Other Layered Components . ................................ 2–4<br />
2.2 <strong>OpenVMS</strong> <strong>Cluster</strong> Software Functions ............................ 2–4<br />
2.2.1 Functions ............................................... 2–4<br />
2.3 Ensuring the Integrity of <strong>Cluster</strong> Membership ..................... 2–5<br />
2.3.1 Connection Manager ...................................... 2–5<br />
2.3.2 <strong>Cluster</strong> Partitioning ....................................... 2–5<br />
2.3.3 Quorum Algorithm ........................................ 2–5<br />
2.3.4 System Parameters ....................................... 2–6<br />
2.3.5 Calculating <strong>Cluster</strong> Votes . . . ................................ 2–6<br />
2.3.6 Example ............................................... 2–7<br />
2.3.7 Quorum Disk ............................................ 2–7<br />
2.3.8 Quorum Disk Watcher ..................................... 2–8<br />
2.3.9 Rules for Specifying Quorum ................................ 2–8<br />
2.4 State Transitions ............................................ 2–9<br />
iii
2.4.1 Adding a Member ......................................... 2–9<br />
2.4.2 Losing a Member ......................................... 2–9<br />
2.5 <strong>OpenVMS</strong> <strong>Cluster</strong> Membership ................................. 2–11<br />
2.5.1 <strong>Cluster</strong> Group Number .................................... 2–12<br />
2.5.2 <strong>Cluster</strong> Password ......................................... 2–12<br />
2.5.3 Location ................................................ 2–12<br />
2.5.4 Example ................................................ 2–12<br />
2.6 Synchronizing <strong>Cluster</strong> Functions by the Distributed Lock Manager . . . . . 2–13<br />
2.6.1 Distributed Lock Manager Functions .......................... 2–13<br />
2.6.2 System Management of the Lock Manager . . ................... 2–14<br />
2.6.3 Large-Scale Locking Applications . ............................ 2–14<br />
2.7 Resource Sharing ............................................ 2–14<br />
2.7.1 Distributed File System .................................... 2–15<br />
2.7.2 RMS and Distributed Lock Manager .......................... 2–15<br />
2.8 Disk Availability ............................................ 2–15<br />
2.8.1 MSCP Server ............................................ 2–15<br />
2.8.2 Device Serving ........................................... 2–15<br />
2.8.3 Enabling the MSCP Server ................................. 2–16<br />
2.9 Tape Availability ............................................ 2–16<br />
2.9.1 TMSCP Server ........................................... 2–16<br />
2.9.2 Enabling the TMSCP Server ................................ 2–16<br />
2.10 Queue Availability ........................................... 2–16<br />
2.10.1 Controlling Queues . . . .................................... 2–17<br />
3 <strong>OpenVMS</strong> <strong>Cluster</strong> Interconnect Configurations<br />
iv<br />
3.1 Overview .................................................. 3–1<br />
3.2 <strong>OpenVMS</strong> <strong>Cluster</strong> <strong>Systems</strong> Interconnected by CI ................... 3–1<br />
3.2.1 Design ................................................. 3–2<br />
3.2.2 Example ................................................ 3–2<br />
3.2.3 Star Couplers ............................................ 3–3<br />
3.3 <strong>OpenVMS</strong> <strong>Cluster</strong> <strong>Systems</strong> Interconnected by DSSI ................. 3–3<br />
3.3.1 Design ................................................. 3–3<br />
3.3.2 Availability . . ............................................ 3–3<br />
3.3.3 Guidelines . . ............................................ 3–3<br />
3.3.4 Example ................................................ 3–4<br />
3.4 <strong>OpenVMS</strong> <strong>Cluster</strong> <strong>Systems</strong> Interconnected by LANs ................. 3–4<br />
3.4.1 Design ................................................. 3–4<br />
3.4.2 <strong>Cluster</strong> Group Numbers and <strong>Cluster</strong> Passwords ................. 3–4<br />
3.4.3 Servers ................................................. 3–5<br />
3.4.4 Satellites . . . ............................................ 3–5<br />
3.4.5 Satellite Booting ......................................... 3–5<br />
3.4.6 Examples . . . ............................................ 3–6<br />
3.4.7 LAN Bridge Failover Process ................................ 3–9<br />
3.5 <strong>OpenVMS</strong> <strong>Cluster</strong> <strong>Systems</strong> Interconnected by MEMORY CHANNEL . . . . 3–10<br />
3.5.1 Design ................................................. 3–10<br />
3.5.2 Examples . . . ............................................ 3–10<br />
3.6 Multihost SCSI <strong>OpenVMS</strong> <strong>Cluster</strong> <strong>Systems</strong> ........................ 3–12<br />
3.6.1 Design ................................................. 3–12<br />
3.6.2 Examples . . . ............................................ 3–12<br />
3.7 Multihost Fibre Channel <strong>OpenVMS</strong> <strong>Cluster</strong> <strong>Systems</strong> ................ 3–13<br />
3.7.1 Design ................................................. 3–13<br />
3.7.2 Examples . . . ............................................ 3–14
4 The <strong>OpenVMS</strong> <strong>Cluster</strong> Operating Environment<br />
4.1 Preparing the Operating Environment ............................ 4–1<br />
4.2 Installing the <strong>OpenVMS</strong> Operating System ....................... 4–1<br />
4.2.1 System Disks ............................................ 4–1<br />
4.2.2 Where to Install . . ........................................ 4–2<br />
4.2.3 Information Required ...................................... 4–2<br />
4.3 Installing Software Licenses . . . ................................ 4–5<br />
4.3.1 Guidelines .............................................. 4–6<br />
4.4 Installing Layered Products .................................... 4–6<br />
4.4.1 Procedure ............................................... 4–6<br />
4.5 Configuring and Starting a Satellite Booting Service . ............... 4–7<br />
4.5.1 Configuring and Starting the LANCP Utility ................... 4–9<br />
4.5.2 Booting Satellite Nodes with LANCP . . . ....................... 4–9<br />
4.5.3 Data Files Used by LANCP ................................ 4–9<br />
4.5.4 Using LAN MOP Services in New Installations . . . ............... 4–9<br />
4.5.5 Using LAN MOP Services in Existing Installations ............... 4–10<br />
4.5.6 Configuring DECnet ....................................... 4–13<br />
4.5.7 Starting DECnet . ........................................ 4–15<br />
4.5.8 What is the <strong>Cluster</strong> Alias? . . ................................ 4–15<br />
4.5.9 Enabling Alias Operations . . ................................ 4–16<br />
5 Preparing a Shared Environment<br />
5.1 Shareable Resources . ........................................ 5–1<br />
5.1.1 Local Resources . . ........................................ 5–2<br />
5.1.2 Sample Configuration ..................................... 5–2<br />
5.2 Common-Environment and Multiple-Environment <strong>Cluster</strong>s ........... 5–2<br />
5.3 Directory Structure on Common System Disks ..................... 5–3<br />
5.3.1 Directory Roots . . ........................................ 5–4<br />
5.3.2 Directory Structure Example ................................ 5–4<br />
5.3.3 Search Order ............................................ 5–5<br />
5.4 <strong>Cluster</strong>wide Logical Names .................................... 5–6<br />
5.4.1 Default <strong>Cluster</strong>wide Logical Name Tables ..................... 5–6<br />
5.4.2 Translation Order ........................................ 5–7<br />
5.4.3 Creating <strong>Cluster</strong>wide Logical Name Tables ..................... 5–8<br />
5.4.4 Alias Collisions Involving <strong>Cluster</strong>wide Logical Name Tables . ....... 5–8<br />
5.4.5 Creating <strong>Cluster</strong>wide Logical Names . . . ....................... 5–9<br />
5.4.6 Management Guidelines . . . ................................ 5–10<br />
5.4.7 Using <strong>Cluster</strong>wide Logical Names in Applications . ............... 5–11<br />
5.4.7.1 <strong>Cluster</strong>wide Attributes for $TRNLNM System Service . . ....... 5–11<br />
5.4.7.2 <strong>Cluster</strong>wide Attribute for $GETSYI System Service ........... 5–11<br />
5.4.7.3 Creating <strong>Cluster</strong>wide Tables with the $CRELNT System<br />
Service .............................................. 5–11<br />
5.5 Defining and Accessing <strong>Cluster</strong>wide Logical Names . . ............... 5–12<br />
5.5.1 Defining <strong>Cluster</strong>wide Logical Names in SYSTARTUP_VMS.COM .... 5–12<br />
5.5.2 Defining Certain Logical Names in SYLOGICALS.COM ........... 5–12<br />
5.5.3 Using Conditional Definitions for Startup Command Procedures ..... 5–13<br />
5.6 Coordinating Startup Command Procedures ....................... 5–13<br />
5.6.1 <strong>OpenVMS</strong> Startup Procedures ............................... 5–14<br />
5.6.2 Building Startup Procedures ................................ 5–14<br />
5.6.3 Combining Existing Procedures .............................. 5–15<br />
5.6.4 Using Multiple Startup Procedures ........................... 5–15<br />
5.7 Providing <strong>OpenVMS</strong> <strong>Cluster</strong> System Security ...................... 5–16<br />
5.7.1 Security Checks . . ........................................ 5–16<br />
v
5.8 Files Relevant to <strong>OpenVMS</strong> <strong>Cluster</strong> Security . . . ................... 5–17<br />
5.9 Network Security ............................................ 5–22<br />
5.9.1 Mechanisms . ............................................ 5–22<br />
5.10 Coordinating System Files . .................................... 5–22<br />
5.10.1 Procedure . . . ............................................ 5–22<br />
5.10.2 Network Database Files .................................... 5–23<br />
5.11 System Time on the <strong>Cluster</strong> .................................... 5–24<br />
5.11.1 Setting System Time . . .................................... 5–24<br />
6 <strong>Cluster</strong> Storage Devices<br />
vi<br />
6.1 Data File Sharing ........................................... 6–1<br />
6.1.1 Access Methods .......................................... 6–1<br />
6.1.2 Examples . . . ............................................ 6–2<br />
6.1.3 Specifying a Preferred Path ................................. 6–5<br />
6.2 Naming <strong>OpenVMS</strong> <strong>Cluster</strong> Storage Devices ....................... 6–6<br />
6.2.1 Allocation Classes ........................................ 6–6<br />
6.2.2 Specifying Node Allocation Classes ........................... 6–7<br />
6.2.2.1 Assigning Node Allocation Class Values on Computers ......... 6–9<br />
6.2.2.2 Assigning Node Allocation Class Values on HSC Subsystems . . . . 6–9<br />
6.2.2.3 Assigning Node Allocation Class Values on HSJ Subsystems . . . . . 6–10<br />
6.2.2.4 Assigning Node Allocation Class Values on HSD Subsystems . . . . 6–10<br />
6.2.2.5 Assigning Node Allocation Class Values on DSSI ISEs ......... 6–10<br />
6.2.2.6 Node Allocation Class Example With a DSA Disk and Tape . . . . . 6–11<br />
6.2.2.7 Node Allocation Class Example With Mixed Interconnects . . . . . . 6–12<br />
6.2.2.8 Node Allocation Classes and VAX 6000 Tapes ................ 6–13<br />
6.2.2.9 Node Allocation Classes and RAID Array 210 and 230 Devices<br />
.................................................... 6–13<br />
6.2.3 Reasons for Using Port Allocation Classes . . . ................... 6–14<br />
6.2.3.1 Constraint of the SCSI Controller Letter in Device Names . . . . . . 6–15<br />
6.2.3.2 Constraints Removed by Port Allocation Classes . . . ........... 6–15<br />
6.2.4 Specifying Port Allocation Classes ........................... 6–17<br />
6.2.4.1 Port Allocation Classes for Devices Attached to a Multi-Host<br />
Interconnect .......................................... 6–17<br />
6.2.4.2 Port Allocation Class 0 for Devices Attached to a Single-Host<br />
Interconnect .......................................... 6–17<br />
6.2.4.3 Port Allocation Class -1 ................................. 6–18<br />
6.2.4.4 How to Implement Port Allocation Classes .................. 6–18<br />
6.2.4.5 <strong>Cluster</strong>wide Reboot Requirements for SCSI Interconnects ....... 6–19<br />
6.3 MSCP and TMSCP Served Disks and Tapes ....................... 6–20<br />
6.3.1 Enabling Servers ......................................... 6–20<br />
6.3.1.1 Serving the System Disk ................................ 6–22<br />
6.3.1.2 Setting the MSCP and TMSCP System Parameters ........... 6–22<br />
6.4 MSCP I/O Load Balancing . .................................... 6–22<br />
6.4.1 Load Capacity ........................................... 6–23<br />
6.4.2 Increasing the Load Capacity When FDDI is Used ............... 6–23<br />
6.4.3 Available Serving Capacity ................................. 6–23<br />
6.4.4 Static Load Balancing . .................................... 6–23<br />
6.4.5 Dynamic Load Balancing (VAX Only) ......................... 6–23<br />
6.4.6 Overriding MSCP I/O Load Balancing for Special Purposes ........ 6–24<br />
6.5 Managing <strong>Cluster</strong> Disks With the Mount Utility ................... 6–24<br />
6.5.1 Mounting <strong>Cluster</strong> Disks ................................... 6–24<br />
6.5.2 Examples of Mounting Shared Disks .......................... 6–25<br />
6.5.3 Mounting <strong>Cluster</strong> Disks With Command Procedures . . . ........... 6–25
6.5.4 Disk Rebuild Operation .................................... 6–26<br />
6.5.5 Rebuilding <strong>Cluster</strong> Disks . . . ................................ 6–26<br />
6.5.6 Rebuilding System Disks . . . ................................ 6–26<br />
6.6 Shadowing Disks Across an <strong>OpenVMS</strong> <strong>Cluster</strong> ..................... 6–27<br />
6.6.1 Purpose ................................................ 6–27<br />
6.6.2 Shadow Sets ............................................. 6–27<br />
6.6.3 I/O Capabilities . . ........................................ 6–28<br />
6.6.4 Supported Devices ........................................ 6–28<br />
6.6.5 Shadow Set Limits ........................................ 6–29<br />
6.6.6 Distributing Shadowed Disks ................................ 6–29<br />
7 Setting Up and Managing <strong>Cluster</strong> Queues<br />
7.1 Introduction ................................................ 7–1<br />
7.2 Controlling Queue Availability . . ................................ 7–1<br />
7.3 Starting a Queue Manager and Creating the Queue Database . . ....... 7–2<br />
7.4 Starting Additional Queue Managers ............................. 7–3<br />
7.4.1 Command Format ........................................ 7–3<br />
7.4.2 Database Files . . . ........................................ 7–3<br />
7.5 Stopping the Queuing System . . ................................ 7–4<br />
7.6 Moving Queue Database Files . . ................................ 7–4<br />
7.6.1 Location Guidelines ....................................... 7–4<br />
7.7 Setting Up Print Queues ...................................... 7–4<br />
7.7.1 Creating a Queue . ........................................ 7–5<br />
7.7.2 Command Format ........................................ 7–5<br />
7.7.3 Ensuring Queue Availability ................................ 7–6<br />
7.7.4 Examples ............................................... 7–6<br />
7.8 Setting Up <strong>Cluster</strong>wide Generic Print Queues ...................... 7–7<br />
7.8.1 Sample Configuration ..................................... 7–7<br />
7.8.2 Command Example ....................................... 7–8<br />
7.9 Setting Up Execution Batch Queues ............................. 7–8<br />
7.9.1 Before You Begin . ........................................ 7–9<br />
7.9.2 Batch Command Format . . . ................................ 7–10<br />
7.9.3 Autostart Command Format ................................ 7–10<br />
7.9.4 Examples ............................................... 7–10<br />
7.10 Setting Up <strong>Cluster</strong>wide Generic Batch Queues ..................... 7–10<br />
7.10.1 Sample Configuration ..................................... 7–11<br />
7.11 Starting Local Batch Queues . . . ................................ 7–11<br />
7.11.1 Startup Command Procedure ................................ 7–12<br />
7.12 Using a Common Command Procedure ........................... 7–12<br />
7.12.1 Command Procedure ...................................... 7–12<br />
7.12.2 Examples ............................................... 7–13<br />
7.12.3 Example ................................................ 7–15<br />
7.13 Disabling Autostart During Shutdown ............................ 7–16<br />
7.13.1 Options ................................................ 7–17<br />
8 Configuring an <strong>OpenVMS</strong> <strong>Cluster</strong> System<br />
8.1 Overview of the <strong>Cluster</strong> Configuration Procedures ................... 8–1<br />
8.1.1 Before Configuring the System .............................. 8–3<br />
8.1.2 Data Requested by the <strong>Cluster</strong> Configuration Procedures . . . ....... 8–5<br />
8.1.3 Invoking the Procedure .................................... 8–8<br />
8.2 Adding Computers . . . ........................................ 8–9<br />
8.2.1 Controlling Conversational Bootstrap Operations . ............... 8–10<br />
vii
8.2.2 Common AUTOGEN Parameter Files ......................... 8–11<br />
8.2.3 Examples . . . ............................................ 8–11<br />
8.2.4 Adding a Quorum Disk .................................... 8–16<br />
8.3 Removing Computers ......................................... 8–17<br />
8.3.1 Example ................................................ 8–18<br />
8.3.2 Removing a Quorum Disk .................................. 8–19<br />
8.4 Changing Computer Characteristics . ............................ 8–20<br />
8.4.1 Preparation . ............................................ 8–21<br />
8.4.2 Examples . . . ............................................ 8–24<br />
8.5 Creating a Duplicate System Disk . . . ............................ 8–31<br />
8.5.1 Preparation . ............................................ 8–32<br />
8.5.2 Example ................................................ 8–32<br />
8.6 Postconfiguration Tasks . . . .................................... 8–33<br />
8.6.1 Updating Parameter Files .................................. 8–35<br />
8.6.2 Shutting Down the <strong>Cluster</strong> ................................. 8–36<br />
8.6.3 Shutting Down a Single Node . . . ............................ 8–37<br />
8.6.4 Updating Network Data .................................... 8–37<br />
8.6.5 Altering Satellite Local Disk Labels ........................... 8–38<br />
8.6.6 Changing Allocation Class Values ............................ 8–38<br />
8.6.7 Rebooting . . . ............................................ 8–39<br />
8.6.8 Rebooting Satellites Configured with <strong>OpenVMS</strong> on a Local Disk . . . . . 8–39<br />
8.7 Running AUTOGEN with Feedback . . ............................ 8–40<br />
8.7.1 Advantages . ............................................ 8–40<br />
8.7.2 Initial Values ............................................ 8–41<br />
8.7.3 Obtaining Reasonable Feedback . ............................ 8–41<br />
8.7.4 Creating a Command File to Run AUTOGEN ................... 8–42<br />
9 Building Large <strong>OpenVMS</strong> <strong>Cluster</strong> <strong>Systems</strong><br />
viii<br />
9.1 Setting Up the <strong>Cluster</strong> ........................................ 9–1<br />
9.2 General Booting Considerations ................................ 9–2<br />
9.2.1 Concurrent Booting . . . .................................... 9–2<br />
9.2.2 Minimizing Boot Time . .................................... 9–3<br />
9.3 Booting Satellites ........................................... 9–4<br />
9.4 Configuring and Booting Satellite Nodes .......................... 9–4<br />
9.4.1 Booting from a Single LAN Adapter .......................... 9–5<br />
9.4.2 Changing the Default Boot Adapter ........................... 9–6<br />
9.4.3 Booting from Multiple LAN Adapters (Alpha Only) . . . ........... 9–6<br />
9.4.4 Enabling Satellites to Use Alternate LAN Adapters for Booting . . . . . 9–7<br />
9.4.5 Configuring MOP Service .................................. 9–9<br />
9.4.6 Controlling Satellite Booting ................................ 9–10<br />
9.5 System-Disk Throughput . . .................................... 9–13<br />
9.5.1 Avoiding Disk Rebuilds .................................... 9–14<br />
9.5.2 Offloading Work ......................................... 9–14<br />
9.5.3 Configuring Multiple System Disks .......................... 9–15<br />
9.6 Conserving System Disk Space ................................. 9–17<br />
9.6.1 Techniques . . ............................................ 9–17<br />
9.7 Adjusting System Parameters .................................. 9–17<br />
9.7.1 The SCSBUFFCNT Parameter (VAX Only) . . ................... 9–17<br />
9.7.2 The SCSRESPCNT Parameter . . ............................ 9–18<br />
9.7.3 The CLUSTER_CREDITS Parameter ......................... 9–18<br />
9.8 Minimize Network Instability .................................. 9–19<br />
9.9 DECnet <strong>Cluster</strong> Alias ........................................ 9–20
10 Maintaining an <strong>OpenVMS</strong> <strong>Cluster</strong> System<br />
10.1 Backing Up Data and Files .................................... 10–1<br />
10.2 Updating the <strong>OpenVMS</strong> Operating System . ....................... 10–2<br />
10.2.1 Rolling Upgrades . ........................................ 10–3<br />
10.3 LAN Network Failure Analysis . ................................ 10–3<br />
10.4 Recording Configuration Data . . ................................ 10–3<br />
10.4.1 Record Information ....................................... 10–4<br />
10.4.2 Satellite Network Data .................................... 10–4<br />
10.5 Cross-Architecture Satellite Booting ............................. 10–5<br />
10.5.1 Sample Configurations ..................................... 10–5<br />
10.5.2 Usage Notes ............................................. 10–7<br />
10.5.3 Configuring DECnet ....................................... 10–8<br />
10.6 Controlling OPCOM Messages . . ................................ 10–9<br />
10.6.1 Overriding OPCOM Defaults ................................ 10–10<br />
10.6.2 Example ................................................ 10–10<br />
10.7 Shutting Down a <strong>Cluster</strong> ...................................... 10–10<br />
10.7.1 The NONE Option ........................................ 10–11<br />
10.7.2 The REMOVE_NODE Option ............................... 10–11<br />
10.7.3 The CLUSTER_SHUTDOWN Option . . ....................... 10–11<br />
10.7.4 The REBOOT_CHECK Option .............................. 10–12<br />
10.7.5 The SAVE_FEEDBACK Option .............................. 10–12<br />
10.8 Dump Files ................................................ 10–12<br />
10.8.1 Controlling Size and Creation ............................... 10–12<br />
10.8.2 Sharing Dump Files ....................................... 10–13<br />
10.9 Maintaining the Integrity of <strong>OpenVMS</strong> <strong>Cluster</strong> Membership . . . ....... 10–14<br />
10.9.1 <strong>Cluster</strong> Group Data ....................................... 10–15<br />
10.9.2 Example ................................................ 10–15<br />
10.10 Adjusting Maximum Packet Size for LAN Configurations ............ 10–16<br />
10.10.1 System Parameter Settings for LANs . . ....................... 10–16<br />
10.10.2 How to Use NISCS_MAX_PKTSZ ............................ 10–16<br />
10.10.3 Editing Parameter Files .................................... 10–17<br />
10.11 Determining Process Quotas . . . ................................ 10–17<br />
10.11.1 Quota Values ............................................ 10–17<br />
10.11.2 PQL Parameters . ........................................ 10–17<br />
10.11.3 Examples ............................................... 10–18<br />
10.12 Restoring <strong>Cluster</strong> Quorum ..................................... 10–19<br />
10.12.1 Restoring Votes . . ........................................ 10–19<br />
10.12.2 Reducing <strong>Cluster</strong> Quorum Value ............................. 10–19<br />
10.13 <strong>Cluster</strong> Performance . ........................................ 10–20<br />
10.13.1 Using the SHOW Commands ................................ 10–20<br />
10.13.2 Using the Monitor Utility . . ................................ 10–21<br />
10.13.3 Using Compaq Availability Manager and DECamds .............. 10–22<br />
10.13.4 Monitoring LAN Activity . . ................................ 10–23<br />
A <strong>Cluster</strong> System Parameters<br />
A.1 Values for Alpha and VAX Computers ............................ A–1<br />
ix
B Building Common Files<br />
B.1 Building a Common SYSUAF.DAT File ........................... B–1<br />
B.2 Merging RIGHTSLIST.DAT Files . . . ............................ B–3<br />
C <strong>Cluster</strong> Troubleshooting<br />
x<br />
C.1 Diagnosing Computer Failures .................................. C–1<br />
C.1.1 Preliminary Checklist . .................................... C–1<br />
C.1.2 Sequence of Booting Events ................................. C–1<br />
C.2 Computer on the CI Fails to Boot . . . ............................ C–3<br />
C.3 Satellite Fails to Boot ........................................ C–4<br />
C.3.1 Displaying Connection Messages . ............................ C–5<br />
C.3.2 General <strong>OpenVMS</strong> <strong>Cluster</strong> Satellite-Boot Troubleshooting ......... C–6<br />
C.3.3 MOP Server Troubleshooting ................................ C–9<br />
C.3.4 Disk Server Troubleshooting ................................ C–10<br />
C.3.5 Satellite Booting Troubleshooting ............................ C–10<br />
C.3.6 Alpha Booting Messages (Alpha Only) ......................... C–11<br />
C.4 Computer Fails to Join the <strong>Cluster</strong> . . ............................ C–13<br />
C.4.1 Verifying <strong>OpenVMS</strong> <strong>Cluster</strong> Software Load . . ................... C–13<br />
C.4.2 Verifying Boot Disk and Root ................................ C–13<br />
C.4.3 Verifying SCSNODE and SCSSYSTEMID Parameters . ........... C–14<br />
C.4.4 Verifying <strong>Cluster</strong> Security Information ........................ C–14<br />
C.5 Startup Procedures Fail to Complete . ............................ C–14<br />
C.6 Diagnosing LAN Component Failures ............................ C–15<br />
C.7 Diagnosing <strong>Cluster</strong> Hangs . .................................... C–15<br />
C.7.1 <strong>Cluster</strong> Quorum is Lost .................................... C–15<br />
C.7.2 Inaccessible <strong>Cluster</strong> Resource . . . ............................ C–16<br />
C.8 Diagnosing CLUEXIT Bugchecks ................................ C–16<br />
C.8.1 Conditions Causing Bugchecks . . ............................ C–16<br />
C.9 Port Communications ........................................ C–17<br />
C.9.1 Port Polling . ............................................ C–17<br />
C.9.2 LAN Communications . .................................... C–18<br />
C.9.3 System Communications Services (SCS) Connections . . ........... C–18<br />
C.10 Diagnosing Port Failures . . .................................... C–18<br />
C.10.1 Hierarchy of Communication Paths ........................... C–18<br />
C.10.2 Where Failures Occur . .................................... C–19<br />
C.10.3 Verifying CI Port Functions ................................. C–19<br />
C.10.4 Verifying Virtual Circuits ................................... C–20<br />
C.10.5 Verifying CI Cable Connections . . ............................ C–20<br />
C.10.6 Diagnosing CI Cabling Problems . ............................ C–21<br />
C.10.7 Repairing CI Cables . . . .................................... C–23<br />
C.10.8 Verifying LAN Connections ................................. C–24<br />
C.11 Analyzing Error-Log Entries for Port Devices . . . ................... C–24<br />
C.11.1 Examine the Error Log .................................... C–24<br />
C.11.2 Formats ................................................ C–25<br />
C.11.3 CI Device-Attention Entries ................................. C–26<br />
C.11.4 Error Recovery ........................................... C–27<br />
C.11.5 LAN Device-Attention Entries . . . ............................ C–27<br />
C.11.6 Logged Message Entries .................................... C–29<br />
C.11.7 Error-Log Entry Descriptions ................................ C–31<br />
C.12 OPA0 Error-Message Logging and Broadcasting . ................... C–37<br />
C.12.1 OPA0 Error Messages . .................................... C–38<br />
C.12.2 CI Port Recovery ......................................... C–40
D Sample Programs for LAN Control<br />
D.1 Purpose of Programs . ........................................ D–1<br />
D.2 Starting the NISCA Protocol . . . ................................ D–1<br />
D.2.1 Start the Protocol . ........................................ D–2<br />
D.3 Stopping the NISCA Protocol . . . ................................ D–2<br />
D.3.1 Stop the Protocol . ........................................ D–3<br />
D.3.2 Verify Successful Execution . ................................ D–3<br />
D.4 Analyzing Network Failures . . . ................................ D–3<br />
D.4.1 Failure Analysis . . ........................................ D–3<br />
D.4.2 How the LAVC$FAILURE_ANALYSIS Program Works ............ D–4<br />
D.5 Using the Network Failure Analysis Program ...................... D–4<br />
D.5.1 Create a Network Diagram . ................................ D–5<br />
D.5.2 Edit the Source File ....................................... D–7<br />
D.5.3 Assemble and Link the Program ............................. D–9<br />
D.5.4 Modify Startup Files ...................................... D–10<br />
D.5.5 Execute the Program ...................................... D–10<br />
D.5.6 Modify MODPARAMS.DAT . ................................ D–10<br />
D.5.7 Test the Program . ........................................ D–10<br />
D.5.8 Display Suspect Components ................................ D–10<br />
E Subroutines for LAN Control<br />
E.1 Introduction ................................................ E–1<br />
E.1.1 Purpose of the Subroutines . ................................ E–1<br />
E.2 Starting the NISCA Protocol . . . ................................ E–1<br />
E.2.1 Status . ................................................ E–2<br />
E.2.2 Error Messages . . ........................................ E–2<br />
E.3 Stopping the NISCA Protocol . . . ................................ E–3<br />
E.3.1 Status . ................................................ E–3<br />
E.3.2 Error Messages . . ........................................ E–4<br />
E.4 Creating a Representation of a Network Component . . ............... E–5<br />
E.4.1 Status . ................................................ E–6<br />
E.4.2 Error Messages . . ........................................ E–6<br />
E.5 Creating a Network Component List ............................. E–7<br />
E.5.1 Status . ................................................ E–8<br />
E.5.2 Error Messages . . ........................................ E–8<br />
E.6 Starting Network Component Failure Analysis ..................... E–9<br />
E.6.1 Status . ................................................ E–9<br />
E.6.2 Error Messages . . ........................................ E–9<br />
E.7 Stopping Network Component Failure Analysis ..................... E–10<br />
E.7.1 Status . ................................................ E–10<br />
E.7.2 Error Messages . . ........................................ E–10<br />
F Troubleshooting the NISCA Protocol<br />
F.1 How NISCA Fits into the SCA . . ................................ F–1<br />
F.1.1 SCA Protocols . . . ........................................ F–1<br />
F.1.2 Paths Used for Communication .............................. F–4<br />
F.1.3 PEDRIVER ............................................. F–4<br />
F.2 Addressing LAN Communication Problems . ....................... F–5<br />
F.2.1 Symptoms .............................................. F–5<br />
F.2.2 Traffic Control . . . ........................................ F–5<br />
F.2.3 Excessive Packet Losses on LAN Paths . ....................... F–5<br />
F.2.4 Preliminary Network Diagnosis .............................. F–6<br />
xi
xii<br />
F.2.5 Tracing Intermittent Errors ................................. F–6<br />
F.2.6 Checking System Parameters . . . ............................ F–7<br />
F.2.7 Channel Timeouts ........................................ F–8<br />
F.3 Using SDA to Monitor LAN Communications . . . ................... F–9<br />
F.3.1 Isolating Problem Areas .................................... F–9<br />
F.3.2 SDA Command SHOW PORT . . . ............................ F–9<br />
F.3.3 Monitoring Virtual Circuits ................................. F–10<br />
F.3.4 Monitoring PEDRIVER Buses . . . ............................ F–13<br />
F.3.5 Monitoring LAN Adapters .................................. F–14<br />
F.4 Troubleshooting NISCA Communications ......................... F–16<br />
F.4.1 Areas of Trouble .......................................... F–16<br />
F.5 Channel Formation .......................................... F–16<br />
F.5.1 How Channels Are Formed ................................. F–16<br />
F.5.2 Techniques for Troubleshooting . . ............................ F–17<br />
F.6 Retransmission Problems . . .................................... F–17<br />
F.6.1 Why Retransmissions Occur ................................ F–18<br />
F.6.2 Techniques for Troubleshooting . . ............................ F–19<br />
F.7 Understanding NISCA Datagrams . . . ............................ F–19<br />
F.7.1 Packet Format ........................................... F–19<br />
F.7.2 LAN Headers ............................................ F–20<br />
F.7.3 Ethernet Header ......................................... F–20<br />
F.7.4 FDDI Header ............................................ F–20<br />
F.7.5 Datagram Exchange (DX) Header ............................ F–21<br />
F.7.6 Channel Control (CC) Header . . . ............................ F–22<br />
F.7.7 Transport (TR) Header . .................................... F–23<br />
F.8 Using a LAN Protocol Analysis Program .......................... F–25<br />
F.8.1 Single or Multiple LAN Segments ............................ F–25<br />
F.8.2 Multiple LAN Segments .................................... F–26<br />
F.9 Data Isolation Techniques . .................................... F–26<br />
F.9.1 All <strong>OpenVMS</strong> <strong>Cluster</strong> Traffic ................................ F–26<br />
F.9.2 Specific <strong>OpenVMS</strong> <strong>Cluster</strong> Traffic ............................ F–26<br />
F.9.3 Virtual Circuit (Node-to-Node) Traffic ......................... F–27<br />
F.9.4 Channel (LAN Adapter–to–LAN Adapter) Traffic ................ F–27<br />
F.9.5 Channel Control Traffic .................................... F–27<br />
F.9.6 Transport Data ........................................... F–27<br />
F.10 Setting Up an <strong>HP</strong> 4972A LAN Protocol Analyzer ................... F–28<br />
F.10.1 Analyzing Channel Formation Problems ....................... F–28<br />
F.10.2 Analyzing Retransmission Problems .......................... F–28<br />
F.11 Filters .................................................... F–30<br />
F.11.1 Capturing All LAN Retransmissions for a Specific <strong>OpenVMS</strong><br />
<strong>Cluster</strong> ................................................. F–31<br />
F.11.2 Capturing All LAN Packets for a Specific <strong>OpenVMS</strong> <strong>Cluster</strong> ........ F–31<br />
F.11.3 Setting Up the Distributed Enable Filter ....................... F–31<br />
F.11.4 Setting Up the Distributed Trigger Filter . . . ................... F–32<br />
F.12 Messages .................................................. F–32<br />
F.12.1 Distributed Enable Message ................................ F–32<br />
F.12.2 Distributed Trigger Message ................................ F–33<br />
F.13 Programs That Capture Retransmission Errors . . ................... F–33<br />
F.13.1 Starter Program .......................................... F–33<br />
F.13.2 Partner Program ......................................... F–34<br />
F.13.3 Scribe Program .......................................... F–34
G NISCA Transport Protocol Channel Selection and Congestion<br />
Control<br />
Index<br />
G.1 NISCA Transmit Channel Selection .............................. G–1<br />
G.1.1 Multiple-Channel Load Distribution on <strong>OpenVMS</strong> Version 7.3 (Alpha<br />
and VAX) or Later ........................................ G–1<br />
G.1.1.1 Equivalent Channel Set Selection . . ....................... G–1<br />
G.1.1.2 Local and Remote LAN Adapter Load Distribution ........... G–2<br />
G.1.2 Preferred Channel (<strong>OpenVMS</strong> Version 7.2 and Earlier) ........... G–2<br />
G.2 NISCA Congestion Control ..................................... G–3<br />
G.2.1 Congestion Caused by Retransmission . . ....................... G–4<br />
G.2.1.1 <strong>OpenVMS</strong> VAX Version 6.0 or <strong>OpenVMS</strong> AXP Version 1.5, or<br />
Later ............................................... G–4<br />
G.2.1.2 VMS Version 5.5 or Earlier .............................. G–5<br />
G.2.2 HELLO Multicast Datagrams ............................... G–5<br />
Examples<br />
7–1 Sample Commands for Creating <strong>OpenVMS</strong> <strong>Cluster</strong> Queues . ....... 7–13<br />
7–2 Common Procedure to Start <strong>OpenVMS</strong> <strong>Cluster</strong> Queues ............ 7–15<br />
8–1 Sample Interactive CLUSTER_CONFIG.COM Session to Add a<br />
Computer as a Boot Server . ................................ 8–12<br />
8–2 Sample Interactive CLUSTER_CONFIG.COM Session to Add a<br />
Computer Running DECnet–Plus ............................ 8–13<br />
8–3 Sample Interactive CLUSTER_CONFIG.COM Session to Add a VAX<br />
Satellite with Local Page and Swap Files ...................... 8–15<br />
8–4 Sample Interactive CLUSTER_CONFIG.COM Session to Remove a<br />
Satellite with Local Page and Swap Files ...................... 8–18<br />
8–5 Sample Interactive CLUSTER_CONFIG.COM Session to Enable the<br />
Local Computer as a Disk Server ............................. 8–24<br />
8–6 Sample Interactive CLUSTER_CONFIG.COM Session to Change the<br />
Local Computer’s ALLOCLASS Value . . ....................... 8–25<br />
8–7 Sample Interactive CLUSTER_CONFIG.COM Session to Enable the<br />
Local Computer as a Boot Server ............................. 8–26<br />
8–8 Sample Interactive CLUSTER_CONFIG.COM Session to Change a<br />
Satellite’s Hardware Address ................................ 8–27<br />
8–9 Sample Interactive CLUSTER_CONFIG.COM Session to Enable the<br />
Local Computer as a Tape Server ............................ 8–28<br />
8–10 Sample Interactive CLUSTER_CONFIG.COM Session to Change the<br />
Local Computer’s TAPE_ALLOCLASS Value .................... 8–30<br />
8–11 Sample Interactive CLUSTER_CONFIG.COM Session to Convert a<br />
Standalone Computer to a <strong>Cluster</strong> Boot Server . . . ............... 8–31<br />
8–12 Sample Interactive CLUSTER_CONFIG.COM CREATE Session ..... 8–32<br />
10–1 Sample NETNODE_UPDATE.COM File ....................... 10–5<br />
10–2 Defining an Alpha Satellite in a VAX Boot Node . . ............... 10–8<br />
10–3 Defining a VAX Satellite in an Alpha Boot Node . . ............... 10–9<br />
10–4 Sample SYSMAN Session to Change the <strong>Cluster</strong> Password . . ....... 10–15<br />
C–1 Crossed Cables: Configuration 1 ............................. C–22<br />
C–2 Crossed Cables: Configuration 2 ............................. C–22<br />
xiii
Figures<br />
xiv<br />
C–3 Crossed Cables: Configuration 3 . ............................ C–22<br />
C–4 Crossed Cables: Configuration 4 . ............................ C–23<br />
C–5 Crossed Cables: Configuration 5 . ............................ C–23<br />
C–6 CI Device-Attention Entries ................................. C–26<br />
C–7 LAN Device-Attention Entry ................................ C–27<br />
C–8 CI Port Logged-Message Entry . . ............................ C–29<br />
D–1 Portion of LAVC$FAILURE_ANALYSIS.MAR to Edit . . ........... D–7<br />
F–1 SDA Command SHOW PORT Display ......................... F–10<br />
F–2 SDA Command SHOW PORT/VC Display . . . ................... F–10<br />
F–3 SDA Command SHOW PORT/BUS Display . . ................... F–13<br />
F–4 SDA Command SHOW LAN/COUNTERS Display ................ F–14<br />
1–1 <strong>OpenVMS</strong> <strong>Cluster</strong> System Communications . ................... 1–6<br />
1–2 Single-Point <strong>OpenVMS</strong> <strong>Cluster</strong> System Management . . ........... 1–8<br />
2–1 <strong>OpenVMS</strong> <strong>Cluster</strong> System Architecture ........................ 2–2<br />
3–1 <strong>OpenVMS</strong> <strong>Cluster</strong> Configuration Based on CI ................... 3–2<br />
3–2 DSSI <strong>OpenVMS</strong> <strong>Cluster</strong> Configuration ........................ 3–4<br />
3–3 LAN <strong>OpenVMS</strong> <strong>Cluster</strong> System with Single Server Node and System<br />
Disk ................................................... 3–7<br />
3–4 LAN and Fibre Channel <strong>OpenVMS</strong> <strong>Cluster</strong> System: Sample<br />
Configuration ............................................ 3–8<br />
3–5 FDDI in Conjunction with Ethernet in an <strong>OpenVMS</strong> <strong>Cluster</strong><br />
System ................................................. 3–9<br />
3–6 Two-Node MEMORY CHANNEL <strong>OpenVMS</strong> <strong>Cluster</strong> Configuration . . . 3–11<br />
3–7 Three-Node MEMORY CHANNEL <strong>OpenVMS</strong> <strong>Cluster</strong><br />
Configuration ............................................ 3–11<br />
3–8 Three-Node <strong>OpenVMS</strong> <strong>Cluster</strong> Configuration Using a Shared SCSI<br />
Interconnect . ............................................ 3–13<br />
3–9 Four-Node <strong>OpenVMS</strong> <strong>Cluster</strong> Configuration Using a Fibre Channel<br />
Interconnect . ............................................ 3–14<br />
5–1 Resource Sharing in Mixed-Architecture <strong>Cluster</strong> System ........... 5–2<br />
5–2 Directory Structure on a Common System Disk .................. 5–4<br />
5–3 File Search Order on Common System Disk . ................... 5–5<br />
5–4 Translation Order Specified by LNM$FILE_DEV ................ 5–8<br />
6–1 Dual-Ported Disks ........................................ 6–3<br />
6–2 Dual-Pathed Disks ........................................ 6–4<br />
6–3 Configuration with <strong>Cluster</strong>-Accessible Devices ................... 6–5<br />
6–4 Disk and Tape Dual Pathed Between HSC Controllers . ........... 6–8<br />
6–5 Disk and Tape Dual Pathed Between Computers ................ 6–11<br />
6–6 Device Names in a Mixed-Interconnect <strong>Cluster</strong> .................. 6–12<br />
6–7 SCSI Device Names Using a Node Allocation Class . . . ........... 6–15<br />
6–8 Device Names Using Port Allocation Classes ................... 6–16<br />
6–9 Shadow Set With Three Members ............................ 6–28<br />
6–10 Shadow Sets Accessed Through the MSCP Server ................ 6–29<br />
7–1 Sample Printer Configuration . . . ............................ 7–5<br />
7–2 Print Queue Configuration .................................. 7–7
Tables<br />
7–3 <strong>Cluster</strong>wide Generic Print Queue Configuration . . ............... 7–8<br />
7–4 Sample Batch Queue Configuration ........................... 7–9<br />
7–5 <strong>Cluster</strong>wide Generic Batch Queue Configuration . . ............... 7–11<br />
10–1 VAX Nodes Boot Alpha Satellites ............................. 10–6<br />
10–2 Alpha and VAX Nodes Boot Alpha and VAX Satellites ............. 10–7<br />
C–1 Correctly Connected Two-Computer CI <strong>Cluster</strong> . . . ............... C–21<br />
C–2 Crossed CI Cable Pair ..................................... C–21<br />
F–1 Protocols in the SCA Architecture ............................ F–2<br />
F–2 Channel-Formation Handshake .............................. F–17<br />
F–3 Lost Messages Cause Retransmissions . ....................... F–18<br />
F–4 Lost ACKs Cause Retransmissions ........................... F–19<br />
F–5 NISCA Headers . . ........................................ F–20<br />
F–6 Ethernet Header . ........................................ F–20<br />
F–7 FDDI Header ............................................ F–21<br />
F–8 DX Header .............................................. F–21<br />
F–9 CC Header .............................................. F–22<br />
F–10 TR Header .............................................. F–24<br />
1–1 System Management Tools . ................................ 1–9<br />
1–2 System Management Products from <strong>OpenVMS</strong> Partners ........... 1–12<br />
2–1 Communications Services . . . ................................ 2–3<br />
2–2 Transitions Caused by Adding a <strong>Cluster</strong> Member . ............... 2–9<br />
2–3 Transitions Caused by Loss of a <strong>Cluster</strong> Member . ............... 2–10<br />
3–1 Satellite Booting Process . . . ................................ 3–6<br />
4–1 Information Required to Perform an Installation . . ............... 4–3<br />
4–2 Installing Layered Products on a Common System Disk ........... 4–7<br />
4–3 Procedure for Configuring the DECnet Network . . ............... 4–13<br />
5–1 Default <strong>Cluster</strong>wide Logical Name Tables and Logical Names ...... 5–7<br />
5–2 Alias Collisions and Outcomes .............................. 5–9<br />
5–3 Security Files ............................................ 5–18<br />
5–4 Procedure for Coordinating Files ............................. 5–23<br />
6–1 Device Access Methods ..................................... 6–2<br />
6–2 Changing a DSSI Allocation Class Value ...................... 6–10<br />
6–3 Ensuring Unique Tape Access Paths . . . ....................... 6–13<br />
6–4 Examples of Device Names with Port Allocation Classes 1-32767 .... 6–17<br />
6–5 Examples of Device Names With Port Allocation Class 0 ........... 6–18<br />
6–6 MSCP_LOAD and TMSCP_LOAD Parameter Settings ............ 6–20<br />
6–7 MSCP_SERVE_ALL and TMSCP_SERVE_ALL Parameter<br />
Settings ................................................ 6–21<br />
8–1 Summary of <strong>Cluster</strong> Configuration Functions ................... 8–2<br />
8–2 Preconfiguration Tasks ..................................... 8–3<br />
8–3 Data Requested by CLUSTER_CONFIG_LAN.COM and<br />
CLUSTER_CONFIG.COM . . ................................ 8–5<br />
8–4 Preparing to Add Computers to an <strong>OpenVMS</strong> <strong>Cluster</strong> ............. 8–9<br />
8–5 Preparing to Add a Quorum Disk Watcher ...................... 8–17<br />
xv
xvi<br />
8–6 Preparing to Remove Computers from an <strong>OpenVMS</strong> <strong>Cluster</strong> ....... 8–18<br />
8–7 Preparing to Remove a Quorum Disk Watcher ................... 8–19<br />
8–8 CHANGE Options of the <strong>Cluster</strong> Configuration Procedure ......... 8–20<br />
8–9 Tasks Involved in Changing <strong>OpenVMS</strong> <strong>Cluster</strong> Configurations . . . . . . 8–21<br />
8–10 Actions Required to Reconfigure a <strong>Cluster</strong> . . . ................... 8–33<br />
9–1 Sample System Disk I/O Activity and Boot Time for a Single VAX<br />
Satellite ................................................ 9–3<br />
9–2 Sample System Disk I/O Activity and Boot Times for Multiple VAX<br />
Satellites . . . ............................................ 9–3<br />
9–3 Checklist for Satellite Booting . . . ............................ 9–4<br />
9–4 Procedure for Defining a Pseudonode Using DECnet MOP Services . . 9–7<br />
9–5 Procedure for Defining a Pseudonode Using LANCP MOP Services . . 9–7<br />
9–6 Procedure for Creating Different DECnet Node Databases ......... 9–8<br />
9–7 Procedure for Creating Different LANCP Node Databases ......... 9–8<br />
9–8 Controlling Satellite Booting ................................ 9–10<br />
9–9 Techniques to Minimize Network Problems . . ................... 9–19<br />
10–1 Backup Methods ......................................... 10–1<br />
10–2 Upgrading the <strong>OpenVMS</strong> Operating System . ................... 10–2<br />
10–3 OPCOM System Logical Names . . ............................ 10–10<br />
10–4 AUTOGEN Dump-File Symbols . . ............................ 10–12<br />
10–5 Common SYSUAF.DAT Scenarios and Probable Results ........... 10–18<br />
10–6 Reducing the Value of <strong>Cluster</strong> Quorum ........................ 10–20<br />
A–1 Adjustable <strong>Cluster</strong> System Parameters ........................ A–1<br />
A–2 <strong>Cluster</strong> System Parameters Reserved for <strong>OpenVMS</strong> Use Only . . . . . . A–12<br />
B–1 Building a Common SYSUAF.DAT File ........................ B–1<br />
C–1 Sequence of Booting Events ................................. C–2<br />
C–2 Alpha Booting Messages (Alpha Only) ......................... C–11<br />
C–3 Port Failures ............................................ C–19<br />
C–4 How to Verify Virtual Circuit States .......................... C–20<br />
C–5 Informational and Other Error-Log Entries . . ................... C–25<br />
C–6 Port Messages for All Devices . . . ............................ C–31<br />
C–7 Port Messages for LAN Devices . . ............................ C–36<br />
C–8 OPA0 Messages .......................................... C–38<br />
D–1 Procedure for Using the LAVC$FAILURE_ANALYSIS.MAR<br />
Program ................................................ D–4<br />
D–2 Creating a Physical Description of the Network ................. D–5<br />
E–1 Subroutines for LAN Control ................................ E–1<br />
E–2 SYS$LAVC_START_BUS Status . ............................ E–2<br />
E–3 SYS$LAVC_STOP_BUS Status . . ............................ E–4<br />
E–4 SYS$LAVC_DEFINE_NET_COMPONENT Parameters . ........... E–5<br />
E–5 SYS$LAVC_DEFINE_NET_PATH Parameters ................... E–7<br />
E–6 SYS$LAVC_DEFINE_NET_PATH Status ....................... E–8<br />
F–1 SCA Protocol Layers . . .................................... F–2<br />
F–2 Communication Paths . .................................... F–4<br />
F–3 System Parameters for Timing . . ............................ F–7<br />
F–4 Channel Timeout Detection ................................. F–8<br />
F–5 SHOW PORT/VC Display ................................... F–11
F–6 Channel Formation ....................................... F–16<br />
F–7 Fields in the Ethernet Header ............................... F–20<br />
F–8 Fields in the FDDI Header . ................................ F–21<br />
F–9 Fields in the DX Header . . . ................................ F–22<br />
F–10 Fields in the CC Header . . . ................................ F–23<br />
F–11 Fields in the TR Header .................................... F–24<br />
F–12 Tracing Datagrams ....................................... F–28<br />
F–13 Capturing Retransmissions on the LAN ....................... F–31<br />
F–14 Capturing All LAN Packets (LAVc_all) . ....................... F–31<br />
F–15 Setting Up a Distributed Enable Filter (Distrib_Enable) ........... F–31<br />
F–16 Setting Up the Distributed Trigger Filter (Distrib_Trigger) . . ....... F–32<br />
F–17 Setting Up the Distributed Enable Message (Distrib_Enable) ....... F–32<br />
F–18 Setting Up the Distributed Trigger Message (Distrib_Trigger) ....... F–33<br />
G–1 Conditions that Create HELLO Datagram Congestion ............ G–5<br />
xvii
Introduction<br />
Preface<br />
<strong>OpenVMS</strong> <strong>Cluster</strong> <strong>Systems</strong> describes system management for <strong>OpenVMS</strong> <strong>Cluster</strong><br />
systems. Although the <strong>OpenVMS</strong> <strong>Cluster</strong> software for VAX and Alpha computers<br />
is separately purchased, licensed, and installed, the difference between the two<br />
architectures lies mainly in the hardware used. Essentially, system management<br />
for VAX and Alpha computers in an <strong>OpenVMS</strong> <strong>Cluster</strong> is identical. Exceptions<br />
are pointed out.<br />
Who Should Use This Manual<br />
This document is intended for anyone responsible for setting up and managing<br />
<strong>OpenVMS</strong> <strong>Cluster</strong> systems. To use the document as a guide to cluster<br />
management, you must have a thorough understanding of system management<br />
concepts and procedures, as described in the <strong>OpenVMS</strong> System Manager’s<br />
Manual.<br />
How This Manual Is Organized<br />
<strong>OpenVMS</strong> <strong>Cluster</strong> <strong>Systems</strong> contains ten chapters and seven appendixes.<br />
Chapter 1 introduces <strong>OpenVMS</strong> <strong>Cluster</strong> systems.<br />
Chapter 2 presents the software concepts integral to maintaining <strong>OpenVMS</strong><br />
<strong>Cluster</strong> membership and integrity.<br />
Chapter 3 describes various <strong>OpenVMS</strong> <strong>Cluster</strong> configurations and the ways they<br />
are interconnected.<br />
Chapter 4 explains how to set up an <strong>OpenVMS</strong> <strong>Cluster</strong> system and coordinate<br />
system files.<br />
Chapter 5 explains how to set up an environment in which resources can be<br />
shared across nodes in the <strong>OpenVMS</strong> <strong>Cluster</strong> system.<br />
Chapter 6 discusses disk and tape management concepts and procedures and how<br />
to use Volume Shadowing for <strong>OpenVMS</strong> to prevent data unavailability.<br />
Chapter 7 discusses queue management concepts and procedures.<br />
Chapter 8 explains how to build an <strong>OpenVMS</strong> <strong>Cluster</strong> system once the necessary<br />
preparations are made, and how to reconfigure and maintain the cluster.<br />
Chapter 9 provides guidelines for configuring and building large <strong>OpenVMS</strong><br />
<strong>Cluster</strong> systems, booting satellite nodes, and cross-architecture booting.<br />
Chapter 10 describes ongoing <strong>OpenVMS</strong> <strong>Cluster</strong> system maintenance.<br />
Appendix A lists and defines <strong>OpenVMS</strong> <strong>Cluster</strong> system parameters.<br />
Appendix B provides guidelines for building a cluster common user authorization<br />
file.<br />
xix
Appendix C provides troubleshooting information.<br />
Appendix D presents three sample programs for LAN control and explains how to<br />
use the Local Area <strong>OpenVMS</strong> <strong>Cluster</strong> Network Failure Analysis Program.<br />
Appendix E describes the subroutine package used with local area <strong>OpenVMS</strong><br />
<strong>Cluster</strong> sample programs.<br />
Appendix F provides techniques for troubleshooting network problems related to<br />
the NISCA transport protocol.<br />
Appendix G describes how the interactions of workload distribution and network<br />
topology affect <strong>OpenVMS</strong> <strong>Cluster</strong> system performance, and discusses transmit<br />
channel selection by PEDRIVER.<br />
Associated Documents<br />
This document is not a one-volume reference manual. The utilities and commands<br />
are described in detail in the <strong>OpenVMS</strong> System Manager’s Manual, the <strong>OpenVMS</strong><br />
System Management Utilities Reference Manual, and the <strong>OpenVMS</strong> DCL<br />
Dictionary.<br />
For additional information on the topics covered in this manual, refer to the<br />
following documents:<br />
• Guidelines for <strong>OpenVMS</strong> <strong>Cluster</strong> Configurations<br />
• <strong>OpenVMS</strong> Alpha Partitioning and Galaxy Guide<br />
• Guide to <strong>OpenVMS</strong> File Applications<br />
• <strong>OpenVMS</strong> Guide to System Security<br />
• <strong>OpenVMS</strong> Alpha System Dump Analyzer Utility Manual<br />
• VMS System Dump Analyzer Utility Manual<br />
• <strong>OpenVMS</strong> I/O User’s Reference Manual<br />
• <strong>OpenVMS</strong> License Management Utility Manual<br />
• <strong>OpenVMS</strong> System Management Utilities Reference Manual<br />
• <strong>OpenVMS</strong> System Manager’s Manual<br />
• A Comparison of System Management on <strong>OpenVMS</strong> AXP and <strong>OpenVMS</strong> VAX1 xx<br />
• <strong>OpenVMS</strong> System Services Reference Manual<br />
• Volume Shadowing for <strong>OpenVMS</strong><br />
• <strong>OpenVMS</strong> <strong>Cluster</strong> Software Software Product Description (SPD 29.78.xx)<br />
• DECnet for <strong>OpenVMS</strong> Network Management Utilities<br />
• DECnet for <strong>OpenVMS</strong> Networking Manual<br />
• The DECnet–Plus (formerly known as DECnet/OSI) documentation set<br />
• The TCP/IP Services for <strong>OpenVMS</strong> documentation set<br />
For additional information about Compaq <strong>OpenVMS</strong> products and services, access<br />
the Compaq website at the following location:<br />
http://www.openvms.compaq.com/<br />
1 This manual has been archived but is available on the <strong>OpenVMS</strong> Documentation<br />
CD–ROM.
Reader’s Comments<br />
Compaq welcomes your comments on this manual. Please send comments to<br />
either of the following addresses:<br />
Internet openvmsdoc@compaq.com<br />
Mail Compaq Computer Corporation<br />
OSSG Documentation Group, ZKO3-4/U08<br />
110 Spit Brook Rd.<br />
Nashua, NH 03062-2698<br />
How To Order Additional Documentation<br />
Visit the following World Wide Web address for information about how to order<br />
additional documentation:<br />
http://www.openvms.compaq.com/<br />
Conventions<br />
The following conventions are used in this manual:<br />
Return In examples, a key name enclosed in a box indicates that<br />
you press a key on the keyboard. (In text, a key name is not<br />
enclosed in a box.)<br />
In the HTML version of this document, this convention appears<br />
as brackets, rather than a box.<br />
. . . A horizontal ellipsis in examples indicates one of the following<br />
possibilities:<br />
.<br />
.<br />
.<br />
• Additional optional arguments in a statement have been<br />
omitted.<br />
• The preceding item or items can be repeated one or more<br />
times.<br />
• Additional parameters, values, or other information can be<br />
entered.<br />
A vertical ellipsis indicates the omission of items from a code<br />
example or command format; the items are omitted because<br />
they are not important to the topic being discussed.<br />
( ) In command format descriptions, parentheses indicate that you<br />
must enclose choices in parentheses if you specify more than<br />
one.<br />
[ ] In command format descriptions, brackets indicate optional<br />
choices. You can choose one or more items or no items.<br />
Do not type the brackets on the command line. However,<br />
you must include the brackets in the syntax for <strong>OpenVMS</strong><br />
directory specifications and for a substring specification in an<br />
assignment statement.<br />
| In command format descriptions, vertical bars separate choices<br />
within brackets or braces. Within brackets, the choices are<br />
optional; within braces, at least one choice is required. Do not<br />
type the vertical bars on the command line.<br />
{ } In command format descriptions, braces indicate required<br />
choices; you must choose at least one of the items listed. Do<br />
not type the braces on the command line.<br />
xxi
xxii<br />
bold text This typeface represents the introduction of a new term. It<br />
also represents the name of an argument, an attribute, or a<br />
reason.<br />
italic text Italic text indicates important information, complete titles<br />
of manuals, or variables. Variables include information that<br />
varies in system output (Internal error number), in command<br />
lines (/PRODUCER=name), and in command parameters in<br />
text (where dd represents the predefined code for the device<br />
type).<br />
UPPERCASE TEXT Uppercase text indicates a command, the name of a routine,<br />
the name of a file, or the abbreviation for a system privilege.<br />
Monospace text Monospace type indicates code examples and interactive screen<br />
displays.<br />
In the C programming language, monospace type in text<br />
identifies the following elements: keywords, the names<br />
of independently compiled external functions and files,<br />
syntax summaries, and references to variables or identifiers<br />
introduced in an example.<br />
- A hyphen at the end of a command format description,<br />
command line, or code line indicates that the command or<br />
statement continues on the following line.<br />
numbers All numbers in text are assumed to be decimal unless<br />
otherwise noted. Nondecimal radixes—binary, octal, or<br />
hexadecimal—are explicitly indicated.
1.1 Overview<br />
1<br />
Introduction to <strong>OpenVMS</strong> <strong>Cluster</strong> System<br />
Management<br />
‘‘<strong>Cluster</strong>’’ technology was pioneered by Digital Equipment Corporation in 1983<br />
with the VAXcluster system. The VAXcluster system was built using multiple<br />
standard VAX computing systems and the VMS operating system. The initial<br />
VAXcluster system offered the power and manageability of a centralized system<br />
and the flexibility of many physically distributed computing systems.<br />
Through the years, the technology has evolved into <strong>OpenVMS</strong> <strong>Cluster</strong> systems,<br />
which support both the <strong>OpenVMS</strong> Alpha and the <strong>OpenVMS</strong> VAX operating<br />
systems and hardware, as well as a multitude of additional features and options.<br />
When Compaq Computer Corporation acquired Digital Equipment Corporation<br />
in 1998, it acquired the most advanced cluster technology available. Compaq<br />
continues to enhance and expand <strong>OpenVMS</strong> <strong>Cluster</strong> capabilities.<br />
An <strong>OpenVMS</strong> <strong>Cluster</strong> system is a highly integrated organization of <strong>OpenVMS</strong><br />
software, Alpha and VAX computers, and storage devices that operate as a single<br />
system. The <strong>OpenVMS</strong> <strong>Cluster</strong> acts as a single virtual system, even though it<br />
is made up of many distributed systems. As members of an <strong>OpenVMS</strong> <strong>Cluster</strong><br />
system, Alpha and VAX computers can share processing resources, data storage,<br />
and queues under a single security and management domain, yet they can boot<br />
or shut down independently.<br />
The distance between the computers in an <strong>OpenVMS</strong> <strong>Cluster</strong> system depends on<br />
the interconnects that you use. The computers can be located in one computer<br />
lab, on two floors of a building, between buildings on a campus, or on two<br />
different sites hundreds of miles apart.<br />
An <strong>OpenVMS</strong> <strong>Cluster</strong> system with computers located on two different sites is<br />
known as a multiple-site <strong>OpenVMS</strong> <strong>Cluster</strong> system. A multiple-site <strong>OpenVMS</strong><br />
<strong>Cluster</strong> forms the basis of a disaster tolerant <strong>OpenVMS</strong> <strong>Cluster</strong> system. For more<br />
information about multiple site clusters, refer to Guidelines for <strong>OpenVMS</strong> <strong>Cluster</strong><br />
Configurations.<br />
Disaster Tolerant <strong>Cluster</strong> Services for <strong>OpenVMS</strong> is a Compaq Services system<br />
management and software package for configuring and managing <strong>OpenVMS</strong><br />
disaster tolerant clusters. For more information about Disaster Tolerant <strong>Cluster</strong><br />
Services for <strong>OpenVMS</strong>, contact your Compaq Services representative.<br />
You can also visit:<br />
http://www.compaq.co.uk/globalservices/continuity/dis_clus.stm<br />
Introduction to <strong>OpenVMS</strong> <strong>Cluster</strong> System Management 1–1
Introduction to <strong>OpenVMS</strong> <strong>Cluster</strong> System Management<br />
1.1 Overview<br />
1.1.1 Uses<br />
<strong>OpenVMS</strong> <strong>Cluster</strong> systems are an ideal environment for developing highavailability<br />
applications, such as transaction processing systems, servers for<br />
network client/server applications, and data-sharing applications.<br />
1.1.2 Benefits<br />
Computers in an <strong>OpenVMS</strong> <strong>Cluster</strong> system interact to form a cooperative,<br />
distributed operating system and derive a number of benefits, as shown in the<br />
following table.<br />
Benefit Description<br />
Resource sharing <strong>OpenVMS</strong> <strong>Cluster</strong> software automatically synchronizes and load balances<br />
batch and print queues, storage devices, and other resources among all<br />
cluster members.<br />
Flexibility Application programmers do not have to change their application code,<br />
and users do not have to know anything about the <strong>OpenVMS</strong> <strong>Cluster</strong><br />
environment to take advantage of common resources.<br />
High availability System designers can configure redundant hardware components to create<br />
highly available systems that eliminate or withstand single points of<br />
failure.<br />
Nonstop processing The <strong>OpenVMS</strong> operating system, which runs on each node in an<br />
<strong>OpenVMS</strong> <strong>Cluster</strong>, facilitates dynamic adjustments to changes in the<br />
configuration.<br />
Scalability Organizations can dynamically expand computing and storage resources<br />
as business needs grow or change without shutting down the system or<br />
applications running on the system.<br />
Performance An <strong>OpenVMS</strong> <strong>Cluster</strong> system can provide high performance.<br />
Management Rather than repeating the same system management operation on<br />
multiple <strong>OpenVMS</strong> systems, management tasks can be performed<br />
concurrently for one or more nodes.<br />
Security Computers in an <strong>OpenVMS</strong> <strong>Cluster</strong> share a single security database that<br />
can be accessed by all nodes in a cluster.<br />
Load balancing Distributes work across cluster members based on the current load of<br />
each member.<br />
1.2 Hardware Components<br />
<strong>OpenVMS</strong> <strong>Cluster</strong> system configurations consist of hardware components from<br />
the following general groups:<br />
• Computers<br />
• Interconnects<br />
• Storage devices<br />
References: Detailed <strong>OpenVMS</strong> <strong>Cluster</strong> configuration guidelines can be found<br />
in the <strong>OpenVMS</strong> <strong>Cluster</strong> Software Software Product Description (SPD) and in<br />
Guidelines for <strong>OpenVMS</strong> <strong>Cluster</strong> Configurations.<br />
1–2 Introduction to <strong>OpenVMS</strong> <strong>Cluster</strong> System Management
Introduction to <strong>OpenVMS</strong> <strong>Cluster</strong> System Management<br />
1.2 Hardware Components<br />
1.2.1 Computers<br />
Up to 96 computers, ranging from desktop to mainframe systems, can be<br />
members of an <strong>OpenVMS</strong> <strong>Cluster</strong> system. Active members that run the<br />
<strong>OpenVMS</strong> Alpha or <strong>OpenVMS</strong> VAX operating system and participate fully in<br />
<strong>OpenVMS</strong> <strong>Cluster</strong> negotiations can include:<br />
• Alpha computers or workstations<br />
• VAX computers or workstations or MicroVAX computers<br />
1.2.2 Physical Interconnects<br />
An interconnect is a physical path that connects computers to other computers<br />
and to storage subsystems. <strong>OpenVMS</strong> <strong>Cluster</strong> systems support a variety of<br />
interconnects (also referred to as buses) so that members can communicate using<br />
the most appropriate and effective method possible:<br />
• LANs<br />
Ethernet (10/100, Gigabit)<br />
Asynchronous transfer mode (ATM)<br />
Fiber Distributed Data Interface (FDDI)<br />
• CI<br />
• Digital Storage <strong>Systems</strong> Interconnect (DSSI)<br />
• MEMORY CHANNEL (node to node only)<br />
• Small Computer <strong>Systems</strong> Interface (SCSI) (storage only)<br />
• Fibre Channel (FC) (storage only)<br />
1.2.3 <strong>OpenVMS</strong> Galaxy SMCI<br />
In addition to the physical interconnects listed in Section 1.2.2, another type of<br />
interconnect, a shared memory CI (SMCI) for <strong>OpenVMS</strong> Galaxy instances, is<br />
available. SMCI supports cluster communications between Galaxy instances.<br />
For more information about SMCI and Galaxy configurations, see the <strong>OpenVMS</strong><br />
Alpha Partitioning and Galaxy Guide.<br />
1.2.4 Storage Devices<br />
A shared storage device is a disk or tape that is accessed by multiple computers<br />
in the cluster. Nodes access remote disks and tapes by means of the MSCP and<br />
TMSCP server software (described in Section 1.3.1).<br />
<strong>Systems</strong> within an <strong>OpenVMS</strong> <strong>Cluster</strong> support a wide range of storage devices:<br />
• Disks and disk drives, including:<br />
Digital Storage Architecture (DSA) disks<br />
RF series integrated storage elements (ISEs)<br />
Small Computer <strong>Systems</strong> Interface (SCSI) devices<br />
Solid state disks<br />
• Tapes and tape drives<br />
Introduction to <strong>OpenVMS</strong> <strong>Cluster</strong> System Management 1–3
Introduction to <strong>OpenVMS</strong> <strong>Cluster</strong> System Management<br />
1.2 Hardware Components<br />
• Controllers and I/O servers, including the following:<br />
Controller Interconnect<br />
HSC CI<br />
HSJ CI<br />
HSD DSSI<br />
HSZ SCSI<br />
HSG FC<br />
In addition, the K.scsi HSC controller allows the connection of the<br />
StorageWorks arrays with SCSI devices on the HSC storage subsystems.<br />
Note: HSC, HSJ, HSD, and HSZ controllers support many combinations<br />
of SDIs (standard disk interfaces) and STIs (standard tape interfaces) that<br />
connect disks and tapes.<br />
1.3 Software Components<br />
The <strong>OpenVMS</strong> operating system, which runs on each node in the <strong>OpenVMS</strong><br />
<strong>Cluster</strong>, includes several software components that facilitate resource sharing and<br />
dynamic adjustments to changes in the underlying hardware configuration.<br />
If one computer becomes unavailable, the <strong>OpenVMS</strong> <strong>Cluster</strong> system continues<br />
operating because <strong>OpenVMS</strong> is still running on the remaining computers.<br />
1.3.1 <strong>OpenVMS</strong> <strong>Cluster</strong> Software Functions<br />
The following table describes the software components and their main function.<br />
Component Facilitates Function<br />
Connection<br />
manager<br />
Distributed lock<br />
manager<br />
Distributed file<br />
system<br />
Distributed job<br />
controller<br />
Member integrity Coordinates participation of computers in the cluster and<br />
maintains cluster integrity when computers join or leave<br />
the cluster.<br />
Resource<br />
synchronization<br />
Synchronizes operations of the distributed file system, job<br />
controller, device allocation, and other cluster facilities. If<br />
an <strong>OpenVMS</strong> <strong>Cluster</strong> computer shuts down, all locks that<br />
it holds are released so processing can continue on the<br />
remaining computers.<br />
Resource sharing Allows all computers to share access to mass storage and<br />
file records, regardless of the type of storage device (DSA,<br />
RF, SCSI, and solid state subsystem) or its location.<br />
Queuing Makes generic and execution queues available across the<br />
cluster.<br />
MSCP server Disk serving Implements the proprietary mass storage control protocol<br />
in order to make disks available to all nodes that do not<br />
have direct access to those disks.<br />
TMSCP server Tape serving Implements the proprietary tape mass storage control<br />
protocol in order to make tape drives available to all nodes<br />
that do not have direct access to those tape drives.<br />
1–4 Introduction to <strong>OpenVMS</strong> <strong>Cluster</strong> System Management
1.4 Communications<br />
Introduction to <strong>OpenVMS</strong> <strong>Cluster</strong> System Management<br />
1.4 Communications<br />
The System Communications Architecture (SCA) defines the communications<br />
mechanisms that allow nodes in an <strong>OpenVMS</strong> <strong>Cluster</strong> system to cooperate. It<br />
governs the sharing of data between resources at the nodes and binds together<br />
System Applications (SYSAPs) that run on different Alpha and VAX computers.<br />
SCA consists of the following hierarchy of components:<br />
Communications<br />
Software Function<br />
System applications<br />
(SYSAPs)<br />
System Communications<br />
Services (SCS)<br />
Consists of clusterwide applications (for example, disk and tape class<br />
drivers, connection manager, and MSCP server) that use SCS software for<br />
interprocessor communication.<br />
Provides basic connection management and communication services,<br />
implemented as a logical path, between system applications (SYSAPs) on<br />
nodes in an <strong>OpenVMS</strong> <strong>Cluster</strong> system.<br />
Port drivers Control the communication paths between local and remote ports.<br />
Physical interconnects Consists of ports or adapters for CI, DSSI, Ethernet (10/100 and Gigabit),<br />
ATM, FDDI, and MEMORY CHANNEL interconnects.<br />
1.4.1 System Communications<br />
Figure 1–1 shows the interrelationships between the <strong>OpenVMS</strong> <strong>Cluster</strong><br />
components.<br />
Introduction to <strong>OpenVMS</strong> <strong>Cluster</strong> System Management 1–5
Introduction to <strong>OpenVMS</strong> <strong>Cluster</strong> System Management<br />
1.4 Communications<br />
Figure 1–1 <strong>OpenVMS</strong> <strong>Cluster</strong> System Communications<br />
Node<br />
$QIO<br />
SCS and<br />
Port<br />
Driver<br />
Port<br />
Device<br />
Connection<br />
Virtual Circuit<br />
Communications Mechanism<br />
Node<br />
Process Process<br />
SYSAP<br />
SYSAP<br />
Port<br />
Driver<br />
Port<br />
Device<br />
Software<br />
Hardware<br />
ZK−6795A−GE<br />
In Figure 1–1, processes in different nodes exchange information with each other:<br />
• Processes can call the $QIO system service and other system services directly<br />
from a program or indirectly using other mechanisms such as <strong>OpenVMS</strong><br />
Record Management Services (RMS). The $QIO system service initiates all<br />
I/O requests.<br />
• A SYSAP on one <strong>OpenVMS</strong> <strong>Cluster</strong> node must communicate with a SYSAP<br />
on another node. For example, a connection manager on one node must<br />
communicate with the connection manager on another node, or a disk class<br />
driver on one node must communicate with the MSCP server on another<br />
node.<br />
• The following SYSAPs use SCS for cluster communication:<br />
Disk and tape class drivers<br />
MSCP server<br />
TMSCP server<br />
DECnet class driver<br />
Connection manager<br />
• SCS routines provide services to format and transfer SYSAP messages to a<br />
port driver for delivery over a specific interconnect.<br />
1–6 Introduction to <strong>OpenVMS</strong> <strong>Cluster</strong> System Management
Introduction to <strong>OpenVMS</strong> <strong>Cluster</strong> System Management<br />
1.4 Communications<br />
• Communications go through the port drivers to <strong>OpenVMS</strong> <strong>Cluster</strong> computers<br />
and storage controllers. The port driver manages a logical path, called a<br />
virtual circuit, between each pair of ports in an <strong>OpenVMS</strong> <strong>Cluster</strong> system.<br />
1.4.2 Application Communications<br />
Applications running on <strong>OpenVMS</strong> <strong>Cluster</strong> systems use DECnet or<br />
TCP/IP (transmission control protocol and internet protocol) for application<br />
communication. The DECnet and TCP/IP communication services allow processes<br />
to locate or start remote servers and then exchange messages.<br />
Note that generic references to DECnet in this document mean either DECnet for<br />
<strong>OpenVMS</strong> or DECnet–Plus (formerly known as DECnet/OSI) software.<br />
1.4.3 <strong>Cluster</strong> Alias<br />
A DECnet feature known as a cluster alias provides a collective name for the<br />
nodes in an <strong>OpenVMS</strong> <strong>Cluster</strong> system. Application software can connect to a<br />
node in the <strong>OpenVMS</strong> <strong>Cluster</strong> using the cluster alias name rather than a specific<br />
node name. This frees the application from keeping track of individual nodes in<br />
the <strong>OpenVMS</strong> <strong>Cluster</strong> system and results in design simplification, configuration<br />
flexibility, and application availability.<br />
1.5 System Management<br />
The <strong>OpenVMS</strong> <strong>Cluster</strong> system manager must manage multiple users and<br />
resources for maximum productivity and efficiency while maintaining the<br />
necessary security.<br />
1.5.1 Ease of Management<br />
An <strong>OpenVMS</strong> <strong>Cluster</strong> system is easily managed because the multiple members,<br />
hardware, and software are designed to cooperate as a single system:<br />
• Smaller configurations usually include only one system disk (or two for an<br />
<strong>OpenVMS</strong> <strong>Cluster</strong> configuration with both <strong>OpenVMS</strong> VAX and <strong>OpenVMS</strong><br />
Alpha operating systems), regardless of the number or location of computers<br />
in the configuration.<br />
• Software needs to be installed only once for each operating system (VAX and<br />
Alpha), and is accessible by every user and node of the <strong>OpenVMS</strong> <strong>Cluster</strong>.<br />
• Users need to be added once to have access to the resources of the entire<br />
<strong>OpenVMS</strong> <strong>Cluster</strong>.<br />
• Several system management utilities and commands facilitate cluster<br />
management.<br />
Figure 1–2 illustrates centralized system management.<br />
1.5.2 Tools and Utilities from Compaq<br />
The <strong>OpenVMS</strong> operating system supports a number of utilities and tools to assist<br />
you with the management of the distributed resources in <strong>OpenVMS</strong> <strong>Cluster</strong><br />
configurations. Proper management is essential to ensure the availability and<br />
performance of <strong>OpenVMS</strong> <strong>Cluster</strong> configurations.<br />
Introduction to <strong>OpenVMS</strong> <strong>Cluster</strong> System Management 1–7
Introduction to <strong>OpenVMS</strong> <strong>Cluster</strong> System Management<br />
1.5 System Management<br />
Figure 1–2 Single-Point <strong>OpenVMS</strong> <strong>Cluster</strong> System Management<br />
Performance<br />
Management Capacity<br />
Planning<br />
Operations<br />
Management<br />
Security<br />
Storage<br />
Management<br />
ZK−7008A−GE<br />
<strong>OpenVMS</strong> and its partners offer a wide selection of tools to meet diverse system<br />
management needs. Table 1–1 describes the Compaq products available for<br />
cluster management and indicates whether each is supplied with the operating<br />
system or is an optional product. Table 1–2 describes some of the most important<br />
utilities and tools produced by <strong>OpenVMS</strong> partners for managing an <strong>OpenVMS</strong><br />
<strong>Cluster</strong> configuration.<br />
1–8 Introduction to <strong>OpenVMS</strong> <strong>Cluster</strong> System Management
Introduction to <strong>OpenVMS</strong> <strong>Cluster</strong> System Management<br />
1.5 System Management<br />
Table 1–1 System Management Tools<br />
Tool<br />
Accounting<br />
Supplied or<br />
Optional Function<br />
VMS Accounting Supplied Tracks how resources are being used.<br />
Configuration and capacity planning<br />
LMF (License<br />
Management Facility)<br />
Graphical Configuration<br />
Manager (GCM)<br />
Galaxy Configuration<br />
Utility (GCU)<br />
SYSGEN (System<br />
Generation) utility<br />
Supplied Helps the system manager to determine which software<br />
products are licensed and installed on a standalone<br />
system and on each of the computers in an <strong>OpenVMS</strong><br />
<strong>Cluster</strong> system.<br />
Supplied A portable client/server application that gives you a<br />
way to view and control the configuration of partitioned<br />
AlphaServer systems running <strong>OpenVMS</strong>.<br />
Supplied A DECwindows Motif application that allows system<br />
managers to configure and manage an <strong>OpenVMS</strong><br />
Galaxy system from a single workstation window.<br />
Supplied Allows you to tailor your system for a specific hardware<br />
and software configuration. Use SYSGEN to modify<br />
system parameters, load device drivers, and create<br />
additional page and swap files.<br />
CLUSTER_CONFIG.COM Supplied Automates the configuration or reconfiguration of an<br />
<strong>OpenVMS</strong> <strong>Cluster</strong> system and assumes the use of<br />
DECnet.<br />
CLUSTER_CONFIG_<br />
LAN.COM<br />
Compaq Management<br />
Agents for <strong>OpenVMS</strong><br />
Compaq Insight Manager<br />
XE<br />
Event and fault tolerance<br />
Supplied Automates configuration or reconfiguration of an<br />
<strong>OpenVMS</strong> <strong>Cluster</strong> system without the use of DECnet.<br />
Supplied Consists of a web server for system management with<br />
management agents that allow you to look at devices on<br />
your <strong>OpenVMS</strong> systems.<br />
Supplied<br />
with every<br />
Compaq NT<br />
server<br />
OPCOM message routing Supplied Provides event notification.<br />
Operations management<br />
<strong>Cluster</strong>wide process<br />
services<br />
Centralizes system management in one system<br />
to reduce cost, improve operational efficiency and<br />
effectiveness, and minimize system down time. You can<br />
use Compaq Insight Manager XE on an NT server to<br />
monitor every system in an <strong>OpenVMS</strong> <strong>Cluster</strong> system.<br />
In a configuration of heterogeneous Compaq systems,<br />
you can use Compaq Insight Manager XE on an NT<br />
server to monitor all systems.<br />
Supplied Allows <strong>OpenVMS</strong> system management commands, such<br />
as SHOW USERS, SHOW SYSTEM, and STOP/ID=, to<br />
operate clusterwide.<br />
Availability Manager Supplied From either an <strong>OpenVMS</strong> Alpha or a Windows node,<br />
enables you to monitor one or more <strong>OpenVMS</strong> nodes<br />
on an extended local area network (LAN). Availability<br />
Manager collects system and process data from multiple<br />
<strong>OpenVMS</strong> nodes simultaneously, then analyzes the<br />
data, and displays the output using a native Java GUI.<br />
(continued on next page)<br />
Introduction to <strong>OpenVMS</strong> <strong>Cluster</strong> System Management 1–9
Introduction to <strong>OpenVMS</strong> <strong>Cluster</strong> System Management<br />
1.5 System Management<br />
Table 1–1 (Cont.) System Management Tools<br />
Tool<br />
Operations management<br />
Supplied or<br />
Optional Function<br />
DECamds Supplied Collects and analyzes data from multiple nodes<br />
simultaneously, directing all output to a centralized<br />
DECwindows display. The analysis detects resource<br />
availability problems and suggests corrective actions.<br />
SCACP (<strong>Systems</strong><br />
Communications<br />
Architecture Control<br />
Program)<br />
DFS (Distributed File<br />
Service)<br />
DNS (Distributed Name<br />
Service)<br />
LATCP (Local Area<br />
Transport Control<br />
Program)<br />
LANCP (LAN Control<br />
Program)<br />
NCP (Network Control<br />
Protocol) utility<br />
NCL (Network Control<br />
Language) utility<br />
<strong>OpenVMS</strong> Management<br />
Station<br />
Supplied Enables you to monitor and manage switched LAN<br />
paths.<br />
Optional Allows disks to be served across a LAN or WAN.<br />
Optional Configures certain network nodes as name servers that<br />
associate objects with network names.<br />
Supplied Provides the function to control and obtain information<br />
from the LAT port driver.<br />
Supplied Allows the system manager to configure and control the<br />
LAN software on <strong>OpenVMS</strong> systems.<br />
Optional Allows the system manager to supply and access<br />
information about the DECnet for <strong>OpenVMS</strong> (Phase<br />
IV) network from a configuration database.<br />
Optional Allows the system manager to supply and access<br />
information about the DECnet–Plus network from a<br />
configuration database.<br />
Supplied Enables system managers to set up and manage<br />
accounts and print queues across multiple <strong>OpenVMS</strong><br />
<strong>Cluster</strong> systems and <strong>OpenVMS</strong> nodes. <strong>OpenVMS</strong><br />
Management Station is a Microsoft Windows and<br />
Windows NT based management tool.<br />
POLYCENTER Software<br />
Installation Utility (PCSI)<br />
Supplied Provides rapid installations of software products.<br />
Queue Manager Supplied Uses <strong>OpenVMS</strong> <strong>Cluster</strong> generic and execution queues<br />
to feed node-specific queues across the cluster.<br />
Show <strong>Cluster</strong> utility Supplied Monitors activity and performance in an <strong>OpenVMS</strong><br />
<strong>Cluster</strong> configuration, then collects and sends<br />
information about that activity to a terminal or other<br />
output device.<br />
SDA (System Dump<br />
Analyzer)<br />
SYSMAN (System<br />
Management utility)<br />
Supplied Allows you to inspect the contents of memory as saved<br />
in the dump taken at crash time or as it exists in a<br />
running system. You can use SDA interactively or in<br />
batch mode.<br />
Supplied Enables device and processor control commands to take<br />
effect across an <strong>OpenVMS</strong> <strong>Cluster</strong>.<br />
VMSINSTAL Supplied Provides software installations.<br />
1–10 Introduction to <strong>OpenVMS</strong> <strong>Cluster</strong> System Management<br />
(continued on next page)
Introduction to <strong>OpenVMS</strong> <strong>Cluster</strong> System Management<br />
1.5 System Management<br />
Table 1–1 (Cont.) System Management Tools<br />
Tool<br />
Performance<br />
Supplied or<br />
Optional Function<br />
AUTOGEN utility Supplied Optimizes system parameter settings based on usage.<br />
Monitor utility Supplied Provides basic performance data.<br />
Security<br />
Authorize utility Supplied Modifies user account profiles.<br />
SET ACL command Supplied Sets complex protection on many system objects.<br />
SET AUDIT command Supplied Facilitates tracking of sensitive system objects.<br />
Storage management<br />
Backup utility Supplied Allows <strong>OpenVMS</strong> <strong>Cluster</strong> system managers to create<br />
backup copies of files and directories from storage media<br />
and then restore them. This utility can be used on one<br />
node to back up data stored on disks throughout the<br />
<strong>OpenVMS</strong> <strong>Cluster</strong> system.<br />
Mount utility Supplied Enables a disk or tape volume for processing by one<br />
computer, a subset of <strong>OpenVMS</strong> <strong>Cluster</strong> computers, or<br />
all <strong>OpenVMS</strong> <strong>Cluster</strong> computers.<br />
Volume Shadowing for<br />
<strong>OpenVMS</strong><br />
Optional Replicates disk data across multiple disks to help<br />
<strong>OpenVMS</strong> <strong>Cluster</strong> systems survive disk failures.<br />
1.5.3 System Management Tools from <strong>OpenVMS</strong> Partners<br />
<strong>OpenVMS</strong> Partners offer a wide selection of tools to meet diverse system<br />
management needs, as shown in Table 1–2. The types of tools are described<br />
in the following list:<br />
• Schedule managers<br />
Enable specific actions to be triggered at determined times, including<br />
repetitive and periodic activities, such as nightly backups.<br />
• Event managers<br />
Monitor a system and report occurrences and events that may require an<br />
action or that may indicate a critical or alarming situation, such as low<br />
memory or an attempted security breakin.<br />
• Console managers<br />
Enable a remote connection to and emulation of a system console so that<br />
system messages can be displayed and commands can be issued.<br />
• Performance managers<br />
Monitor system performance by collecting and analyzing data to allow proper<br />
tailoring and configuration of system resources. Performance managers may<br />
also collect historical data for capacity planning.<br />
Introduction to <strong>OpenVMS</strong> <strong>Cluster</strong> System Management 1–11
Introduction to <strong>OpenVMS</strong> <strong>Cluster</strong> System Management<br />
1.5 System Management<br />
Table 1–2 System Management Products from <strong>OpenVMS</strong> Partners<br />
Business Partner Product Type or Function<br />
BMC Perform and Predict Performance and capacity manager<br />
Patrol for <strong>OpenVMS</strong> Event manager<br />
ControlM Console manager<br />
Computer Associates AdviseIT Performance manager<br />
CommandIT Console manager<br />
ScheduleIT Schedule manager<br />
WatchIT Event manager<br />
Unicenter TNG Package of various products<br />
Fortel ViewPoint Performance manager<br />
Global Maintech VCC Console manager<br />
Heroix RoboMon Event manager<br />
RoboCentral Console manager<br />
ISE Schedule Schedule manager<br />
Ki NETWORKS CLIM Console manager<br />
ORSYP Dollar Universe Schedule manager<br />
RAXCO Perfect Cache Storage performance<br />
Perfect Disk Storage management<br />
TECsys Development Inc. Console Works Console manager<br />
For current information about <strong>OpenVMS</strong> Partners and<br />
the tools they provide, visit the following web site:<br />
http://www.openvms.compaq.com/openvms/system_management.html<br />
1.5.4 Other Configuration Aids<br />
In addition to these utilities and partner products, several commands are<br />
available that allow the system manager to set parameters on HSC, HSJ, HSD,<br />
HSZ, HSG, and RF subsystems to help configure the system. See the appropriate<br />
hardware documentation for more information.<br />
1–12 Introduction to <strong>OpenVMS</strong> <strong>Cluster</strong> System Management
2<br />
<strong>OpenVMS</strong> <strong>Cluster</strong> Concepts<br />
To help you understand the design and implementation of an <strong>OpenVMS</strong> <strong>Cluster</strong><br />
system, this chapter describes its basic architecture.<br />
2.1 <strong>OpenVMS</strong> <strong>Cluster</strong> System Architecture<br />
Figure 2–1 illustrates the protocol layers within the <strong>OpenVMS</strong> <strong>Cluster</strong> system<br />
architecture, ranging from the communications mechanisms at the base of the<br />
figure to the users of the system at the top of the figure. These protocol layers<br />
include:<br />
• Ports<br />
• System Communications Services (SCS)<br />
• System Applications (SYSAPs)<br />
• Other layered components<br />
<strong>OpenVMS</strong> <strong>Cluster</strong> Concepts 2–1
<strong>OpenVMS</strong> <strong>Cluster</strong> Concepts<br />
2.1 <strong>OpenVMS</strong> <strong>Cluster</strong> System Architecture<br />
Figure 2–1 <strong>OpenVMS</strong> <strong>Cluster</strong> System Architecture<br />
SPI<br />
Memory<br />
Channel<br />
MCdriver<br />
<strong>Cluster</strong>wide<br />
Services<br />
Message Service<br />
Connection<br />
Manager<br />
Memory<br />
Channel<br />
PMdriver<br />
2.1.1 Port Layer<br />
RMS<br />
Lock Manager<br />
CI<br />
PNdriver<br />
Star Coupler<br />
2–2 <strong>OpenVMS</strong> <strong>Cluster</strong> Concepts<br />
User<br />
Application<br />
DSA disk<br />
MSCP server<br />
File System<br />
DSA disk<br />
DUdriver<br />
System Communications Services<br />
SYS$SCS<br />
DSSI<br />
PIdriver<br />
DSA tape<br />
TMSCP server<br />
LAN<br />
PEdriver<br />
Ethernet<br />
(10/100, Gigabit)<br />
E*driver<br />
I/O Services<br />
DSA tape<br />
TUdriver<br />
SCSI disk<br />
DKdriver<br />
FC<br />
PG*driver<br />
FG*driver<br />
FDDI<br />
F*driver<br />
SCSI<br />
PK*driver<br />
FWD SE<br />
FC Switch<br />
SCSI tape<br />
MKdriver<br />
ATM<br />
ATM Switch<br />
User<br />
Applications<br />
Device<br />
Drivers<br />
Port<br />
Drivers<br />
Physical<br />
Device<br />
Drivers<br />
This lowest level of the architecture provides connections, in the form of<br />
communication ports and physical paths, between devices. The port layer<br />
can contain any of the following interconnects:<br />
VM-0161A-AI
<strong>OpenVMS</strong> <strong>Cluster</strong> Concepts<br />
2.1 <strong>OpenVMS</strong> <strong>Cluster</strong> System Architecture<br />
• LANs<br />
ATM<br />
Ethernet (10/100 and Gigabit Ethernet)<br />
FDDI<br />
• CI<br />
• DSSI<br />
• MEMORY CHANNEL<br />
• SCSI<br />
• Fibre Channel<br />
Each interconnect is accessed by a port (also referred to as an adapter) that<br />
connects to the processor node. For example, the Fibre Channel interconnect is<br />
accessed by way of a Fibre Channel port.<br />
2.1.2 SCS Layer<br />
The SCS layer provides basic connection management and communications<br />
services in the form of datagrams, messages, and block transfers over each logical<br />
path. Table 2–1 describes these services.<br />
Table 2–1 Communications Services<br />
Service Delivery Guarantees Usage<br />
Datagrams<br />
Information units of less<br />
than one packet<br />
Messages<br />
Information units of less<br />
than one packet<br />
Block data transfers<br />
Any contiguous data<br />
in a process virtual<br />
address space. There is<br />
no size limit except that<br />
imposed by the physical<br />
memory constraints of<br />
the host system.<br />
Delivery of datagrams is not guaranteed. Datagrams<br />
can be lost, duplicated, or delivered out of order.<br />
Messages are guaranteed to be delivered and to arrive<br />
in order. Virtual circuit sequence numbers are used on<br />
the individual packets.<br />
Delivery of block data is guaranteed. The sending and<br />
receiving ports and the port emulators cooperate in<br />
breaking the transfer into data packets and ensuring<br />
that all packets are correctly transmitted, received,<br />
and placed in the appropriate destination buffer.<br />
Block data transfers differ from messages in the size<br />
of the transfer.<br />
Status and information messages<br />
whose loss is not critical<br />
Applications that have their<br />
own reliability protocols (such<br />
as DECnet)<br />
Disk read and write requests<br />
Disk subsystems and disk servers<br />
to move data associated with disk<br />
read and write requests<br />
The SCS layer is implemented as a combination of hardware and software, or<br />
software only, depending upon the type of port. SCS manages connections in an<br />
<strong>OpenVMS</strong> <strong>Cluster</strong> and multiplexes messages between system applications over<br />
a common transport called a virtual circuit. A virtual circuit exists between<br />
each pair of SCS ports and a set of SCS connections that are multiplexed on that<br />
virtual circuit.<br />
<strong>OpenVMS</strong> <strong>Cluster</strong> Concepts 2–3
<strong>OpenVMS</strong> <strong>Cluster</strong> Concepts<br />
2.1 <strong>OpenVMS</strong> <strong>Cluster</strong> System Architecture<br />
2.1.3 System Applications (SYSAPs) Layer<br />
The next higher layer in the <strong>OpenVMS</strong> <strong>Cluster</strong> architecture consists of the<br />
SYSAPs layer. This layer consists of multiple system applications that provide,<br />
for example, access to disks and tapes and cluster membership control. SYSAPs<br />
can include:<br />
• Connection manager<br />
• MSCP server<br />
• TMSCP server<br />
• Disk and tape class drivers<br />
These components are described in detail later in this chapter.<br />
2.1.4 Other Layered Components<br />
A wide range of <strong>OpenVMS</strong> components layer on top of the <strong>OpenVMS</strong> <strong>Cluster</strong><br />
system architecture, including:<br />
• Volume Shadowing for <strong>OpenVMS</strong><br />
• Distributed lock manager<br />
• Process control services<br />
• Distributed file system<br />
• Record Management Services (RMS)<br />
• Distributed job controller<br />
These components, except for volume shadowing, are described in detail later in<br />
this chapter. Volume Shadowing for <strong>OpenVMS</strong> is described in Section 6.6.<br />
2.2 <strong>OpenVMS</strong> <strong>Cluster</strong> Software Functions<br />
The <strong>OpenVMS</strong> <strong>Cluster</strong> software components that implement <strong>OpenVMS</strong> <strong>Cluster</strong><br />
communication and resource-sharing functions always run on every computer in<br />
the <strong>OpenVMS</strong> <strong>Cluster</strong>. If one computer fails, the <strong>OpenVMS</strong> <strong>Cluster</strong> system<br />
continues operating, because the components still run on the remaining<br />
computers.<br />
2.2.1 Functions<br />
The following table summarizes the <strong>OpenVMS</strong> <strong>Cluster</strong> communication and<br />
resource-sharing functions and the components that perform them.<br />
Function Performed By<br />
Ensure that <strong>OpenVMS</strong> <strong>Cluster</strong> computers<br />
communicate with one another to enforce the<br />
rules of cluster membership<br />
Synchronize functions performed by other<br />
<strong>OpenVMS</strong> <strong>Cluster</strong> components, <strong>OpenVMS</strong><br />
products, and other software components<br />
Connection manager<br />
Distributed lock manager<br />
Share disks and files Distributed file system<br />
Make disks available to nodes that do not<br />
have direct access<br />
MSCP server<br />
2–4 <strong>OpenVMS</strong> <strong>Cluster</strong> Concepts
Function Performed By<br />
<strong>OpenVMS</strong> <strong>Cluster</strong> Concepts<br />
2.2 <strong>OpenVMS</strong> <strong>Cluster</strong> Software Functions<br />
Make tapes available to nodes that do not<br />
have direct access<br />
TMSCP server<br />
Make queues available Distributed job controller<br />
2.3 Ensuring the Integrity of <strong>Cluster</strong> Membership<br />
The connection manager ensures that computers in an <strong>OpenVMS</strong> <strong>Cluster</strong> system<br />
communicate with one another to enforce the rules of cluster membership.<br />
Computers in an <strong>OpenVMS</strong> <strong>Cluster</strong> system share various data and system<br />
resources, such as access to disks and files. To achieve the coordination that is<br />
necessary to maintain resource integrity, the computers must maintain a clear<br />
record of cluster membership.<br />
2.3.1 Connection Manager<br />
The connection manager creates an <strong>OpenVMS</strong> <strong>Cluster</strong> when the first computer is<br />
booted and reconfigures the cluster when computers join or leave it during cluster<br />
state transitions. The overall responsibilities of the connection manager are to:<br />
• Prevent partitioning (see Section 2.3.2).<br />
• Track which nodes in the <strong>OpenVMS</strong> <strong>Cluster</strong> system are active and which are<br />
not.<br />
• Deliver messages to remote nodes.<br />
• Remove nodes.<br />
• Provide a highly available message service in which other software<br />
components, such as the distributed lock manager, can synchronize access to<br />
shared resources.<br />
2.3.2 <strong>Cluster</strong> Partitioning<br />
A primary purpose of the connection manager is to prevent cluster partitioning,<br />
a condition in which nodes in an existing <strong>OpenVMS</strong> <strong>Cluster</strong> configuration divide<br />
into two or more independent clusters.<br />
<strong>Cluster</strong> partitioning can result in data file corruption because the distributed lock<br />
manager cannot coordinate access to shared resources for multiple <strong>OpenVMS</strong><br />
<strong>Cluster</strong> systems. The connection manager prevents cluster partitioning using a<br />
quorum algorithm.<br />
2.3.3 Quorum Algorithm<br />
The quorum algorithm is a mathematical method for determining if a majority of<br />
<strong>OpenVMS</strong> <strong>Cluster</strong> members exist so resources can be shared across an <strong>OpenVMS</strong><br />
<strong>Cluster</strong> system. Quorum is the number of votes that must be present for the<br />
cluster to function. Quorum is a dynamic value calculated by the connection<br />
manager to prevent cluster partitioning. The connection manager allows<br />
processing to occur only if a majority of the <strong>OpenVMS</strong> <strong>Cluster</strong> members are<br />
functioning.<br />
<strong>OpenVMS</strong> <strong>Cluster</strong> Concepts 2–5
<strong>OpenVMS</strong> <strong>Cluster</strong> Concepts<br />
2.3 Ensuring the Integrity of <strong>Cluster</strong> Membership<br />
2.3.4 System Parameters<br />
Two system parameters, VOTES and EXPECTED_VOTES, are key to the<br />
computations performed by the quorum algorithm. The following table describes<br />
these parameters.<br />
Parameter Description<br />
VOTES Specifies a fixed number of votes that a computer contributes toward quorum.<br />
The system manager can set the VOTES parameters on each computer or allow<br />
the operating system to set it to the following default values:<br />
EXPECTED_<br />
VOTES<br />
• For satellite nodes, the default value is 0.<br />
• For all other computers, the default value is 1.<br />
2.3.5 Calculating <strong>Cluster</strong> Votes<br />
The quorum algorithm operates as follows:<br />
Step Action<br />
2–6 <strong>OpenVMS</strong> <strong>Cluster</strong> Concepts<br />
Each Alpha or VAX computer with a nonzero value for the VOTES system<br />
parameter is considered a voting member.<br />
Specifies the sum of all VOTES held by <strong>OpenVMS</strong> <strong>Cluster</strong> members. The initial<br />
value is used to derive an estimate of the correct quorum value for the cluster.<br />
The system manager must set this parameter on each active Alpha or VAX<br />
computer, including satellites, in the cluster.<br />
1 When nodes in the <strong>OpenVMS</strong> <strong>Cluster</strong> boot, the connection manager uses the largest value<br />
for EXPECTED_VOTES of all systems present to derive an estimated quorum value<br />
according to the following formula:<br />
Estimated quorum = (EXPECTED_VOTES + 2)/2 | Rounded down<br />
2 During a state transition (whenever a node enters or leaves the cluster or when a quorum<br />
disk is recognized), the connection manager dynamically computes the cluster quorum<br />
value to be the maximum of the following:<br />
• The current cluster quorum value (calculated during the last cluster transition).<br />
• Estimated quorum, as described in step 1.<br />
• The value calculated from the following formula, where the VOTES system parameter<br />
is the total votes held by all cluster members: QUORUM = (VOTES + 2)/2 |<br />
Rounded down<br />
Note: Quorum disks are discussed in Section 2.3.7.<br />
3 The connection manager compares the cluster votes value to the cluster quorum value and<br />
determines what action to take based on the following conditions:<br />
WHEN... THEN...<br />
The total number of cluster votes<br />
is equal to at least the quorum<br />
value<br />
The current number of cluster<br />
votes drops below the quorum<br />
value (because of computers<br />
leaving the cluster)<br />
The <strong>OpenVMS</strong> <strong>Cluster</strong> system continues running.<br />
The remaining <strong>OpenVMS</strong> <strong>Cluster</strong> members suspend<br />
all process activity and all I/O operations to clusteraccessible<br />
disks and tapes until sufficient votes are<br />
added (that is, enough computers have joined the<br />
<strong>OpenVMS</strong> <strong>Cluster</strong>) to bring the total number of votes<br />
to a value greater than or equal to quorum.
<strong>OpenVMS</strong> <strong>Cluster</strong> Concepts<br />
2.3 Ensuring the Integrity of <strong>Cluster</strong> Membership<br />
Note: When a node leaves the <strong>OpenVMS</strong> <strong>Cluster</strong> system, the connection<br />
manager does not decrease the cluster quorum value. In fact, the connection<br />
manager never decreases the cluster quorum value, it only increases it, unless<br />
the REMOVE NODE option was selected during shutdown. However, system<br />
managers can decrease the value according to the instructions in Section 10.12.2.<br />
2.3.6 Example<br />
Consider a cluster consisting of three computers, each computer having its<br />
VOTES parameter set to 1 and its EXPECTED_VOTES parameter set to 3. The<br />
connection manager dynamically computes the cluster quorum value to be 2<br />
(that is, (3 + 2)/2). In this example, any two of the three computers constitute a<br />
quorum and can run in the absence of the third computer. No single computer<br />
can constitute a quorum by itself. Therefore, there is no way the three <strong>OpenVMS</strong><br />
<strong>Cluster</strong> computers can be partitioned and run as two independent clusters.<br />
2.3.7 Quorum Disk<br />
A cluster system manager can designate a disk a quorum disk. The quorum<br />
disk acts as a virtual cluster member whose purpose is to add one vote to the total<br />
cluster votes. By establishing a quorum disk, you can increase the availability<br />
of a two-node cluster; such configurations can maintain quorum in the event of<br />
failure of either the quorum disk or one node, and continue operating.<br />
Note: Setting up a quorum disk is recommended only for <strong>OpenVMS</strong> <strong>Cluster</strong><br />
configurations with two nodes. A quorum disk is neither necessary nor<br />
recommended for configurations with more than two nodes.<br />
For example, assume an <strong>OpenVMS</strong> <strong>Cluster</strong> configuration with many satellites<br />
(that have no votes) and two nonsatellite systems (each having one vote) that<br />
downline load the satellites. Quorum is calculated as follows:<br />
(EXPECTED VOTES + 2)/2 = (2 + 2)/2 = 2<br />
Because there is no quorum disk, if either nonsatellite system departs from the<br />
cluster, only one vote remains and cluster quorum is lost. Activity will be blocked<br />
throughout the cluster until quorum is restored.<br />
However, if the configuration includes a quorum disk (adding one vote to the total<br />
cluster votes), and the EXPECTED_VOTES parameter is set to 3 on each node,<br />
then quorum will still be 2 even if one of the nodes leaves the cluster. Quorum is<br />
calculated as follows:<br />
(EXPECTED VOTES + 2)/2 = (3 + 2)/2 = 2<br />
Rules: Each <strong>OpenVMS</strong> <strong>Cluster</strong> system can include only one quorum disk. At<br />
least one computer must have a direct (not served) connection to the quorum<br />
disk:<br />
• Any computers that have a direct, active connection to the quorum disk or<br />
that have the potential for a direct connection should be enabled as quorum<br />
disk watchers.<br />
• Computers that cannot access the disk directly must rely on the quorum disk<br />
watchers for information about the status of votes contributed by the quorum<br />
disk.<br />
Reference: For more information about enabling a quorum disk, see<br />
Section 8.2.4. Section 8.3.2 describes removing a quorum disk.<br />
<strong>OpenVMS</strong> <strong>Cluster</strong> Concepts 2–7
<strong>OpenVMS</strong> <strong>Cluster</strong> Concepts<br />
2.3 Ensuring the Integrity of <strong>Cluster</strong> Membership<br />
2.3.8 Quorum Disk Watcher<br />
To enable a computer as a quorum disk watcher, use one of the following methods:<br />
Method Perform These Steps<br />
Run the CLUSTER_<br />
CONFIG.COM<br />
procedure<br />
(described in Chapter 8)<br />
Respond YES when the<br />
<strong>OpenVMS</strong> installation<br />
procedure asks whether<br />
the cluster will contain<br />
a quorum disk<br />
(described in Chapter 4)<br />
Edit the<br />
MODPARAMS or<br />
AGEN$ files (described<br />
in Chapter 8)<br />
Invoke the procedure and:<br />
1. Select the CHANGE option.<br />
2. From the CHANGE menu, select the item labeled ‘‘Enable a quorum<br />
disk on the local computer’’.<br />
3. At the prompt, supply the quorum disk device name.<br />
The procedure uses the information you provide to update the values of<br />
the DISK_QUORUM and QDSKVOTES system parameters.<br />
During the installation procedure:<br />
1. Answer Y when the the procedure asks whether the cluster will<br />
contain a quorum disk.<br />
2. At the prompt, supply the quorum disk device name.<br />
The procedure uses the information you provide to update the values of<br />
the DISK_QUORUM and QDSKVOTES system parameters.<br />
Edit the following parameters:<br />
• DISK_QUORUM: Specify the quorum disk name, in ASCII, as a<br />
value for the DISK_QUORUM system parameter.<br />
• QDSKVOTES: Set an appropriate value for the QDSKVOTES<br />
parameter. This parameter specifies the number of votes contributed<br />
to the cluster votes total by a quorum disk. The number of votes<br />
contributed by the quorum disk is equal to the smallest value of the<br />
QDSKVOTES parameter on any quorum disk watcher.<br />
Hint: If only one quorum disk watcher has direct access to the quorum disk, then<br />
remove the disk and give its votes to the node.<br />
2.3.9 Rules for Specifying Quorum<br />
For the quorum disk’s votes to be counted in the total cluster votes, the following<br />
conditions must be met:<br />
• On all computers capable of becoming watchers, you must specify the<br />
same physical device name as a value for the DISK_QUORUM system<br />
parameter. The remaining computers (which must have a blank value for<br />
DISK_QUORUM) recognize the name specified by the first quorum disk<br />
watcher with which they communicate.<br />
• At least one quorum disk watcher must have a direct, active connection to the<br />
quorum disk.<br />
• The disk must contain a valid format file named QUORUM.DAT in the<br />
master file directory. The QUORUM.DAT file is created automatically after a<br />
system specifying a quorum disk has booted into the cluster for the first time.<br />
This file is used on subsequent reboots.<br />
Note: The file is not created if the system parameter STARTUP_P1 is set to<br />
MIN.<br />
• To permit recovery from failure conditions, the quorum disk must be mounted<br />
by all disk watchers.<br />
2–8 <strong>OpenVMS</strong> <strong>Cluster</strong> Concepts
<strong>OpenVMS</strong> <strong>Cluster</strong> Concepts<br />
2.3 Ensuring the Integrity of <strong>Cluster</strong> Membership<br />
• The <strong>OpenVMS</strong> <strong>Cluster</strong> can include only one quorum disk.<br />
• The quorum disk cannot be a member of a shadow set.<br />
Hint: By increasing the quorum disk’s votes to one less than the total votes from<br />
both systems (and by increasing the value of the EXPECTED_VOTES system<br />
parameter by the same amount), you can boot and run the cluster with only one<br />
node.<br />
2.4 State Transitions<br />
<strong>OpenVMS</strong> <strong>Cluster</strong> state transitions occur when a computer joins or leaves an<br />
<strong>OpenVMS</strong> <strong>Cluster</strong> system and when the cluster recognizes a quorum disk state<br />
change. The connection manager controls these events to ensure the preservation<br />
of data integrity throughout the cluster.<br />
A state transition’s duration and effect on users (applications) are determined by<br />
the reason for the transition, the configuration, and the applications in use.<br />
2.4.1 Adding a Member<br />
Every transition goes through one or more phases, depending on whether its<br />
cause is the addition of a new <strong>OpenVMS</strong> <strong>Cluster</strong> member or the failure of a<br />
current member.<br />
Table 2–2 describes the phases of a transition caused by the addition of a new<br />
member.<br />
Table 2–2 Transitions Caused by Adding a <strong>Cluster</strong> Member<br />
Phase Description<br />
New member<br />
detection<br />
Early in its boot sequence, a computer seeking membership in an <strong>OpenVMS</strong><br />
<strong>Cluster</strong> system sends messages to current members asking to join the cluster.<br />
The first cluster member that receives the membership request acts as the<br />
new computer’s advocate and proposes reconfiguring the cluster to include the<br />
computer in the cluster. While the new computer is booting, no applications<br />
are affected.<br />
Note: The connection manager will not allow a computer to join the <strong>OpenVMS</strong><br />
<strong>Cluster</strong> system if the node’s value for EXPECTED_VOTES would readjust<br />
quorum higher than calculated votes to cause the <strong>OpenVMS</strong> <strong>Cluster</strong> to suspend<br />
activity.<br />
Reconfiguration During a configuration change due to a computer being added to an<br />
<strong>OpenVMS</strong> <strong>Cluster</strong>, all current <strong>OpenVMS</strong> <strong>Cluster</strong> members must establish<br />
communications with the new computer. Once communications are established,<br />
the new computer is admitted to the cluster. In some cases, the lock database<br />
is rebuilt.<br />
2.4.2 Losing a Member<br />
Table 2–3 describes the phases of a transition caused by the failure of a current<br />
<strong>OpenVMS</strong> <strong>Cluster</strong> member.<br />
<strong>OpenVMS</strong> <strong>Cluster</strong> Concepts 2–9
<strong>OpenVMS</strong> <strong>Cluster</strong> Concepts<br />
2.4 State Transitions<br />
Table 2–3 Transitions Caused by Loss of a <strong>Cluster</strong> Member<br />
Cause Description<br />
Failure detection The duration of this phase depends on the cause of the failure and on how the failure is detected.<br />
During normal cluster operation, messages sent from one computer to another are acknowledged<br />
when received.<br />
IF... THEN...<br />
A message is not acknowledged<br />
within a period determined<br />
by <strong>OpenVMS</strong> <strong>Cluster</strong><br />
communications software<br />
A cluster member is shut down or<br />
fails<br />
The repair attempt phase begins.<br />
The operating system causes datagrams to be sent from<br />
the computer shutting down to the other members.<br />
These datagrams state the computer’s intention to sever<br />
communications and to stop sharing resources. The failure<br />
detection and repair attempt phases are bypassed, and the<br />
reconfiguration phase begins immediately.<br />
Repair attempt If the virtual circuit to an <strong>OpenVMS</strong> <strong>Cluster</strong> member is broken, attempts are made to repair<br />
the path. Repair attempts continue for an interval specified by the PAPOLLINTERVAL system<br />
parameter. (System managers can adjust the value of this parameter to suit local conditions.)<br />
Thereafter, the path is considered irrevocably broken, and steps must be taken to reconfigure the<br />
<strong>OpenVMS</strong> <strong>Cluster</strong> system so that all computers can once again communicate with each other and<br />
so that computers that cannot communicate are removed from the <strong>OpenVMS</strong> <strong>Cluster</strong>.<br />
Reconfiguration If a cluster member is shut down or fails, the cluster must be reconfigured. One of the remaining<br />
computers acts as coordinator and exchanges messages with all other cluster members to<br />
determine an optimal cluster configuration with the most members and the most votes. This<br />
phase, during which all user (application) activity is blocked, usually lasts less than 3 seconds,<br />
although the actual time depends on the configuration.<br />
(continued on next page)<br />
2–10 <strong>OpenVMS</strong> <strong>Cluster</strong> Concepts
Table 2–3 (Cont.) Transitions Caused by Loss of a <strong>Cluster</strong> Member<br />
Cause Description<br />
<strong>OpenVMS</strong> <strong>Cluster</strong><br />
system recovery<br />
Application<br />
recovery<br />
<strong>OpenVMS</strong> <strong>Cluster</strong> Concepts<br />
2.4 State Transitions<br />
Recovery includes the following stages, some of which can take place in parallel:<br />
Stage Action<br />
I/O completion When a computer is removed from the cluster, <strong>OpenVMS</strong> <strong>Cluster</strong> software<br />
ensures that all I/O operations that are started prior to the transition complete<br />
before I/O operations that are generated after the transition. This stage<br />
usually has little or no effect on applications.<br />
Lock database<br />
rebuild<br />
Disk mount<br />
verification<br />
Quorum disk<br />
votes validation<br />
Because the lock database is distributed among all members, some portion of<br />
the database might need rebuilding. A rebuild is performed as follows:<br />
WHEN... THEN...<br />
A computer leaves the<br />
<strong>OpenVMS</strong> <strong>Cluster</strong><br />
A computer is added to<br />
the <strong>OpenVMS</strong> <strong>Cluster</strong><br />
A rebuild is always performed.<br />
A rebuild is performed when the LOCKDIRWT<br />
system parameter is greater than 1.<br />
Caution: Setting the LOCKDIRWT system parameter to different values on<br />
the same model or type of computer can cause the distributed lock manager<br />
to use the computer with the higher value. This could cause undue resource<br />
usage on that computer.<br />
This stage occurs only when the failure of a voting member causes quorum to<br />
be lost. To protect data integrity, all I/O activity is blocked until quorum is<br />
regained. Mount verification is the mechanism used to block I/O during this<br />
phase.<br />
If, when a computer is removed, the remaining members can determine that<br />
it has shut down or failed, the votes contributed by the quorum disk are<br />
included without delay in quorum calculations that are performed by the<br />
remaining members. However, if the quorum watcher cannot determine that<br />
the computer has shut down or failed (for example, if a console halt, power<br />
failure, or communications failure has occurred), the votes are not included<br />
for a period (in seconds) equal to four times the value of the QDSKINTERVAL<br />
system parameter. This period is sufficient to determine that the failed<br />
computer is no longer using the quorum disk.<br />
Disk rebuild If the transition is the result of a computer rebooting after a failure, the disks<br />
are marked as improperly dismounted.<br />
Reference: See Sections 6.5.5 and 6.5.6 for information about rebuilding<br />
disks.<br />
When you assess the effect of a state transition on application users, consider that the application<br />
recovery phase includes activities such as replaying a journal file, cleaning up recovery units, and<br />
users logging in again.<br />
2.5 <strong>OpenVMS</strong> <strong>Cluster</strong> Membership<br />
<strong>OpenVMS</strong> <strong>Cluster</strong> systems based on LAN use a cluster group number and a<br />
cluster password to allow multiple independent <strong>OpenVMS</strong> <strong>Cluster</strong> systems to<br />
coexist on the same extended LAN and to prevent accidental access to a cluster<br />
by unauthorized computers.<br />
<strong>OpenVMS</strong> <strong>Cluster</strong> Concepts 2–11
<strong>OpenVMS</strong> <strong>Cluster</strong> Concepts<br />
2.5 <strong>OpenVMS</strong> <strong>Cluster</strong> Membership<br />
2.5.1 <strong>Cluster</strong> Group Number<br />
The cluster group number uniquely identifies each <strong>OpenVMS</strong> <strong>Cluster</strong> system<br />
on a LAN. This number must be from 1 to 4095 or from 61440 to 65535.<br />
Rule: If you plan to have more than one <strong>OpenVMS</strong> <strong>Cluster</strong> system on a LAN,<br />
you must coordinate the assignment of cluster group numbers among system<br />
managers.<br />
Note: <strong>OpenVMS</strong> <strong>Cluster</strong> systems operating on CI and DSSI do not use cluster<br />
group numbers and passwords.<br />
2.5.2 <strong>Cluster</strong> Password<br />
The cluster password prevents an unauthorized computer using the cluster<br />
group number, from joining the cluster. The password must be from 1 to 31<br />
alphanumeric characters in length, including dollar signs ( $ ) and underscores<br />
(_).<br />
2.5.3 Location<br />
The cluster group number and cluster password are maintained in the cluster<br />
authorization file, SYS$COMMON:[SYSEXE]CLUSTER_AUTHORIZE.DAT. This<br />
file is created during installation of the operating system if you indicate that you<br />
want to set up a cluster that utilizes the LAN. The installation procedure then<br />
prompts you for the cluster group number and password.<br />
Note: If you convert an <strong>OpenVMS</strong> <strong>Cluster</strong> that uses only the CI<br />
or DSSI interconnect to one that includes a LAN interconnect, the<br />
SYS$COMMON:[SYSEXE]CLUSTER_AUTHORIZE.DAT file is created when<br />
you execute the CLUSTER_CONFIG.COM command procedure, as described in<br />
Chapter 8.<br />
Reference: For information about <strong>OpenVMS</strong> <strong>Cluster</strong> group data in the<br />
CLUSTER_AUTHORIZE.DAT file, see Sections 8.4 and 10.9.<br />
2.5.4 Example<br />
If all nodes in the <strong>OpenVMS</strong> <strong>Cluster</strong> do not have the same cluster password, an<br />
error report similar to the following is logged in the error log file.<br />
V A X / V M S SYSTEM ERROR REPORT COMPILED 30-JAN-1994 15:38:03<br />
PAGE 19.<br />
******************************* ENTRY<br />
ERROR SEQUENCE 24.<br />
DATE/TIME 30-JAN-1994 15:35:47.94<br />
SYSTEM UPTIME: 5 DAYS 03:46:21<br />
SCS NODE: DAISIE<br />
161. *******************************<br />
LOGGED ON: SID 12000003<br />
SYS_TYPE 04010002<br />
VAX/VMS V6.0<br />
DEVICE ATTENTION KA46 CPU FW REV# 3. CONSOLE FW REV# 0.1<br />
NI-SCS SUB-SYSTEM, DAISIE$PEA0:<br />
INVALID CLUSTER PASSWORD RECEIVED<br />
2–12 <strong>OpenVMS</strong> <strong>Cluster</strong> Concepts
STATUS 00000000<br />
00000000<br />
DATALINK UNIT 0001<br />
DATALINK NAME 41534503<br />
00000000<br />
00000000<br />
00000000<br />
REMOTE NODE 554C4306<br />
00203132<br />
00000000<br />
00000000<br />
REMOTE ADDR 000400AA<br />
FC15<br />
LOCAL ADDR 000400AA<br />
4D34<br />
ERROR CNT 0001<br />
UCB$W_ERRCNT 0003<br />
DATALINK NAME = ESA1:<br />
REMOTE NODE = CLU21<br />
ETHERNET ADDR = AA-00-04-00-15-FC<br />
ETHERNET ADDR = AA-00-04-00-34-4D<br />
1. ERROR OCCURRENCES THIS ENTRY<br />
3. ERRORS THIS UNIT<br />
<strong>OpenVMS</strong> <strong>Cluster</strong> Concepts<br />
2.5 <strong>OpenVMS</strong> <strong>Cluster</strong> Membership<br />
2.6 Synchronizing <strong>Cluster</strong> Functions by the Distributed Lock<br />
Manager<br />
The distributed lock manager is an <strong>OpenVMS</strong> feature for synchronizing<br />
functions required by the distributed file system, the distributed job controller,<br />
device allocation, user-written <strong>OpenVMS</strong> <strong>Cluster</strong> applications, and other<br />
<strong>OpenVMS</strong> products and software components.<br />
The distributed lock manager uses the connection manager and SCS to<br />
communicate information between <strong>OpenVMS</strong> <strong>Cluster</strong> computers.<br />
2.6.1 Distributed Lock Manager Functions<br />
The functions of the distributed lock manager include the following:<br />
• Synchronizes access to shared clusterwide resources, including:<br />
Devices<br />
Files<br />
Records in files<br />
Any user-defined resources, such as databases and memory<br />
Each resource is managed clusterwide by an <strong>OpenVMS</strong> <strong>Cluster</strong> computer.<br />
• Implements the $ENQ and $DEQ system services to provide clusterwide<br />
synchronization of access to resources by allowing the locking and unlocking<br />
of resource names.<br />
Reference: For detailed information about system services, refer to the<br />
<strong>OpenVMS</strong> System Services Reference Manual.<br />
• Queues process requests for access to a locked resource. This queuing<br />
mechanism allows processes to be put into a wait state until a particular<br />
resource is available. As a result, cooperating processes can synchronize their<br />
access to shared objects, such as files and records.<br />
• Releases all locks that an <strong>OpenVMS</strong> <strong>Cluster</strong> computer holds if the computer<br />
fails. This mechanism allows processing to continue on the remaining<br />
computers.<br />
<strong>OpenVMS</strong> <strong>Cluster</strong> Concepts 2–13
<strong>OpenVMS</strong> <strong>Cluster</strong> Concepts<br />
2.6 Synchronizing <strong>Cluster</strong> Functions by the Distributed Lock Manager<br />
• Supports clusterwide deadlock detection.<br />
2.6.2 System Management of the Lock Manager<br />
The lock manager is fully automated and usually requires no explicit system<br />
management. However, the LOCKDIRWT system parameter can be used to<br />
adjust how control of lock resource trees is distributed across the cluster.<br />
The node that controls a lock resource tree is called the resource master. Each<br />
resource tree may be mastered by a different node.<br />
For most configurations, large computers and boot nodes perform optimally when<br />
LOCKDIRWT is set to 1 and satellite nodes have LOCKDIRWT set to 0. These<br />
values are set automatically by the CLUSTER_CONFIG.COM procedure.<br />
In some circumstances, you may want to change the values of the LOCKDIRWT<br />
across the cluster to control which nodes master resource trees. The following list<br />
describes how the value of the LOCKDIRWT system parameter affects resource<br />
tree mastership:<br />
• If multiple nodes have locks on a resource tree, the tree is mastered by the<br />
node with the highest value for LOCKDIRWT, regardless of actual locking<br />
rates.<br />
• If multiple nodes with the same LOCKDIRWT value have locks on a resource,<br />
the tree is mastered by the node with the highest locking rate on that tree.<br />
• Note that if only one node has locks on a resource tree, it becomes the master<br />
of the tree, regardless of the LOCKDIRWT value.<br />
Thus, using varying values for the LOCKDIRWT system parameter, you can<br />
implement a resource tree mastering policy that is priority based. Using equal<br />
values for the LOCKDIRWT system parameter, you can implement a resource<br />
tree mastering policy that is activity based. If necessary, a combination of<br />
priority-based and activity-based remastering can be used.<br />
2.6.3 Large-Scale Locking Applications<br />
The Enqueue process limit (ENQLM), which is set in the SYSUAF.DAT file and<br />
which controls the number of locks that a process can own, can be adjusted to<br />
meet the demands of large scale databases and other server applications.<br />
Prior to <strong>OpenVMS</strong> Version 7.1, the limit was 32767. This limit was removed<br />
to enable the efficient operation of large scale databases and other server<br />
applications. A process can now own up to 16,776,959 locks, the architectural<br />
maximum. By setting ENQLM in SYSUAF.DAT to 32767 (using the Authorize<br />
utility), the lock limit is automatically extended to the maximum of 16,776,959<br />
locks. $CREPRC can pass large quotas to the target process if it is initialized<br />
from a process with the SYSUAF Enqlm quota of 32767.<br />
Reference: See the <strong>OpenVMS</strong> Programming Concepts Manual for additional<br />
information about the distributed lock manager and resource trees. See the<br />
<strong>OpenVMS</strong> System Manager’s Manual for more information about Enqueue<br />
Quota.<br />
2.7 Resource Sharing<br />
Resource sharing in an <strong>OpenVMS</strong> <strong>Cluster</strong> system is enabled by the distributed<br />
file system, RMS, and the distributed lock manager.<br />
2–14 <strong>OpenVMS</strong> <strong>Cluster</strong> Concepts
<strong>OpenVMS</strong> <strong>Cluster</strong> Concepts<br />
2.7 Resource Sharing<br />
2.7.1 Distributed File System<br />
The <strong>OpenVMS</strong> <strong>Cluster</strong> distributed file system allows all computers to share<br />
mass storage and files. The distributed file system provides the same access<br />
to disks, tapes, and files across the <strong>OpenVMS</strong> <strong>Cluster</strong> that is provided on a<br />
standalone computer.<br />
2.7.2 RMS and Distributed Lock Manager<br />
The distributed file system and <strong>OpenVMS</strong> Record Management Services (RMS)<br />
use the distributed lock manager to coordinate clusterwide file access. RMS files<br />
can be shared to the record level.<br />
Any disk or tape can be made available to the entire <strong>OpenVMS</strong> <strong>Cluster</strong> system.<br />
The storage devices can be:<br />
• Connected to an HSC, HSJ, HSD, HSG, HSZ, DSSI, or SCSI subsystem<br />
• A local device that is served to the <strong>OpenVMS</strong> <strong>Cluster</strong><br />
All cluster-accessible devices appear as if they are connected to every computer.<br />
2.8 Disk Availability<br />
Locally connected disks can be served across an <strong>OpenVMS</strong> <strong>Cluster</strong> by the MSCP<br />
server.<br />
2.8.1 MSCP Server<br />
The MSCP server makes locally connected disks, including the following,<br />
available across the cluster:<br />
• DSA disks local to <strong>OpenVMS</strong> <strong>Cluster</strong> members using SDI<br />
• HSC and HSJ disks in an <strong>OpenVMS</strong> <strong>Cluster</strong> using mixed interconnects<br />
• ISE and HSD disks in an <strong>OpenVMS</strong> <strong>Cluster</strong> using mixed interconnects<br />
• SCSI and HSZ disks<br />
• FC and HSG disks<br />
• Disks on boot servers and disk servers located anywhere in the <strong>OpenVMS</strong><br />
<strong>Cluster</strong><br />
In conjunction with the disk class driver (DUDRIVER), the MSCP server<br />
implements the storage server portion of the MSCP protocol on a computer,<br />
allowing the computer to function as a storage controller. The MSCP protocol<br />
defines conventions for the format and timing of messages sent and received for<br />
certain families of mass storage controllers and devices designed by Compaq. The<br />
MSCP server decodes and services MSCP I/O requests sent by remote cluster<br />
nodes.<br />
Note: The MSCP server is not used by a computer to access files on locally<br />
connected disks.<br />
2.8.2 Device Serving<br />
Once a device is set up to be served:<br />
• Any cluster member can submit I/O requests to it.<br />
• The local computer can decode and service MSCP I/O requests sent by remote<br />
<strong>OpenVMS</strong> <strong>Cluster</strong> computers.<br />
<strong>OpenVMS</strong> <strong>Cluster</strong> Concepts 2–15
<strong>OpenVMS</strong> <strong>Cluster</strong> Concepts<br />
2.8 Disk Availability<br />
2.8.3 Enabling the MSCP Server<br />
The MSCP server is controlled by the MSCP_LOAD and MSCP_SERVE_ALL<br />
system parameters. The values of these parameters are set initially by answers<br />
to questions asked during the <strong>OpenVMS</strong> installation procedure (described in<br />
Section 8.4), or during the CLUSTER_CONFIG.COM procedure (described in<br />
Chapter 8).<br />
The default values for these parameters are as follows:<br />
• MSCP is not loaded on satellites.<br />
• MSCP is loaded on boot server and disk server nodes.<br />
Reference: See Section 6.3 for more information about setting system<br />
parameters for MSCP serving.<br />
2.9 Tape Availability<br />
Locally connected tapes can be served across an <strong>OpenVMS</strong> <strong>Cluster</strong> by the TMSCP<br />
server.<br />
2.9.1 TMSCP Server<br />
The TMSCP server makes locally connected tapes, including the following,<br />
available across the cluster:<br />
• HSC and HSJ tapes<br />
• ISE and HSD tapes<br />
• SCSI tapes<br />
The TMSCP server implements the TMSCP protocol, which is used to<br />
communicate with a controller for TMSCP tapes. In conjunction with the tape<br />
class driver (TUDRIVER), the TMSCP protocol is implemented on a processor,<br />
allowing the processor to function as a storage controller.<br />
The processor submits I/O requests to locally accessed tapes, and accepts the I/O<br />
requests from any node in the cluster. In this way, the TMSCP server makes<br />
locally connected tapes available to all nodes in the cluster. The TMSCP server<br />
can also make HSC tapes and DSSI ISE tapes accessible to <strong>OpenVMS</strong> <strong>Cluster</strong><br />
satellites.<br />
2.9.2 Enabling the TMSCP Server<br />
The TMSCP server is controlled by the TMSCP_LOAD system parameter. The<br />
value of this parameter is set initially by answers to questions asked during<br />
the <strong>OpenVMS</strong> installation procedure (described in Section 4.2.3) or during the<br />
CLUSTER_CONFIG.COM procedure (described in Section 8.4). By default, the<br />
setting of the TMSCP_LOAD parameter does not load the TMSCP server and<br />
does not serve any tapes.<br />
2.10 Queue Availability<br />
The distributed job controller makes queues available across the cluster in<br />
order to achieve the following:<br />
2–16 <strong>OpenVMS</strong> <strong>Cluster</strong> Concepts
Function Description<br />
Permit users on any <strong>OpenVMS</strong><br />
<strong>Cluster</strong> computer to submit<br />
batch and print jobs to queues<br />
that execute on any computer in<br />
the <strong>OpenVMS</strong> <strong>Cluster</strong><br />
Distribute the batch and print<br />
processing work load over<br />
<strong>OpenVMS</strong> <strong>Cluster</strong> nodes<br />
<strong>OpenVMS</strong> <strong>Cluster</strong> Concepts<br />
2.10 Queue Availability<br />
Users can submit jobs to any queue in the cluster, provided that<br />
the necessary mass storage volumes and peripheral devices are<br />
accessible to the computer on which the job executes.<br />
System managers can set up generic batch and print queues<br />
that distribute processing work loads among computers. The<br />
distributed job controller directs batch and print jobs either to the<br />
execution queue with the lowest ratio of jobs-to-queue limit or to<br />
the next available printer.<br />
The job controller uses the distributed lock manager to signal other computers<br />
in the <strong>OpenVMS</strong> <strong>Cluster</strong> to examine the batch and print queue jobs to be<br />
processed.<br />
2.10.1 Controlling Queues<br />
To control queues, you use one or several queue managers to maintain a<br />
clusterwide queue database that stores information about queues and jobs.<br />
Reference: For detailed information about setting up <strong>OpenVMS</strong> <strong>Cluster</strong> queues,<br />
see Chapter 7.<br />
<strong>OpenVMS</strong> <strong>Cluster</strong> Concepts 2–17
3<br />
<strong>OpenVMS</strong> <strong>Cluster</strong> Interconnect Configurations<br />
3.1 Overview<br />
This chapter provides an overview of various types of <strong>OpenVMS</strong> <strong>Cluster</strong><br />
configurations and the ways they are interconnected.<br />
References: For definitive information about supported <strong>OpenVMS</strong> <strong>Cluster</strong><br />
configurations, refer to:<br />
• <strong>OpenVMS</strong> <strong>Cluster</strong> Software Software Product Description (SPD 29.78.xx)<br />
• Guidelines for <strong>OpenVMS</strong> <strong>Cluster</strong> Configurations<br />
All Alpha and VAX nodes in any type of <strong>OpenVMS</strong> <strong>Cluster</strong> must have direct<br />
connections to all other nodes. Sites can choose to use one or more of the<br />
following interconnects:<br />
• LANs<br />
ATM<br />
Ethernet (10/100 and Gigabit Ethernet)<br />
FDDI<br />
• CI<br />
• DSSI<br />
• MEMORY CHANNEL<br />
• SCSI (requires a second interconnect for node-to-node [SCS] communications)<br />
• Fibre Channel (requires a second interconnect for node-to-node [SCS]<br />
communications)<br />
Processing needs and available hardware resources determine how individual<br />
<strong>OpenVMS</strong> <strong>Cluster</strong> systems are configured. The configuration discussions in this<br />
chapter are based on these physical interconnects.<br />
3.2 <strong>OpenVMS</strong> <strong>Cluster</strong> <strong>Systems</strong> Interconnected by CI<br />
The CI was the first interconnect used for <strong>OpenVMS</strong> <strong>Cluster</strong> communications.<br />
The CI supports the exchange of information among VAX and Alpha nodes, and<br />
HSC and HSJ nodes at the rate of 70 megabits per second on two paths.<br />
<strong>OpenVMS</strong> <strong>Cluster</strong> Interconnect Configurations 3–1
<strong>OpenVMS</strong> <strong>Cluster</strong> Interconnect Configurations<br />
3.2 <strong>OpenVMS</strong> <strong>Cluster</strong> <strong>Systems</strong> Interconnected by CI<br />
3.2.1 Design<br />
The CI is designed for access to storage and for reliable host-to-host<br />
communication. CI is a high-performance, highly available way to connect<br />
Alpha and VAX nodes to disk and tape storage devices and to each other. An<br />
<strong>OpenVMS</strong> <strong>Cluster</strong> system based on the CI for cluster communications uses<br />
star couplers as common connection points for computers, and HSC and HSJ<br />
subsystems.<br />
3.2.2 Example<br />
Figure 3–1 shows how the CI components are typically configured.<br />
Figure 3–1 <strong>OpenVMS</strong> <strong>Cluster</strong> Configuration Based on CI<br />
VAX<br />
VAX<br />
CI CI<br />
HSJ<br />
SCSI<br />
Star Coupler<br />
Disks Containing Shared Environment Files<br />
VAX<br />
System<br />
Disk<br />
Alpha<br />
SCSI<br />
Alpha<br />
Alpha<br />
HSJ<br />
Alpha<br />
System<br />
Disk<br />
Alpha<br />
VM-0665A-AI<br />
Note: If you want to add workstations to a CI <strong>OpenVMS</strong> <strong>Cluster</strong> system, you<br />
must utilize an additional type of interconnect, such as Ethernet or FDDI, in the<br />
configuration. Workstations are typically configured as satellites in an <strong>OpenVMS</strong><br />
<strong>Cluster</strong> system (see Section 3.4.4).<br />
Reference: For instructions on adding satellites to an existing CI <strong>OpenVMS</strong><br />
<strong>Cluster</strong> system, refer to Section 8.2.<br />
3–2 <strong>OpenVMS</strong> <strong>Cluster</strong> Interconnect Configurations
<strong>OpenVMS</strong> <strong>Cluster</strong> Interconnect Configurations<br />
3.2 <strong>OpenVMS</strong> <strong>Cluster</strong> <strong>Systems</strong> Interconnected by CI<br />
3.2.3 Star Couplers<br />
What appears to be a single point of failure in the CI configuration in Figure 3–1<br />
is the star coupler that connects all the CI lines. In reality, the star coupler is<br />
not a single point of failure because there are actually two star couplers in every<br />
cabinet.<br />
Star couplers are also immune to power failures because they contain no powered<br />
components but are constructed as sets of high-frequency pulse transformers.<br />
Because they do no processing or buffering, star couplers also are not I/O<br />
throughput bottlenecks. They operate at the full-rated speed of the CI cables.<br />
However, in very heavy I/O situations, exceeding CI bandwidth may require<br />
multiple star couplers.<br />
3.3 <strong>OpenVMS</strong> <strong>Cluster</strong> <strong>Systems</strong> Interconnected by DSSI<br />
The DIGITAL Storage <strong>Systems</strong> Interconnect (DSSI) is a medium-bandwidth<br />
interconnect that Alpha and VAX nodes can use to access disk and tape<br />
peripherals. Each peripheral is an integrated storage element (ISE) that<br />
contains its own controller and its own MSCP server that works in parallel with<br />
the other ISEs on the DSSI.<br />
3.3.1 Design<br />
Although the DSSI is designed primarily to access disk and tape storage, it has<br />
proven an excellent way to connect small numbers of nodes using the <strong>OpenVMS</strong><br />
<strong>Cluster</strong> protocols. Each DSSI port connects to a single DSSI bus. As in the case<br />
of the CI, several DSSI ports can be connected to a node to provide redundant<br />
paths between nodes. However, unlike CI, DSSI does not provide redundant<br />
paths.<br />
3.3.2 Availability<br />
<strong>OpenVMS</strong> <strong>Cluster</strong> configurations using ISE devices and the DSSI bus offer high<br />
availability, flexibility, growth potential, and ease of system management.<br />
DSSI nodes in an <strong>OpenVMS</strong> <strong>Cluster</strong> configuration can access a common system<br />
disk and all data disks directly on a DSSI bus and serve them to satellites.<br />
Satellites (and users connected through terminal servers) can access any disk<br />
through any node designated as a boot server. If one of the boot servers fails,<br />
applications on satellites continue to run because disk access fails over to the<br />
other server. Although applications running on nonintelligent devices, such as<br />
terminal servers, are interrupted, users of terminals can log in again and restart<br />
their jobs.<br />
3.3.3 Guidelines<br />
Generic configuration guidelines for DSSI <strong>OpenVMS</strong> <strong>Cluster</strong> systems are as<br />
follows:<br />
• Currently, a total of four Alpha and/or VAX nodes can be connected to a<br />
common DSSI bus.<br />
• Multiple DSSI buses can operate in an <strong>OpenVMS</strong> <strong>Cluster</strong> configuration, thus<br />
dramatically increasing the amount of storage that can be configured into the<br />
system.<br />
<strong>OpenVMS</strong> <strong>Cluster</strong> Interconnect Configurations 3–3
<strong>OpenVMS</strong> <strong>Cluster</strong> Interconnect Configurations<br />
3.3 <strong>OpenVMS</strong> <strong>Cluster</strong> <strong>Systems</strong> Interconnected by DSSI<br />
References: Some restrictions apply to the type of CPUs and DSSI I/O adapters<br />
that can reside on the same DSSI bus. Consult your service representative or see<br />
the <strong>OpenVMS</strong> <strong>Cluster</strong> Software Software Product Description (SPD) for complete<br />
and up-to-date configuration details about DSSI <strong>OpenVMS</strong> <strong>Cluster</strong> systems.<br />
3.3.4 Example<br />
Figure 3–2 shows a typical DSSI configuration.<br />
Figure 3–2 DSSI <strong>OpenVMS</strong> <strong>Cluster</strong> Configuration<br />
RF Disks<br />
DSSI<br />
Alpha or VAX Alpha or VAX Alpha or VAX<br />
ZK−5944A−GE<br />
3.4 <strong>OpenVMS</strong> <strong>Cluster</strong> <strong>Systems</strong> Interconnected by LANs<br />
The Ethernet (10/100 and Gigabit), FDDI, and ATM interconnects are industrystandard<br />
local area networks (LANs) that are generally shared by a wide variety<br />
of network consumers. When <strong>OpenVMS</strong> <strong>Cluster</strong> systems are based on LAN,<br />
cluster communications are carried out by a port driver (PEDRIVER) that<br />
emulates CI port functions.<br />
3.4.1 Design<br />
The <strong>OpenVMS</strong> <strong>Cluster</strong> software is designed to use the Ethernet, ATM, and FDDI<br />
ports and interconnects simultaneously with the DECnet, TCP/IP, and SCS<br />
protocols. This is accomplished by allowing LAN data link software to control the<br />
hardware port. This software provides a multiplexing function so that the cluster<br />
protocols are simply another user of a shared hardware resource. See Figure 2–1<br />
for an illustration of this concept.<br />
3.4.2 <strong>Cluster</strong> Group Numbers and <strong>Cluster</strong> Passwords<br />
A single LAN can support multiple LAN-based <strong>OpenVMS</strong> <strong>Cluster</strong> systems. Each<br />
<strong>OpenVMS</strong> <strong>Cluster</strong> is identified and secured by a unique cluster group number<br />
and a cluster password. Chapter 2 describes cluster group numbers and cluster<br />
passwords in detail.<br />
3–4 <strong>OpenVMS</strong> <strong>Cluster</strong> Interconnect Configurations
<strong>OpenVMS</strong> <strong>Cluster</strong> Interconnect Configurations<br />
3.4 <strong>OpenVMS</strong> <strong>Cluster</strong> <strong>Systems</strong> Interconnected by LANs<br />
3.4.3 Servers<br />
<strong>OpenVMS</strong> <strong>Cluster</strong> computers interconnected by a LAN are generally configured<br />
as either servers or satellites. The following table describes servers.<br />
Server Type Description<br />
MOP servers Downline load the <strong>OpenVMS</strong> boot driver to satellites by means of the<br />
Maintenance Operations Protocol (MOP).<br />
Disk servers Use MSCP server software to make their locally connected disks and<br />
any CI or DSSI connected disks available to satellites over the LAN.<br />
Tape servers Use TMSCP server software to make their locally connected tapes and<br />
any CI or DSSI connected tapes available to satellite nodes over the<br />
LAN.<br />
Boot servers A combination of a MOP server and a disk server that serves one or<br />
more Alpha or VAX system disks. Boot and disk servers make user<br />
and application data disks available across the cluster. These servers<br />
should be the most powerful computers in the <strong>OpenVMS</strong> <strong>Cluster</strong> and<br />
should use the highest-bandwidth LAN adapters in the cluster. Boot<br />
servers must always run the MSCP server software.<br />
3.4.4 Satellites<br />
Satellites are computers without a local system disk. Generally, satellites are<br />
consumers of cluster resources, although they can also provide facilities for disk<br />
serving, tape serving, and batch processing. If satellites are equipped with local<br />
disks, they can enhance performance by using such local disks for paging and<br />
swapping.<br />
Satellites are booted remotely from a boot server (or from a MOP server and a<br />
disk server) serving the system disk. Section 3.4.5 describes MOP and disk server<br />
functions during satellite booting.<br />
Note: An Alpha system disk can be mounted as a data disk on a VAX computer<br />
and, with proper MOP setup, can be used to boot Alpha satellites. Similarly, a<br />
VAX system disk can be mounted on an Alpha computer and, with the proper<br />
MOP setup, can be used to boot VAX satellites.<br />
Reference: Cross-architecture booting is described in Section 10.5.<br />
3.4.5 Satellite Booting<br />
When a satellite requests an operating system load, a MOP server for the<br />
appropriate <strong>OpenVMS</strong> Alpha or <strong>OpenVMS</strong> VAX operating system sends a<br />
bootstrap image to the satellite that allows the satellite to load the rest of the<br />
operating system from a disk server and join the cluster. The sequence of actions<br />
during booting is described in Table 3–1.<br />
<strong>OpenVMS</strong> <strong>Cluster</strong> Interconnect Configurations 3–5
<strong>OpenVMS</strong> <strong>Cluster</strong> Interconnect Configurations<br />
3.4 <strong>OpenVMS</strong> <strong>Cluster</strong> <strong>Systems</strong> Interconnected by LANs<br />
Table 3–1 Satellite Booting Process<br />
Step Action Comments<br />
1 Satellite requests MOP<br />
service.<br />
2 MOP server loads the<br />
Alpha or VAX system.<br />
3 Satellite finds additional<br />
parameters located on<br />
the system disk and<br />
root.<br />
4 Satellite executes the<br />
load program<br />
†VAX specific<br />
‡Alpha specific<br />
This is the original boot request that a satellite sends out across<br />
the network. Any node in the <strong>OpenVMS</strong> <strong>Cluster</strong> that has MOP<br />
service enabled and has the LAN address of the particular<br />
satellite node in its database can become the MOP server for<br />
the satellite.<br />
‡The MOP server responds to an Alpha satellite boot request by<br />
downline loading the SYS$SYSTEM:APB.EXE program along<br />
with the required parameters.<br />
†The MOP server responds to a VAX satellite boot request by<br />
downline loading the SYS$SHARE:NISCS_LOAD.EXE program<br />
along with the required parameters.<br />
For Alpha and VAX computers, Some of these parameters<br />
include:<br />
• System disk name<br />
• Root number of the satellite<br />
The satellite finds <strong>OpenVMS</strong> <strong>Cluster</strong> system parameters, such<br />
as SCSSYSTEMID, SCSNODE, and NISCS_CONV_BOOT. The<br />
satellite also finds the cluster group code and password.<br />
The program establishes an SCS connection to a disk server for<br />
the satellite system disk and loads the SYSBOOT.EXE program.<br />
3.4.6 Examples<br />
Figure 3–3 shows an <strong>OpenVMS</strong> <strong>Cluster</strong> system based on a LAN interconnect with<br />
a single Alpha server node and a single Alpha system disk.<br />
Note: To include VAX satellites in this configuration, configure a VAX system<br />
disk on the Alpha server node following the instructions in Section 10.5.<br />
3–6 <strong>OpenVMS</strong> <strong>Cluster</strong> Interconnect Configurations
<strong>OpenVMS</strong> <strong>Cluster</strong> Interconnect Configurations<br />
3.4 <strong>OpenVMS</strong> <strong>Cluster</strong> <strong>Systems</strong> Interconnected by LANs<br />
Figure 3–3 LAN <strong>OpenVMS</strong> <strong>Cluster</strong> System with Single Server Node and System Disk<br />
Shared<br />
Disks<br />
System<br />
Disk<br />
Alpha<br />
LAN<br />
Alpha<br />
Satellites<br />
Local<br />
Disks<br />
VM-0666A-AI<br />
In Figure 3–3, the server node (and its system disk) is a single point of failure.<br />
If the server node fails, the satellite nodes cannot access any of the shared disks<br />
including the system disk. Note that some of the satellite nodes have locally<br />
connected disks. If you convert one or more of these into system disks, satellite<br />
nodes can boot from their own local system disk.<br />
Figure 3–4 shows an example of an <strong>OpenVMS</strong> <strong>Cluster</strong> system that uses LAN and<br />
Fibre Channel interconnects.<br />
<strong>OpenVMS</strong> <strong>Cluster</strong> Interconnect Configurations 3–7
<strong>OpenVMS</strong> <strong>Cluster</strong> Interconnect Configurations<br />
3.4 <strong>OpenVMS</strong> <strong>Cluster</strong> <strong>Systems</strong> Interconnected by LANs<br />
Figure 3–4 LAN and Fibre Channel <strong>OpenVMS</strong> <strong>Cluster</strong> System: Sample<br />
Configuration<br />
LAN Switch<br />
Alpha Alpha Alpha Alpha<br />
A B C<br />
D<br />
FC Switch<br />
HSG HSG<br />
Shadow Set A<br />
Shadow Set B<br />
FC Switch<br />
HSG HSG<br />
VM-0667A-AI<br />
The LAN connects nodes A and B with nodes C and D into a single <strong>OpenVMS</strong><br />
<strong>Cluster</strong> system.<br />
In Figure 3–4, Volume Shadowing for <strong>OpenVMS</strong> is used to maintain key data<br />
storage devices in identical states (shadow sets A and B). Any data on the<br />
shadowed disks written at one site will also be written at the other site. However,<br />
the benefits of high data availability must be weighed against the performance<br />
overhead required to use the MSCP server to serve the shadow set over the<br />
cluster interconnect.<br />
Figure 3–5 illustrates how FDDI can be configured with Ethernet from the<br />
bridges to the server CPU nodes. This configuration can increase overall<br />
throughput. <strong>OpenVMS</strong> <strong>Cluster</strong> systems that have heavily utilized Ethernet<br />
segments can replace the Ethernet backbone with a faster LAN to alleviate the<br />
performance bottleneck that can be caused by the Ethernet.<br />
3–8 <strong>OpenVMS</strong> <strong>Cluster</strong> Interconnect Configurations
<strong>OpenVMS</strong> <strong>Cluster</strong> Interconnect Configurations<br />
3.4 <strong>OpenVMS</strong> <strong>Cluster</strong> <strong>Systems</strong> Interconnected by LANs<br />
Figure 3–5 FDDI in Conjunction with Ethernet in an <strong>OpenVMS</strong> <strong>Cluster</strong> System<br />
Satellite Nodes<br />
Ethernet to<br />
FDDI Bridges<br />
10/100<br />
10/100<br />
10/100<br />
FDDI Wiring<br />
Concentrator<br />
Alpha VAX<br />
CI CI<br />
HSJ<br />
Star Coupler<br />
HSJ<br />
VM-0668A-AI<br />
Comments:<br />
• Each satellite LAN segment could be in a different building or town because<br />
of the longer distances allowed by FDDI.<br />
• The longer distances provided by FDDI permit you to create new <strong>OpenVMS</strong><br />
<strong>Cluster</strong> systems in your computing environment. The large nodes on the right<br />
could have replaced server nodes that previously existed on the individual<br />
Ethernet segments.<br />
• The VAX and Alpha computers have CI connections for storage. Currently, no<br />
storage controllers connect directly to FDDI. CPU nodes connected to FDDI<br />
must have local storage or access to storage over another interconnect.<br />
If an <strong>OpenVMS</strong> <strong>Cluster</strong> system has more than one FDDI-connected node,<br />
then those CPU nodes will probably use CI or DSSI connections for storage.<br />
The VAX and Alpha computers, connected by CI in Figure 3–5, are considered<br />
a lobe of the <strong>OpenVMS</strong> <strong>Cluster</strong> system.<br />
3.4.7 LAN Bridge Failover Process<br />
The following table describes how the bridge parameter settings can affect the<br />
failover process.<br />
<strong>OpenVMS</strong> <strong>Cluster</strong> Interconnect Configurations 3–9
<strong>OpenVMS</strong> <strong>Cluster</strong> Interconnect Configurations<br />
3.4 <strong>OpenVMS</strong> <strong>Cluster</strong> <strong>Systems</strong> Interconnected by LANs<br />
Option Comments<br />
Decreasing the LISTEN_TIME value<br />
allows the bridge to detect topology<br />
changes more quickly.<br />
Decreasing the FORWARDING_DELAY<br />
value can cause the bridge to forward<br />
packets unnecessarily to the other LAN<br />
segment.<br />
If you reduce the LISTEN_TIME parameter value,<br />
you should also decrease the value for the HELLO_<br />
INTERVAL bridge parameter according to the bridgespecific<br />
guidelines. However, note that decreasing the<br />
value for the HELLO_INTERVAL parameter causes an<br />
increase in network traffic.<br />
Unnecessary forwarding can temporarily cause more<br />
traffic on both LAN segments until the bridge software<br />
determines which LAN address is on each side of the<br />
bridge.<br />
Note: If you change a parameter on one LAN bridge, you should change that<br />
parameter on all bridges to ensure that selection of a new root bridge does not<br />
change the value of the parameter. The actual parameter value the bridge uses is<br />
the value specified by the root bridge.<br />
3.5 <strong>OpenVMS</strong> <strong>Cluster</strong> <strong>Systems</strong> Interconnected by MEMORY<br />
CHANNEL<br />
MEMORY CHANNEL is a high-performance cluster interconnect technology for<br />
PCI-based Alpha systems. With the benefits of very low latency, high bandwidth,<br />
and direct memory access, MEMORY CHANNEL complements and extends the<br />
ability of <strong>OpenVMS</strong> <strong>Cluster</strong>s to work as a single, virtual system. MEMORY<br />
CHANNEL is used for node-to-node cluster communications only. You use it<br />
in combination with another interconnect, such as Fibre Channel, SCSI, CI, or<br />
DSSI, that is dedicated to storage traffic.<br />
3.5.1 Design<br />
A node requires the following three hardware components to support a MEMORY<br />
CHANNEL connection:<br />
• PCI-to MEMORY CHANNEL adapter<br />
• Link cable (3 m or 10 feet long)<br />
• Port in a MEMORY CHANNEL hub (except for a two-node configuration in<br />
which the cable connects just two PCI adapters)<br />
3.5.2 Examples<br />
Figure 3–6 shows a two-node MEMORY CHANNEL cluster with shared access to<br />
Fibre Channel storage and a LAN interconnect for failover.<br />
3–10 <strong>OpenVMS</strong> <strong>Cluster</strong> Interconnect Configurations
<strong>OpenVMS</strong> <strong>Cluster</strong> Interconnect Configurations<br />
3.5 <strong>OpenVMS</strong> <strong>Cluster</strong> <strong>Systems</strong> Interconnected by MEMORY CHANNEL<br />
Figure 3–6 Two-Node MEMORY CHANNEL <strong>OpenVMS</strong> <strong>Cluster</strong> Configuration<br />
LAN<br />
MC<br />
LAN MC<br />
MC LAN<br />
CPU CPU<br />
FC<br />
HSG<br />
VM-0669A-AI<br />
A three-node MEMORY CHANNEL cluster connected by a MEMORY CHANNEL<br />
hub and also by a LAN interconnect is shown in Figure 3–7. The three nodes<br />
share access to the Fibre Channel storage. The LAN interconnect enables failover<br />
if the MEMORY CHANNEL interconnect fails.<br />
Figure 3–7 Three-Node MEMORY CHANNEL <strong>OpenVMS</strong> <strong>Cluster</strong> Configuration<br />
CPU CPU<br />
CPU<br />
HSG<br />
LAN<br />
MC Hub<br />
FC<br />
VM-0670A-AI<br />
<strong>OpenVMS</strong> <strong>Cluster</strong> Interconnect Configurations 3–11
<strong>OpenVMS</strong> <strong>Cluster</strong> Interconnect Configurations<br />
3.6 Multihost SCSI <strong>OpenVMS</strong> <strong>Cluster</strong> <strong>Systems</strong><br />
3.6 Multihost SCSI <strong>OpenVMS</strong> <strong>Cluster</strong> <strong>Systems</strong><br />
<strong>OpenVMS</strong> <strong>Cluster</strong> systems support the Small Computer <strong>Systems</strong> Interface (SCSI)<br />
as a storage interconnect. A SCSI interconnect, also called a SCSI bus, is an<br />
industry-standard interconnect that supports one or more computers, peripheral<br />
devices, and interconnecting components.<br />
Beginning with <strong>OpenVMS</strong> Alpha Version 6.2, multiple Alpha computers<br />
can simultaneously access SCSI disks over a SCSI interconnect. Another<br />
interconnect, for example, a local area network, is required for host-to-host<br />
<strong>OpenVMS</strong> cluster communications.<br />
3.6.1 Design<br />
Beginning with <strong>OpenVMS</strong> Alpha Version 6.2-1H3, <strong>OpenVMS</strong> Alpha supports<br />
up to three nodes on a shared SCSI bus as the storage interconnect. A quorum<br />
disk can be used on the SCSI bus to improve the availability of two-node<br />
configurations. Host-based RAID (including host-based shadowing) and the<br />
MSCP server are supported for shared SCSI storage devices.<br />
With the introduction of the SCSI hub DWZZH-05, four nodes can be supported<br />
in a SCSI multihost <strong>OpenVMS</strong> <strong>Cluster</strong> system. In order to support four nodes,<br />
the hub’s fair arbitration feature must be enabled.<br />
For a complete description of these configurations, see Guidelines for <strong>OpenVMS</strong><br />
<strong>Cluster</strong> Configurations.<br />
3.6.2 Examples<br />
Figure 3–8 shows an <strong>OpenVMS</strong> <strong>Cluster</strong> configuration that uses a SCSI<br />
interconnect for shared access to SCSI devices. Note that another interconnect, a<br />
LAN in this example, is used for host-to-host communications.<br />
3–12 <strong>OpenVMS</strong> <strong>Cluster</strong> Interconnect Configurations
<strong>OpenVMS</strong> <strong>Cluster</strong> Interconnect Configurations<br />
3.6 Multihost SCSI <strong>OpenVMS</strong> <strong>Cluster</strong> <strong>Systems</strong><br />
Figure 3–8 Three-Node <strong>OpenVMS</strong> <strong>Cluster</strong> Configuration Using a Shared SCSI<br />
Interconnect<br />
Shared SCSI Storage<br />
SCSI bus<br />
T T<br />
Client<br />
Server<br />
Server<br />
Client Client<br />
LAN<br />
ZK−7479A−GE<br />
3.7 Multihost Fibre Channel <strong>OpenVMS</strong> <strong>Cluster</strong> <strong>Systems</strong><br />
<strong>OpenVMS</strong> <strong>Cluster</strong> systems support FC interconnect as a storage interconnect.<br />
Fibre Channel is an ANSI standard network and storage interconnect that offers<br />
many advantages over other interconnects, including high-speed transmission and<br />
long interconnect distances. A second interconnect is required for node-to-node<br />
communications.<br />
3.7.1 Design<br />
<strong>OpenVMS</strong> Alpha supports the Fibre Channel SAN configurations described in<br />
the latest Compaq StorageWorks Heterogeneous Open SAN Design Reference<br />
Guide and in the Data Replication Manager (DRM) user documentation. This<br />
configuration support includes multiswitch Fibre Channel fabrics, up to 500<br />
meters of multimode fiber, and up to 100 kilometers of single-mode fiber. In<br />
addition, DRM configurations provide long-distance intersite links (ISLs) through<br />
the use of the Open <strong>Systems</strong> Gateway and wave division multiplexors. <strong>OpenVMS</strong><br />
supports sharing of the fabric and the HSG storage with non-<strong>OpenVMS</strong> systems.<br />
<strong>OpenVMS</strong> provides support for the number of hosts, switches, and storage<br />
controllers specified in the StorageWorks documentation. In general, the number<br />
of hosts and storage controllers is limited only by the number of available fabric<br />
connections.<br />
Host-based RAID (including host-based shadowing) and the MSCP server are<br />
supported for shared Fibre Channel storage devices. Multipath support is<br />
available for these configurations.<br />
For a complete description of these configurations, see Guidelines for <strong>OpenVMS</strong><br />
<strong>Cluster</strong> Configurations.<br />
<strong>OpenVMS</strong> <strong>Cluster</strong> Interconnect Configurations 3–13
<strong>OpenVMS</strong> <strong>Cluster</strong> Interconnect Configurations<br />
3.7 Multihost Fibre Channel <strong>OpenVMS</strong> <strong>Cluster</strong> <strong>Systems</strong><br />
3.7.2 Examples<br />
Figure 3–9 shows a multihost configuration with two independent Fibre Channel<br />
interconnects connecting the hosts to the storage subsystems. Note that another<br />
interconnect is used for node-to-node communications.<br />
Figure 3–9 Four-Node <strong>OpenVMS</strong> <strong>Cluster</strong> Configuration Using a Fibre Channel<br />
Interconnect<br />
Alpha<br />
KGPSA KGPSA<br />
Fibre Channel A<br />
Alpha<br />
KGPSA KGPSA<br />
Alpha<br />
KGPSA KGPSA<br />
HSG80 HSG80 HSG80 HSG80<br />
Disk storage subsystem Disk storage subsystem<br />
3–14 <strong>OpenVMS</strong> <strong>Cluster</strong> Interconnect Configurations<br />
<strong>Cluster</strong> communications<br />
Fibre Channel B<br />
MDR<br />
SCSI tape<br />
Tape storage<br />
subsystem<br />
Alpha<br />
KGPSA KGPSA<br />
VM-0081A-AI
4<br />
The <strong>OpenVMS</strong> <strong>Cluster</strong> Operating Environment<br />
This chapter describes how to prepare the <strong>OpenVMS</strong> <strong>Cluster</strong> operating<br />
environment.<br />
4.1 Preparing the Operating Environment<br />
To prepare the cluster operating environment, there are a number of steps you<br />
perform on the first <strong>OpenVMS</strong> <strong>Cluster</strong> node before configuring other computers<br />
into the cluster. The following table describes these tasks.<br />
Task Section<br />
Check all hardware connections to computer,<br />
interconnects, and devices.<br />
Verify that all microcode and hardware is set to<br />
the correct revision levels.<br />
Install the <strong>OpenVMS</strong> operating system. Section 4.2<br />
Install all software licenses, including <strong>OpenVMS</strong><br />
<strong>Cluster</strong> licenses.<br />
Section 4.3<br />
Install layered products. Section 4.4<br />
Configure and start LANCP or DECnet for<br />
satellite booting<br />
Section 4.5<br />
4.2 Installing the <strong>OpenVMS</strong> Operating System<br />
Described in the appropriate hardware<br />
documentation.<br />
Contact your support representative.<br />
Only one <strong>OpenVMS</strong> operating system version can exist on a system disk.<br />
Therefore, when installing or upgrading the <strong>OpenVMS</strong> operating systems:<br />
• Install the <strong>OpenVMS</strong> Alpha operating system on each Alpha system disk<br />
• Install the <strong>OpenVMS</strong> VAX operating system on each VAX system disk<br />
4.2.1 System Disks<br />
A system disk is one of the few resources that cannot be shared between Alpha<br />
and VAX systems. However, an Alpha system disk can be mounted as a data<br />
disk on a VAX computer and, with MOP configured appropriately, can be used to<br />
boot Alpha satellites. Similarly, a VAX system disk can be mounted on an Alpha<br />
computer and, with the appropriate MOP configuration, can be used to boot VAX<br />
satellites.<br />
Reference: Cross-architecture booting is described in Section 10.5.<br />
Once booted, Alpha and VAX processors can share access to data on any disk in<br />
the <strong>OpenVMS</strong> <strong>Cluster</strong>, including system disks. For example, an Alpha system<br />
can mount a VAX system disk as a data disk and a VAX system can mount an<br />
Alpha system disk as a data disk.<br />
The <strong>OpenVMS</strong> <strong>Cluster</strong> Operating Environment 4–1
The <strong>OpenVMS</strong> <strong>Cluster</strong> Operating Environment<br />
4.2 Installing the <strong>OpenVMS</strong> Operating System<br />
Note: An <strong>OpenVMS</strong> <strong>Cluster</strong> running both implementations of DECnet requires<br />
a system disk for DECnet for <strong>OpenVMS</strong> (Phase IV) and another system disk<br />
for DECnet–Plus (Phase V). For more information, see the DECnet–Plus<br />
documentation.<br />
4.2.2 Where to Install<br />
You may want to set up common system disks according to these guidelines:<br />
IF you want the cluster to<br />
have... THEN perform the installation or upgrade...<br />
One common system disk for all<br />
computer members<br />
A combination of one or more<br />
common system disks and one<br />
or more local (individual) system<br />
disks<br />
Once on the cluster common system disk.<br />
Either:<br />
• Once for each system disk<br />
or<br />
• Once on a common system disk and then run the CLUSTER_<br />
CONFIG.COM procedure to create duplicate system disks<br />
(thus enabling systems to have their own local system disk)<br />
Note: If your cluster includes multiple common system disks, you must later coordinate system files<br />
to define the cluster operating environment, as described in Chapter 5.<br />
Reference: See Section 8.5 for information about creating a duplicate system disk.<br />
Example: If your <strong>OpenVMS</strong> <strong>Cluster</strong> consists of 10 computers, 4 of which boot<br />
from a common Alpha system disk, 2 of which boot from a second common Alpha<br />
system disk, 2 of which boot from a common VAX system disk, and 2 of which<br />
boot from their own local system disk, you need to perform an installation five<br />
times.<br />
4.2.3 Information Required<br />
Table 4–1 table lists the questions that the <strong>OpenVMS</strong> operating system<br />
installation procedure prompts you with and describes how certain system<br />
parameters are affected by responses you provide. You will notice that two of the<br />
prompts vary, depending on whether the node is running DECnet. The table also<br />
provides an example of an installation procedure that is taking place on a node<br />
named JUPITR.<br />
Important: Be sure you determine answers to the questions before you begin the<br />
installation.<br />
Note about versions: Refer to the appropriate <strong>OpenVMS</strong> <strong>OpenVMS</strong> Release<br />
Notes document for the required version numbers of hardware and firmware.<br />
When mixing versions of the operating system in an <strong>OpenVMS</strong> <strong>Cluster</strong>, check the<br />
release notes for information about compatibility.<br />
Reference: Refer to the appropriate <strong>OpenVMS</strong> upgrade and installation manual<br />
for complete installation instructions.<br />
4–2 The <strong>OpenVMS</strong> <strong>Cluster</strong> Operating Environment
Table 4–1 Information Required to Perform an Installation<br />
The <strong>OpenVMS</strong> <strong>Cluster</strong> Operating Environment<br />
4.2 Installing the <strong>OpenVMS</strong> Operating System<br />
Prompt Response Parameter<br />
Will this node be a cluster<br />
member (Y/N)?<br />
What is the node’s DECnet<br />
node name?<br />
What is the node’s DECnet<br />
node address?<br />
What is the node’s SCS<br />
node name?<br />
What is the node’s<br />
SCSSYSTEMID number?<br />
WHEN<br />
you respond...<br />
AND...<br />
N CI and DSSI<br />
hardware is not<br />
present<br />
N CI and DSSI<br />
hardware is<br />
present<br />
THEN the VAXcluster parameter is<br />
set to...<br />
0 — Node will not participate in the<br />
<strong>OpenVMS</strong> <strong>Cluster</strong>.<br />
1 — Node will automatically<br />
participate in the <strong>OpenVMS</strong> <strong>Cluster</strong><br />
in the presence of CI or DSSI<br />
hardware.<br />
Y 2 — Node will participate in the<br />
<strong>OpenVMS</strong> <strong>Cluster</strong>.<br />
If the node is running DECnet, this prompt, the following prompt, and<br />
the SCSSYSTEMID prompt are displayed. Enter the DECnet node<br />
name or the DECnet–Plus node synonym (for example, JUPITR). If a<br />
node synonym is not defined, SCSNODE can be any name from 1 to<br />
6 alphanumeric characters in length. The name cannot include dollar<br />
signs ( $ ) or underscores ( _ ).<br />
Enter the DECnet node address (for example, a valid address might be<br />
2.211). If an address has not been assigned, enter 0 now and enter a<br />
valid address when you start DECnet (discussed later in this chapter).<br />
For DECnet–Plus, this question is asked when nodes are configured<br />
with a Phase IV compatible address. If a Phase IV compatible address<br />
is not configured, then the SCSSYSTEMID system parameter can be<br />
set to any value.<br />
If the node is not running DECnet, this prompt and the following<br />
prompt are displayed in place of the two previous prompts. Enter a<br />
name of 1 to 6 alphanumeric characters that uniquely names this node.<br />
At least 1 character must be a letter. The name cannot include dollar<br />
signs ( $ ) or underscores ( _ ).<br />
This number must be unique within this cluster. SCSSYSTEMID is<br />
the low-order 32 bits of the 48-bit system identification number.<br />
If the node is running DECnet for <strong>OpenVMS</strong>, calculate the value from<br />
the DECnet address using the following formula:<br />
SCSSYSTEMID = (DECnet-area-number * 1024) + (DECnet-nodenumber)<br />
Example: If the DECnet address is 2.211, calculate the value as<br />
follows:<br />
SCSSYSTEMID = (2 * 1024) + 211 = 2259<br />
VAXCLUSTER<br />
SCSNODE<br />
SCSSYSTEMID<br />
SCSNODE<br />
SCSSYSTEMID<br />
(continued on next page)<br />
The <strong>OpenVMS</strong> <strong>Cluster</strong> Operating Environment 4–3
The <strong>OpenVMS</strong> <strong>Cluster</strong> Operating Environment<br />
4.2 Installing the <strong>OpenVMS</strong> Operating System<br />
Table 4–1 (Cont.) Information Required to Perform an Installation<br />
Prompt Response Parameter<br />
Will the Ethernet be used<br />
for cluster communications<br />
(Y/N)? 1<br />
Enter this cluster’s group<br />
number:<br />
Enter this cluster’s<br />
password:<br />
Reenter this cluster’s<br />
password for verification:<br />
Will JUPITR be a disk<br />
server (Y/N)?<br />
Will JUPITR serve HSC or<br />
RF disks (Y/N)?<br />
IF you<br />
respond... THEN the NISCS_LOAD_PEA0 parameter is set to...<br />
N 0 — PEDRIVER is not loaded2 ; cluster<br />
communications does not use Ethernet or FDDI.<br />
Y 1 — Loads PEDRIVER to enable cluster<br />
communications over Ethernet or FDDI.<br />
Enter a number in the range of 1 to 4095 or 61440 to 65535 (see<br />
Section 2.5). This value is stored in the CLUSTER_AUTHORIZE.DAT<br />
file in the SYS$COMMON:[SYSEXE] directory.<br />
Enter the cluster password. The password must be from 1 to 31<br />
alphanumeric characters in length and can include dollar signs<br />
( $ ) and underscores ( _ ) (see Section 2.5). This value is stored<br />
in scrambled form in the CLUSTER_AUTHORIZE.DAT file in the<br />
SYS$COMMON:[SYSEXE] directory.<br />
NISCS_LOAD_<br />
PEA0<br />
Not applicable<br />
Not applicable<br />
Reenter the password. Not applicable<br />
IF you<br />
respond... THEN the MSCP_LOAD parameter is set to...<br />
N 0 — The MSCP server will not be loaded. This is<br />
the correct setting for configurations in which all<br />
<strong>OpenVMS</strong> <strong>Cluster</strong> nodes can directly access all shared<br />
storage and do not require LAN failover.<br />
Y 1 — Loads the MSCP server with attributes specified<br />
by the MSCP_SERVE_ALL parameter, using the<br />
default CPU load capacity.<br />
IF you<br />
respond... THEN the MSCP_SERVE_ALL parameter is set to...<br />
Y 1 — Serves all available disks.<br />
N 2 — Serves only locally connected (not HSC, HSJ, or<br />
RF) disks.<br />
MSCP_LOAD<br />
MSCP_<br />
SERVE_ALL<br />
1All references to the Ethernet are also applicable to FDDI.<br />
2PEDRIVER is the LAN port emulator driver that implements the NISCA protocol and controls communications between<br />
local and remote LAN ports.<br />
4–4 The <strong>OpenVMS</strong> <strong>Cluster</strong> Operating Environment<br />
(continued on next page)
Table 4–1 (Cont.) Information Required to Perform an Installation<br />
The <strong>OpenVMS</strong> <strong>Cluster</strong> Operating Environment<br />
4.2 Installing the <strong>OpenVMS</strong> Operating System<br />
Prompt Response Parameter<br />
Enter a value for JUPITR’s<br />
ALLOCLASS parameter: 3<br />
Does this cluster contain a<br />
quorum disk [N]?<br />
The value is dependent on the system configuration:<br />
• If the system will serve RF disks, assign a nonzero value to the<br />
allocation class.<br />
Reference: See Section 6.2.2.5 to assign DSSI allocation classes.<br />
• If the system will serve HSC disks, enter the allocation class value<br />
of the HSC.<br />
Reference: See Section 6.2.2.2 to assign HSC allocation classes.<br />
• If the system will serve HSJ disks, enter the allocation class value<br />
of the HSJ.<br />
Reference: For complete information about the HSJ console<br />
commands, refer to the HSJ hardware documentation. See<br />
Section 6.2.2.3 to assign HSJ allocation classes.<br />
• If the system will serve HSD disks, enter the allocation class<br />
value of the HSD.<br />
Reference: See Section 6.2.2.4 to assign HSC allocation classes.<br />
• If the system disk is connected to a dual-pathed disk, enter a<br />
value from 1 to 255 that will be used on both storage controllers.<br />
• If the system is connected to a shared SCSI bus (it shares storage<br />
on that bus with another system) and if it does not use port<br />
allocation classes for naming the SCSI disks, enter a value from<br />
1 to 255. This value must be used by all the systems and disks<br />
connected to the SCSI bus.<br />
Reference: For complete information about port allocation<br />
classes, see Section 6.2.1.<br />
• If the system will use Volume Shadowing for <strong>OpenVMS</strong>, enter a<br />
value from 1 to 255.<br />
Reference: For more information, see Volume Shadowing for<br />
<strong>OpenVMS</strong>.<br />
• If none of the above are true, enter 0 (zero).<br />
Enter Y or N, depending on your configuration. If you enter Y, the<br />
procedure prompts for the name of the quorum disk. Enter the device<br />
name of the quorum disk. (Quorum disks are discussed in Chapter 2.)<br />
3 Refer to Section 6.2 for complete information about device naming conventions.<br />
4.3 Installing Software Licenses<br />
ALLOCLASS<br />
DISK_<br />
QUORUM<br />
While rebooting at the end of the installation procedure, the system displays<br />
messages warning that you must install the operating system software and the<br />
<strong>OpenVMS</strong> <strong>Cluster</strong> software license. The <strong>OpenVMS</strong> <strong>Cluster</strong> software supports<br />
the <strong>OpenVMS</strong> License Management Facility (LMF). License units for clustered<br />
systems are allocated on an unlimited system-use basis.<br />
The <strong>OpenVMS</strong> <strong>Cluster</strong> Operating Environment 4–5
The <strong>OpenVMS</strong> <strong>Cluster</strong> Operating Environment<br />
4.3 Installing Software Licenses<br />
4.3.1 Guidelines<br />
Be sure to install all <strong>OpenVMS</strong> <strong>Cluster</strong> licenses and all licenses for layered<br />
products and DECnet as soon as the system is available. Procedures for installing<br />
licenses are described in the release notes distributed with the software kit and<br />
in the <strong>OpenVMS</strong> License Management Utility Manual. Additional licensing<br />
information is described in the respective SPDs.<br />
Use the following guidelines when you install software licenses:<br />
• Install an <strong>OpenVMS</strong> <strong>Cluster</strong> Software for Alpha license for each Alpha<br />
processor in the <strong>OpenVMS</strong> <strong>Cluster</strong>.<br />
• Install an <strong>OpenVMS</strong> <strong>Cluster</strong> Software for VAX license for each VAX processor<br />
in an <strong>OpenVMS</strong> <strong>Cluster</strong> system.<br />
• Install or upgrade licenses for layered products that will run on all nodes in<br />
an <strong>OpenVMS</strong> <strong>Cluster</strong> system.<br />
• <strong>OpenVMS</strong> Product Authorization Keys (PAKs) that have the Alpha option<br />
can be loaded and used only on Alpha processors. However, PAKs can be<br />
located in a license database (LDB) that is shared by both Alpha and VAX<br />
processors.<br />
• Do not load Availability PAKs for VAX systems (Availability PAKs that do not<br />
include the Alpha option) on Alpha systems.<br />
• PAK types such as Activity PAKs (also known as concurrent or n-user PAKs)<br />
and Personal Use PAKs (identified by the RESERVE_UNITS option) work on<br />
both VAX and Alpha systems.<br />
• Compaq recommends that you perform licensing tasks using an Alpha LMF.<br />
4.4 Installing Layered Products<br />
By installing layered products before other nodes are added to the <strong>OpenVMS</strong><br />
<strong>Cluster</strong>, the software is installed automatically on new members when they are<br />
added to the <strong>OpenVMS</strong> <strong>Cluster</strong> system.<br />
Note: For clusters with multiple system disks (VAX, Alpha, or both) you must<br />
perform a separate installation for each system disk.<br />
4.4.1 Procedure<br />
Table 4–2 describes the actions you take to install layered products on a common<br />
system disk.<br />
4–6 The <strong>OpenVMS</strong> <strong>Cluster</strong> Operating Environment
Table 4–2 Installing Layered Products on a Common System Disk<br />
Phase Action<br />
Before<br />
installation<br />
The <strong>OpenVMS</strong> <strong>Cluster</strong> Operating Environment<br />
4.4 Installing Layered Products<br />
Perform one or more of the following steps, as necessary for your system.<br />
1. Check each node’s system parameters and modify the values, if necessary. Refer to the layeredproduct<br />
installation guide or release notes for information about adjusting system parameter<br />
values.<br />
2. If necessary, disable logins on each node that boots from the disk using the DCL command SET<br />
LOGINS/INTERACTIVE=0. Send a broadcast message to notify users about the installation.<br />
Installation Refer to the appropriate layered-product documentation for product-specific installation information.<br />
Perform the installation once for each system disk.<br />
After<br />
installation<br />
Perform one or more of the following steps, as necessary for your system.<br />
1. If necessary, create product-specific files in the SYS$SPECIFIC directory on each node. (The<br />
installation utility describes whether or not you need to create a directory in SYS$SPECIFIC.)<br />
When creating files and directories, be careful to specify exactly where you want the file to be<br />
located:<br />
• Use SYS$SPECIFIC or SYS$COMMON instead of SYS$SYSROOT.<br />
• Use SYS$SPECIFIC:[SYSEXE] or SYS$COMMON:[SYSEXE] instead of SYS$SYSTEM.<br />
Reference: Section 5.3 describes directory structures in more detail.<br />
2. Modify files in SYS$SPECIFIC if the installation procedure tells you to do so. Modify files on<br />
each node that boots from this system disk.<br />
3. Reboot each node to ensure that:<br />
• The node is set up to run the layered product correctly.<br />
• The node is running the latest version of the layered product.<br />
4. Manually run the installation verification procedure (IVP) if you did not run it during the<br />
layered product installation. Run the IVP from at least one node in the <strong>OpenVMS</strong> <strong>Cluster</strong>, but<br />
preferably from all nodes that boot from this system disk.<br />
4.5 Configuring and Starting a Satellite Booting Service<br />
After you have installed the operating system and the required licenses on the<br />
first <strong>OpenVMS</strong> <strong>Cluster</strong> computer, you can configure and start a satellite booting<br />
service. You can use the LANCP utility, or DECnet software, or both.<br />
Compaq recommends LANCP for booting <strong>OpenVMS</strong> <strong>Cluster</strong> satellites. LANCP<br />
has shipped with the <strong>OpenVMS</strong> operating system since Version 6.2. It provides<br />
a general-purpose MOP booting service that can be used for booting satellites<br />
into an <strong>OpenVMS</strong> <strong>Cluster</strong>. (LANCP can service all types of MOP downline load<br />
requests, including those from terminal servers, LAN resident printers, and X<br />
terminals, and can be used to customize your LAN environment.)<br />
DECnet provides a MOP booting service for booting <strong>OpenVMS</strong> <strong>Cluster</strong> satellites,<br />
as well as other local and wide area network services, including task-to-task<br />
communications for applications.<br />
The <strong>OpenVMS</strong> <strong>Cluster</strong> Operating Environment 4–7
The <strong>OpenVMS</strong> <strong>Cluster</strong> Operating Environment<br />
4.5 Configuring and Starting a Satellite Booting Service<br />
Note<br />
If you plan to use LANCP in place of DECnet, and you also plan to<br />
move from DECnet Phase IV to DECnet–Plus, Compaq recommends the<br />
following order:<br />
1. Replace DECnet with LANCP for satellite booting (MOP downline<br />
load service) using LAN$POPULATE.COM.<br />
2. Migrate from DECnet Phase IV to DECnet-Plus.<br />
There are two cluster configuration command procedures, CLUSTER_CONFIG_<br />
LAN.COM and CLUSTER_CONFIG.COM. CLUSTER_CONFIG_LAN.COM uses<br />
LANCP to provide MOP services to boot satellites; CLUSTER_CONFIG.COM<br />
uses DECnet for the same purpose.<br />
Before choosing LANCP, DECnet, or both, consider the following factors:<br />
• Applications you will be running on your cluster<br />
DECnet task-to-task communications is a method commonly used for<br />
communication between programs that run on different nodes in a cluster or<br />
a network. If you are running a program with that dependency, you need to<br />
run DECnet. If you are not running any programs with that dependency, you<br />
do not need to run DECnet.<br />
• Limiting applications that require DECnet to certain nodes in your cluster<br />
If you are running applications that require DECnet task-to-task<br />
communications, you can run those applications on a subset of the nodes<br />
in your cluster and restrict DECnet usage to those nodes. You can use<br />
LANCP software on the remaining nodes and use a different network, such as<br />
DIGITAL TCP/IP Services for <strong>OpenVMS</strong>, for other network services.<br />
• Managing two types of software for the same purpose<br />
If you are already using DECnet for booting satellites, you may not want to<br />
introduce another type of software for that purpose. Introducing any new<br />
software requires time to learn and manage it.<br />
• LANCP MOP services can coexist with DECnet MOP services in an <strong>OpenVMS</strong><br />
<strong>Cluster</strong> in the following ways:<br />
Running on different systems<br />
For example, DECnet MOP service is enabled on some of the systems on<br />
the LAN and LAN MOP is enabled on other systems.<br />
Running on different LAN devices on the same system<br />
For example, DECnet MOP service is enabled on a subset of the available<br />
LAN devices on the system and LAN MOP is enabled on the remainder.<br />
Running on the same LAN device on the same system but targeting a<br />
different set of nodes for service<br />
For example, both DECnet MOP and LAN MOP are enabled but LAN<br />
MOP has limited the nodes to which it will respond. This allows DECnet<br />
MOP to respond to the remaining nodes.<br />
Instructions for configuring both LANCP and DECnet are provided in this<br />
section.<br />
4–8 The <strong>OpenVMS</strong> <strong>Cluster</strong> Operating Environment
The <strong>OpenVMS</strong> <strong>Cluster</strong> Operating Environment<br />
4.5 Configuring and Starting a Satellite Booting Service<br />
4.5.1 Configuring and Starting the LANCP Utility<br />
You can use the LAN Control Program (LANCP) utility to configure a local area<br />
network (LAN). You can also use the LANCP utility, in place of DECnet or in<br />
addition to DECnet, to provide support for booting satellites in an <strong>OpenVMS</strong><br />
<strong>Cluster</strong> and for servicing all types of MOP downline load requests, including<br />
those from terminal servers, LAN resident printers, and X terminals.<br />
Reference: For more information about using the LANCP utility to configure<br />
a LAN, see the <strong>OpenVMS</strong> System Manager’s Manual, Volume 2: Tuning,<br />
Monitoring, and Complex <strong>Systems</strong> and the <strong>OpenVMS</strong> System Management<br />
Utilities Reference Manual: A–L.<br />
4.5.2 Booting Satellite Nodes with LANCP<br />
The LANCP utility provides a general-purpose MOP booting service that can<br />
be used for booting satellites into an <strong>OpenVMS</strong> <strong>Cluster</strong>. It can also be used to<br />
service all types of MOP downline load requests, including those from terminal<br />
servers, LAN resident printers, and X terminals. To use LANCP for this purpose,<br />
all <strong>OpenVMS</strong> <strong>Cluster</strong> nodes must be running <strong>OpenVMS</strong> Version 6.2 or higher.<br />
The CLUSTER_CONFIG_LAN.COM cluster configuration command procedure<br />
uses LANCP in place of DECnet to provide MOP services to boot satellites.<br />
Note: If you plan to use LANCP in place of DECnet, and you also plan to move<br />
from DECnet for <strong>OpenVMS</strong> (Phase IV) to DECnet–Plus, Compaq recommends the<br />
following order:<br />
1. Replace DECnet with LANCP for satellite booting (MOP downline load<br />
service), using LAN$POPULATE.COM.<br />
2. Migrate from DECnet for <strong>OpenVMS</strong> to DECnet–Plus.<br />
4.5.3 Data Files Used by LANCP<br />
LANCP uses the following data files:<br />
• SYS$SYSTEM:LAN$DEVICE_DATABASE.DAT<br />
This file maintains information about devices on the local node. By default,<br />
the file is created in SYS$SPECIFIC:[SYSEXE], and the system looks for the<br />
file in that location. However, you can modify the file name or location for this<br />
file by redefining the systemwide logical name LAN$DEVICE_DATABASE.<br />
• SYS$SYSTEM:LAN$NODE_DATABASE.DAT<br />
This file contains information about the nodes for which LANCP will supply<br />
boot service. This file should be shared among all nodes in the <strong>OpenVMS</strong><br />
<strong>Cluster</strong>, including both Alpha and VAX systems. By default, the file is created<br />
in SYS$COMMON:[SYSEXE], and the system looks for the file in that<br />
location. However, you can modify the file name or location for this file by<br />
redefining the systemwide logical name LAN$NODE_DATABASE.<br />
4.5.4 Using LAN MOP Services in New Installations<br />
To use LAN MOP services for satellite booting in new installations, follow these<br />
steps:<br />
1. Add the startup command for LANCP.<br />
The <strong>OpenVMS</strong> <strong>Cluster</strong> Operating Environment 4–9
The <strong>OpenVMS</strong> <strong>Cluster</strong> Operating Environment<br />
4.5 Configuring and Starting a Satellite Booting Service<br />
You should start up LANCP as part of your system startup procedure. To do<br />
this, remove the comment from the line in SYS$MANAGER:SYSTARTUP_<br />
VMS.COM that runs the LAN$STARTUP command procedure. If your<br />
<strong>OpenVMS</strong> <strong>Cluster</strong> system will have more than one system disk, see<br />
Section 4.5.3 for a description of logicals that can be defined for locating<br />
LANCP configuration files.<br />
$ @SYS$STARTUP:LAN$STARTUP<br />
You should now either reboot the system or invoke the preceding command<br />
procedure from the system manager’s account to start LANCP.<br />
2. Follow the steps in Chapter 8 for configuring an <strong>OpenVMS</strong> <strong>Cluster</strong> system<br />
and adding satellites. Use the CLUSTER_CONFIG_LAN.COM command<br />
procedure instead of CLUSTER_CONFIG.COM. If you invoke CLUSTER_<br />
CONFIG.COM, it gives you the option to switch to running CLUSTER_<br />
CONFIG_LAN.COM if the LANCP process has been started.<br />
4.5.5 Using LAN MOP Services in Existing Installations<br />
To migrate from DECnet MOP services to LAN MOP services for satellite booting,<br />
follow these steps:<br />
1. Redefine the LANCP database logical names.<br />
This step is optional. If you want to move the data files used by LANCP,<br />
LAN$DEVICE_DATABASE and LAN$NODE_DATABASE, off the system<br />
disk, redefine their systemwide logical names. Add the definitions to the<br />
system startup files.<br />
2. Use LANCP to create the LAN$DEVICE_DATABASE<br />
The permanent LAN$DEVICE_DATABASE is created when you issue the<br />
first LANCP DEVICE command. To create the database and get a list of<br />
available devices, enter the following commands:<br />
$ MCR LANCP<br />
LANCP> LIST DEVICE /MOPDLL<br />
%LANCP-I-FNFDEV, File not found, LAN$DEVICE_DATABASE<br />
%LANACP-I-CREATDEV, Created LAN$DEVICE_DATABASE file<br />
Device Listing, permanent database:<br />
--- MOP Downline Load Service Characteristics ---<br />
Device State Access Mode Client Data Size<br />
------ ----- ----------- ------ ---------<br />
ESA0 Disabled NoExlusive NoKnownClientsOnly 246 bytes<br />
FCA0 Disabled NoExlusive NoKnownClientsOnly 246 bytes<br />
3. Use LANCP to enable LAN devices for MOP booting.<br />
By default, the LAN devices have MOP booting capability disabled.<br />
Determine the LAN devices for which you want to enable MOP booting.<br />
Then use the DEFINE command in the LANCP utility to enable these devices<br />
to service MOP boot requests in the permanent database, as shown in the<br />
following example:<br />
LANCP> DEFINE DEVICE ESA0:/MOP=ENABLE<br />
4. Run LAN$POPULATE.COM (found in SYS$EXAMPLES) to obtain MOP<br />
booting information and to produce LAN$DEFINE and LAN$DECNET_MOP_<br />
CLEANUP, which are site specific.<br />
4–10 The <strong>OpenVMS</strong> <strong>Cluster</strong> Operating Environment
The <strong>OpenVMS</strong> <strong>Cluster</strong> Operating Environment<br />
4.5 Configuring and Starting a Satellite Booting Service<br />
LAN$POPULATE extracts all MOP booting information from a DECnet Phase<br />
IV NETNODE_REMOTE.DAT file or from the output of the DECnet–Plus<br />
NCL command SHOW MOP CLIENT * ALL.<br />
For DECnet Phase IV sites, the LAN$POPULATE procedure scans all<br />
DECnet areas (1–63) by default. If you MOP boot systems from only a single<br />
or a few DECnet areas, you can cause the LAN$POPULATE procedure to<br />
operate on a single area at a time by providing the area number as the P1<br />
parameter to the procedure, as shown in the following example (including<br />
log):<br />
$ @SYS$EXAMPLES:LAN$POPULATE 15<br />
LAN$POPULATE - V1.0<br />
Do you want help (Y/N) :<br />
LAN$DEFINE.COM has been successfully created.<br />
To apply the node definitions to the LANCP permanent database,<br />
invoke the created LAN$DEFINE.COM command procedure.<br />
Compaq recommends that you review LAN$DEFINE.COM and remove any<br />
obsolete entries prior to executing this command procedure.<br />
A total of 2 MOP definitions were entered into LAN$DEFINE.COM<br />
5. Run LAN$DEFINE.COM to populate LAN$NODE_DATABASE.<br />
LAN$DEFINE populates the LANCP downline loading information<br />
into the LAN node database, SYS$COMMON:[SYSEVE]LAN$NODE_<br />
DATABASE.DAT file. Compaq recommends that you review<br />
LAN$DEFINE.COM and remove any obsolete entries before executing<br />
it.<br />
In the following sequence, the LAN$DEFINE.COM procedure that was just<br />
created is displayed on the screen and then executed:<br />
$ TYPE LAN$DEFINE.COM<br />
$ !<br />
$ ! This file was generated by LAN$POPULATE.COM on 16-DEC-1996 09:20:31<br />
$ ! on node CLU21.<br />
$ !<br />
$ ! Only DECnet Area 15 was scanned.<br />
$ !<br />
$ MCR LANCP<br />
Define Node PORK /Address=08-00-2B-39-82-85 /File=APB.EXE -<br />
/Root=$21$DKA300: /Boot_type=Alpha_Satellite<br />
Define Node JYPIG /Address=08-00-2B-A2-1F-81 /File=APB.EXE -<br />
/Root=$21$DKA300: /Boot_type=Alpha_Satellite<br />
EXIT<br />
$ @LAN$DEFINE<br />
%LANCP-I-FNFNOD, File not found, LAN$NODE_DATABASE<br />
-LANCP-I-CREATNOD, Created LAN$NODE_DATABASE file<br />
$<br />
The following example shows a LAN$DEFINE.COM command procedure<br />
that was generated by LAN$POPULATE for migration from DECnet–Plus to<br />
LANCP.<br />
The <strong>OpenVMS</strong> <strong>Cluster</strong> Operating Environment 4–11
The <strong>OpenVMS</strong> <strong>Cluster</strong> Operating Environment<br />
4.5 Configuring and Starting a Satellite Booting Service<br />
$ ! LAN$DEFINE.COM - LAN MOP Client Setup<br />
$ !<br />
$ ! This file was generated by LAN$POPULATE.COM at 8-DEC-1996 14:28:43.31<br />
$ ! on node BIGBOX.<br />
$ !<br />
$ SET NOON<br />
$ WRITE SYS$OUTPUT "Setting up MOP DLL clients in LANCP...<br />
$ MCR LANCP<br />
SET NODE SLIDER<br />
/ADDRESS=08-00-2B-12-D8-72/ROOT=BIGBOX$DKB0:/BOOT_TYP<br />
E=VAX_satellite/FILE=NISCS_LOAD.EXE<br />
DEFINE NODE SLIDER<br />
/ADDRESS=08-00-2B-12-D8-72/ROOT=BIGBOX$DKB0:/BOOT_TYP<br />
E=VAX_satellite/FILE=NISCS_LOAD.EXE<br />
EXIT<br />
$ !<br />
$ WRITE SYS$OUTPUT "DECnet Phase V to LAN MOPDLL client migration complete!"<br />
$ EXIT<br />
6. Run LAN$DECNET_MOP_CLEANUP.COM.<br />
You can use LAN$DECNET_MOP_CLEANUP.COM to remove the clients’<br />
MOP downline loading information from the DECnet database. Compaq<br />
recommends that you review LAN$DECNET_MOP_CLEANUP.COM and<br />
remove any obsolete entries before executing it.<br />
The following example shows a LAN$DECNET_MOP_CLEANUP.COM<br />
command procedure that was generated by LAN$POPULATE for migration<br />
from DECnet–Plus to LANCP.<br />
Note: When migrating from DECnet–Plus, additional cleanup is necessary.<br />
You must edit your NCL scripts (*.NCL) manually.<br />
$ ! LAN$DECNET_MOP_CLEANUP.COM - DECnet MOP Client Cleanup<br />
$ !<br />
$ ! This file was generated by LAN$POPULATE.COM at 8-DEC-1995 14:28:43.47<br />
$ ! on node BIGBOX.<br />
$ !<br />
$ SET NOON<br />
$ WRITE SYS$OUTPUT "Removing MOP DLL clients from DECnet database..."<br />
$ MCR NCL<br />
DELETE NODE 0 MOP CLIENT SLIDER<br />
EXIT<br />
$ !<br />
$ WRITE SYS$OUTPUT "DECnet Phase V MOPDLL client cleanup complete!"<br />
$ EXIT<br />
7. Start LANCP.<br />
To start LANCP, execute the startup command procedure as follows:<br />
$ @SYS$STARTUP:LAN$STARTUP<br />
%RUN-S-PROC_ID, identification of created process is 2920009B<br />
$<br />
You should start up LANCP for all boot nodes as part of your system startup<br />
procedure. To do this, include the following line in your site-specific startup<br />
file (SYS$MANAGER:SYSTARTUP_VMS.COM):<br />
$ @SYS$STARTUP:LAN$STARTUP<br />
If you have defined logicals for either LAN$DEVICE_DATABASE or<br />
LAN$NODE_DATABASE, be sure that these are defined in your startup<br />
files prior to starting up LANCP.<br />
4–12 The <strong>OpenVMS</strong> <strong>Cluster</strong> Operating Environment
The <strong>OpenVMS</strong> <strong>Cluster</strong> Operating Environment<br />
4.5 Configuring and Starting a Satellite Booting Service<br />
8. Disable DECnet MOP booting.<br />
If you use LANCP for satellite booting, you may no longer need DECnet to<br />
handle MOP requests. If this is the case for your site, you can turn off this<br />
capability with the appropriate NCP command (DECnet for <strong>OpenVMS</strong>) or<br />
NCL commands (DECnet–Plus).<br />
For more information about the LANCP utility, see the <strong>OpenVMS</strong> System<br />
Manager’s Manual and the <strong>OpenVMS</strong> System Management Utilities Reference<br />
Manual.<br />
4.5.6 Configuring DECnet<br />
The process of configuring the DECnet network typically entails several<br />
operations, as shown in Table 4–3. An <strong>OpenVMS</strong> <strong>Cluster</strong> running both<br />
implementations of DECnet requires a system disk for DECnet for <strong>OpenVMS</strong><br />
(Phase IV) and another system disk for DECnet–Plus (Phase V).<br />
Note: DECnet for <strong>OpenVMS</strong> implements Phase IV of Digital Network<br />
Architecture (DNA). DECnet–Plus implements Phase V of DNA. The following<br />
discussions are specific to the DECnet for <strong>OpenVMS</strong> product.<br />
Reference: Refer to the DECnet–Plus documentation for equivalent DECnet–<br />
Plus configuration information.<br />
Table 4–3 Procedure for Configuring the DECnet Network<br />
Step Action<br />
1 Log in as system manager and execute the NETCONFIG.COM command procedure as shown. Enter information<br />
about your node when prompted. Note that DECnet–Plus nodes execute the NET$CONFIGURE.COM command<br />
procedure.<br />
Reference: See the DECnet for <strong>OpenVMS</strong> or the DECnet–Plus documentation, as appropriate, for examples of<br />
these procedures.<br />
2 When a node uses multiple LAN adapter connections to the same LAN and also uses DECnet for communications,<br />
you must disable DECnet use of all but one of the LAN devices.<br />
To do this, remove all but one of the lines and circuits associated with the adapters connected to the same LAN<br />
or extended LAN from the DECnet configuration database after the NETCONFIG.COM procedure is run.<br />
For example, issue the following commands to invoke NCP and disable DECnet use of the LAN device XQB0:<br />
$ RUN SYS$SYSTEM:NCP<br />
NCP> PURGE CIRCUIT QNA-1 ALL<br />
NCP> DEFINE CIRCUIT QNA-1 STA OFF<br />
NCP> EXIT<br />
References:<br />
See Guidelines for <strong>OpenVMS</strong> <strong>Cluster</strong> Configurations for more information about distributing connections to LAN<br />
segments in <strong>OpenVMS</strong> <strong>Cluster</strong> configurations.<br />
See the DECnet–Plus documentation for information about removing routing circuits associated with all but<br />
one LAN adapter. (Note that the LAN adapter issue is not a problem if the DECnet–Plus node uses extended<br />
addressing and does not have any Phase IV compatible addressing in use on any of the routing circuits.)<br />
(continued on next page)<br />
The <strong>OpenVMS</strong> <strong>Cluster</strong> Operating Environment 4–13
The <strong>OpenVMS</strong> <strong>Cluster</strong> Operating Environment<br />
4.5 Configuring and Starting a Satellite Booting Service<br />
Table 4–3 (Cont.) Procedure for Configuring the DECnet Network<br />
Step Action<br />
3 Make remote node data available clusterwide. NETCONFIG.COM creates in the SYS$SPECIFIC:[SYSEXE]<br />
directory the permanent remote-node database file NETNODE_REMOTE.DAT, in which remote-node data<br />
is maintained. To make this data available throughout the <strong>OpenVMS</strong> <strong>Cluster</strong>, you move the file to the<br />
SYS$COMMON:[SYSEXE] directory.<br />
Example: Enter the following commands to make DECnet information available clusterwide:<br />
$ RENAME SYS$SPECIFIC:[SYSEXE]NETNODE_REMOTE.DAT SYS$COMMON:[SYSEXE]NETNODE_REMOTE.DAT<br />
If your configuration includes multiple system disks, you can set up a common NETNODE_REMOTE.DAT file<br />
automatically by using the following command in SYLOGICALS.COM:<br />
$ DEFINE/SYSTEM/EXE NETNODE_REMOTE ddcu:[directory]NETNODE_REMOTE.DAT<br />
Notes: Compaq recommends that you set up a common NETOBJECT.DAT file clusterwide in the same manner.<br />
DECdns is used by DECnet–Plus nodes to manage node data (the namespace). For DECnet–Plus, Session<br />
Control Applications replace objects.<br />
4 Designate and enable router nodes to support the use of a cluster alias. At least one node participating in a<br />
cluster alias must be configured as a level 1 router.<br />
†On VAX systems, you can designate a computer as a router node when you execute NETCONFIG.COM (as<br />
shown in step 1).<br />
‡On Alpha systems, you might need to enable level 1 routing manually because the NETCONFIG.COM procedure<br />
does not prompt you with the routing question.<br />
Depending on whether the configuration includes all Alpha nodes or a combination of VAX and Alpha nodes,<br />
follow these instructions:<br />
†VAX specific<br />
‡Alpha specific<br />
IF the cluster consists of... THEN...<br />
Alpha nodes only You must enable level 1 routing manually (see the example below) on one of<br />
the Alpha nodes.<br />
Both Alpha and VAX nodes You do not need to enable level 1 routing on an Alpha node if one of the VAX<br />
nodes is already a routing node.<br />
You do not need to enable the DECnet extended function license DVNETEXT<br />
on an Alpha node if one of the VAX nodes is already a routing node.<br />
‡Example: On Alpha systems, if you need to enable level 1 routing on an Alpha node, invoke the NCP utility to<br />
do so. For example:<br />
$ RUN SYS$SYSTEM:NCP<br />
NCP> DEFINE EXECUTOR TYPE ROUTING IV<br />
‡On Alpha systems, level 1 routing is supported to enable cluster alias operations only.<br />
4–14 The <strong>OpenVMS</strong> <strong>Cluster</strong> Operating Environment<br />
(continued on next page)
The <strong>OpenVMS</strong> <strong>Cluster</strong> Operating Environment<br />
4.5 Configuring and Starting a Satellite Booting Service<br />
Table 4–3 (Cont.) Procedure for Configuring the DECnet Network<br />
Step Action<br />
5 Optionally, define a cluster alias. If you want to define a cluster alias, invoke the NCP utility to do so. The<br />
information you specify using these commands is entered in the DECnet permanent executor database and takes<br />
effect when you start the network.<br />
Example: The following NCP commands establish SOLAR as an alias:<br />
$ RUN SYS$SYSTEM:NCP<br />
NCP> DEFINE NODE 2.1 NAME SOLAR<br />
NCP> DEFINE EXECUTOR ALIAS NODE SOLAR<br />
NCP> EXIT<br />
$<br />
Reference: Section 4.5.8 describes the cluster alias. Section 4.5.9 describes how to enable alias operations<br />
for other computers. See the DECnet–Plus documentation for information about setting up a cluster alias on<br />
DECnet–Plus nodes.<br />
Note: DECnet for <strong>OpenVMS</strong> nodes and DECnet–Plus nodes cannot share a cluster alias.<br />
4.5.7 Starting DECnet<br />
If you are using DECnet–Plus, a separate step is not required to start the<br />
network. DECnet–Plus starts automatically on the next reboot after the node has<br />
been configured using the NET$CONFIGURE.COM procedure.<br />
If you are using DECnet for <strong>OpenVMS</strong>, at the system prompt, enter the following<br />
command to start the network:<br />
$ @SYS$MANAGER:STARTNET.COM<br />
To ensure that the network is started each time an <strong>OpenVMS</strong> <strong>Cluster</strong> computer<br />
boots, add that command line to the appropriate startup command file or files.<br />
(Startup command files are discussed in Section 5.6.)<br />
4.5.8 What is the <strong>Cluster</strong> Alias?<br />
The cluster alias acts as a single network node identifier for an <strong>OpenVMS</strong> <strong>Cluster</strong><br />
system. When enabled, the cluster alias makes all the <strong>OpenVMS</strong> <strong>Cluster</strong> nodes<br />
appear to be one node from the point of view of the rest of the network.<br />
Computers in the cluster can use the alias for communications with other<br />
computers in a DECnet network. For example, networked applications that use<br />
the services of an <strong>OpenVMS</strong> <strong>Cluster</strong> should use an alias name. Doing so ensures<br />
that the remote access will be successful when at least one <strong>OpenVMS</strong> <strong>Cluster</strong><br />
member is available to process the client program’s requests.<br />
Rules:<br />
• DECnet for <strong>OpenVMS</strong> (Phase IV) allows a maximum of 64 <strong>OpenVMS</strong> <strong>Cluster</strong><br />
computers to participate in a cluster alias. If your cluster includes more than<br />
64 computers, you must determine which 64 should participate in the alias<br />
and then define the alias on those computers.<br />
At least one of the <strong>OpenVMS</strong> <strong>Cluster</strong> nodes that uses the alias node identifier<br />
must have level 1 routing enabled.<br />
On Alpha nodes, routing between multiple circuits is not supported.<br />
However, routing is supported to allow cluster alias operations. Level<br />
1 routing is supported only for enabling the use of a cluster alias. The<br />
DVNETEXT PAK must be used to enable this limited function.<br />
On VAX nodes, full level 1 routing support is available.<br />
The <strong>OpenVMS</strong> <strong>Cluster</strong> Operating Environment 4–15
The <strong>OpenVMS</strong> <strong>Cluster</strong> Operating Environment<br />
4.5 Configuring and Starting a Satellite Booting Service<br />
On both Alpha and VAX systems, all cluster nodes sharing the same alias<br />
node address must be in the same area.<br />
• DECnet–Plus allows a maximum of 96 <strong>OpenVMS</strong> <strong>Cluster</strong> computers to<br />
participate in the cluster alias.<br />
DECnet–Plus does not require that a cluster member be a routing node, but<br />
an adjacent Phase V router is required to use a cluster alias for DECnet–Plus<br />
systems.<br />
• A single cluster alias can include nodes running either DECnet for <strong>OpenVMS</strong><br />
or DECnet–Plus, but not both.<br />
4.5.9 Enabling Alias Operations<br />
If you have defined a cluster alias and have enabled routing as shown in<br />
Section 4.5.6, you can enable alias operations for other computers after the<br />
computers are up and running in the cluster. To enable such operations (that<br />
is, to allow a computer to accept incoming connect requests directed toward the<br />
alias), follow these steps:<br />
1. Log in as system manager and invoke the SYSMAN utility. For example:<br />
$ RUN SYS$SYSTEM:SYSMAN<br />
SYSMAN><br />
2. At the SYSMAN> prompt, enter the following commands:<br />
SYSMAN> SET ENVIRONMENT/CLUSTER<br />
%SYSMAN-I-ENV, current command environment:<br />
<strong>Cluster</strong>wide on local cluster<br />
Username SYSTEM will be used on nonlocal nodes<br />
SYSMAN> SET PROFILE/PRIVILEGES=(OPER,SYSPRV)<br />
SYSMAN> DO MCR NCP SET EXECUTOR STATE OFF<br />
%SYSMAN-I-OUTPUT, command execution on node X...<br />
. ..<br />
SYSMAN> DO MCR NCP DEFINE EXECUTOR ALIAS INCOMING ENABLED<br />
%SYSMAN-I-OUTPUT, command execution on node X...<br />
. ..<br />
SYSMAN> DO @SYS$MANAGER:STARTNET.COM<br />
%SYSMAN-I-OUTPUT, command execution on node X...<br />
. ..<br />
Note: Compaq does not recommend enabling alias operations for satellite nodes.<br />
Reference: For more details about DECnet for <strong>OpenVMS</strong> networking and<br />
cluster alias, see the DECnet for <strong>OpenVMS</strong> Networking Manual and DECnet<br />
for <strong>OpenVMS</strong> Network Management Utilities. For equivalent information about<br />
DECnet–Plus, see the DECnet–Plus documentation.<br />
4–16 The <strong>OpenVMS</strong> <strong>Cluster</strong> Operating Environment
5<br />
Preparing a Shared Environment<br />
In any <strong>OpenVMS</strong> <strong>Cluster</strong> environment, it is best to share resources as much as<br />
possible. Resource sharing facilitates workload balancing because work can be<br />
distributed across the cluster.<br />
5.1 Shareable Resources<br />
Most, but not all, resources can be shared across nodes in an <strong>OpenVMS</strong> <strong>Cluster</strong>.<br />
The following table describes resources that can be shared.<br />
Shareable<br />
Resources Description<br />
System disks All members of the same architecture1 can share a single system disk, each<br />
member can have its own system disk, or members can use a combination of<br />
both methods.<br />
Data disks All members can share any data disks. For local disks, access is limited to<br />
the local node unless you explicitly set up the disks to be cluster accessible by<br />
means of the MSCP server.<br />
Tape drives All members can share tape drives. (Note that this does not imply that all<br />
members can have simultaneous access.) For local tape drives, access is<br />
limited to the local node unless you explicitly set up the tapes to be cluster<br />
accessible by means of the TMSCP server. Only DSA tapes can be served to all<br />
<strong>OpenVMS</strong> <strong>Cluster</strong> members.<br />
Batch and print<br />
queues<br />
Users can submit batch jobs to any queue in the <strong>OpenVMS</strong> <strong>Cluster</strong>, regardless<br />
of the processor on which the job will actually execute. Generic queues can<br />
balance the load among the available processors.<br />
Applications Most applications work in an <strong>OpenVMS</strong> <strong>Cluster</strong> just as they do on a<br />
single system. Application designers can also create applications that run<br />
simultaneously on multiple <strong>OpenVMS</strong> <strong>Cluster</strong> nodes, which share data in a<br />
file.<br />
User authorization<br />
files<br />
All nodes can use either a common user authorization file (UAF) for the same<br />
access on all systems or multiple UAFs to enable node-specific quotas. If<br />
a common UAF is used, all user passwords, directories, limits, quotas, and<br />
privileges are the same on all systems.<br />
1 Data on system disks can be shared between Alpha and VAX processors. However, VAX nodes cannot<br />
boot from an Alpha system disk, and Alpha nodes cannot boot from a VAX system disk.<br />
Preparing a Shared Environment 5–1
Preparing a Shared Environment<br />
5.1 Shareable Resources<br />
5.1.1 Local Resources<br />
The following table lists resources that are accessible only to the local node.<br />
Nonshareable<br />
Resources Description<br />
Memory Each <strong>OpenVMS</strong> <strong>Cluster</strong> member maintains its own memory.<br />
User processes When a user process is created on an <strong>OpenVMS</strong> <strong>Cluster</strong> member, the process must<br />
complete on that computer, using local memory.<br />
Printers A printer that does not accept input through queues is used only by the <strong>OpenVMS</strong><br />
<strong>Cluster</strong> member to which it is attached. A printer that accepts input through<br />
queues is accessible by any <strong>OpenVMS</strong> <strong>Cluster</strong> member.<br />
5.1.2 Sample Configuration<br />
Figure 5–1 shows an example of an <strong>OpenVMS</strong> <strong>Cluster</strong> system that has both an<br />
Alpha system disk and a VAX system disk, and a dual-ported disk that is set up<br />
so the environmental files can be shared between the Alpha and VAX systems.<br />
Figure 5–1 Resource Sharing in Mixed-Architecture <strong>Cluster</strong> System<br />
Alpha Alpha<br />
CI<br />
H<br />
S<br />
J<br />
Disk Containing<br />
Shared Environment<br />
Files<br />
VAX Alpha<br />
Alpha System<br />
Disk<br />
VAX System<br />
Disk<br />
Star Coupler<br />
CI<br />
H<br />
S<br />
J<br />
Alpha VAX<br />
VM-0671A-AI<br />
5.2 Common-Environment and Multiple-Environment <strong>Cluster</strong>s<br />
Depending on your processing needs, you can prepare either an environment in<br />
which all environmental files are shared clusterwide or an environment in which<br />
5–2 Preparing a Shared Environment
Preparing a Shared Environment<br />
5.2 Common-Environment and Multiple-Environment <strong>Cluster</strong>s<br />
some files are shared clusterwide while others are accessible only by certain<br />
computers.<br />
The following table describes the characteristics of common- and multipleenvironment<br />
clusters.<br />
<strong>Cluster</strong> Type Characteristics Advantages<br />
Common environment<br />
Operating<br />
environment<br />
is identical on<br />
all nodes in the<br />
<strong>OpenVMS</strong> <strong>Cluster</strong>.<br />
Multiple environment<br />
Operating<br />
environment can<br />
vary from node to<br />
node.<br />
The environment is set up so that:<br />
• All nodes run the same programs,<br />
applications, and utilities.<br />
• All users have the same type of<br />
user accounts, and the same logical<br />
names are defined.<br />
• All users can have common access to<br />
storage devices and queues. (Note<br />
that access is subject to how access<br />
control list [ACL] protection is set<br />
up for each user.)<br />
• All users can log in to any node<br />
in the configuration and work in<br />
the same environment as all other<br />
users.<br />
An individual processor or a subset of<br />
processors are set up to:<br />
• Provide multiple access according to<br />
the type of tasks users perform and<br />
the resources they use.<br />
• Share a set of resources that are not<br />
available on other nodes.<br />
• Perform specialized functions<br />
using restricted resources while<br />
other processors perform general<br />
timesharing work.<br />
• Allow users to work in environments<br />
that are specific to the node where<br />
they are logged in.<br />
5.3 Directory Structure on Common System Disks<br />
Easier to manage because you use<br />
a common version of each system<br />
file.<br />
Effective when you want to share<br />
some data among computers but<br />
you also want certain computers<br />
to serve specialized needs.<br />
The installation or upgrade procedure for your operating system generates a<br />
common system disk, on which most operating system and optional product<br />
files are stored in a common root directory.<br />
Preparing a Shared Environment 5–3
Preparing a Shared Environment<br />
5.3 Directory Structure on Common System Disks<br />
5.3.1 Directory Roots<br />
The system disk directory structure is the same on both Alpha and VAX systems.<br />
Whether the system disk is for Alpha or VAX, the entire directory structure—<br />
that is, the common root plus each computer’s local root—is stored on the<br />
same disk. After the installation or upgrade completes, you use the CLUSTER_<br />
CONFIG.COM command procedure described in Chapter 8 to create a local root<br />
for each new computer to use when booting into the cluster.<br />
In addition to the usual system directories, each local root contains a<br />
[SYSn.SYSCOMMON] directory that is a directory alias for [VMS$COMMON],<br />
the cluster common root directory in which cluster common files actually reside.<br />
When you add a computer to the cluster, CLUSTER_CONFIG.COM defines the<br />
common root directory alias.<br />
5.3.2 Directory Structure Example<br />
Figure 5–2 illustrates the directory structure set up for computers JUPITR and<br />
SATURN, which are run from a common system disk. The disk’s master file<br />
directory (MFD) contains the local roots (SYS0 for JUPITR, SYS1 for SATURN)<br />
and the cluster common root directory, [VMS$COMMON].<br />
Figure 5–2 Directory Structure on a Common System Disk<br />
[SYS0] (JUPITR's root)<br />
[SYSCBI]<br />
[SYSERR]<br />
[SYSEXE]<br />
[SYSHLP]<br />
[SYSLIB]<br />
[SYSMAINT]<br />
[SYSMGR]<br />
[SYSMSG]<br />
[SYSTEST]<br />
[SYSUPD]<br />
[SYSCOMMON]<br />
5–4 Preparing a Shared Environment<br />
[000000] (Master File Directory)<br />
[VMS$COMMON] (cluster common root)<br />
[SYSCBI]<br />
[SYSERR]<br />
[SYSEXE]<br />
[SYSHLP]<br />
[SYSLIB]<br />
[SYSMAINT]<br />
[SYSMGR]<br />
[SYSMSG]<br />
[SYSTEST]<br />
[SYSUPD]<br />
[SYS1] (SATURN's root)<br />
[SYSCBI]<br />
[SYSERR]<br />
[SYSEXE]<br />
[SYSHLP]<br />
[SYSLIB]<br />
[SYSMAINT]<br />
[SYSMGR]<br />
[SYSMSG]<br />
[SYSTEST]<br />
[SYSUPD]<br />
[SYSCOMMON]<br />
VM-0001A-AI
Preparing a Shared Environment<br />
5.3 Directory Structure on Common System Disks<br />
5.3.3 Search Order<br />
The logical name SYS$SYSROOT is defined as a search list that points first to<br />
a local root (SYS$SYSDEVICE:[SYS0.SYSEXE]) and then to the common root<br />
(SYS$COMMON:[SYSEXE]). Thus, the logical names for the system directories<br />
(SYS$SYSTEM, SYS$LIBRARY, SYS$MANAGER, and so forth) point to two<br />
directories.<br />
Figure 5–3 shows how directories on a common system disk are searched when<br />
the logical name SYS$SYSTEM is used in file specifications.<br />
Figure 5–3 File Search Order on Common System Disk<br />
SYS$SYSTEM:file<br />
SYS$SYSROOT:[SYSEXE]file<br />
SYS$SPECIFIC:[SYSEXE]file<br />
SYS$COMMON:[SYSEXE]file<br />
[SYS0.SYSEXE]file (for JUPITR)<br />
[SYS1.SYSEXE]file (for SATURN)<br />
[SYS0.SYSCOMMON.SYSEXE]file (for JUPITR)<br />
[SYS1.SYSCOMMON.SYSEXE]file (for SATURN)<br />
VM-0002A-AI<br />
Important: Keep this search order in mind when you manipulate system files<br />
on a common system disk. Computer-specific files must always reside and be<br />
updated in the appropriate computer’s system subdirectory.<br />
Examples<br />
1. MODPARAMS.DAT must reside in SYS$SPECIFIC:[SYSEXE], which is<br />
[SYS0.SYSEXE] on JUPITR, and in [SYS1.SYSEXE] on SATURN. Thus, to<br />
create a new MODPARAMS.DAT file for JUPITR when logged in on JUPITR,<br />
enter the following command:<br />
$ EDIT SYS$SPECIFIC:[SYSEXE]MODPARAMS.DAT<br />
Once the file is created, you can use the following command to modify it when<br />
logged on to JUPITR:<br />
$ EDIT SYS$SYSTEM:MODPARAMS.DAT<br />
Note that if a MODPARAMS.DAT file does not exist in JUPITR’s<br />
SYS$SPECIFIC:[SYSEXE] directory when you enter this command, but<br />
there is a MODPARAMS.DAT file in the directory SYS$COMMON:[SYSEXE],<br />
the command edits the MODPARAMS.DAT file in the common directory. If<br />
there is no MODPARAMS.DAT file in either directory, the command creates<br />
the file in JUPITR’s SYS$SPECIFIC:[SYSEXE] directory.<br />
2. To modify JUPITR’s MODPARAMS.DAT when logged in on any other<br />
computer that boots from the same common system disk, enter the following<br />
command:<br />
$ EDIT SYS$SYSDEVICE:[SYS0.SYSEXE]MODPARAMS.DAT<br />
Preparing a Shared Environment 5–5
Preparing a Shared Environment<br />
5.3 Directory Structure on Common System Disks<br />
3. To modify records in the cluster common system authorization file in a cluster<br />
with a single, cluster-common system disk, enter the following commands on<br />
any computer:<br />
$ SET DEFAULT SYS$COMMON:[SYSEXE]<br />
$ RUN SYS$SYSTEM:AUTHORIZE<br />
4. To modify records in a computer-specific system authorization file when logged<br />
in to another computer that boots from the same cluster common system disk,<br />
you must set your default directory to the specific computer. For example, if<br />
you have set up a computer-specific system authorization file (SYSUAF.DAT)<br />
for computer JUPITR, you must set your default directory to JUPITR’s<br />
computer-specific [SYSEXE] directory before invoking AUTHORIZE, as<br />
follows:<br />
$ SET DEFAULT SYS$SYSDEVICE:[SYS0.SYSEXE]<br />
$ RUN SYS$SYSTEM:AUTHORIZE<br />
5.4 <strong>Cluster</strong>wide Logical Names<br />
<strong>Cluster</strong>wide logical names were introduced in <strong>OpenVMS</strong> Version 7.2 for both<br />
<strong>OpenVMS</strong> Alpha and <strong>OpenVMS</strong> VAX. <strong>Cluster</strong>wide logical names extend the<br />
convenience and ease-of-use features of shareable logical names to <strong>OpenVMS</strong><br />
<strong>Cluster</strong> systems.<br />
Existing applications can take advantage of clusterwide logical names without<br />
any changes to the application code. Only a minor modification to the logical<br />
name tables referenced by the application (directly or indirectly) is required.<br />
New logical names are local by default. <strong>Cluster</strong>wide is an attribute of a logical<br />
name table. In order for a new logical name to be clusterwide, it must be created<br />
in a clusterwide logical name table.<br />
Some of the most important features of clusterwide logical names are:<br />
• When a new node joins the cluster, it automatically receives the current set of<br />
clusterwide logical names.<br />
• When a clusterwide logical name or name table is created, modified, or<br />
deleted, the change is automatically propagated to every other node in the<br />
cluster running <strong>OpenVMS</strong> Version 7.2 or later. Modifications include security<br />
profile changes to a clusterwide table.<br />
• Translations are done locally so there is minimal performance degradation for<br />
clusterwide name translations.<br />
• Because LNM$CLUSTER_TABLE and LNM$SYSCLUSTER_TABLE exist<br />
on all systems running <strong>OpenVMS</strong> Version 7.2 or later, the programs and<br />
command procedures that use clusterwide logical names can be developed,<br />
tested, and run on nonclustered systems.<br />
5.4.1 Default <strong>Cluster</strong>wide Logical Name Tables<br />
To support clusterwide logical names, the operating system creates two<br />
clusterwide logical name tables and their logical names at system startup, as<br />
shown in Table 5–1. These logical name tables and logical names are in addition<br />
to the ones supplied for the process, job, group, and system logical name tables.<br />
The names of the clusterwide logical name tables are contained in the system<br />
logical name directory, LNM$SYSTEM_DIRECTORY.<br />
5–6 Preparing a Shared Environment
Preparing a Shared Environment<br />
5.4 <strong>Cluster</strong>wide Logical Names<br />
Table 5–1 Default <strong>Cluster</strong>wide Logical Name Tables and Logical Names<br />
Name Purpose<br />
LNM$SYSCLUSTER_TABLE The default table for clusterwide system logical names.<br />
It is empty when shipped. This table is provided<br />
for system managers who want to use clusterwide<br />
logical names to customize their environments. The<br />
names in this table are available to anyone translating<br />
a logical name using SHOW LOGICAL/SYSTEM,<br />
specifying a table name of LNM$SYSTEM, or<br />
LNM$DCL_LOGICAL (DCL’s default table search<br />
list), or LNM$FILE_DEV (system and RMS default).<br />
LNM$SYSCLUSTER The logical name for LNM$SYSCLUSTER_TABLE.<br />
It is provided for convenience in referencing<br />
LNM$SYSCLUSTER_TABLE. It is consistent in<br />
format with LNM$SYSTEM_TABLE and its logical<br />
name, LNM$SYSTEM.<br />
LNM$CLUSTER_TABLE The parent table for all clusterwide logical name<br />
tables, including LNM$SYSCLUSTER_TABLE. When<br />
you create a new table using LNM$CLUSTER_TABLE<br />
as the parent table, the new table will be available<br />
clusterwide.<br />
LNM$CLUSTER The logical name for LNM$CLUSTER_TABLE.<br />
It is provided for convenience in referencing<br />
LNM$CLUSTER_TABLE.<br />
5.4.2 Translation Order<br />
The definition of LNM$SYSTEM has been expanded to include<br />
LNM$SYSCLUSTER. When a system logical name is translated, the search<br />
order is LNM$SYSTEM_TABLE, LNM$SYSCLUSTER_TABLE. Because the<br />
definitions for the system default table names, LNM$FILE_DEV and LNM$DCL_<br />
LOGICALS, include LNM$SYSTEM, translations using those default tables<br />
include definitions in LNM$SYSCLUSTER.<br />
The current precedence order for resolving logical names is preserved.<br />
<strong>Cluster</strong>wide logical names that are translated against LNM$FILE_DEV are<br />
resolved last, after system logical names. The precedence order, from first to last,<br />
is process –> job –> group –> system –> cluster, as shown in Figure 5–4.<br />
Preparing a Shared Environment 5–7
Preparing a Shared Environment<br />
5.4 <strong>Cluster</strong>wide Logical Names<br />
Figure 5–4 Translation Order Specified by LNM$FILE_DEV<br />
LMN$FILE_DEV<br />
1<br />
LMN$PROCESS<br />
2<br />
LMN$PROCESS_TABLE<br />
Process<br />
Table<br />
Process-private<br />
3 Process-private<br />
LMN$JOB LMN$GROUP LMN$SYSTEM<br />
4 Process-private<br />
LMN$JOB_803B9020<br />
5 Shareable<br />
Job<br />
Table<br />
6 Process-private<br />
LMN$GROUP_000200<br />
7 Shareable<br />
Group<br />
Table<br />
8 Shareable<br />
LMN$SYSTEM_TABLE LMN$SYSCLUSTER<br />
System<br />
Table<br />
9 Shareable<br />
10 Shareable<br />
LMN$SYSCLUSTER_TABLE<br />
11 <strong>Cluster</strong>wide<br />
shareable<br />
<strong>Cluster</strong><br />
Table<br />
VM-0003A-AI<br />
5.4.3 Creating <strong>Cluster</strong>wide Logical Name Tables<br />
You might want to create additional clusterwide logical name tables for the<br />
following purposes:<br />
• For a multiprocess clusterwide application to use<br />
• For members of a UIC group to share<br />
To create a clusterwide logical name table, you must have create (C) access to<br />
the parent table and write (W) access to LNM$SYSTEM_DIRECTORY, or the<br />
SYSPRV (system) privilege.<br />
A shareable logical name table has UIC-based protection. Each class of user<br />
(system (S), owner (O), group (G), and world (W)) can be granted four types of<br />
access: read (R), write (W), create (C), or delete (D).<br />
You can create additional clusterwide logical name tables in the same way that<br />
you can create additional process, job, and group logical name tables—with the<br />
CREATE/NAME_TABLE command or with the $CRELNT system service. When<br />
creating a clusterwide logical name table, you must specify the /PARENT_TABLE<br />
qualifier and provide a value for the qualifier that is a clusterwide table name.<br />
Any existing clusterwide table used as the parent table will make the new table<br />
clusterwide.<br />
The following example shows how to create a clusterwide logical name table:<br />
$ CREATE/NAME_TABLE/PARENT_TABLE=LNM$CLUSTER_TABLE -<br />
_$ new-clusterwide-logical-name-table<br />
5.4.4 Alias Collisions Involving <strong>Cluster</strong>wide Logical Name Tables<br />
Alias collisions involving clusterwide logical name tables are treated differently<br />
from alias collisions of other types of logical name tables. Table 5–2 describes the<br />
types of collisions and their outcomes.<br />
5–8 Preparing a Shared Environment
Table 5–2 Alias Collisions and Outcomes<br />
Collision Type Outcome<br />
Creating a local table with same<br />
name and access mode as an<br />
existing clusterwide table<br />
Creating a clusterwide table with<br />
same name and access mode as an<br />
existing local table<br />
Creating a clusterwide table with<br />
same name and access mode as an<br />
existing clusterwide table<br />
Preparing a Shared Environment<br />
5.4 <strong>Cluster</strong>wide Logical Names<br />
New local table is not created. The condition value<br />
SS$_NORMAL is returned, which means that the<br />
service completed successfully but the logical name<br />
table already exists. The existing clusterwide table<br />
and its names on all nodes remain in effect.<br />
New clusterwide table is created. The condition<br />
value SS$_LNMCREATED is returned, which means<br />
that the logical name table was created. The local<br />
table and its names are deleted. If the clusterwide<br />
table was created with the DCL command DEFINE,<br />
a message is displayed:<br />
DCL-I-TABSUPER, previous table table_name<br />
has been superseded<br />
If the clusterwide table was created with the<br />
$CRELNT system service, $CRELNT returns the<br />
condition value: SS$_SUPERSEDE.<br />
New clusterwide table is not created. The condition<br />
value SS$_NORMAL is returned, which means that<br />
the service completed successfully but the logical<br />
name table already exists. The existing table and<br />
all its names remain in effect, regardless of the<br />
setting of the $CRELNT system service’s CREATE-IF<br />
attribute. This prevents surprise implicit deletions of<br />
existing table names from other nodes.<br />
5.4.5 Creating <strong>Cluster</strong>wide Logical Names<br />
To create a clusterwide logical name, you must have write (W) access to the<br />
table in which the logical name is to be entered, or SYSNAM privilege if you<br />
are creating clusterwide logical names only in LNM$SYSCLUSTER. Unless<br />
you specify an access mode (user, supervisor, and so on), the access mode of the<br />
logical name you create defaults to the access mode from which the name was<br />
created. If you created the name with a DCL command, the access mode defaults<br />
to supervisor mode. If you created the name with a program, the access mode<br />
typically defaults to user mode.<br />
When you create a clusterwide logical name, you must include the name of a<br />
clusterwide logical name table in the definition of the logical name. You can<br />
create clusterwide logical names by using DCL commands or with the $CRELNM<br />
system service.<br />
The following example shows how to create a clusterwide logical name in the<br />
default clusterwide logical name table, LNM$CLUSTER_TABLE, using the<br />
DEFINE command:<br />
$ DEFINE/TABLE=LNM$CLUSTER_TABLE logical-name equivalence-string<br />
To create clusterwide logical names that will reside in a clusterwide logical<br />
name table you created, you define the new clusterwide logical name with the<br />
DEFINE command, specifying your new clusterwide table’s name with the<br />
/TABLE qualifier, as shown in the following example:<br />
Preparing a Shared Environment 5–9
Preparing a Shared Environment<br />
5.4 <strong>Cluster</strong>wide Logical Names<br />
$ DEFINE/TABLE=new-clusterwide-logical-name-table logical-name -<br />
_$ equivalence-string<br />
Note<br />
If you attempt to create a new clusterwide logical name with the same<br />
access mode and identical equivalence names and attributes as an<br />
existing clusterwide logical name, the existing name is not deleted, and<br />
no messages are sent to remote nodes. This behavior differs from similar<br />
attempts for other types of logical names, which delete the existing name<br />
and create the new one. For clusterwide logical names, this difference is<br />
a performance enhancement.<br />
The condition value SS$_NORMAL is returned. The service completed<br />
successfully, but the new logical name was not created.<br />
5.4.6 Management Guidelines<br />
When using clusterwide logical names, observe the following guidelines:<br />
1. Do not use certain logical names clusterwide.<br />
The following logical names are not valid for clusterwide use:<br />
• Mailbox names, because mailbox devices are local to a node.<br />
• SYS$NODE and SYS$NODE_FULLNAME must be in LNM$SYSTEM_<br />
TABLE and are node specific.<br />
• LMF$LICENSE_TABLE.<br />
2. Do not redefine LNM$SYSTEM.<br />
LNM$SYSTEM is now defined as LNM$SYSTEM_TABLE,<br />
LNM$SYSCLUSTER_TABLE. Do not reverse the order of these two<br />
tables. If you do, then any names created using the /SYSTEM qualifier or in<br />
LNM$SYSTEM would go in LNM$SYSCLUSTER_TABLE and be clusterwide.<br />
Various system failures would result. For example, the MOUNT/SYSTEM<br />
command would attempt to create a clusterwide logical name for a mounted<br />
volume, which would result in an error.<br />
3. Keep LNM$SYSTEM contents in LNM$SYSTEM.<br />
Do not merge the logical names in LNM$SYSTEM into LNM$SYSCLUSTER.<br />
Many system logical names in LNM$SYSTEM contain system roots and<br />
either node-specific devices, or node-specific directories, or both.<br />
4. Adopt naming conventions for logical names used at your site.<br />
To avoid confusion and name conflicts, develop one naming convention for<br />
system-specific logical names and another for clusterwide logical names.<br />
5. Avoid using the dollar sign ($) in your own site’s logical names, because<br />
<strong>OpenVMS</strong> software uses it in its names.<br />
6. Be aware that clusterwide logical name operations will stall when the<br />
clusterwide logical name database is not consistent.<br />
This can occur during system initialization when the system’s clusterwide<br />
logical name database is not completely initialized. It can also occur when the<br />
cluster server process has not finished updating the clusterwide logical name<br />
database, or during resynchronization after nodes enter or leave the cluster.<br />
5–10 Preparing a Shared Environment
Preparing a Shared Environment<br />
5.4 <strong>Cluster</strong>wide Logical Names<br />
As soon as consistency is reestablished, the processing of clusterwide logical<br />
name operations resumes.<br />
5.4.7 Using <strong>Cluster</strong>wide Logical Names in Applications<br />
The $TRNLNM system service and the $GETSYI system service provide<br />
attributes that are specific to clusterwide logical names. This section describes<br />
those attributes. It also describes the use of $CRELNT as it pertains to<br />
creating a clusterwide table. For more information about using logical names in<br />
applications, refer to the <strong>OpenVMS</strong> Programming Concepts Manual.<br />
5.4.7.1 <strong>Cluster</strong>wide Attributes for $TRNLNM System Service<br />
Two clusterwide attributes are available in the $TRNLNM system service:<br />
• LNM$V_CLUSTERWIDE<br />
• LNM$M_INTERLOCKED<br />
LNM$V_CLUSTERWIDE is an output attribute to be returned in the itemlist<br />
if you asked for the LNM$_ATTRIBUTES item for a logical name that is<br />
clusterwide.<br />
LNM$M_INTERLOCKED is an attr argument bit that can be set to ensure<br />
that any clusterwide logical name modifications in progress are completed before<br />
the name is translated. LNM$M_INTERLOCKED is not set by default. If your<br />
application requires translation using the most recent definition of a clusterwide<br />
logical name, use this attribute to ensure that the translation is stalled until all<br />
pending modifications have been made.<br />
On a single system, when one process modifies the shareable part of the logical<br />
name database, the change is visible immediately to other processes on that node.<br />
Moreover, while the modification is in progress, no other process can translate or<br />
modify shareable logical names.<br />
In contrast, when one process modifies the clusterwide logical name database,<br />
the change is visible immediately on that node, but it takes a short time for the<br />
change to be propagated to other nodes. By default, translations of clusterwide<br />
logical names are not stalled. Therefore, it is possible for processes on different<br />
nodes to translate a logical name and get different equivalence names when<br />
modifications are in progress.<br />
The use of LNM$M_INTERLOCKED guarantees that your application will<br />
receive the most recent definition of a clusterwide logical name.<br />
5.4.7.2 <strong>Cluster</strong>wide Attribute for $GETSYI System Service<br />
The clusterwide attribute, SYI$_CWLOGICALS, has been added to the $GETSYI<br />
system service. When you specify SYI$_CWLOGICALS, $GETSYI returns the<br />
value 1 if the clusterwide logical name database has been initialized on the CPU,<br />
or the value 0 if it has not been initialized. Because this number is a Boolean<br />
value (1 or 0), the buffer length field in the item descriptor should specify 1 (byte).<br />
On a nonclustered system, the value of SYI$_CWLOGICALS is always 0.<br />
5.4.7.3 Creating <strong>Cluster</strong>wide Tables with the $CRELNT System Service<br />
When creating a clusterwide table, the $CRELNT requester must supply a table<br />
name. <strong>OpenVMS</strong> does not supply a default name for clusterwide tables because<br />
the use of default names enables a process without the SYSPRV privilege to<br />
create a shareable table.<br />
Preparing a Shared Environment 5–11
Preparing a Shared Environment<br />
5.5 Defining and Accessing <strong>Cluster</strong>wide Logical Names<br />
5.5 Defining and Accessing <strong>Cluster</strong>wide Logical Names<br />
Initializing the clusterwide logical name database on a booting node requires<br />
sending a message to another node and having its CLUSTER_SERVER process<br />
reply with one or messages containing a description of the database. The<br />
CLUSTER_SERVER process on the booting node requests system services to<br />
create the equivalent names and tables. How long this initialization takes varies<br />
with conditions such as the size of the clusterwide logical name database, the<br />
speed of the cluster interconnect, and the responsiveness of the CLUSTER_<br />
SERVER process on the responding node.<br />
Until a booting node’s copy of the clusterwide logical name database is consistent<br />
with the logical name databases of the rest of the cluster, any attempt on<br />
the booting node to create or delete clusterwide names or tables is stalled<br />
transparently. Because translations are not stalled by default, any attempt<br />
to translate a clusterwide name before the database is consistent may fail<br />
or succeed, depending on timing. To stall a translation until the database is<br />
consistent, specify the F$TRNLNM CASE argument as INTERLOCKED.<br />
5.5.1 Defining <strong>Cluster</strong>wide Logical Names in SYSTARTUP_VMS.COM<br />
In general, system managers edit the SYLOGICALS.COM command procedure<br />
to define site-specific logical names that take effect at system startup. However,<br />
Compaq recommends that, if possible, clusterwide logical names be defined in the<br />
SYSTARTUP_VMS.COM command procedure instead with the exception of those<br />
logical names discussed in Section 5.5.2. The reason for defining clusterwide<br />
logical names in SYSTARTUP_VMS.COM is that SYSTARTUP_VMS.COM is run<br />
at a much later stage in the booting process than SYLOGICALS.COM.<br />
<strong>OpenVMS</strong> startup is single streamed and synchronous except for actions taken<br />
by created processes, such as the CLUSTER_SERVER process. Although the<br />
CLUSTER_SERVER process is created very early in startup, it is possible that<br />
when SYLOGICALS.COM is executed, the booting node’s copy of the clusterwide<br />
logical name database has not been fully initialized. In such a case, a clusterwide<br />
definition in SYLOGICALS.COM would stall startup and increase the time it<br />
takes for the system to become operational.<br />
<strong>OpenVMS</strong> will ensure that the clusterwide database has been initialized before<br />
SYSTARTUP_VMS.COM is executed.<br />
5.5.2 Defining Certain Logical Names in SYLOGICALS.COM<br />
To be effective, certain logical names, such as LMF$LICENSE, NET$PROXY,<br />
and VMS$OBJECTS must be defined earlier in startup than when SYSTARTUP_<br />
VMS.COM is invoked. Most such names are defined in SYLOGICALS.COM, with<br />
the exception of VMS$OBJECTS, which is defined in SYSECURITY.COM, and<br />
any names defined in SYCONFIG.COM.<br />
Although Compaq recommends defining clusterwide logical names in<br />
SYSTARTUP_VMS.COM, to define these names to be clusterwide, you must<br />
do so in SYLOGICALS.COM or SYSECURITY.COM. Note that doing this may<br />
increase startup time.<br />
Alternatively, you can take the traditional approach and define these names as<br />
systemwide logical names with the same definition on every node.<br />
5–12 Preparing a Shared Environment
Preparing a Shared Environment<br />
5.5 Defining and Accessing <strong>Cluster</strong>wide Logical Names<br />
5.5.3 Using Conditional Definitions for Startup Command Procedures<br />
For clusterwide definitions in any startup command procedure that is common to<br />
all cluster nodes, Compaq recommends that you use a conditional definition. For<br />
example:<br />
$ IF F$TRNLNM("CLUSTER_APPS") .EQS. "" THEN -<br />
_$ DEFINE/TABLE=LNM$SYSCLUSTER/EXEC CLUSTER_APPS -<br />
_$ $1$DKA500:[COMMON_APPS]<br />
A conditional definition can prevent unpleasant surprises. For example, suppose<br />
a system manager redefines a name that is also defined in SYSTARTUP_<br />
VMS.COM but does not edit SYSTARTUP_VMS.COM because the new definition<br />
is temporary. If a new node joins the cluster, the new node would initially<br />
receive the new definition. However, when the new node executes SYSTARTUP_<br />
VMS.COM, it will cause all the nodes in the cluster, including itself, to revert to<br />
the original value.<br />
If you include a conditional definition in SYLOGICALS.COM or<br />
SYSECURITY.COM, specify the F$TRNLNM CASE argument as INTERLOCKED<br />
to ensure that clusterwide logical names have been fully initialized before the<br />
translation completes. An example of a conditional definition with the argument<br />
specified follows:<br />
$ IF F$TRNLNM("CLUSTER_APPS",,,,"INTERLOCKED") .EQS. "" THEN -<br />
_$ DEFINE/TABLE=LNM$SYSCLUSTER/EXEC CLUSTER_APPS -<br />
_$ $1$DKA500:[COMMON_APPS]<br />
Note<br />
F$GETSYI ("CWLOGICALS") always returns a value of FALSE on a<br />
noncluster system. Procedures that are designed to run in both clustered<br />
and nonclustered environments should first determine whether they are<br />
in a cluster and, if so, then determine whether clusterwide logical names<br />
are initialized.<br />
5.6 Coordinating Startup Command Procedures<br />
Immediately after a computer boots, it runs the site-independent command<br />
procedure SYS$SYSTEM:STARTUP.COM to start up the system and control the<br />
sequence of startup events. The STARTUP.COM procedure calls a number of<br />
other startup command procedures that perform cluster-specific and node-specific<br />
tasks.<br />
The following sections describe how, by setting up appropriate cluster-specific<br />
startup command procedures and other system files, you can prepare the<br />
<strong>OpenVMS</strong> <strong>Cluster</strong> operating environment on the first installed computer before<br />
adding other computers to the cluster.<br />
Reference: See also the <strong>OpenVMS</strong> System Manager’s Manual for more<br />
information about startup command procedures.<br />
Preparing a Shared Environment 5–13
Preparing a Shared Environment<br />
5.6 Coordinating Startup Command Procedures<br />
5.6.1 <strong>OpenVMS</strong> Startup Procedures<br />
Several startup command procedures are distributed as part of the <strong>OpenVMS</strong><br />
operating system. The SYS$SYSTEM:STARTUP.COM command procedure<br />
executes immediately after <strong>OpenVMS</strong> is booted and invokes the site-specific<br />
startup command procedures described in the following table.<br />
Procedure Name Invoked by Function<br />
SYS$MANAGER:<br />
SYPAGSWPFILES.COM<br />
SYS$MANAGER:<br />
SYCONFIG.COM<br />
SYS$MANAGER:<br />
SYSECURITY.COM<br />
SYS$MANAGER:<br />
SYLOGICALS.COM<br />
SYS$MANAGER:<br />
SYSTARTUP_VMS.COM<br />
SYS$SYSTEM:<br />
STARTUP.COM<br />
SYS$SYSTEM:<br />
STARTUP.COM<br />
SYS$SYSTEM:<br />
STARTUP.COM<br />
SYS$SYSTEM:<br />
ST ARTUP.COM<br />
SYS$SYSTEM:<br />
STARTUP.COM<br />
A file to which you add commands to install page and swap files<br />
(other than the primary page and swap files that are installed<br />
automatically).<br />
Connects special devices and loads device I/O drivers.<br />
Defines the location of the security audit and archive files before<br />
it starts the security audit server.<br />
Creates systemwide logical names, and defines system<br />
components as executive-mode logical names. (<strong>Cluster</strong>wide<br />
logical names should be defined in SYSTARTUP_VMS.COM.)<br />
<strong>Cluster</strong> common disks can be mounted at the end of this<br />
procedure.<br />
Performs many of the following startup and login functions:<br />
• Mounts all volumes except the system disk.<br />
• Sets device characteristics.<br />
• Defines clusterwide logical names<br />
• Initializes and starts batch and print queues.<br />
• Installs known images.<br />
• Starts layered products.<br />
• Starts the DECnet software.<br />
• Analyzes most recent system failure.<br />
• Purges old operator log files.<br />
• Starts the LAT network (if used).<br />
• Defines the maximum number of interactive users.<br />
• Announces that the system is up and running.<br />
• Allows users to log in.<br />
The directory SYS$COMMON:[SYSMGR] contains a template file for each<br />
command procedure that you can edit. Use the command procedure templates (in<br />
SYS$COMMON:[SYSMGR]*.TEMPLATE) as examples for customization of your<br />
system’s startup and login characteristics.<br />
5.6.2 Building Startup Procedures<br />
The first step in preparing an <strong>OpenVMS</strong> <strong>Cluster</strong> shared environment is to build a<br />
SYSTARTUP_VMS command procedure. Each computer executes the procedure<br />
at startup time to define the operating environment.<br />
5–14 Preparing a Shared Environment
Preparing a Shared Environment<br />
5.6 Coordinating Startup Command Procedures<br />
Prepare the SYSTARTUP_VMS.COM procedure as follows:<br />
Step Action<br />
1 In each computer’s SYS$SPECIFIC:[SYSMGR] directory, edit the SYSTARTUP_<br />
VMS.TEMPLATE file to set up a SYSTARTUP_VMS.COM procedure that:<br />
• Performs computer-specific startup functions such as the following:<br />
Setting up dual-ported and local disks<br />
Loading device drivers<br />
Setting up local terminals and terminal server access<br />
• Invoking the common startup procedure (described next).<br />
2 Build a common command procedure that includes startup commands that you want to be<br />
common to all computers. The common procedure might contain commands that:<br />
• Install images<br />
• Define logical names<br />
• Set up queues<br />
• Set up and mount physically accessible mass storage devices<br />
• Perform any other common startup functions<br />
Note: You might choose to build these commands into individual command procedures that<br />
are invoked from the common procedure. For example, the MSCPMOUNT.COM file in the<br />
SYS$EXAMPLES directory is a sample common command procedure that contains commands<br />
typically used to mount cluster disks. The example includes comments explaining each phase<br />
of the procedure.<br />
3 Place the common procedure in the SYS$COMMON:[SYSMGR] directory on a common system<br />
disk or other cluster-accessible disk.<br />
Important: The common procedure is usually located in the SYS$COMMON:[SYSMGR]<br />
directory on a common system disk but can reside on any disk, provided that the disk is<br />
cluster accessible and is mounted when the procedure is invoked. If you create a copy of the<br />
common procedure for each computer, you must remember to update each copy whenever you<br />
make changes.<br />
5.6.3 Combining Existing Procedures<br />
To build startup procedures for an <strong>OpenVMS</strong> <strong>Cluster</strong> system in which existing<br />
computers are to be combined, you should compare both the computer-specific<br />
SYSTARTUP_VMS and the common startup command procedures on each<br />
computer and make any adjustments required. For example, you can compare<br />
the procedures from each computer and include commands that define the same<br />
logical names in your common SYSTARTUP_VMS command procedure.<br />
After you have chosen which commands to make common, you can build the<br />
common procedures on one of the <strong>OpenVMS</strong> <strong>Cluster</strong> computers.<br />
5.6.4 Using Multiple Startup Procedures<br />
To define a multiple-environment cluster, you set up computer-specific versions<br />
of one or more system files. For example, if you want to give users larger<br />
working set quotas on URANUS, you would create a computer-specific version<br />
of SYSUAF.DAT and place that file in URANUS’s SYS$SPECIFIC:[SYSEXE]<br />
directory. That directory can be located in URANUS’s root on a common system<br />
disk or on an individual system disk that you have set up on URANUS.<br />
Preparing a Shared Environment 5–15
Preparing a Shared Environment<br />
5.6 Coordinating Startup Command Procedures<br />
Follow these steps to build SYSTARTUP and SYLOGIN command files for a<br />
multiple-environment <strong>OpenVMS</strong> <strong>Cluster</strong>:<br />
Step Action<br />
1 Include in SYSTARTUP_VMS.COM elements that you want to remain unique to a computer,<br />
such as commands to define computer-specific logical names and symbols.<br />
2 Place these files in the SYS$SPECIFIC root on each computer.<br />
Example: Consider a three-member cluster consisting of computers JUPITR,<br />
SATURN, and PLUTO. The timesharing environments on JUPITR and SATURN<br />
are the same. However, PLUTO runs applications for a specific user group.<br />
In this cluster, you would create a common SYSTARTUP_VMS command<br />
procedure for JUPITR and SATURN that defines identical environments on<br />
these computers. But the command procedure for PLUTO would be different; it<br />
would include commands to define PLUTO’s special application environment.<br />
5.7 Providing <strong>OpenVMS</strong> <strong>Cluster</strong> System Security<br />
The <strong>OpenVMS</strong> security subsystem ensures that all authorization information<br />
and object security profiles are consistent across all nodes in the cluster.<br />
The <strong>OpenVMS</strong> VAX and <strong>OpenVMS</strong> Alpha operating systems do not support<br />
multiple security domains because the operating system cannot enforce a level<br />
of separation needed to support different security domains on separate cluster<br />
members.<br />
5.7.1 Security Checks<br />
In an <strong>OpenVMS</strong> <strong>Cluster</strong> system, individual nodes use a common set of<br />
authorizations to mediate access control that, in effect, ensures that a security<br />
check results in the same answer from any node in the cluster. The following list<br />
outlines how the <strong>OpenVMS</strong> operating system provides a basic level of protection:<br />
• Authorized users can have processes executing on any <strong>OpenVMS</strong> <strong>Cluster</strong><br />
member.<br />
• A process, acting on behalf of an authorized individual, requests access to a<br />
cluster object.<br />
• A coordinating node determines the outcome by comparing its copy of the<br />
common authorization database with the security profile for the object being<br />
accessed.<br />
The <strong>OpenVMS</strong> operating system provides the same strategy for the protection of<br />
files and queues, and further incorporates all other cluster-visible objects, such as<br />
devices, volumes, and lock resource domains.<br />
Starting with <strong>OpenVMS</strong> Version 7.3, the operating system provides clusterwide<br />
intrusion detection, which extends protection against attacks of all types<br />
throughout the cluster. The intrusion data and information from each system is<br />
integrated to protect the cluster as a whole. Prior to Version 7.3, each system was<br />
protected individually.<br />
The SECURITY_POLICY system parameter controls whether a local or a<br />
clusterwide intrusion database is maintained for each system. The default setting<br />
is for a clusterwide database, which contains all unauthorized attempts and the<br />
state of any intrusion events for all cluster members that are using this setting.<br />
<strong>Cluster</strong> members using the clusterwide intrusion database are made aware if a<br />
cluster member is under attack or has any intrusion events recorded. Events<br />
5–16 Preparing a Shared Environment
Preparing a Shared Environment<br />
5.7 Providing <strong>OpenVMS</strong> <strong>Cluster</strong> System Security<br />
recorded on one system can cause another system in the cluster to take restrictive<br />
action. (For example, the person attempting to log in is monitored more closely<br />
and limited to a certain number of login retries within a limited period of time.<br />
Once a person exceeds either the retry or time limitation, he or she cannot log<br />
in.)<br />
Actions of the cluster manager in setting up an <strong>OpenVMS</strong> <strong>Cluster</strong> system can<br />
affect the security operations of the system. You can facilitate <strong>OpenVMS</strong> <strong>Cluster</strong><br />
security management using the suggestions discussed in the following sections.<br />
The easiest way to ensure a single security domain is to maintain a single copy of<br />
each of the following files on one or more disks that are accessible from anywhere<br />
in the <strong>OpenVMS</strong> <strong>Cluster</strong> system. When a cluster is configured with multiple<br />
system disks, you can use system logical names (as shown in Section 5.10) to<br />
ensure that only a single copy of each file exists.<br />
The <strong>OpenVMS</strong> security domain is controlled by the data in the following files:<br />
SYS$MANAGER:VMS$AUDIT_SERVER.DAT<br />
SYS$SYSTEM:NETOBJECT.DAT<br />
SYS$SYSTEM:NETPROXY.DAT<br />
TCPIP$PROXY.DAT<br />
SYS$SYSTEM:QMAN$MASTER.DAT<br />
SYS$SYSTEM:RIGHTSLIST.DAT<br />
SYS$SYSTEM:SYSALF.DAT<br />
SYS$SYSTEM:SYSUAF.DAT<br />
SYS$SYSTEM:SYSUAFALT.DAT<br />
SYS$SYSTEM:VMS$OBJECTS.DAT<br />
SYS$SYSTEM:VMS$PASSWORD_HISTORY.DATA<br />
SYS$SYSTEM:VMSMAIL_PROFILE.DATA<br />
SYS$LIBRARY:VMS$PASSWORD_DICTIONARY.DATA<br />
SYS$LIBRARY:VMS$PASSWORD_POLICY.EXE<br />
Note: Using shared files is not the only way of achieving a single security<br />
domain. You may need to use multiple copies of one or more of these files on<br />
different nodes in a cluster. For example, on Alpha nodes you may choose<br />
to deploy system-specific user authorization files (SYSUAFs) to allow for<br />
different memory management working-set quotas among different nodes. Such<br />
configurations are fully supported as long as the security information available to<br />
each node in the cluster is identical.<br />
5.8 Files Relevant to <strong>OpenVMS</strong> <strong>Cluster</strong> Security<br />
Table 5–3 describes the security-relevant portions of the files that must be<br />
common across all cluster members to ensure that a single security domain<br />
exists.<br />
Notes:<br />
• Some of these files are created only on request and may not exist in all<br />
configurations.<br />
• A file can be absent on one node only if it is absent on all nodes.<br />
• As soon as a required file is created on one node, it must be created or<br />
commonly referenced on all remaining cluster nodes.<br />
Preparing a Shared Environment 5–17
Preparing a Shared Environment<br />
5.8 Files Relevant to <strong>OpenVMS</strong> <strong>Cluster</strong> Security<br />
Table 5–3 Security Files<br />
The following table describes designations for the files in Table 5–3.<br />
Table Keyword Meaning<br />
Required The file contains some data that must be kept common across all cluster<br />
members to ensure that a single security environment exists.<br />
Recommended The file contains data that should be kept common at the discretion of the site<br />
security administrator or system manager. Nonetheless, Digital recommends<br />
that you synchronize the recommended files.<br />
File Name Contains<br />
VMS$AUDIT_SERVER.DAT<br />
[recommended]<br />
NETOBJECT.DAT<br />
[required]<br />
NETPROXY.DAT<br />
and NET$PROXY.DAT<br />
[required]<br />
5–18 Preparing a Shared Environment<br />
Information related to security auditing. Among the information contained is the list<br />
of enabled security auditing events and the destination of the system security audit<br />
journal file. When more than one copy of this file exists, all copies should be updated<br />
after any SET AUDIT command.<br />
<strong>OpenVMS</strong> <strong>Cluster</strong> system managers should ensure that the name assigned to the<br />
security audit journal file resolves to the following location:<br />
SYS$COMMON:[SYSMGR]SECURITY.AUDIT$JOURNAL<br />
Rule: If you need to relocate the audit journal file somewhere other than<br />
the system disk (or if you have multiple system disks), you should redirect<br />
the audit journal uniformly across all nodes in the cluster. Use the command<br />
SET AUDIT/JOURNAL=SECURITY/DESTINATION=file-name, specifying a file<br />
name that resolves to the same file throughout the cluster.<br />
Changes are automatically made in the audit server database,<br />
SYS$MANAGER:VMS$AUDIT_SERVER.DAT. This database also identifies which<br />
events are enabled and how to monitor the audit system’s use of resources, and<br />
restores audit system settings each time the system is rebooted.<br />
Caution: Failure to synchronize multiple copies of this file properly may result in<br />
partitioned auditing domains.<br />
Reference: For more information, see the <strong>OpenVMS</strong> Guide to System Security.<br />
The DECnet object database. Among the information contained in this file is the list<br />
of known DECnet server accounts and passwords. When more than one copy of this<br />
file exists, all copies must be updated after every use of the NCP commands SET<br />
OBJECT or DEFINE OBJECT.<br />
Caution: Failure to synchronize multiple copies of this file properly may result<br />
in unexplained network login failures and unauthorized network access. For<br />
instructions on maintaining a single copy, refer to Section 5.10.1.<br />
Reference: Refer to the DECnet–Plus documentation for equivalent NCL command<br />
information.<br />
The network proxy database. It is maintained by the <strong>OpenVMS</strong> Authorize utility.<br />
When more than one copy of this file exists, all copies must be updated after any<br />
UAF proxy command.<br />
Note: The NET$PROXY.DAT and NETPROXY.DAT files are equivalent;<br />
NET$PROXY is for DECnet–Plus implementations and NETPROXY.DAT is for<br />
DECnet for <strong>OpenVMS</strong> implementations.<br />
Caution: Failure to synchronize multiple copies of this file properly may result<br />
in unexplained network login failures and unauthorized network access. For<br />
instructions on maintaining a single copy, refer to Section 5.10.1.<br />
Reference: Appendix B discusses how to consolidate several NETPROXY.DAT and<br />
RIGHTSLIST.DAT files.<br />
(continued on next page)
Table 5–3 (Cont.) Security Files<br />
File Name Contains<br />
Preparing a Shared Environment<br />
5.8 Files Relevant to <strong>OpenVMS</strong> <strong>Cluster</strong> Security<br />
TCPIP$PROXY.DAT This database provides <strong>OpenVMS</strong> identities for remote NFS clients and UNIX-style<br />
identifiers for local NFS client users; provides proxy accounts for remote processes.<br />
For more information about this file, see the Compaq TCP/IP Services for <strong>OpenVMS</strong><br />
Management manual.<br />
QMAN$MASTER.DAT<br />
[required]<br />
RIGHTSLIST.DAT<br />
[required]<br />
SYSALF.DAT<br />
[required]<br />
The master queue manager database. This file contains the security information for<br />
all shared batch and print queues.<br />
Rule: If two or more nodes are to participate in a shared queuing system, a<br />
single copy of this file must be maintained on a shared disk. For instructions on<br />
maintaining a single copy, refer to Section 5.10.1.<br />
The rights identifier database. It is maintained by the <strong>OpenVMS</strong> Authorize utility<br />
and by various rights identifier system services. When more than one copy of this<br />
file exists, all copies must be updated after any change to any identifier or holder<br />
records.<br />
Caution: Failure to synchronize multiple copies of this file properly may result<br />
in unauthorized system access and unauthorized access to protected objects. For<br />
instructions on maintaining a single copy, refer to Section 5.10.1.<br />
Reference: Appendix B discusses how to consolidate several NETPROXY.DAT and<br />
RIGHTSLIST.DAT files.<br />
The system Autologin facility database. It is maintained by the <strong>OpenVMS</strong> SYSMAN<br />
utility. When more than one copy of this file exists, all copies must be updated after<br />
any SYSMAN ALF command.<br />
Note: This file may not exist in all configurations.<br />
Caution: Failure to synchronize multiple copies of this file properly may result<br />
in unexplained login failures and unauthorized system access. For instructions on<br />
maintaining a single copy, refer to Section 5.10.1.<br />
(continued on next page)<br />
Preparing a Shared Environment 5–19
Preparing a Shared Environment<br />
5.8 Files Relevant to <strong>OpenVMS</strong> <strong>Cluster</strong> Security<br />
Table 5–3 (Cont.) Security Files<br />
File Name Contains<br />
SYSUAF.DAT<br />
[required]<br />
5–20 Preparing a Shared Environment<br />
The system user authorization file. It is maintained by the <strong>OpenVMS</strong> Authorize<br />
utility and is modifiable via the $SETUAI system service. When more than one copy<br />
of this file exists, you must ensure that the SYSUAF and associated $SETUAI item<br />
codes are synchronized for each user record. The following table shows the fields in<br />
SYSUAF and their associated $SETUAI item codes.<br />
Internal Field Name $SETUAI Item Code<br />
UAF$R_DEF_CLASS UAI$_DEF_CLASS<br />
UAF$Q_DEF_PRIV UAI$_DEF_PRIV<br />
UAF$B_DIALUP_ACCESS_P UAI$_DIALUP_ACCESS_P<br />
UAF$B_DIALUP_ACCESS_S UAI$_DIALUP_ACCESS_S<br />
UAF$B_ENCRYPT UAI$_ENCRYPT<br />
UAF$B_ENCRYPT2 UAI$_ENCRYPT2<br />
UAF$Q_EXPIRATION UAI$_EXPIRATION<br />
UAF$L_FLAGS UAI$_FLAGS<br />
UAF$B_LOCAL_ACCESS_P UAI$_LOCAL_ACCESS_P<br />
UAF$B_LOCAL_ACCESS_S UAI$_LOCAL_ACCESS_S<br />
UAF$B_NETWORK_ACCESS_P UAI$_NETWORK_ACCESS_P<br />
UAF$B_NETWORK_ACCESS_S UAI$_NETWORK_ACCESS_S<br />
UAF$B_PRIME_DAYS UAI$_PRIMEDAYS<br />
UAF$Q_PRIV UAI$_PRIV<br />
UAF$Q_PWD UAI$_PWD<br />
UAF$Q_PWD2 UAI$_PWD2<br />
UAF$Q_PWD_DATE UAI$_PWD_DATE<br />
UAF$Q_PWD2_DATE UAI$_PWD2_DATE<br />
UAF$B_PWD_LENGTH UAI$_PWD_LENGTH<br />
UAF$Q_PWD_LIFETIME UAI$_PWD_LIFETIME<br />
UAF$B_REMOTE_ACCESS_P UAI$_REMOTE_ACCESS_P<br />
UAF$B_REMOTE_ACCESS_S UAI$_REMOTE_ACCESS_S<br />
UAF$R_MAX_CLASS UAI$_MAX_CLASS<br />
UAF$R_MIN_CLASS UAI$_MIN_CLASS<br />
UAF$W_SALT UAI$_SALT<br />
UAF$L_UIC Not applicable<br />
Caution: Failure to synchronize multiple copies of the SYSUAF files properly may<br />
result in unexplained login failures and unauthorized system access. For instructions<br />
on maintaining a single copy, refer to Section 5.10.1.<br />
Reference: Appendix B discusses creation and management of the various elements<br />
of an <strong>OpenVMS</strong> <strong>Cluster</strong> common SYSUAF.DAT authorization database.<br />
(continued on next page)
Table 5–3 (Cont.) Security Files<br />
File Name Contains<br />
SYSUAFALT.DAT<br />
[required]<br />
†VMS$OBJECTS.DAT<br />
[required]<br />
VMS$PASSWORD_<br />
HISTORY.DATA<br />
[recommended]<br />
VMSMAIL_PROFILE.DATA<br />
[recommended]<br />
VMS$PASSWORD_<br />
DICTIONARY.DATA<br />
[recommended]<br />
†VAX specific<br />
Preparing a Shared Environment<br />
5.8 Files Relevant to <strong>OpenVMS</strong> <strong>Cluster</strong> Security<br />
The system alternate user authorization file. This file serves as a backup to<br />
SYSUAF.DAT and is enabled via the SYSUAFALT system parameter. When more<br />
than one copy of this file exists, all copies must be updated after any change to any<br />
authorization records in this file.<br />
Note: This file may not exist in all configurations.<br />
Caution: Failure to synchronize multiple copies of this file properly may result in<br />
unexplained login failures and unauthorized system access.<br />
On VAX systems, this file is located in SYS$COMMON:[SYSEXE] and contains<br />
the clusterwide object database. Among the information contained in this file are<br />
the security profiles for all clusterwide objects. When more than one copy of this<br />
file exists, all copies must be updated after any change to the security profile of a<br />
clusterwide object or after new clusterwide objects are created. <strong>Cluster</strong>wide objects<br />
include disks, tapes, and resource domains.<br />
<strong>OpenVMS</strong> <strong>Cluster</strong> system managers should ensure that the security object database<br />
is present on each node in the <strong>OpenVMS</strong> <strong>Cluster</strong> by specifying a file name that<br />
resolves to the same file throughout the cluster, not to a file that is unique to each<br />
node.<br />
The database is updated whenever characteristics are modified, and the information<br />
is distributed so that all nodes participating in the cluster share a common view of<br />
the objects. The security database is created and maintained by the audit server<br />
process.<br />
Rule: If you relocate the database, be sure the logical name VMS$OBJECTS<br />
resolves to the same file for all nodes in a common-environment cluster. To<br />
reestablish the logical name after each system boot, define the logical in<br />
SYSECURITY.COM.<br />
Caution: Failure to synchronize multiple copies of this file properly may result in<br />
unauthorized access to protected objects.<br />
The system password history database. It is maintained by the system password<br />
change facility. When more than one copy of this file exists, all copies should be<br />
updated after any password change.<br />
Caution: Failure to synchronize multiple copies of this file properly may result in a<br />
violation of the system password policy.<br />
The system mail database. This file is maintained by the <strong>OpenVMS</strong> Mail utility and<br />
contains mail profiles for all system users. Among the information contained in this<br />
file is the list of all mail forwarding addresses in use on the system. When more<br />
than one copy of this file exists, all copies should be updated after any changes to<br />
mail forwarding.<br />
Caution: Failure to synchronize multiple copies of this file properly may result in<br />
unauthorized disclosure of information.<br />
The system password dictionary. The system password dictionary is a list of English<br />
language words and phrases that are not legal for use as account passwords. When<br />
more than one copy of this file exists, all copies should be updated after any sitespecific<br />
additions.<br />
Caution: Failure to synchronize multiple copies of this file properly may result in a<br />
violation of the system password policy.<br />
(continued on next page)<br />
Preparing a Shared Environment 5–21
Preparing a Shared Environment<br />
5.8 Files Relevant to <strong>OpenVMS</strong> <strong>Cluster</strong> Security<br />
Table 5–3 (Cont.) Security Files<br />
File Name Contains<br />
VMS$PASSWORD_POLICY.EXE<br />
[recommended]<br />
5.9 Network Security<br />
Any site-specific password filters. It is created and installed by the site-security<br />
administrator or system manager. When more than one copy of this file exists, all<br />
copies should be identical.<br />
Caution: Failure to synchronize multiple copies of this file properly may result in a<br />
violation of the system password policy.<br />
Note: System managers can create this file as an image to enforce their local<br />
password policy. This is an architecture-specific image file that cannot be shared<br />
between VAX and Alpha computers.<br />
Network security should promote interoperability and uniform security<br />
approaches throughout networks. The following list shows three major areas of<br />
network security:<br />
• User authentication<br />
• <strong>OpenVMS</strong> <strong>Cluster</strong> membership management<br />
• Using a security audit log file<br />
<strong>OpenVMS</strong> <strong>Cluster</strong> system managers should also ensure consistency in the use of<br />
DECnet software for intracluster communication.<br />
5.9.1 Mechanisms<br />
Depending on the level of network security required, you might also want<br />
to consider how other security mechanisms, such as protocol encryption and<br />
decryption, can promote additional security protection across the cluster.<br />
Reference: See the <strong>OpenVMS</strong> Guide to System Security.<br />
5.10 Coordinating System Files<br />
Follow these guidelines to coordinate system files:<br />
IF you are setting up... THEN follow the procedures in...<br />
A common-environment<br />
<strong>OpenVMS</strong> <strong>Cluster</strong> that consists<br />
of newly installed systems<br />
An <strong>OpenVMS</strong> <strong>Cluster</strong> that will<br />
combine one or more computers<br />
that have been running with<br />
computer-specific files<br />
<strong>OpenVMS</strong> System Manager’s Manual to build these files. Because<br />
the files on new operating systems are empty except for the<br />
Digital-supplied accounts, very little coordination is necessary.<br />
Appendix B to create common copies of the files from the<br />
computer-specific files.<br />
5.10.1 Procedure<br />
In a common-environment cluster with one common system disk,<br />
you use a common copy of each system file and place the files in the<br />
SYS$COMMON:[SYSEXE] directory on the common system disk or on a disk<br />
that is mounted by all cluster nodes. No further action is required.<br />
To prepare a common user environment for an <strong>OpenVMS</strong> <strong>Cluster</strong> system that<br />
includes more than one common VAX system disk or more than one common<br />
Alpha system disk, you must coordinate the system files on those disks.<br />
5–22 Preparing a Shared Environment
Preparing a Shared Environment<br />
5.10 Coordinating System Files<br />
Rules: The following rules apply to the procedures described in Table 5–4:<br />
• Disks holding common resources must be mounted early in the system startup<br />
procedure, such as in the SYLOGICALS.COM procedure.<br />
• You must ensure that the disks are mounted with each <strong>OpenVMS</strong> <strong>Cluster</strong><br />
reboot.<br />
Table 5–4 Procedure for Coordinating Files<br />
Step Action<br />
1 Decide where to locate the SYSUAF.DAT and NETPROXY.DAT files. In a cluster with<br />
multiple system disks, system management is much easier if the common system files are<br />
located on a single disk that is not a system disk.<br />
2 Copy SYS$SYSTEM:SYSUAF.DAT and SYS$SYSTEM:NETPROXY.DAT to a location other<br />
than the system disk.<br />
3 Copy SYS$SYSTEM:RIGHTSLIST.DAT and SYS$SYSTEM:VMSMAIL_PROFILE.DATA to the<br />
same directory in which SYSUAF.DAT and NETPROXY.DAT reside.<br />
4 Edit the file SYS$COMMON:[SYSMGR]SYLOGICALS.COM on each system disk and define<br />
logical names that specify the location of the cluster common files.<br />
Example: If the files will be located on $1$DJA16, define logical names as follows:<br />
$ DEFINE/SYSTEM/EXEC SYSUAF -<br />
$1$DJA16:[VMS$COMMON.SYSEXE]SYSUAF.DAT<br />
$ DEFINE/SYSTEM/EXEC NETPROXY -<br />
$1$DJA16:[VMS$COMMON.SYSEXE]NETPROXY.DAT<br />
$ DEFINE/SYSTEM/EXEC RIGHTSLIST -<br />
$1$DJA16:[VMS$COMMON.SYSEXE]RIGHTSLIST.DAT<br />
$ DEFINE/SYSTEM/EXEC VMSMAIL_PROFILE -<br />
$1$DJA16:[VMS$COMMON.SYSEXE]VMSMAIL_PROFILE.DATA<br />
$ DEFINE/SYSTEM/EXEC NETNODE_REMOTE -<br />
$1$DJA16:[VMS$COMMON.SYSEXE]NETNODE_REMOTE.DAT<br />
$ DEFINE/SYSTEM/EXEC NETNODE_UPDATE -<br />
$1$DJA16:[VMS$COMMON.SYSMGR]NETNODE_UPDATE.COM<br />
$ DEFINE/SYSTEM/EXEC QMAN$MASTER -<br />
$1$DJA16:[VMS$COMMON.SYSEXE]<br />
5 To ensure that the system disks are mounted correctly with each reboot, follow these steps:<br />
1. Copy the SYS$EXAMPLES:CLU_MOUNT_DISK.COM file to the<br />
[VMS$COMMON.SYSMGR] directory, and edit it for your configuration.<br />
2. Edit SYLOGICALS.COM and include commands to mount, with the appropriate volume<br />
label, the system disk containing the shared files.<br />
Example: If the system disk is $1$DJA16, include the following command:<br />
$ @SYS$SYSDEVICE:[VMS$COMMON.SYSMGR]CLU_MOUNT_DISK.COM $1$DJA16: volume-label<br />
6 When you are ready to start the queuing system, be sure you have moved the queue and<br />
journal files to a cluster-available disk. Any cluster common disk is a good choice if the disk<br />
has sufficient space.<br />
Enter the following command:<br />
5.10.2 Network Database Files<br />
$ START/QUEUE/MANAGER $1$DJA16:[VMS$COMMON.SYSEXE]<br />
In <strong>OpenVMS</strong> <strong>Cluster</strong> systems on the LAN and in mixed-interconnect clusters,<br />
you must also coordinate the SYS$MANAGER:NETNODE_UPDATE.COM file,<br />
which is a file that contains all essential network configuration data for satellites.<br />
NETNODE_UPDATE.COM is updated each time you add or remove a satellite<br />
or change its Ethernet or FDDI hardware address. This file is discussed more<br />
thoroughly in Section 10.4.2.<br />
Preparing a Shared Environment 5–23
Preparing a Shared Environment<br />
5.10 Coordinating System Files<br />
In <strong>OpenVMS</strong> <strong>Cluster</strong> systems configured with DECnet for <strong>OpenVMS</strong> software,<br />
you must also coordinate NETNODE_REMOTE.DAT, which is the remote node<br />
network database.<br />
5.11 System Time on the <strong>Cluster</strong><br />
When a computer joins the cluster, the cluster attempts to set the joining<br />
computer’s system time to the current time on the cluster. Although it is likely<br />
that the system time will be similar on each cluster computer, there is no<br />
assurance that the time will be set. Also, no attempt is made to ensure that the<br />
system times remain similar throughout the cluster. (For example, there is no<br />
protection against different computers having different clock rates.)<br />
An <strong>OpenVMS</strong> <strong>Cluster</strong> system spanning multiple time zones must use a single,<br />
clusterwide common time on all nodes. Use of a common time ensures timestamp<br />
consistency (for example, between applications, file-system instances) across the<br />
<strong>OpenVMS</strong> <strong>Cluster</strong> members.<br />
5.11.1 Setting System Time<br />
Use the SYSMAN command CONFIGURATION SET TIME to set the time across<br />
the cluster. This command issues warnings if the time on all nodes cannot be<br />
set within certain limits. Refer to the <strong>OpenVMS</strong> System Manager’s Manual for<br />
information about the SET TIME command.<br />
5–24 Preparing a Shared Environment
6<br />
<strong>Cluster</strong> Storage Devices<br />
One of the most important features of <strong>OpenVMS</strong> <strong>Cluster</strong> systems is the ability to<br />
provide access to devices and files across multiple systems.<br />
In a traditional computing environment, a single system is directly attached to<br />
its storage subsystems. Even though the system may be networked with other<br />
systems, when the system is shut down, no other system on the network has<br />
access to its disks or any other devices attached to the system.<br />
In an <strong>OpenVMS</strong> <strong>Cluster</strong> system, disks and tapes can be made accessible to one or<br />
more members. So, if one computer shuts down, the remaining computers still<br />
have access to the devices.<br />
6.1 Data File Sharing<br />
6.1.1 Access Methods<br />
<strong>Cluster</strong>-accessible devices play a key role in <strong>OpenVMS</strong> <strong>Cluster</strong>s because, when<br />
you place data files or applications on a cluster-accessible device, computers can<br />
share a single copy of each common file. Data sharing is possible between VAX<br />
computers, between Alpha computers, and between VAX and Alpha computers.<br />
In addition, multiple systems (VAX and Alpha) can write to a shared disk file<br />
simultaneously. It is this ability that allows multiple systems in an <strong>OpenVMS</strong><br />
<strong>Cluster</strong> to share a single system disk; multiple systems can boot from the same<br />
system disk and share operating system files and utilities to save disk space and<br />
simplify system management.<br />
Note: Tapes do not allow multiple systems to access a tape file simultaneously.<br />
Depending on your business needs, you may want to restrict access to a particular<br />
device to the users on the computer that are directly connected (local) to the<br />
device. Alternatively, you may decide to set up a disk or tape as a served device<br />
so that any user on any <strong>OpenVMS</strong> <strong>Cluster</strong> computer can allocate and use it.<br />
Table 6–1 describes the various access methods.<br />
<strong>Cluster</strong> Storage Devices 6–1
<strong>Cluster</strong> Storage Devices<br />
6.1 Data File Sharing<br />
Table 6–1 Device Access Methods<br />
Method Device Access Comments Illustrated in<br />
Local Restricted to the computer that is<br />
directly connected to the device.<br />
Dual ported Using either of two physical ports,<br />
each of which can be connected to<br />
separate controllers. A dual-ported<br />
disk can survive the failure of a<br />
single controller by failing over to<br />
the other controller.<br />
Shared Through a shared interconnect to<br />
multiple systems.<br />
Served Through a computer that has the<br />
MSCP or TMSCP server software<br />
loaded.<br />
Dual pathed Possible through more than one<br />
path.<br />
Can be set up to be served to other<br />
systems.<br />
As long as one of the controllers is<br />
available, the device is accessible by<br />
all systems in the cluster.<br />
Can be set up to be served to<br />
systems that are not on the shared<br />
interconnect.<br />
MSCP and TMSCP serving are<br />
discussed in Section 6.3.<br />
If one path fails, the device is<br />
accessed over the other path.<br />
Requires the use of allocation<br />
classes (described in Section 6.2.1 to<br />
provide a unique, path-independent<br />
name.)<br />
Note: The path to an individual disk may appear to be local from some nodes and served from others.<br />
6.1.2 Examples<br />
6–2 <strong>Cluster</strong> Storage Devices<br />
Figure 6–3<br />
Figure 6–1<br />
Figure 6–2<br />
Figures 6–2 and 6–3<br />
Figure 6–2<br />
When storage subsystems are connected directly to a specific system, the<br />
availability of the subsystem is lower due to the reliance on the host system.<br />
To increase the availability of these configurations, <strong>OpenVMS</strong> <strong>Cluster</strong> systems<br />
support dual porting, dual pathing, and MSCP and TMSCP serving.<br />
Figure 6–1 shows a dual-ported configuration, in which the disks have<br />
independent connections to two separate computers. As long as one of the<br />
computers is available, the disk is accessible by the other systems in the cluster.
Figure 6–1 Dual-Ported Disks<br />
Network<br />
CPU<br />
Shadowed Disks<br />
Dual−Ported<br />
Disks<br />
CPU<br />
CPU<br />
<strong>Cluster</strong> Storage Devices<br />
6.1 Data File Sharing<br />
CPU failure<br />
does not prohibit<br />
access to data.<br />
ZK−7017A−GE<br />
Note: Disks can also be shadowed using Volume Shadowing for <strong>OpenVMS</strong>. The<br />
automatic recovery from system failure provided by dual porting and shadowing<br />
is transparent to users and does not require any operator intervention.<br />
Figure 6–2 shows a dual-pathed DSSI and Ethernet configuration. The disk<br />
devices, accessible through a shared SCSI interconnect, are MSCP served to the<br />
client nodes on the LAN.<br />
Rule: A dual-pathed DSA disk cannot be used as a system disk for a directly<br />
connected CPU. Because a device can be on line to one controller at a time, only<br />
one of the server nodes can use its local connection to the device. The second<br />
server node accesses the device through the MSCP (or the TMSCP server). If the<br />
computer that is currently serving the device fails, the other computer detects<br />
the failure and fails the device over to its local connection. The device thereby<br />
remains available to the cluster.<br />
<strong>Cluster</strong> Storage Devices 6–3
<strong>Cluster</strong> Storage Devices<br />
6.1 Data File Sharing<br />
Figure 6–2 Dual-Pathed Disks<br />
Disks<br />
Server Nodes<br />
Client<br />
Nodes<br />
6–4 <strong>Cluster</strong> Storage Devices<br />
SCSI<br />
CPU CPU<br />
LAN<br />
VM-0672A-AI<br />
Dual-pathed disks or tapes can be failed over between two computers that serve<br />
the devices to the cluster, provided that:<br />
• The same device controller letter is generated and the same allocation class<br />
is specified on each computer, with the result that the device has the same<br />
name on both systems. (Section 6.2.1 describes allocation classes.)<br />
• Both computers are running the MSCP server for disks, the TMSCP server<br />
for tapes, or both.<br />
Caution: Failure to observe these requirements can endanger data integrity.<br />
You can set up HSC or HSJ storage devices to be dual ported between two storage<br />
subsystems, as shown in Figure 6–3.
Figure 6–3 Configuration with <strong>Cluster</strong>-Accessible Devices<br />
Local<br />
Devices<br />
MSCP<br />
Served<br />
MSCP<br />
Served<br />
VAX Local Disk Alpha<br />
CI<br />
SCSI<br />
SCSI<br />
Quorum Disk<br />
Star Coupler<br />
CI<br />
HSJ HSJ<br />
<strong>Cluster</strong> Storage Devices<br />
6.1 Data File Sharing<br />
MSCP<br />
Served<br />
MSCP<br />
Served<br />
Local<br />
Devices<br />
ZK−1637−GE<br />
By design, HSC and HSJ disks and tapes are directly accessible by all <strong>OpenVMS</strong><br />
<strong>Cluster</strong> nodes that are connected to the same star coupler. Therefore, if the<br />
devices are dual ported, they are automatically dual pathed. Computers<br />
connected by CI can access a dual-ported HSC or HSJ device by way of a<br />
path through either subsystem connected to the device. If one subsystem fails,<br />
access fails over to the other subsystem.<br />
Note: To control the path that is taken during failover, you can specify a<br />
preferred path to force access to disks over a specific path. Section 6.1.3 describes<br />
the preferred-path capability.<br />
6.1.3 Specifying a Preferred Path<br />
The operating system supports specifying a preferred path for DSA disks,<br />
including RA series disks and disks that are accessed through the MSCP server.<br />
(This function is not available for tapes.) If a preferred path is specified for a<br />
disk, the MSCP disk class drivers use that path:<br />
• For the first attempt to locate the disk and bring it on line with a DCL<br />
command MOUNT<br />
• For failover of an already mounted disk<br />
In addition, you can initiate failover of a mounted disk to force the disk to the<br />
preferred path or to use load-balancing information for disks accessed by MSCP<br />
servers.<br />
You can specify the preferred path by using the SET PREFERRED_PATH DCL<br />
command or by using the $QIO function (IO$_SETPRFPATH), with the P1<br />
parameter containing the address of a counted ASCII string (.ASCIC). This string<br />
<strong>Cluster</strong> Storage Devices 6–5
<strong>Cluster</strong> Storage Devices<br />
6.1 Data File Sharing<br />
is the node name of the HSC or HSJ, or of the <strong>OpenVMS</strong> system that is to be the<br />
preferred path.<br />
Rule: The node name must match an existing node running the MSCP server<br />
that is known to the local node.<br />
Reference: For more information about the use of the SET PREFERRED_PATH<br />
DCL command, refer to the <strong>OpenVMS</strong> DCL Dictionary: N–Z.<br />
For more information about the use of the IO$_SETPRFPATH function, refer to<br />
the <strong>OpenVMS</strong> I/O User’s Reference Manual.<br />
6.2 Naming <strong>OpenVMS</strong> <strong>Cluster</strong> Storage Devices<br />
In the <strong>OpenVMS</strong> operating system, a device name takes the form of ddcu, where:<br />
• dd represents the predefined code for the device type<br />
• c represents the predefined controller designation<br />
• u represents the unit number<br />
For CI or DSSI devices, the controller designation is always the letter A; and the<br />
unit number is selected by the system manager.<br />
For SCSI devices, the controller letter is assigned by <strong>OpenVMS</strong>, based on the<br />
system configuration. The unit number is determined by the SCSI bus ID and<br />
the logical unit number (LUN) of the device.<br />
Because device names must be unique in an <strong>OpenVMS</strong> <strong>Cluster</strong>, and because<br />
every cluster member must use the same name for the same device, <strong>OpenVMS</strong><br />
adds a prefix to the device name, as follows:<br />
• If a device is attached to a single computer, the device name is extended to<br />
include the name of that computer:<br />
node$ddcu<br />
where node represents the SCS node name of the system on which the device<br />
resides.<br />
• If a device is attached to multiple computers, the node name part of the<br />
device name is replaced by a dollar sign and a number (called a node or port<br />
allocation class, depending on usage), as follows:<br />
6.2.1 Allocation Classes<br />
6–6 <strong>Cluster</strong> Storage Devices<br />
$allocation-class$ddcu<br />
The purpose of allocation classes is to provide unique and unchanging device<br />
names. The device name is used by the <strong>OpenVMS</strong> <strong>Cluster</strong> distributed lock<br />
manager in conjunction with <strong>OpenVMS</strong> facilities (such as RMS and the XQP) to<br />
uniquely identify shared devices, files, and data.<br />
Allocation classes are required in <strong>OpenVMS</strong> <strong>Cluster</strong> configurations where storage<br />
devices are accessible through multiple paths. Without the use of allocation<br />
classes, device names that relied on node names would change as access paths to<br />
the devices change.<br />
Prior to <strong>OpenVMS</strong> Version 7.1, only one type of allocation class existed, which was<br />
node based. It was named allocation class. <strong>OpenVMS</strong> Version 7.1 introduced a<br />
second type, port allocation class, which is specific to a single interconnect and<br />
is assigned to all devices attached to that interconnect. Port allocation classes<br />
were originally designed for naming SCSI devices. Their use has been expanded
<strong>Cluster</strong> Storage Devices<br />
6.2 Naming <strong>OpenVMS</strong> <strong>Cluster</strong> Storage Devices<br />
to include additional devices types: floppy disks, PCI RAID controller disks, and<br />
IDE disks.<br />
The use of port allocation classes is optional. They are designed to solve the<br />
device-naming and configuration conflicts that can occur in certain configurations,<br />
as described in Section 6.2.3.<br />
To differentiate between the earlier node-based allocation class and the newer<br />
port allocation class, the term node allocation class was assigned to the earlier<br />
type.<br />
Prior to <strong>OpenVMS</strong> Version 7.2, all nodes with direct access to the same<br />
multipathed device were required to use the same nonzero value for the node<br />
allocation class. <strong>OpenVMS</strong> Version 7.2 introduced the MSCP_SERVE_ALL<br />
system parameter, which can be set to serve all disks or to exclude those whose<br />
node allocation class differs.<br />
Note<br />
If SCSI devices are connected to multiple hosts and if port allocation<br />
classes are not used, then all nodes with direct access to the same<br />
multipathed devices must use the same nonzero node allocation class.<br />
Multipathed MSCP controllers also have an allocation class parameter, which is<br />
set to match that of the connected nodes. (If the allocation class does not match,<br />
the devices attached to the nodes cannot be served.)<br />
6.2.2 Specifying Node Allocation Classes<br />
A node allocation class can be assigned to computers, HSC or HSJ controllers,<br />
and DSSI ISEs. The node allocation class is a numeric value from 1 to 255 that<br />
is assigned by the system manager.<br />
The default node allocation class value is 0. A node allocation class value of 0<br />
is appropriate only when serving a local, single-pathed disk. If a node allocation<br />
class of 0 is assigned, served devices are named using the node-name$device-name<br />
syntax, that is, the device name prefix reverts to the node name.<br />
The following rules apply to specifying node allocation class values:<br />
1. When serving satellites, the same nonzero node allocation class value must be<br />
assigned to the serving computers and controllers.<br />
2. All cluster-accessible devices on computers with a nonzero node allocation<br />
class value must have unique names throughout the cluster. For example, if<br />
two computers have the same node allocation class value, it is invalid for both<br />
computers to have a local disk named DJA0 or a tape named MUA0. This<br />
also applies to HSC and HSJ subsystems.<br />
System managers provide node allocation classes separately for disks and tapes.<br />
The node allocation class for disks and the node allocation class for tapes can be<br />
different.<br />
The node allocation class names are constructed as follows:<br />
$disk-allocation-class$device-name<br />
$tape-allocation-class$device-name<br />
<strong>Cluster</strong> Storage Devices 6–7
<strong>Cluster</strong> Storage Devices<br />
6.2 Naming <strong>OpenVMS</strong> <strong>Cluster</strong> Storage Devices<br />
Caution: Failure to set node allocation class values and device unit numbers<br />
correctly can endanger data integrity and cause locking conflicts that suspend<br />
normal cluster operations. Figure 6–4 shows an example of how cluster device<br />
names are specified in a CI configuration.<br />
Figure 6–4 Disk and Tape Dual Pathed Between HSC Controllers<br />
VOYGR1<br />
JUPITR<br />
SET ALLOCATE DISK 1<br />
SET ALLOCATE TAPE 1<br />
6–8 <strong>Cluster</strong> Storage Devices<br />
CI<br />
$1$DJA17<br />
$1$MUB6<br />
Star Coupler<br />
CI<br />
SATURN<br />
VOYGR2<br />
SET ALLOCATE DISK 1<br />
SET ALLOCATE TAPE 1<br />
ZK−6656−GE<br />
In this configuration:<br />
• The disk device name ($1$DJA17) and tape device name ($1$MUB6) are<br />
derived using the node allocation class of the two controllers.<br />
• Node allocation classes are not required for the two computers.<br />
• JUPITR and SATURN can access the disk or tape through either VOYGR1 or<br />
VOYGR2.<br />
• Note that the disk and tape allocation classes do not need to be the same.<br />
If one controller with node allocation class 1 is not available, users can gain<br />
access to a device specified by that node allocation class through the other<br />
controller.<br />
Figure 6–6 builds on Figure 6–4 by including satellite nodes that access devices<br />
$1$DUA17 and $1$MUA12 through the JUPITR and NEPTUN computers. In<br />
this configuration, the computers JUPITR and NEPTUN require node allocation<br />
classes so that the satellite nodes are able to use consistent device names<br />
regardless of the access path to the devices.<br />
Note: System management is usually simplified by using the same node<br />
allocation class value for all servers, HSC and HSJ subsystems, and DSSI<br />
ISEs; you can arbitrarily choose a number between 1 and 255. Note, however,<br />
that to change a node allocation class value, you must shut down and reboot the<br />
entire cluster (described in Section 8.6). If you use a common node allocation<br />
class for computers and controllers, ensure that all devices have unique unit<br />
numbers.
<strong>Cluster</strong> Storage Devices<br />
6.2 Naming <strong>OpenVMS</strong> <strong>Cluster</strong> Storage Devices<br />
6.2.2.1 Assigning Node Allocation Class Values on Computers<br />
There are two ways to assign a node allocation class: by using CLUSTER_<br />
CONFIG.COM or CLUSTER_CONFIG_LAN.COM, which is described in<br />
Section 8.4, or by using AUTOGEN, as shown in the following table.<br />
Step Action<br />
1 Edit the root directory [SYSn.SYSEXE]MODPARAMS.DAT on each node that boots from<br />
the system disk. The following example shows a MODPARAMS.DAT file. The entries are<br />
hypothetical and should be regarded as examples, not as suggestions for specific parameter<br />
settings.<br />
!<br />
! Site-specific AUTOGEN data file. In an <strong>OpenVMS</strong> <strong>Cluster</strong><br />
! where a common system disk is being used, this file<br />
! should reside in SYS$SPECIFIC:[SYSEXE], not a common<br />
! system directory.<br />
!<br />
! Add modifications that you want to make to AUTOGEN’s<br />
! hardware configuration data, system parameter<br />
! calculations, and page, swap, and dump file sizes<br />
! to the bottom of this file.<br />
SCSNODE="NODE01"<br />
SCSSYSTEMID=99999<br />
NISCS_LOAD_PEA0=1<br />
VAXCLUSTER=2<br />
MSCP_LOAD=1<br />
MSCP_SERVE_ALL=1<br />
ALLOCLASS=1<br />
TAPE_ALLOCLASS=1<br />
2 Invoke AUTOGEN to set the system parameter values:<br />
$ @SYS$UPDATE:AUTOGEN start-phase end-phase<br />
3 Shut down and reboot the entire cluster in order for the new values to take effect.<br />
6.2.2.2 Assigning Node Allocation Class Values on HSC Subsystems<br />
Assign or change node allocation class values on HSC subsystems while the<br />
cluster is shut down. To assign a node allocation class to disks for an HSC<br />
subsystem, specify the value using the HSC console command in the following<br />
format:<br />
SET ALLOCATE DISK allocation-class-value<br />
To assign a node allocation class for tapes, enter a SET ALLOCATE TAPE<br />
command in the following format:<br />
SET ALLOCATE TAPE tape-allocation-class-value<br />
For example, to change the value of a node allocation class for disks to 1, set the<br />
HSC internal door switch to the Enable position and enter a command sequence<br />
like the following at the appropriate HSC consoles:<br />
Ctrl/C<br />
HSC> RUN SETSHO<br />
SETSHO> SET ALLOCATE DISK 1<br />
SETSHO> EXIT<br />
SETSHO-Q Rebooting HSC; Y to continue, Ctrl/Y to abort:? Y<br />
Restore the HSC internal door-switch setting.<br />
Reference: For complete information about the HSC console commands, refer to<br />
the HSC hardware documentation.<br />
<strong>Cluster</strong> Storage Devices 6–9
<strong>Cluster</strong> Storage Devices<br />
6.2 Naming <strong>OpenVMS</strong> <strong>Cluster</strong> Storage Devices<br />
6.2.2.3 Assigning Node Allocation Class Values on HSJ Subsystems<br />
To assign a node allocation class value for disks for an HSJ subsystem, enter a<br />
SET CONTROLLER MSCP_ALLOCATION_CLASS command in the following<br />
format:<br />
SET CONTROLLER MSCP_ALLOCATION_CLASS = allocation-class-value<br />
To assign a node allocation class value for tapes, enter a SET CONTROLLER<br />
TMSCP_ALLOCATION_CLASS ALLOCATE TAPE command in the following<br />
format:<br />
SET CONTROLLER TMSCP_ALLOCATION_CLASS = allocation-class-value<br />
For example, to assign or change the node allocation class value for disks to 254<br />
on an HSJ subsystem, use the following command at the HSJ console prompt<br />
(PTMAN>):<br />
PTMAN> SET CONTROLLER MSCP_ALLOCATION_CLASS = 254<br />
6.2.2.4 Assigning Node Allocation Class Values on HSD Subsystems<br />
To assign or change allocation class values on any HSD subsystem, use the<br />
following commands:<br />
$ MC SYSMAN IO CONNECT FYA0:/NOADAP/DRIVER=SYS$FYDRIVER<br />
$ SET HOST/DUP/SERVER=MSCP$DUP/TASK=DIRECT node-name<br />
$ SET HOST/DUP/SERVER=MSCP$DUP/TASK=PARAMS node-name<br />
PARAMS> SET FORCEUNI 0<br />
PARAMS> SET ALLCLASS 143<br />
PARAMS> SET UNITNUM 900<br />
PARAMS> WRITE<br />
Changes require controller initialization, ok? [Y/(N)] Y<br />
PARAMS> EXIT<br />
$<br />
6.2.2.5 Assigning Node Allocation Class Values on DSSI ISEs<br />
To assign or change node allocation class values on any DSSI ISE, the command<br />
you use differs depending on the operating system.<br />
For example, to change the allocation class value to 1 for a DSSI ISE TRACER,<br />
follow the steps in Table 6–2.<br />
6–10 <strong>Cluster</strong> Storage Devices<br />
Table 6–2 Changing a DSSI Allocation Class Value<br />
Step Task<br />
1 Log into the SYSTEM account on the computer connected to the hardware device TRACER<br />
and load its driver as follows:<br />
• If the computer is an Alpha system, then enter the following command at the DCL<br />
prompt:<br />
$ MC SYSMAN IO CONN FYA0:/NOADAP/DRIVER=SYS$FYDRIVER<br />
• If the computer is a VAX system, then enter the following command at the DCL<br />
prompt:<br />
$ MCR SYSGEN CONN FYA0:/NOADAP/DRIVER=FYDRIVER<br />
(continued on next page)
<strong>Cluster</strong> Storage Devices<br />
6.2 Naming <strong>OpenVMS</strong> <strong>Cluster</strong> Storage Devices<br />
Table 6–2 (Cont.) Changing a DSSI Allocation Class Value<br />
Step Task<br />
2 At the DCL prompt, enter the SHOW DEVICE FY command to verify the presence and<br />
status of the FY device, as follows:<br />
$ SHOW DEVICE FY<br />
Device<br />
Name<br />
Device<br />
Status<br />
Error<br />
Count<br />
FYA0: offline 0<br />
3 At the DCL prompt, enter the following command sequence to set the allocation class value<br />
to 1:<br />
$ SET HOST/DUP/SERVER=MSCP$DUP/TASK=PARAMS node-name<br />
params >set allclass 1<br />
params >write<br />
Changes require controller initialization, ok?[Y/N]Y<br />
Initializing...<br />
%HSCPAD-S-REMPGMEND, Remote program terminated--message number 3.<br />
%PAxx, Port has closed virtual circuit - remote node TRACER<br />
%HSCPAD-S-END, control returned to node node-name<br />
$<br />
4 Reboot the entire cluster in order for the new value to take effect.<br />
6.2.2.6 Node Allocation Class Example With a DSA Disk and Tape<br />
Figure 6–5 shows a DSA disk and tape that are dual pathed between two<br />
computers.<br />
Figure 6–5 Disk and Tape Dual Pathed Between Computers<br />
URANUS<br />
$1$DJA8<br />
URANUS$DJA8<br />
NEPTUN$DJA8<br />
NEPTUN<br />
ALLOCLASS=1<br />
TAPE_ALLOCLASS=1<br />
ARIEL OBERON<br />
$1$MUB6<br />
URANUS$MUB6<br />
NEPTUN$MUB6<br />
Ethernet<br />
ALLOCLASS=1<br />
TAPE_ALLOCLASS=1<br />
ZK−6655−GE<br />
In this configuration:<br />
• URANUS and NEPTUN access the disk either locally or through the other<br />
computer’s MSCP server.<br />
<strong>Cluster</strong> Storage Devices 6–11
<strong>Cluster</strong> Storage Devices<br />
6.2 Naming <strong>OpenVMS</strong> <strong>Cluster</strong> Storage Devices<br />
• When satellites ARIEL and OBERON access $1$DJA8, a path is made<br />
through either URANUS or NEPTUN.<br />
• If, for example, the node URANUS has been shut down, the satellites can<br />
access the devices through NEPTUN. When URANUS reboots, access is<br />
available through either URANUS or NEPTUN.<br />
6.2.2.7 Node Allocation Class Example With Mixed Interconnects<br />
Figure 6–6 shows how device names are typically specified in a mixedinterconnect<br />
cluster. This figure also shows how relevant system parameter<br />
values are set for each CI computer.<br />
Figure 6–6 Device Names in a Mixed-Interconnect <strong>Cluster</strong><br />
EUROPA$DUA0<br />
$1$DUA1<br />
$1$MUA8<br />
EUROPA GANYMD RHEA TITAN ARIEL OBERON<br />
JUPITR SATURN URANUS NEPTUN<br />
$1$DJA8<br />
NEPTUN<br />
$1$DUA17<br />
$1$MUA12<br />
Ethernet<br />
$1$DUA3<br />
$1$MUA9<br />
ALLOCLASS=1<br />
MSCP_LOAD=1<br />
CI<br />
ALLOCLASS=1<br />
MSCP_LOAD=1<br />
MSCP_SERVE_ALL=1<br />
TMSCP_LOAD=1<br />
TAPE_ALLOCLASS=1<br />
CI<br />
Star Coupler<br />
MSCP_SERVE_ALL=2<br />
TMSCP_LOAD=1<br />
TAPE_ALLOCLASS=1<br />
6–12 <strong>Cluster</strong> Storage Devices<br />
VOYGR1 VOYGR2<br />
SET ALLOCATE DISK 1<br />
SET ALLOCATE TAPE 1<br />
SET ALLOCATE DISK 1<br />
SET ALLOCATE TAPE 1<br />
ZK−6660−GE<br />
In this configuration:<br />
• A disk and a tape are dual pathed to the HSC or HSJ subsystems named<br />
VOYGR1 and VOYGR2; these subsystems are connected to JUPITR,<br />
SATURN, URANUS and NEPTUN through the star coupler.<br />
• The MSCP and TMSCP servers are loaded on JUPITR and NEPTUN<br />
(MSCP_LOAD = 1, TMSCP_LOAD = 1) and the ALLOCLASS and TAPE_<br />
ALLOCLASS parameters are set to the same value (1) on these computers<br />
and on both HSC or HSJ subsystems.<br />
Note: For optimal availability, two or more CI connected computers should serve<br />
HSC or HSJ devices to the cluster.
<strong>Cluster</strong> Storage Devices<br />
6.2 Naming <strong>OpenVMS</strong> <strong>Cluster</strong> Storage Devices<br />
6.2.2.8 Node Allocation Classes and VAX 6000 Tapes<br />
You must ensure that any tape drive is identified by a unique name that includes<br />
a tape allocation class so that naming conflicts do not occur.<br />
Avoiding Duplicate Names<br />
Duplicate names are more probable with VAX 6000 computers because TK console<br />
tape drives (located in the VAX 6000 cabinet) are usually named either MUA6<br />
or MUB6. Thus, when you configure a VAXcluster system with more than one<br />
VAX 6000 computer, multiple TK console tape drives are likely to have the same<br />
name.<br />
Specifying a Tape Allocation Class<br />
To ensure that the TK console tape drives have names that are unique across the<br />
cluster, specify a tape allocation class name as a numeric value from 0 to 255,<br />
followed by the device name, as follows:<br />
$tape-allocation-class$device-name<br />
Example:<br />
Assume that $1$MUA6, $1$MUB6, $2$MUA6 are all unique device names. The<br />
first two have the same tape allocation class but have different controller letters<br />
(A and B, respectively). The third device has a different tape allocation class than<br />
the first two.<br />
Ensuring a Unique Access Path<br />
Consider the methods described in Table 6–3 to ensure a unique access path to<br />
VAX 6000 TK console tape drives.<br />
Table 6–3 Ensuring Unique Tape Access Paths<br />
Method Description Comments<br />
Set the TK console tape unit<br />
number to a unique value on<br />
each VAX 6000 system.<br />
For VAXcluster systems<br />
configured with two or more<br />
VAX 6000 computers, set up<br />
the console tapes with different<br />
controller letters.<br />
For VAXcluster systems in which<br />
tapes must be TMSCP served across<br />
the cluster, the tape controller letter<br />
and unit number of these tape<br />
drives must be unique clusterwide<br />
and must conform to the cluster<br />
device-naming conventions. If<br />
controller letters and unit numbers<br />
are unique clusterwide, the TAPE_<br />
ALLOCLASS system parameter can<br />
be set to the same value on multiple<br />
VAX 6000 systems.<br />
If your VAXcluster configuration<br />
contains only two VAX 6000<br />
computers, contact a Compaq<br />
services technician to move the<br />
TBK70 controller card to another<br />
BI backplane within the same VAX<br />
computer.<br />
The unit number of the TK console drives is<br />
controlled by the BI bus unit number plug<br />
of the TBK70 controller in the VAX 6000 BI<br />
backplane. A Compaq services technician<br />
should change the unit number so that it is<br />
unique from all other controller cards in the<br />
BI backplane. The unit numbers available<br />
are in the range of 0 to 15 (the default value<br />
is 6).<br />
Moving the controller card changes the<br />
controller letter of the tape drive without<br />
changing the unit number (for example,<br />
MUA6 becomes MUB6).<br />
Note: The tape drives can have the same<br />
unit number.<br />
6.2.2.9 Node Allocation Classes and RAID Array 210 and 230 Devices<br />
If you have RAID devices connected to StorageWorks RAID Array 210 or 230<br />
subsystems, you might experience device-naming problems when running in a<br />
cluster environment if nonzero node allocation classes are used. In this case, the<br />
RAID devices will be named $n$DRcu, where n is the (nonzero) node allocation<br />
class, c is the controller letter, and u is the unit number.<br />
<strong>Cluster</strong> Storage Devices 6–13
<strong>Cluster</strong> Storage Devices<br />
6.2 Naming <strong>OpenVMS</strong> <strong>Cluster</strong> Storage Devices<br />
If multiple nodes in the cluster have the same (nonzero) node allocation class<br />
and these same nodes have RAID controllers, then RAID devices that are distinct<br />
might be given the same name (for example, $1$DRA0). This problem can lead to<br />
data corruption.<br />
To prevent such problems, use the DR_UNIT_BASE system parameter, which<br />
causes the DR devices to be numbered sequentially, starting with the DR_UNIT_<br />
BASE value that you specify. For example, if the node allocation class is $1, the<br />
controller letter is A, and you set DR_UNIT_BASE on one cluster member to<br />
10, the first device name generated by the RAID controller will be $1$DRA10,<br />
followed by $1$DRA11, $1$DRA12, and so forth.<br />
To ensure unique DR device names, set the DR_UNIT_BASE number on each<br />
cluster member so that the resulting device numbers do not overlap. For<br />
example, you can set DR_UNIT_BASE on three cluster members to 10, 20,<br />
and 30 respectively. As long as each cluster member has 10 or fewer devices, the<br />
DR device numbers will be unique.<br />
6.2.3 Reasons for Using Port Allocation Classes<br />
When the node allocation class is nonzero, it becomes the device name prefix for<br />
all attached devices, whether the devices are on a shared interconnect or not. To<br />
ensure unique names within a cluster, it is necessary for the ddcu part of the disk<br />
device name (for example, DKB0) to be unique within an allocation class, even if<br />
the device is on a private bus.<br />
This constraint is relatively easy to overcome for DIGITAL Storage Architecture<br />
(DSA) devices, because a system manager can select from a large unit number<br />
space to ensure uniqueness. The constraint is more difficult to manage for other<br />
device types, such as SCSI devices whose controller letter and unit number are<br />
determined by the hardware configuration.<br />
For example, in the configuration shown in Figure 6–7, each system has a private<br />
SCSI bus with adapter letter A. To obtain unique names, the unit numbers must<br />
be different. This constrains the configuration to a maximum of 8 devices on the<br />
two buses (or 16 if wide addressing can be used on one or more of the buses). This<br />
can result in empty StorageWorks drive bays and in a reduction of the system’s<br />
maximum storage capacity.<br />
6–14 <strong>Cluster</strong> Storage Devices
<strong>Cluster</strong> Storage Devices<br />
6.2 Naming <strong>OpenVMS</strong> <strong>Cluster</strong> Storage Devices<br />
Figure 6–7 SCSI Device Names Using a Node Allocation Class<br />
Device<br />
Name<br />
T<br />
T<br />
Device<br />
Name<br />
$4$DKB0 $4$DKB100<br />
Host Node Alloclass 4 Host<br />
SCSI<br />
Bus<br />
SCSI<br />
ID = 0<br />
SCSI<br />
ID = 7<br />
Adapter = PKB<br />
Adapter = PKA<br />
SCSI<br />
ID = 7<br />
SCSI<br />
ID = 1<br />
$4$DKB200<br />
SCSI<br />
ID = 2<br />
SCSI Bus SCSI<br />
ID = 6<br />
T T<br />
SCSI<br />
ID = 2<br />
Adapter = PKB<br />
Node Alloclass 4<br />
Adapter = PKA<br />
SCSI<br />
Bus<br />
SCSI<br />
ID = 7<br />
SCSI<br />
ID = 3<br />
$4$DKA200 $4$DKA300<br />
T<br />
T<br />
ZK−7483A−GE<br />
6.2.3.1 Constraint of the SCSI Controller Letter in Device Names<br />
The SCSI device name is determined in part by the SCSI controller through<br />
which the device is accessed (for example, B in DKBn). Therefore, to ensure that<br />
each node uses the same name for each device, all SCSI controllers attached to<br />
a shared SCSI bus must have the same <strong>OpenVMS</strong> device name. In Figure 6–7,<br />
each host is attached to the shared SCSI bus by controller PKB.<br />
This requirement can make configuring a shared SCSI bus difficult, because a<br />
system manager has little or no control over the assignment of SCSI controller<br />
device names. It is particularly difficult to match controller letters on different<br />
system types when one or more of the systems have:<br />
• Built-in SCSI controllers that are not supported in SCSI clusters<br />
• Long internal cables that make some controllers inappropriate for SCSI<br />
clusters<br />
6.2.3.2 Constraints Removed by Port Allocation Classes<br />
The port allocation class feature has two major benefits:<br />
• A system manager can specify an allocation class value that is specific to a<br />
port rather than nodewide.<br />
• When a port has a nonzero port allocation class, the controller letter in the<br />
device name that is accessed through that port is always the letter A.<br />
<strong>Cluster</strong> Storage Devices 6–15
<strong>Cluster</strong> Storage Devices<br />
6.2 Naming <strong>OpenVMS</strong> <strong>Cluster</strong> Storage Devices<br />
Using port allocation classes for naming SCSI, IDE, floppy disk, and PCI<br />
RAID controller devices removes the configuration constraints described in<br />
Section 6.2.2.9, in Section 6.2.3, and in Section 6.2.3.1. You do not need to<br />
use the DR_UNIT_BASE system parameter recommended in Section 6.2.2.9.<br />
Furthermore, each bus can be given its own unique allocation class value, so<br />
the ddcu part of the disk device name (for example, DKB0) does not need to be<br />
unique across buses. Moreover, controllers with different device names can be<br />
attached to the same bus, because the disk device names no longer depend on the<br />
controller letter.<br />
Figure 6–8 shows the same configuration as Figure 6–7, with two additions:<br />
a host named CHUCK and an additional disk attached to the lower left SCSI<br />
bus. Port allocation classes are used in the device names in this figure. A port<br />
allocation class of 116 is used for the SCSI interconnect that is shared, and<br />
port allocation class 0 is used for the SCSI interconnects that are not shared.<br />
By using port allocation classes in this configuration, you can do what was not<br />
allowed previously:<br />
• Attach an adapter with a name (PKA) that differs from the name of the other<br />
adapters (PKB) attached to the shared SCSI interconnect, as long as that port<br />
has the same port allocation class (116 in this example).<br />
• Use two disks with the same controller name and number (DKA300) because<br />
each disk is attached to a SCSI interconnect that is not shared.<br />
Figure 6–8 Device Names Using Port Allocation Classes<br />
T<br />
T<br />
SCSI<br />
ID = 5<br />
Adapter = PKA<br />
SCSI<br />
Bus<br />
Device<br />
Name<br />
6–16 <strong>Cluster</strong> Storage Devices<br />
Device<br />
Name<br />
Port Alloclass 116<br />
Host<br />
CHUCK<br />
ABLE$DKA200<br />
$116$DKA0 $116$DKA100<br />
SCSI<br />
ID = 0<br />
SCSI<br />
ID = 7<br />
Adapter = PKB<br />
Adapter = PKA<br />
SCSI<br />
ID = 7<br />
SCSI<br />
ID = 1<br />
$116$DKA200<br />
SCSI Bus SCSI<br />
ID = 6<br />
SCSI<br />
ID = 2<br />
Adapter = PKB<br />
Host<br />
Host<br />
ABLE BAKER<br />
Adapter = PKA<br />
SCSI<br />
ID = 7<br />
Port Alloclass 0<br />
T T<br />
Port Alloclass 0<br />
SCSI<br />
ID = 2<br />
SCSI<br />
ID = 3<br />
SCSI<br />
Bus<br />
SCSI<br />
ID = 3<br />
ABLE$DKA300 BAKER$DKA300<br />
ZK−8779A−GE<br />
T<br />
T
<strong>Cluster</strong> Storage Devices<br />
6.2 Naming <strong>OpenVMS</strong> <strong>Cluster</strong> Storage Devices<br />
6.2.4 Specifying Port Allocation Classes<br />
A port allocation class is a designation for all ports attached to a single<br />
interconnect. It replaces the node allocation class in the device name.<br />
The three types of port allocation classes are:<br />
• Port allocation classes of 1 to 32767 for devices attached to a multihost<br />
interconnect or a single-host interconnect, if desired<br />
• Port allocation class 0 for devices attached to a single-host interconnect<br />
• Port allocation class -1 when no port allocation class is in effect<br />
Each type has its own naming rules.<br />
6.2.4.1 Port Allocation Classes for Devices Attached to a Multi-Host Interconnect<br />
The following rules pertain to port allocation classes for devices attached to a<br />
multihost interconnect:<br />
1. The valid range of port allocation classes is 1 through 32767.<br />
2. When using port allocation classes, the controller letter in the device name is<br />
always A, regardless of the actual controller letter. The $GETDVI item code<br />
DVI$_DISPLAY_DEVNAM displays the actual port name.<br />
Note that it is now more important to use fully specified names (for example,<br />
$101$DKA100 or ABLE$DKA100) rather than abbreviated names (such as<br />
DK100), because a system can have multiple DKA100 disks.<br />
3. Each port allocation class must be unique within a cluster.<br />
4. A port allocation class cannot duplicate the value of another node’s tape or<br />
disk node allocation class.<br />
5. Each node for which MSCP serves a device should have the same nonzero<br />
allocation class value.<br />
Examples of device names that use this type of port allocation class are shown in<br />
Table 6–4.<br />
Table 6–4 Examples of Device Names with Port Allocation Classes 1-32767<br />
Device Name Description<br />
$101$DKA0 The port allocation class is 101; DK represents the disk device<br />
category, A is the controller name, and 0 is the unit number.<br />
$147$DKA0 The port allocation class is 147; DK represents the disk device<br />
category, A is the controller name, and 0 is the unit number.<br />
6.2.4.2 Port Allocation Class 0 for Devices Attached to a Single-Host Interconnect<br />
The following rules pertain to port allocation class 0 for devices attached to a<br />
single-host interconnect:<br />
1. Port allocation class 0 does not become part of the device name. Instead, the<br />
name of the node to which the device is attached becomes the first part of the<br />
device name.<br />
2. The controller letter in the device name remains the designation of the<br />
controller to which the device is attached. (It is not changed to A as it is for<br />
port allocation classes greater than zero.)<br />
<strong>Cluster</strong> Storage Devices 6–17
<strong>Cluster</strong> Storage Devices<br />
6.2 Naming <strong>OpenVMS</strong> <strong>Cluster</strong> Storage Devices<br />
Examples of device names that use port allocation class 0 are shown in<br />
Table 6–5.<br />
Table 6–5 Examples of Device Names With Port Allocation Class 0<br />
Device Name Description<br />
ABLE$DKD100 ABLE is the name of the node to which the device is attached. D<br />
is the designation of the controller to which it is attached, not A<br />
as it is for port allocation classes with a nonzero class. The unit<br />
number of this device is 100. The port allocation class of $0$ is<br />
not included in the device name.<br />
BAKER$DKC200 BAKER is the name of the node to which the device is attached, C<br />
is the designation of the controller to which it is attached, and 200<br />
is the unit number. The port allocation class of $0$ is not included<br />
in the device name.<br />
6.2.4.3 Port Allocation Class -1<br />
The designation of port allocation class -1 means that a port allocation class is<br />
not being used. Instead, a node allocation class is used. The controller letter<br />
remains its predefined designation. (It is assigned by <strong>OpenVMS</strong>, based on the<br />
system configuration. It is not affected by a node allocation class.)<br />
6.2.4.4 How to Implement Port Allocation Classes<br />
Port allocation classes were introduced in <strong>OpenVMS</strong> Alpha Version 7.1 with<br />
support in <strong>OpenVMS</strong> VAX. VAX computers can serve disks connected to Alpha<br />
systems that use port allocation classes in their names.<br />
To implement port allocation classes, you must do the following:<br />
• Enable the use of port allocation classes.<br />
• Assign one or more port allocation classes.<br />
• At a minimum, reboot the nodes on the shared SCSI bus.<br />
Enabling the Use of Port Allocation Classes<br />
To enable the use of port allocation classes, you must set a new SYSGEN<br />
parameter DEVICE_NAMING to 1. The default setting for this parameter is<br />
zero. In addition, the SCSSYSTEMIDH system parameter must be set to zero.<br />
Check to make sure that it is.<br />
Assigning Port Allocation Classes<br />
You can assign one or more port allocation classes with the <strong>OpenVMS</strong> <strong>Cluster</strong><br />
configuration procedure, CLUSTER_CONFIG.COM (or CLUSTER_CONFIG_<br />
LAN.COM).<br />
If it is not possible to use CLUSTER_CONFIG.COM or CLUSTER_CONFIG_<br />
LAN.COM to assign port allocation classes (for example, if you are booting a<br />
private system disk into an existing cluster), you can use the new SYSBOOT<br />
SET/CLASS command.<br />
The following example shows how to use the new SYSBOOT SET/CLASS<br />
command to assign an existing port allocation class of 152 to port PKB.<br />
6–18 <strong>Cluster</strong> Storage Devices<br />
SYSBOOT> SET/CLASS PKB 152<br />
The SYSINIT process ensures that this new name is used in successive boots.
<strong>Cluster</strong> Storage Devices<br />
6.2 Naming <strong>OpenVMS</strong> <strong>Cluster</strong> Storage Devices<br />
To deassign a port allocation class, enter the port name without a class number.<br />
For example:<br />
SYSBOOT> SET/CLASS PKB<br />
The mapping of ports to allocation classes is stored in<br />
SYS$SYSTEM:SYS$DEVICES.DAT, a standard text file. You use the CLUSTER_<br />
CONFIG.COM (or CLUSTER_CONFIG_LAN.COM) command procedure or, in<br />
special cases, SYSBOOT to change SYS$DEVICES.DAT.<br />
6.2.4.5 <strong>Cluster</strong>wide Reboot Requirements for SCSI Interconnects<br />
Changing a device’s allocation class changes the device name. A clusterwide<br />
reboot ensures that all nodes see the device under its new name, which in turn<br />
means that the normal device and file locks remain consistent.<br />
Rebooting an entire cluster when a device name changes is not mandatory. You<br />
may be able to reboot only the nodes that share the SCSI bus, as described in the<br />
following steps. The conditions under which you can do this and the results that<br />
follow are also described.<br />
1. Dismount the devices whose names have changed from all nodes.<br />
This is not always possible. In particular, you cannot dismount a disk on<br />
nodes where it is the system disk. If the disk is not dismounted, a subsequent<br />
attempt to mount the same disk using the new device name will fail with the<br />
following error:<br />
%MOUNT-F-VOLALRMNT, another volume of same label already mounted<br />
Therefore, you must reboot any node that cannot dismount the disk.<br />
2. Reboot all nodes connected to the SCSI bus.<br />
Before you reboot any of these nodes, make sure the disks on the SCSI bus<br />
are dismounted on the nodes not rebooting.<br />
Note<br />
<strong>OpenVMS</strong> ensures that a node cannot boot if the result is a SCSI bus<br />
with naming different from another node already accessing the same bus.<br />
(This check is independent of the dismount check in step 1.)<br />
After the nodes that are connected to the SCSI bus reboot, the device exists<br />
with its new name.<br />
3. Mount the devices systemwide or clusterwide.<br />
If no other node has the disk mounted under the old name, you can mount the<br />
disk systemwide or clusterwide using its new name. The new device name<br />
will be seen on all nodes running compatible software, and these nodes can<br />
also mount the disk and access it normally.<br />
Nodes that have not rebooted still see the old device name as well as the new<br />
device name. However, the old device name cannot be used; the device, when<br />
accessed by the old name, is off line. The old name persists until the node<br />
reboots.<br />
<strong>Cluster</strong> Storage Devices 6–19
<strong>Cluster</strong> Storage Devices<br />
6.3 MSCP and TMSCP Served Disks and Tapes<br />
6.3 MSCP and TMSCP Served Disks and Tapes<br />
The MSCP server and the TMSCP server make locally connected disks and tapes<br />
available to all cluster members. Locally connected disks and tapes are not<br />
automatically cluster accessible. Access to these devices is restricted to the local<br />
computer unless you explicitly set them up as cluster accessible using the MSCP<br />
server for disks or the TMSCP server for tapes.<br />
6.3.1 Enabling Servers<br />
To make a disk or tape accessible to all <strong>OpenVMS</strong> <strong>Cluster</strong> computers, the MSCP<br />
or TMSCP server must be:<br />
• Loaded on the local computer, as described in Table 6–6<br />
• Made functional by setting the MSCP and TMSCP system parameters, as<br />
described in Table 6–7<br />
6–20 <strong>Cluster</strong> Storage Devices<br />
Table 6–6 MSCP_LOAD and TMSCP_LOAD Parameter Settings<br />
Parameter Value Meaning<br />
MSCP_LOAD 0 Do not load the MSCP_SERVER. This is the<br />
default.<br />
1 Load the MSCP server with attributes specified<br />
by the MSCP_SERVE_ALL parameter using the<br />
default CPU load capacity.<br />
>1 Load the MSCP server with attributes specified<br />
by the MSCP_SERVE_ALL parameter. Use the<br />
MSCP_LOAD value as the CPU load capacity.<br />
TMSCP_LOAD 0 Do not load the TMSCP server and do not serve<br />
any tapes (default value).<br />
1 Load the TMSCP server and serve all available<br />
tapes, including all local tapes and all multihost<br />
tapes with a matching TAPE_ALLOCLASS value.<br />
Table 6–7 summarizes the system parameter values you can specify for MSCP_<br />
SERVE_ALL and TMSCP_SERVE_ALL to configure the MSCP and TMSCP<br />
servers. Initial values are determined by your responses when you execute<br />
the installation or upgrade procedure or when you execute the CLUSTER_<br />
CONFIG.COM command procedure described in Chapter 8 to set up your<br />
configuration.<br />
Starting with <strong>OpenVMS</strong> Version 7.2, the serving types are implemented as a bit<br />
mask. To specify the type of serving your system will perform, locate the type<br />
you want in Table 6–7 and specify its value. For some systems, you may want<br />
to specify two serving types, such as serving the system disk and serving locally<br />
attached disks. To specify such a combination, add the values of each type, and<br />
specify the sum.<br />
Note<br />
In a mixed-version cluster that includes any systems running <strong>OpenVMS</strong><br />
Version 7.1-x or earlier, serving all available disks is restricted to serving<br />
all disks whose allocation class matches the system’s node allocation class<br />
(pre-Version 7.2 meaning). To specify this type of serving, use the value 9<br />
(which sets bit 0 and bit 3).
<strong>Cluster</strong> Storage Devices<br />
6.3 MSCP and TMSCP Served Disks and Tapes<br />
Table 6–7 MSCP_SERVE_ALL and TMSCP_SERVE_ALL Parameter Settings<br />
Parameter Bit<br />
Value<br />
When Set Meaning<br />
MSCP_SERVE_ALL 0 1 Serve all available disks (locally attached and those<br />
connected to HSx and DSSI controllers). Disks with<br />
allocation classes that differ from the system’s allocation<br />
class (set by the ALLOCLASS parameter) are also<br />
served if bit 3 is not set.<br />
1 2 Serve locally attached (non-HSx and non-DSSI) disks.<br />
The server does not monitor its I/O traffic and does not<br />
participate in load balancing.<br />
2 4 Serve the system disk. This is the default setting. This<br />
setting is important when other nodes in the cluster<br />
rely on this system being able to serve its system disk.<br />
This setting prevents obscure contention problems that<br />
can occur when a system attempts to complete I/O to a<br />
remote system disk whose system has failed.<br />
3 8 Restrict the serving specified by bit 0. All disks<br />
except those with allocation classes that differ from<br />
the system’s allocation class (set by the ALLOCLASS<br />
parameter) are served.<br />
This is pre-Version 7.2 behavior. If your cluster includes<br />
systems running Open 7.1-x or earlier, and you want to<br />
serve all available disks, you must specify 9, the result<br />
of setting this bit and bit 0.<br />
TMSCP_SERVE_ALL 0 1 Serve all available tapes (locally attached and those<br />
connected to HSx and DSSI controllers). Tapes with<br />
allocation classes that differ from the system’s allocation<br />
class (set by the ALLOCLASS parameter) are also<br />
served if bit 3 is not set.<br />
1 2 Serve locally attached (non-HSx and non-DSSI) tapes.<br />
3 8 Restrict the serving specified by bit 0. Serve all tapes<br />
except those with allocation classes that differ from<br />
the system’s allocation class (set by the ALLOCLASS<br />
parameter).<br />
This is pre-Version 7.2 behavior. If your cluster includes<br />
systems running <strong>OpenVMS</strong> Version 7.1-x or earlier, and<br />
you want to serve all available tapes, you must specify<br />
9, the result of setting this bit and bit 0.<br />
Although the serving types are now implemented as a bit mask, the values of 0,<br />
1, and 2, specified by bit 0 and bit 1, retain their original meanings. These values<br />
are shown in the following table:<br />
Value Description<br />
0 Do not serve any disks (tapes). This is the default.<br />
1 Serve all available disks (tapes).<br />
2 Serve only locally attached (non-HSx and non-DSSI) disks (tapes).<br />
<strong>Cluster</strong> Storage Devices 6–21
<strong>Cluster</strong> Storage Devices<br />
6.3 MSCP and TMSCP Served Disks and Tapes<br />
6.3.1.1 Serving the System Disk<br />
Setting bit 2 to serve the system disk is important when other nodes in the<br />
cluster rely on this system being able to serve its system disk. This setting<br />
prevents obscure contention problems that can occur when a system attempts to<br />
complete I/O to a remote system disk whose system has failed.<br />
The following sequence of events describes how a contention problem can occur if<br />
serving the system disk is disabled (that is, if bit 2 is not set):<br />
• The MSCP_SERVE_ALL setting is changed to disable serving when the<br />
system reboots.<br />
• The serving system crashes.<br />
• The client system that was executing I/O to the serving system’s system disk<br />
is holding locks on resources of that system disk.<br />
• The client system starts mount verification.<br />
• The serving system attempts to boot but cannot because of the locks held on<br />
its system disk by the client system.<br />
• The client’s mount verification process times out after a period of time set by<br />
the MVTIMEOUT system parameter, and the client system releases the locks.<br />
The time period could be several hours.<br />
• The serving system is able to reboot.<br />
6.3.1.2 Setting the MSCP and TMSCP System Parameters<br />
Use either of the following methods to set these system parameters:<br />
• Specify appropriate values for these parameters in a computer’s<br />
MODPARAMS.DAT file and then run AUTOGEN.<br />
• Run the CLUSTER_CONFIG.COM or the CLUSTER_CONFIG_LAN.COM<br />
procedure, as appropriate, and choose the CHANGE option to perform these<br />
operations for disks and tapes.<br />
With either method, the served devices become accessible when the serving<br />
computer reboots. Further, the servers automatically serve any suitable device<br />
that is added to the system later. For example, if new drives are attached to an<br />
HSC subsystem, the devices are dynamically configured.<br />
Note: The SCSI retention command modifier is not supported by the TMSCP<br />
server. Retention operations should be performed from the node serving the tape.<br />
6.4 MSCP I/O Load Balancing<br />
6–22 <strong>Cluster</strong> Storage Devices<br />
MSCP I/O load balancing offers the following advantages:<br />
• Faster I/O response<br />
• Balanced work load among the members of an <strong>OpenVMS</strong> <strong>Cluster</strong><br />
Two types of MSCP I/O load balancing are provided by <strong>OpenVMS</strong> <strong>Cluster</strong><br />
software: static and dynamic. Static load balancing occurs on both VAX and<br />
Alpha systems; dynamic load balancing occurs only on VAX systems. Both types<br />
of load balancing are based on the load capacity ratings of the server systems.
<strong>Cluster</strong> Storage Devices<br />
6.4 MSCP I/O Load Balancing<br />
6.4.1 Load Capacity<br />
The load capacity ratings for the VAX and Alpha systems are predetermined<br />
by Compaq. These ratings are used in the calculation of the available serving<br />
capacity for MSCP static and dynamic load balancing. You can override these<br />
default settings by specifying a different load capacity with the MSCP_LOAD<br />
parameter.<br />
Note that the MSCP server load-capacity values (either the default value or the<br />
value you specify with MSCP_LOAD) are estimates used by the load-balancing<br />
feature. They cannot change the actual MSCP serving capacity of a system.<br />
A system’s MSCP serving capacity depends on many factors including its power,<br />
the performance of its LAN adapter, and the impact of other processing loads.<br />
The available serving capacity, which is calculated by each MSCP server as<br />
described in Section 6.4.3, is used solely to bias the selection process when a<br />
client system (for example, a satellite) chooses which server system to use when<br />
accessing a served disk.<br />
6.4.2 Increasing the Load Capacity When FDDI is Used<br />
When FDDI is used instead of Ethernet, the throughput is far greater. To take<br />
advantage of this greater throughput, Compaq recommends that you change the<br />
server’s load-capacity default setting with the MSCP_LOAD parameter. Start<br />
with a multiplier of four. For example, the load-capacity rating of any Alpha<br />
system connected by FDDI to a disk can be set to 1360 I/O per second (4x340).<br />
Depending on your configuration and the software you are running, you may<br />
want to increase or decrease this value.<br />
6.4.3 Available Serving Capacity<br />
The load-capacity ratings are used by each MSCP server to calculate its available<br />
serving capacity.<br />
The available serving capacity is calculated in the following way:<br />
Step Calculation<br />
1 Each MSCP server counts the read and write requests sent to it and periodically converts<br />
this value to requests per second.<br />
2 Each MSCP server subtracts its requests per second from its load capacity to compute its<br />
available serving capacity.<br />
6.4.4 Static Load Balancing<br />
MSCP servers periodically send their available serving capacities to the MSCP<br />
class driver (DUDRIVER). When a disk is mounted or one fails over, DUDRIVER<br />
assigns the server with the highest available serving capacity to it. (TMSCP<br />
servers do not perform this monitoring function.) This initial assignment is called<br />
static load balancing.<br />
6.4.5 Dynamic Load Balancing (VAX Only)<br />
Dynamic load balancing occurs only on VAX systems. MSCP server activity is<br />
checked every 5 seconds. If activity to any server is excessive, the serving load<br />
automatically shifts to other servers in the cluster.<br />
<strong>Cluster</strong> Storage Devices 6–23
<strong>Cluster</strong> Storage Devices<br />
6.4 MSCP I/O Load Balancing<br />
6.4.6 Overriding MSCP I/O Load Balancing for Special Purposes<br />
In some configurations, you may want to designate one or more systems in your<br />
cluster as the primary I/O servers and restrict I/O traffic on other systems. You<br />
can accomplish these goals by overriding the default load-capacity ratings used by<br />
the MSCP server. For example, if your cluster consists of two Alpha systems and<br />
one VAX 6000-400 system and you want to reduce the MSCP served I/O traffic<br />
to the VAX, you can assign a low MSCP_LOAD value, such as 50, to the VAX.<br />
Because the two Alpha systems each start with a load-capacity rating of 340 and<br />
the VAX now starts with a load-capacity rating of 50, the MSCP served satellites<br />
will direct most of the I/O traffic to the Alpha systems.<br />
6.5 Managing <strong>Cluster</strong> Disks With the Mount Utility<br />
For locally connected disks to be accessible to other nodes in the cluster, the<br />
MSCP server software must be loaded on the computer to which the disks are<br />
connected (see Section 6.3.1). Further, each disk must be mounted with the<br />
Mount utility, using the appropriate qualifier: /CLUSTER, /SYSTEM, or /GROUP.<br />
Mounting multiple disks can be automated with command procedures; a sample<br />
command procedure, MSCPMOUNT.COM, is provided in the SYS$EXAMPLES<br />
directory on your system.<br />
The Mount utility also provides other qualifiers that determine whether a<br />
disk is automatically rebuilt during a remount operation. Different rebuilding<br />
techniques are recommended for data and system disks.<br />
This section describes how to use the Mount utility for these purposes.<br />
6.5.1 Mounting <strong>Cluster</strong> Disks<br />
To mount disks that are to be shared among all computers, specify the MOUNT<br />
command as shown in the following table.<br />
IF... THEN...<br />
At system startup<br />
The disk is attached to a single system<br />
and is to be made available to all other<br />
nodes in the cluster.<br />
The computer has no disks directly<br />
attached to it.<br />
When the system is running<br />
6–24 <strong>Cluster</strong> Storage Devices<br />
Use MOUNT/CLUSTER device-name on the computer to<br />
which the disk is to be mounted. The disk is mounted<br />
on every computer that is active in the cluster at the<br />
time the command executes. First, the disk is mounted<br />
locally. Then, if the mount operation succeeds, the disk is<br />
mounted on other nodes in the cluster.<br />
Use MOUNT/SYSTEM device-name on the computer<br />
for each disk the computer needs to access. The disks<br />
can be attached to a single system or shared disks that<br />
are accessed by an HSx controller. Then, if the mount<br />
operation succeeds, the disk is mounted on the computer<br />
joining the cluster.<br />
You want to add a disk. Use MOUNT/CLUSTER device-name on the computer to<br />
which the disk is to be mounted. The disk is mounted<br />
on every computer that is active in the cluster at the<br />
time the command executes. First, the disk is mounted<br />
locally. Then, if the mount operation succeeds, the disk is<br />
mounted on other nodes in the cluster.<br />
To ensure disks are mounted whenever possible, regardless of the sequence that<br />
systems in the cluster boot (or shut down), startup command procedures should
<strong>Cluster</strong> Storage Devices<br />
6.5 Managing <strong>Cluster</strong> Disks With the Mount Utility<br />
use MOUNT/CLUSTER and MOUNT/SYSTEM as described in the preceding<br />
table.<br />
Note: Only system or group disks can be mounted across the cluster or on a<br />
subset of the cluster members. If you specify MOUNT/CLUSTER without the<br />
/SYSTEM or /GROUP qualifier, /SYSTEM is assumed. Also note that each cluster<br />
disk mounted with the /SYSTEM or /GROUP qualifier must have a unique<br />
volume label.<br />
6.5.2 Examples of Mounting Shared Disks<br />
Suppose you want all the computers in a three-member cluster to share a disk<br />
named COMPANYDOCS. To share the disk, one of the three computers can<br />
mount COMPANYDOCS using the MOUNT/CLUSTER command, as follows:<br />
$ MOUNT/CLUSTER/NOASSIST $1$DUA4: COMPANYDOCS<br />
If you want just two of the three computers to share the disk, those two<br />
computers must both mount the disk with the same MOUNT command, as<br />
follows:<br />
$ MOUNT/SYSTEM/NOASSIST $1$DUA4: COMPANYDOCS<br />
To mount the disk at startup time, include the MOUNT command either in a<br />
common command procedure that is invoked at startup time or in the computerspecific<br />
startup command file.<br />
Note: The /NOASSIST qualifier is used in command procedures that are designed<br />
to make several attempts to mount disks. The disks may be temporarily offline or<br />
otherwise not available for mounting. If, after several attempts, the disk cannot<br />
be mounted, the procedure continues. The /ASSIST qualifier, which is the default,<br />
causes a command procedure to stop and query the operator if a disk cannot be<br />
mounted immediately.<br />
6.5.3 Mounting <strong>Cluster</strong> Disks With Command Procedures<br />
To configure cluster disks, you can create command procedures to mount them.<br />
You may want to include commands that mount cluster disks in a separate<br />
command procedure file that is invoked by a site-specific SYSTARTUP procedure.<br />
Depending on your cluster environment, you can set up your command procedure<br />
in either of the following ways:<br />
• As a separate file specific to each computer in the cluster by making copies of<br />
the common procedure and storing them as separate files<br />
• As a common computer-independent file on a shared disk<br />
With either method, each computer can invoke the common procedure from the<br />
site-specific SYSTARTUP procedure.<br />
Example: The MSCPMOUNT.COM file in the SYS$EXAMPLES directory on<br />
your system is a sample command procedure that contains commands typically<br />
used to mount cluster disks. The example includes comments explaining each<br />
phase of the procedure.<br />
<strong>Cluster</strong> Storage Devices 6–25
<strong>Cluster</strong> Storage Devices<br />
6.5 Managing <strong>Cluster</strong> Disks With the Mount Utility<br />
6.5.4 Disk Rebuild Operation<br />
To minimize disk I/O operations (and thus improve performance) when files are<br />
created or extended, the <strong>OpenVMS</strong> file system maintains a cache of preallocated<br />
file headers and disk blocks.<br />
If a disk is dismounted improperly—for example, if a system fails or is<br />
removed from a cluster without running SYS$SYSTEM:SHUTDOWN.COM—<br />
this preallocated space becomes temporarily unavailable. When the disk is<br />
remounted, MOUNT scans the disk to recover the space. This is called a disk<br />
rebuild operation.<br />
6.5.5 Rebuilding <strong>Cluster</strong> Disks<br />
On a nonclustered computer, the MOUNT scan operation for recovering<br />
preallocated space merely prolongs the boot process. In an <strong>OpenVMS</strong> <strong>Cluster</strong><br />
system, however, this operation can degrade response time for all user processes<br />
in the cluster. While the scan is in progress on a particular disk, most activity on<br />
that disk is blocked.<br />
Note: User processes that attempt to read or write to files on the disk can<br />
experience delays of several minutes or longer, especially if the disk contains a<br />
large number of files or has many users.<br />
Because the rebuild operation can delay access to disks during the startup of any<br />
<strong>OpenVMS</strong> <strong>Cluster</strong> computer, Compaq recommends that procedures for mounting<br />
cluster disks use the /NOREBUILD qualifier. When MOUNT/NOREBUILD<br />
is specified, disks are not scanned to recover lost space, and users experience<br />
minimal delays while computers are mounting disks.<br />
Reference: Section 6.5.6 provides information about rebuilding system disks.<br />
Section 9.5.1 provides more information about disk rebuilds and system-disk<br />
throughput techniques.<br />
6.5.6 Rebuilding System Disks<br />
Rebuilding system disks is especially critical because most system activity<br />
requires access to a system disk. When a system disk rebuild is in progress, very<br />
little activity is possible on any computer that uses that disk.<br />
Unlike other disks, the system disk is automatically mounted early in the boot<br />
sequence. If a rebuild is necessary, and if the value of the system parameter<br />
ACP_REBLDSYSD is 1, the system disk is rebuilt during the boot sequence. (The<br />
default setting of 1 for the ACP_REBLDSYSD system parameter specifies that<br />
the system disk should be rebuilt.) Exceptions are as follows:<br />
6–26 <strong>Cluster</strong> Storage Devices<br />
Setting Comments<br />
ACP_REBLDSYSD parameter should be<br />
set to 0 on satellites.<br />
ACP_REBLDSYSD should be set to the<br />
default value of 1 on boot servers, and<br />
procedures that mount disks on the<br />
boot servers should use the /REBUILD<br />
qualifier.<br />
This setting prevents satellites from rebuilding a system<br />
disk when it is mounted early in the boot sequence<br />
and eliminates delays caused by such a rebuild when<br />
satellites join the cluster.<br />
While these measures can make boot server rebooting<br />
more noticeable, they ensure that system disk space is<br />
available after an unexpected shutdown.<br />
Once the cluster is up and running, system managers can submit a batch<br />
procedure that executes SET VOLUME/REBUILD commands to recover lost<br />
disk space. Such procedures can run at a time when users would not be
<strong>Cluster</strong> Storage Devices<br />
6.5 Managing <strong>Cluster</strong> Disks With the Mount Utility<br />
inconvenienced by the blocked access to disks (for example, between midnight and<br />
6 a.m. each day). Because the SET VOLUME/REBUILD command determines<br />
whether a rebuild is needed, the procedures can execute the command for each<br />
disk that is usually mounted.<br />
Suggestion: The procedures run more quickly and cause less delay in disk<br />
access if they are executed on:<br />
• Powerful computers<br />
• Computers that have direct access to the volume to be rebuilt<br />
Moreover, several such procedures, each of which rebuilds a different set of disks,<br />
can be executed simultaneously.<br />
Caution: If either or both of the following conditions are true when mounting<br />
disks, it is essential to run a procedure with SET VOLUME/REBUILD commands<br />
on a regular basis to rebuild the disks:<br />
• Disks are mounted with the MOUNT/NOREBUILD command.<br />
• The ACP_REBLDSYSD system parameter is set to 0.<br />
Failure to rebuild disk volumes can result in a loss of free space and in<br />
subsequent failures of applications to create or extend files.<br />
6.6 Shadowing Disks Across an <strong>OpenVMS</strong> <strong>Cluster</strong><br />
Volume shadowing (sometimes referred to as disk mirroring) achieves high data<br />
availability by duplicating data on multiple disks. If one disk fails, the remaining<br />
disk or disks can continue to service application and user I/O requests.<br />
6.6.1 Purpose<br />
Volume Shadowing for <strong>OpenVMS</strong> software provides data availability across the<br />
full range of <strong>OpenVMS</strong> configurations—from single nodes to large <strong>OpenVMS</strong><br />
<strong>Cluster</strong> systems—so you can provide data availabililty where you need it most.<br />
Volume Shadowing for <strong>OpenVMS</strong> software is an implementation of RAID 1<br />
(redundant arrays of independent disks) technology. Volume Shadowing<br />
for <strong>OpenVMS</strong> prevents a disk device failure from interrupting system and<br />
application operations. By duplicating data on multiple disks, volume shadowing<br />
transparently prevents your storage subsystems from becoming a single point of<br />
failure because of media deterioration, communication path failure, or controller<br />
or device failure.<br />
6.6.2 Shadow Sets<br />
You can mount one, two, or three compatible disk volumes to form a shadow<br />
set, as shown in Figure 6–9. Each disk in the shadow set is known as a shadow<br />
set member. Volume Shadowing for <strong>OpenVMS</strong> logically binds the shadow set<br />
devices together and represents them as a single virtual device called a virtual<br />
unit. This means that the multiple members of the shadow set, represented<br />
by the virtual unit, appear to operating systems and users as a single, highly<br />
available disk.<br />
<strong>Cluster</strong> Storage Devices 6–27
<strong>Cluster</strong> Storage Devices<br />
6.6 Shadowing Disks Across an <strong>OpenVMS</strong> <strong>Cluster</strong><br />
Figure 6–9 Shadow Set With Three Members<br />
Virtual Unit<br />
Physical Device #1<br />
Physical Device #2<br />
Physical Device #3<br />
ZK−5156A−GE<br />
6.6.3 I/O Capabilities<br />
Applications and users read and write data to and from a shadow set using the<br />
same commands and program language syntax and semantics that are used for<br />
nonshadowed I/O operations. System managers manage and monitor shadow sets<br />
using the same commands and utilities they use for nonshadowed disks. The only<br />
difference is that access is through the virtual unit, not to individual devices.<br />
Reference: Volume Shadowing for <strong>OpenVMS</strong> describes the shadowing product<br />
capabilities in detail.<br />
6.6.4 Supported Devices<br />
For a single workstation or a large data center, valid shadowing configurations<br />
include:<br />
• All MSCP compliant DSA drives<br />
• All DSSI devices<br />
• All StorageWorks SCSI disks and controllers, and some third-party SCSI<br />
devices that implement READL (read long) and WRITEL (write long)<br />
commands and use the SCSI disk driver (DKDRIVER)<br />
Restriction: SCSI disks that do not support READL and WRITEL are<br />
restricted because these disks do not support the shadowing data repair (disk<br />
bad-block errors) capability. Thus, using unsupported SCSI disks can cause<br />
members to be removed from the shadow set.<br />
You can shadow data disks and system disks. Thus, a system disk need not be<br />
a single point of failure for any system that boots from that disk. System disk<br />
shadowing becomes especially important for <strong>OpenVMS</strong> <strong>Cluster</strong> systems that use<br />
a common system disk from which multiple computers boot.<br />
Volume Shadowing for <strong>OpenVMS</strong> does not support the shadowing of quorum<br />
disks. This is because volume shadowing makes use of the <strong>OpenVMS</strong> distributed<br />
lock manager, and the quorum disk must be utilized before locking is enabled.<br />
There are no restrictions on the location of shadow set members beyond the valid<br />
disk configurations defined in the Volume Shadowing for <strong>OpenVMS</strong> Software<br />
Product Description (SPD 27.29.xx).<br />
6–28 <strong>Cluster</strong> Storage Devices
<strong>Cluster</strong> Storage Devices<br />
6.6 Shadowing Disks Across an <strong>OpenVMS</strong> <strong>Cluster</strong><br />
6.6.5 Shadow Set Limits<br />
You can mount a maximum of 500 shadow sets (each having one, two, or three<br />
members) in a standalone or <strong>OpenVMS</strong> <strong>Cluster</strong> system. The number of shadow<br />
sets supported is independent of controller and device types. The shadow sets can<br />
be mounted as public or private volumes.<br />
For any changes to these limits, consult the Volume Shadowing for <strong>OpenVMS</strong><br />
Software Product Description (SPD 27.29.xx).<br />
6.6.6 Distributing Shadowed Disks<br />
The controller-independent design of shadowing allows you to manage shadow<br />
sets regardless of their controller connection or location in the <strong>OpenVMS</strong><br />
<strong>Cluster</strong> system and helps provide improved data availability and very flexible<br />
configurations.<br />
For clusterwide shadowing, members can be located anywhere in an <strong>OpenVMS</strong><br />
<strong>Cluster</strong> system and served by MSCP servers across any supported <strong>OpenVMS</strong><br />
<strong>Cluster</strong> interconnect, including the CI, Ethernet, DSSI, and FDDI. For example,<br />
<strong>OpenVMS</strong> <strong>Cluster</strong> systems using FDDI can be up to 40 kilometers apart, which<br />
further increases the availability and disaster tolerance of a system.<br />
Figure 6–10 shows how shadow set member units are on line to local controllers<br />
located on different nodes. In the figure, a disk volume is local to each of the<br />
nodes ATABOY and ATAGRL. The MSCP server provides access to the shadow<br />
set members over the Ethernet. Even though the disk volumes are local to<br />
different nodes, the disks are members of the same shadow set. A member unit<br />
that is local to one node can be accessed by the remote node over the MSCP<br />
server.<br />
Figure 6–10 Shadow Sets Accessed Through the MSCP Server<br />
ATABOY<br />
$20$DUA1<br />
Ethernet<br />
Shadow Set<br />
Virtual Unit DSA1:<br />
ATAGRL<br />
$10$DUA1<br />
DUA1 DUA1<br />
VM-0673A-AI<br />
For shadow sets that are mounted on an <strong>OpenVMS</strong> <strong>Cluster</strong> system, mounting or<br />
dismounting a shadow set on one node in the cluster does not affect applications<br />
or user functions executing on other nodes in the system. For example, you can<br />
dismount the virtual unit from one node in an <strong>OpenVMS</strong> <strong>Cluster</strong> system and<br />
leave the shadow set operational on the remaining nodes on which it is mounted.<br />
<strong>Cluster</strong> Storage Devices 6–29
<strong>Cluster</strong> Storage Devices<br />
6.6 Shadowing Disks Across an <strong>OpenVMS</strong> <strong>Cluster</strong><br />
6–30 <strong>Cluster</strong> Storage Devices<br />
Other shadowing notes:<br />
• If an individual disk volume is already mounted as a member of an active<br />
shadow set, the disk volume cannot be mounted as a standalone disk on<br />
another node.<br />
• System disks can be shadowed. All nodes booting from shadowed system<br />
disks must:<br />
Have a Volume Shadowing for <strong>OpenVMS</strong> license.<br />
Specify the same physical member of the system disk shadow set as the<br />
boot device.<br />
Set shadowing system parameters to enable shadowing and specify the<br />
system disk virtual unit number.<br />
Mount additional physical members into the system disk shadow set early<br />
in the SYSTARUP_VMS.COM command procedure.<br />
Mount the disks to be used in the shadow set.
7.1 Introduction<br />
7<br />
Setting Up and Managing <strong>Cluster</strong> Queues<br />
This chapter discusses queuing topics specific to <strong>OpenVMS</strong> <strong>Cluster</strong> systems.<br />
Because queues in an <strong>OpenVMS</strong> <strong>Cluster</strong> system are established and controlled<br />
with the same commands used to manage queues on a standalone computer, the<br />
discussions in this chapter assume some knowledge of queue management on a<br />
standalone system, as described in the <strong>OpenVMS</strong> System Manager’s Manual.<br />
Note: See the <strong>OpenVMS</strong> System Manager’s Manual for information about<br />
queuing compatibility.<br />
Users can submit jobs to any queue in the <strong>OpenVMS</strong> <strong>Cluster</strong> system, regardless<br />
of the processor on which the job will actually execute. Generic queues can<br />
balance the work load among the available processors.<br />
The system manager can use one or several queue managers to manage batch<br />
and print queues for an entire <strong>OpenVMS</strong> <strong>Cluster</strong> system. Although a single<br />
queue manager is sufficient for most systems, multiple queue managers can be<br />
useful for distributing the batch and print work load across nodes in the cluster.<br />
Note: <strong>OpenVMS</strong> <strong>Cluster</strong> systems that include both VAX and Alpha computers<br />
must use the queue manager described in this chapter.<br />
7.2 Controlling Queue Availability<br />
Once the batch and print queue characteristics are set up, the system manager<br />
can rely on the distributed queue manager to make queues available across the<br />
cluster.<br />
The distributed queue manager prevents the queuing system from being affected<br />
when a node enters or leaves the cluster during cluster transitions. The following<br />
table describes how the distributed queue manager works.<br />
WHEN... THEN... Comments<br />
The node on which<br />
the queue manager<br />
is running leaves the<br />
<strong>OpenVMS</strong> <strong>Cluster</strong><br />
system.<br />
Nodes are added to the<br />
cluster.<br />
The <strong>OpenVMS</strong> <strong>Cluster</strong><br />
system reboots.<br />
The queue manager<br />
automatically fails over<br />
to another node.<br />
The queue manager<br />
automatically serves the<br />
new nodes.<br />
The queuing system<br />
automatically restarts<br />
by default.<br />
This failover occurs transparently to users<br />
on the system.<br />
The system manager does not need<br />
to enter a command explicitly to start<br />
queuing on the new node.<br />
Thus, you do not have to include<br />
commands in your startup command<br />
procedure for queuing.<br />
Setting Up and Managing <strong>Cluster</strong> Queues 7–1
Setting Up and Managing <strong>Cluster</strong> Queues<br />
7.2 Controlling Queue Availability<br />
WHEN... THEN... Comments<br />
The operating system<br />
automatically restores<br />
the queuing system with<br />
the parameters defined in<br />
the queuing database.<br />
This is because when you start the<br />
queuing system, the characteristics you<br />
define are stored in a queue database.<br />
To control queues, the queue manager maintains a clusterwide queue database<br />
that stores information about queues and jobs. Whether you use one or several<br />
queue managers, only one queue database is shared across the cluster. Keeping<br />
the information for all processes in one database allows jobs submitted from any<br />
computer to execute on any queue (provided that the necessary mass storage<br />
devices are accessible).<br />
7.3 Starting a Queue Manager and Creating the Queue Database<br />
You start up a queue manager using the START/QUEUE/MANAGER command as<br />
you would on a standalone computer. However, in an <strong>OpenVMS</strong> <strong>Cluster</strong> system,<br />
you can also provide a failover list and a unique name for the queue manager.<br />
The /NEW_VERSION qualifier creates a new queue database.<br />
The following command example shows how to start a queue manager:<br />
$ START/QUEUE/MANAGER/NEW_VERSION/ON=(GEM,STONE,*)<br />
The following table explains the components of this sample command.<br />
Command Function<br />
START/QUEUE/MANAGER Creates a single, clusterwide queue manager named SYS$QUEUE_<br />
MANAGER.<br />
/NEW_VERSION Creates a new queue database in SYS$COMMON:[SYSEXE] that<br />
consists of the following three files:<br />
/ON=(node-list)<br />
[optional]<br />
7–2 Setting Up and Managing <strong>Cluster</strong> Queues<br />
• QMAN$MASTER.DAT (master file)<br />
• SYS$QUEUE_MANAGER.QMAN$QUEUES (queue file)<br />
• SYS$QUEUE_MANAGER.QMAN$JOURNAL (journal file)<br />
Rule: Use the /NEW_VERSION qualifier only on the first invocation<br />
of the queue manager or if you want to create a new queue database.<br />
Specifies an ordered list of nodes that can claim the queue manager if<br />
the node running the queue manager should exit the cluster. In the<br />
example:<br />
• The queue manager process starts on node GEM.<br />
• If the queue manager is running on node GEM and GEM leaves<br />
the cluster, the queue manager fails over to node STONE.<br />
• The asterisk wildcard ( * ) is specified as the last node in the node<br />
list to indicate that any remaining, unlisted nodes can start the<br />
queue manager in any order.<br />
Rules: Complete node names are required; you cannot specify the<br />
asterisk wildcard character as part of a node name.<br />
If you want to exclude certain nodes from being eligible to run the<br />
queue manager, do not use the asterisk wildcard character in the<br />
node list.
Setting Up and Managing <strong>Cluster</strong> Queues<br />
7.3 Starting a Queue Manager and Creating the Queue Database<br />
Command Function<br />
/NAME_OF_MANAGER<br />
[optional]<br />
Allows you to assign a unique name to the queue manager. Unique<br />
queue manager names are necessary if you run multiple queue<br />
managers. For example, using the /NAME_OF_MANAGER qualifier<br />
causes queue and journal files to be created using the queue manager<br />
name instead of the default name SYS$QUEUE_MANAGER. For<br />
example, adding the /NAME_OF_MANAGER=PRINT_MANAGER<br />
qualifier command creates these files:<br />
QMAN$MASTER.DAT<br />
PRINT_MANAGER.QMAN$QUEUES<br />
PRINT_MANAGER.QMAN$JOURNAL<br />
Rules for <strong>OpenVMS</strong> <strong>Cluster</strong> systems with multiple system disks:<br />
• Specify the locations of both the master file and the queue and journal files for systems that do<br />
not boot from the system disk where the files are located.<br />
Reference: If you want to locate the queue database files on other devices or directories, refer to<br />
the <strong>OpenVMS</strong> System Manager’s Manual for instructions.<br />
• Specify a device and directory that is accessible across the <strong>OpenVMS</strong> <strong>Cluster</strong>.<br />
• Define the device and directory identically in the SYS$COMMON:SYLOGICALS.COM startup<br />
command procedure on every node.<br />
7.4 Starting Additional Queue Managers<br />
Running multiple queue managers balances the work load by distributing batch<br />
and print jobs across the cluster. For example, you might create separate queue<br />
managers for batch and print queues in clusters with CPU or memory shortages.<br />
This allows the batch queue manager to run on one node while the print queue<br />
manager runs on a different node.<br />
7.4.1 Command Format<br />
To start additional queue managers, include the /ADD and /NAME_OF_<br />
MANAGER qualifiers on the START/QUEUE/MANAGER command. Do not<br />
specify the /NEW_VERSION qualifier. For example:<br />
7.4.2 Database Files<br />
$ START/QUEUE/MANAGER/ADD/NAME_OF_MANAGER=BATCH_MANAGER<br />
Multiple queue managers share one QMAN$MASTER.DAT master file, but an<br />
additional queue file and journal file is created for each queue manager. The<br />
additional files are named in the following format, respectively:<br />
• name_of_manager.QMAN$QUEUES<br />
• name_of_manager.QMAN$JOURNAL<br />
By default, the queue database and its files are located in<br />
SYS$COMMON:[SYSEXE]. If you want to relocate the queue database files,<br />
refer to the instructions in Section 7.6.<br />
Setting Up and Managing <strong>Cluster</strong> Queues 7–3
Setting Up and Managing <strong>Cluster</strong> Queues<br />
7.5 Stopping the Queuing System<br />
7.5 Stopping the Queuing System<br />
When you enter the STOP/QUEUE/MANAGER/CLUSTER command, the queue<br />
manager remains stopped, and requests for queuing are denied until you<br />
enter the START/QUEUE/MANAGER command (without the /NEW_VERSION<br />
qualifier).<br />
The following command shows how to stop a queue manager named PRINT_<br />
MANAGER:<br />
$ STOP/QUEUE/MANAGER/CLUSTER/NAME_OF_MANAGER=PRINT_MANAGER<br />
Rule: You must include the /CLUSTER qualifier on the command line whether<br />
or not the queue manager is running on an <strong>OpenVMS</strong> <strong>Cluster</strong> system. If you<br />
omit the /CLUSTER qualifier, the command stops all queues on the default node<br />
without stopping the queue manager. (This has the same effect as entering the<br />
STOP/QUEUE/ON_NODE command.)<br />
7.6 Moving Queue Database Files<br />
The files in the queue database can be relocated from the default location of<br />
SYS$COMMON:[SYSEXE] to any disk that is mounted clusterwide or that is<br />
accessible to the computers participating in the clusterwide queue scheme. For<br />
example, you can enhance system performance by locating the database on a<br />
shared disk that has a low level of activity.<br />
7.6.1 Location Guidelines<br />
The master file QMAN$MASTER can be in a location separate from the queue<br />
and journal files, but the queue and journal files must be kept together in the<br />
same directory. The queue and journal files for one queue manager can be<br />
separate from those of other queue managers.<br />
The directory you specify must be available to all nodes in the cluster. If the<br />
directory specification is a concealed logical name, it must be defined identically<br />
in the SYS$COMMON:SYLOGICALS.COM startup command procedure on every<br />
node in the cluster.<br />
Reference: The <strong>OpenVMS</strong> System Manager’s Manual contains complete<br />
information about creating or relocating the queue database files. See also<br />
Section 7.12 for a sample common procedure that sets up an <strong>OpenVMS</strong> <strong>Cluster</strong><br />
batch and print system.<br />
7.7 Setting Up Print Queues<br />
To establish print queues, you must determine the type of queue configuration<br />
that best suits your <strong>OpenVMS</strong> <strong>Cluster</strong> system. You have several alternatives that<br />
depend on the number and type of print devices you have on each computer and<br />
on how you want print jobs to be processed. For example, you need to decide:<br />
• Which print queues you want to establish on each computer<br />
• Whether to set up any clusterwide generic queues to distribute print job<br />
processing across the cluster<br />
• Whether to set up autostart queues for availability or improved startup time<br />
7–4 Setting Up and Managing <strong>Cluster</strong> Queues
Figure 7–1 Sample Printer Configuration<br />
LPA0: LPC0:<br />
LPB0:<br />
Setting Up and Managing <strong>Cluster</strong> Queues<br />
7.7 Setting Up Print Queues<br />
JUPITER<br />
LPA0: LPB0:<br />
SATURN URANUS<br />
CI CI<br />
Star Coupler<br />
LPA0:<br />
ZK−1631−GE<br />
Once you determine the appropriate strategy for your cluster, you can create your<br />
queues. Figure 7–1 shows the printer configuration for a cluster consisting of the<br />
active computers JUPITR, SATURN, and URANUS.<br />
7.7.1 Creating a Queue<br />
You set up <strong>OpenVMS</strong> <strong>Cluster</strong> print queues using the same method that you would<br />
use for a standalone computer. However, in an <strong>OpenVMS</strong> <strong>Cluster</strong> system, you<br />
must provide a unique name for each queue you create.<br />
7.7.2 Command Format<br />
You create and name a print queue by specifying the INITIALIZE/QUEUE<br />
command at the DCL prompt in the following format:<br />
INITIALIZE/QUEUE/ON=node-name::device[/START][/NAME_OF_MANAGER=name-of-manager]<br />
queue-name<br />
Qualifier Description<br />
/ON Specifies the computer and printer to which the queue is<br />
assigned. If you specify the /START qualifier, the queue<br />
is started.<br />
/NAME_OF_MANAGER If you are running multiple queue managers, you should<br />
also specify the queue manager with the qualifier.<br />
Setting Up and Managing <strong>Cluster</strong> Queues 7–5
Setting Up and Managing <strong>Cluster</strong> Queues<br />
7.7 Setting Up Print Queues<br />
7.7.3 Ensuring Queue Availability<br />
You can also use the autostart feature to simplify startup and ensure high<br />
availability of execution queues in an <strong>OpenVMS</strong> <strong>Cluster</strong>. If the node on which the<br />
autostart queue is running leaves the <strong>OpenVMS</strong> <strong>Cluster</strong>, the queue automatically<br />
fails over to the next available node on which autostart is enabled. Autostart<br />
is particularly useful on LAT queues. Because LAT printers are usually shared<br />
among users of multiple systems or in <strong>OpenVMS</strong> <strong>Cluster</strong> systems, many users<br />
are affected if a LAT queue is unavailable.<br />
Format for creating autostart queues:<br />
Create an autostart queue with a list of nodes on which the queue can run by<br />
specifying the DCL command INITIALIZE/QUEUE in the following format:<br />
INITIALIZE/QUEUE/AUTOSTART_ON=(node-name::device:,node-name::device:, . . . ) queue-name<br />
When you use the /AUTOSTART_ON qualifier, you must initially activate<br />
the queue for autostart, either by specifying the /START qualifier with the<br />
INITIALIZE /QUEUE command or by entering a START/QUEUE command.<br />
However, the queue cannot begin processing jobs until the ENABLE AUTOSTART<br />
/QUEUES command is entered for a node on which the queue can run. Generic<br />
queues cannot be autostart queues.<br />
Rules: Generic queues cannot be autostart queues. Note that you cannot specify<br />
both /ON and /AUTOSTART_ON.<br />
Reference: Refer to Section 7.13 for information about setting the time at which<br />
autostart is disabled.<br />
7.7.4 Examples<br />
The following commands make the local print queue assignments for JUPITR<br />
shown in Figure 7–2 and start the queues:<br />
$ INITIALIZE/QUEUE/ON=JUPITR::LPA0/START/NAME_OF_MANAGER=PRINT_MANAGER JUPITR_LPA0<br />
$ INITIALIZE/QUEUE/ON=JUPITR::LPB0/START/NAME_OF_MANAGER=PRINT_MANAGER JUPITR_LPB0<br />
7–6 Setting Up and Managing <strong>Cluster</strong> Queues
Figure 7–2 Print Queue Configuration<br />
SATURN<br />
JUPITR_LPA0<br />
Setting Up and Managing <strong>Cluster</strong> Queues<br />
7.7 Setting Up Print Queues<br />
JUPITER<br />
LPA0: LPB0:<br />
CI CI<br />
Star Coupler<br />
JUPITR_LPB0<br />
URANUS<br />
ZK−1632−GE<br />
7.8 Setting Up <strong>Cluster</strong>wide Generic Print Queues<br />
The clusterwide queue database enables you to establish generic queues that<br />
function throughout the cluster. Jobs queued to clusterwide generic queues are<br />
placed in any assigned print queue that is available, regardless of its location<br />
in the cluster. However, the file queued for printing must be accessible to the<br />
computer to which the printer is connected.<br />
7.8.1 Sample Configuration<br />
Figure 7–3 illustrates a clusterwide generic print queue in which the queues<br />
for all LPA0 printers in the cluster are assigned to a clusterwide generic queue<br />
named SYS$PRINT.<br />
A clusterwide generic print queue needs to be initialized and started only<br />
once. The most efficient way to start your queues is to create a common<br />
command procedure that is executed by each <strong>OpenVMS</strong> <strong>Cluster</strong> computer<br />
(see Section 7.12.3).<br />
Setting Up and Managing <strong>Cluster</strong> Queues 7–7
Setting Up and Managing <strong>Cluster</strong> Queues<br />
7.8 Setting Up <strong>Cluster</strong>wide Generic Print Queues<br />
Figure 7–3 <strong>Cluster</strong>wide Generic Print Queue Configuration<br />
Legend:<br />
G = Generic Queue<br />
E = Execution Queue<br />
JUPITR_LPA0<br />
SYS$PRINT<br />
LPA0: LPB0:<br />
E E E<br />
SATURN_LPA0 SATURN_LPB0<br />
URANUS_LPA0<br />
LPA0:<br />
7.8.2 Command Example<br />
SATURN<br />
LPB0:<br />
CI<br />
E<br />
G<br />
E<br />
JUPITER JUPITR<br />
Star Coupler<br />
JUPITR_LPB0<br />
CI<br />
URANUS<br />
LPA0:<br />
ZK−1634−GE<br />
The following command initializes and starts the clusterwide generic queue<br />
SYS$PRINT:<br />
$ INITIALIZE/QUEUE/GENERIC=(JUPITR_LPA0,SATURN_LPA0,URANUS_LPA0)/START SYS$PRINT<br />
Jobs queued to SYS$PRINT are placed in whichever assigned print queue is<br />
available. Thus, in this example, a print job from JUPITR that is queued to<br />
SYS$PRINT can be queued to JUPITR_LPA0, SATURN_LPA0, or URANUS_<br />
LPA0.<br />
7.9 Setting Up Execution Batch Queues<br />
Generally, you set up execution batch queues on each <strong>OpenVMS</strong> <strong>Cluster</strong> computer<br />
using the same procedures you use for a standalone computer. For more detailed<br />
information about how to do this, see the <strong>OpenVMS</strong> System Manager’s Manual.<br />
7–8 Setting Up and Managing <strong>Cluster</strong> Queues
Setting Up and Managing <strong>Cluster</strong> Queues<br />
7.9 Setting Up Execution Batch Queues<br />
7.9.1 Before You Begin<br />
Before you establish batch queues, you should decide which type of queue<br />
configuration best suits your cluster. As system manager, you are responsible for<br />
setting up batch queues to maintain efficient batch job processing on the cluster.<br />
For example, you should do the following:<br />
• Determine what type of processing will be performed on each computer.<br />
• Set up local batch queues that conform to these processing needs.<br />
• Decide whether to set up any clusterwide generic queues that will distribute<br />
batch job processing across the cluster.<br />
• Decide whether to use autostart queues for startup simplicity.<br />
Once you determine the strategy that best suits your needs, you can create a<br />
command procedure to set up your queues. Figure 7–4 shows a batch queue<br />
configuration for a cluster consisting of computers JUPITR, SATURN, and<br />
URANUS.<br />
Figure 7–4 Sample Batch Queue Configuration<br />
SATURN_BATCH<br />
SATURN<br />
JUPITR_BATCH<br />
SATURN_PRINT<br />
JUPITER<br />
CI CI<br />
Star Coupler<br />
JUPITR_PRINT<br />
URANUS<br />
URANUS_BATCH<br />
ZK−1635−GE<br />
Setting Up and Managing <strong>Cluster</strong> Queues 7–9
Setting Up and Managing <strong>Cluster</strong> Queues<br />
7.9 Setting Up Execution Batch Queues<br />
7.9.2 Batch Command Format<br />
You create a batch queue with a unique name by specifying the DCL command<br />
INITIALIZE/QUEUE/BATCH in the following format:<br />
INITIALIZE/QUEUE/BATCH/ON=node::[/START][/NAME_OF_MANAGER=name-of-manager] queuename<br />
Qualifier Description<br />
/ON Specifies the computer on which the batch queue runs.<br />
/START Starts the queue.<br />
/NAME_OF_MANAGER Specifies the name of the queue manager if you are running<br />
multiple queue managers.<br />
7.9.3 Autostart Command Format<br />
You can initialize and start an autostart batch queue by specifying the DCL<br />
command INITIALIZE/QUEUE/BATCH. Use the following command format:<br />
INITIALIZE/QUEUE/BATCH/AUTOSTART_ON=node::queue-name<br />
When you use the /AUTOSTART_ON qualifier, you must initially activate<br />
the queue for autostart, either by specifying the /START qualifier with the<br />
INITIALIZE/QUEUE command or by entering a START/QUEUE command.<br />
However, the queue cannot begin processing jobs until the ENABLE AUTOSTART<br />
/QUEUES command is entered for a node on which the queue can run.<br />
Rule: Generic queues cannot be autostart queues. Note that you cannot specify<br />
both /ON and /AUTOSTART_ON.<br />
7.9.4 Examples<br />
The following commands make the local batch queue assignments for JUPITR,<br />
SATURN, and URANUS shown in Figure 7–4:<br />
$ INITIALIZE/QUEUE/BATCH/ON=JUPITR::/START/NAME_OF_MANAGER=BATCH_QUEUE JUPITR_BATCH<br />
$ INITIALIZE/QUEUE/BATCH/ON=SATURN::/START/NAME_OF_MANAGER=BATCH_QUEUE SATURN_BATCH<br />
$ INITIALIZE/QUEUE/BATCH/ON=URANUS::/START/NAME_OF_MANAGER=BATCH_QUEUE URANUS_BATCH<br />
Because batch jobs on each <strong>OpenVMS</strong> <strong>Cluster</strong> computer are queued to<br />
SYS$BATCH by default, you should consider defining a logical name to establish<br />
this queue as a clusterwide generic batch queue that distributes batch job<br />
processing throughout the cluster (see Example 7–2). Note, however, that you<br />
should do this only if you have a common-environment cluster.<br />
7.10 Setting Up <strong>Cluster</strong>wide Generic Batch Queues<br />
In an <strong>OpenVMS</strong> <strong>Cluster</strong> system, you can distribute batch processing among<br />
computers to balance the use of processing resources. You can achieve this<br />
workload distribution by assigning local batch queues to one or more clusterwide<br />
generic batch queues. These generic batch queues control batch processing across<br />
the cluster by placing batch jobs in assigned batch queues that are available. You<br />
can create a clusterwide generic batch queue as shown in Example 7–2.<br />
A clusterwide generic batch queue needs to be initialized and started only<br />
once. The most efficient way to perform these operations is to create a common<br />
command procedure that is executed by each <strong>OpenVMS</strong> <strong>Cluster</strong> computer (see<br />
Example 7–2).<br />
7–10 Setting Up and Managing <strong>Cluster</strong> Queues
Setting Up and Managing <strong>Cluster</strong> Queues<br />
7.10 Setting Up <strong>Cluster</strong>wide Generic Batch Queues<br />
7.10.1 Sample Configuration<br />
In Figure 7–5, batch queues from each <strong>OpenVMS</strong> <strong>Cluster</strong> computer are assigned<br />
to a clusterwide generic batch queue named SYS$BATCH. Users can submit a<br />
job to a specific queue (for example, JUPITR_BATCH or SATURN_BATCH), or, if<br />
they have no special preference, they can submit it by default to the clusterwide<br />
generic queue SYS$BATCH. The generic queue in turn places the job in an<br />
available assigned queue in the cluster.<br />
If more than one assigned queue is available, the operating system selects the<br />
queue that minimizes the ratio (executing jobs/job limit) for all assigned queues.<br />
Figure 7–5 <strong>Cluster</strong>wide Generic Batch Queue Configuration<br />
JUPITR_BATCH<br />
SYS$BATCH<br />
SATURN_BATCH SATURN_PRINT<br />
URANUS_PRINT<br />
SATURN<br />
JUPITR<br />
JUPITER<br />
CI CI<br />
Star Coupler<br />
7.11 Starting Local Batch Queues<br />
JUPITR_PRINT<br />
URANUS<br />
ZK−1636−GE<br />
Normally, you use local batch execution queues during startup to run batch jobs<br />
to start layered products. For this reason, these queues must be started before<br />
the ENABLE AUTOSTART command is executed, as shown in the command<br />
procedure in Example 7–1.<br />
Setting Up and Managing <strong>Cluster</strong> Queues 7–11
Setting Up and Managing <strong>Cluster</strong> Queues<br />
7.11 Starting Local Batch Queues<br />
7.11.1 Startup Command Procedure<br />
Start the local batch execution queue in each node’s startup command procedure<br />
SYSTARTUP_VMS.COM. If you use a common startup command procedure, add<br />
commands similar to the following to your procedure:<br />
$ SUBMIT/PRIORITY=255/NOIDENT/NOLOG/QUEUE=node_BATCH LAYERED_PRODUCT.COM<br />
$ START/QUEUE node_BATCH<br />
$ DEFINE/SYSTEM/EXECUTIVE SYS$BATCH node_BATCH<br />
Submitting the startup command procedure LAYERED_PRODUCT.COM as a<br />
high-priority batch job before the queue starts ensures that the job is executed<br />
immediately, regardless of the job limit on the queue. If the queue is started<br />
before the command procedure was submitted, the queue might reach its job limit<br />
by scheduling user batch jobs, and the startup job would have to wait.<br />
7.12 Using a Common Command Procedure<br />
Once you have created queues, you must start them to begin processing batch<br />
and print jobs. In addition, you must make sure the queues are started each time<br />
the system reboots, by enabling autostart for autostart queues or by entering<br />
START/QUEUE commands for nonautostart queues. To do so, create a command<br />
procedure containing the necessary commands.<br />
7.12.1 Command Procedure<br />
You can create a common command procedure named, for example,<br />
QSTARTUP.COM, and store it on a shared disk. With this method, each node<br />
can share the same copy of the common QSTARTUP.COM procedure. Each node<br />
invokes the common QSTARTUP.COM procedure from the common version of<br />
SYSTARTUP. You can also include the commands to start queues in the common<br />
SYSTARTUP file instead of in a separate QSTARTUP.COM file.<br />
7–12 Setting Up and Managing <strong>Cluster</strong> Queues
Setting Up and Managing <strong>Cluster</strong> Queues<br />
7.12 Using a Common Command Procedure<br />
7.12.2 Examples<br />
Example 7–1 shows commands used to create <strong>OpenVMS</strong> <strong>Cluster</strong> queues.<br />
Example 7–1 Sample Commands for Creating <strong>OpenVMS</strong> <strong>Cluster</strong> Queues<br />
$<br />
!<br />
$<br />
$<br />
DEFINE/FORM LN_FORM 10 /WIDTH=80 /STOCK=DEFAULT /TRUNCATE<br />
DEFINE/CHARACTERISTIC 2ND_FLOOR 2<br />
.<br />
..<br />
"<br />
$ INITIALIZE/QUEUE/AUTOSTART_ON=(JUPITR::LPA0:)/START JUPITR_PRINT<br />
$ INITIALIZE/QUEUE/AUTOSTART_ON=(SATURN::LPA0:)/START SATURN_PRINT<br />
$ INITIALIZE/QUEUE/AUTOSTART_ON=(URANUS::LPA0:)/START URANUS_PRINT<br />
. ..<br />
#<br />
$ INITIALIZE/QUEUE/BATCH/START/ON=JUPITR:: JUPITR_BATCH<br />
$ INITIALIZE/QUEUE/BATCH/START/ON=SATURN:: SATURN_BATCH<br />
$ INITIALIZE/QUEUE/BATCH/START/ON=URANUS:: URANUS_BATCH<br />
. ..<br />
$<br />
$ INITIALIZE/QUEUE/START -<br />
_$ /AUTOSTART_ON=(JUPITR::LTA1:,SATURN::LTA1,URANUS::LTA1) -<br />
_$ /PROCESSOR=LATSYM /FORM_MOUNTED=LN_FORM -<br />
_$ /RETAIN=ERROR /DEFAULT=(NOBURST,FLAG=ONE,NOTRAILER) -<br />
_$ /RECORD_BLOCKING LN03$PRINT<br />
$ $ INITIALIZE/QUEUE/START -<br />
_$ /AUTOSTART_ON=(JUPITR::LTA2:,SATURN::LTA2,URANUS::LTA2) -<br />
_$ /PROCESSOR=LATSYM /RETAIN=ERROR -<br />
_$ /DEFAULT=(NOBURST,FLAG=ONE,NOTRAILER) /RECORD_BLOCKING -<br />
_$ /CHARACTERISTIC=2ND_FLOOR LA210$PRINT<br />
$ %<br />
$ ENABLE AUTOSTART/QUEUES/ON=SATURN<br />
$ ENABLE AUTOSTART/QUEUES/ON=JUPITR<br />
$ ENABLE AUTOSTART/QUEUES/ON=URANUS<br />
&<br />
$ INITIALIZE/QUEUE/START SYS$PRINT -<br />
_$ /GENERIC=(JUPITR_PRINT,SATURN_PRINT,URANUS_PRINT)<br />
$ ’<br />
$ INITIALIZE/QUEUE/BATCH/START SYS$BATCH -<br />
_$ /GENERIC=(JUPITR_BATCH,SATURN_BATCH,URANUS_BATCH)<br />
$<br />
Setting Up and Managing <strong>Cluster</strong> Queues 7–13
Setting Up and Managing <strong>Cluster</strong> Queues<br />
7.12 Using a Common Command Procedure<br />
Following are descriptions of each command or group of commands in<br />
Example 7–1.<br />
Command Description<br />
! Define all printer forms and characteristics.<br />
" Initialize local print queues. In the example, these queues are autostart<br />
queues and are started automatically when the node executes the ENABLE<br />
AUTOSTART/QUEUES command. Although the /START qualifier is specified to<br />
activate the autostart queues, they do not begin processing jobs until autostart is<br />
enabled.<br />
To enable autostart each time the system reboots, add the ENABLE<br />
AUTOSTART/QUEUES command to your queue startup command procedure,<br />
as shown in Example 7–2.<br />
# Initialize and start local batch queues on all nodes, including satellite nodes. In<br />
this example, the local batch queues are not autostart queues.<br />
$ Initialize queues for remote LAT printers. In the example, these queues are<br />
autostart queues and are set up to run on one of three nodes. The queues are<br />
started on the first of those three nodes to execute the ENABLE AUTOSTART<br />
command.<br />
You must establish the logical devices LTA1 and LTA2 in the LAT startup<br />
command procedure LAT$SYSTARTUP.COM on each node on which the<br />
autostart queue can run. For more information, see the description of editing<br />
LAT$SYSTARTUP.COM in the <strong>OpenVMS</strong> System Manager’s Manual.<br />
Although the /START qualifier is specified to activate these autostart queues, they<br />
will not begin processing jobs until autostart is enabled.<br />
% Enable autostart to start the autostart queues automatically. In the example,<br />
autostart is enabled on node SATURN first, so the queue manager starts the<br />
autostart queues that are set up to run on one of several nodes.<br />
& Initialize and start the generic output queue SYS$PRINT. This is a nonautostart<br />
queue (generic queues cannot be autostart queues). However, generic queues are<br />
not stopped automatically when a system is shut down, so you do not need to<br />
restart the queue each time a node reboots.<br />
’ Initialize and start the generic batch queue SYS$BATCH. Because this is a generic<br />
queue, it is not stopped when the node shuts down. Therefore, you do not need to<br />
restart the queue each time a node reboots.<br />
7–14 Setting Up and Managing <strong>Cluster</strong> Queues
Setting Up and Managing <strong>Cluster</strong> Queues<br />
7.12 Using a Common Command Procedure<br />
7.12.3 Example<br />
Example 7–2 illustrates the use of a common QSTARTUP command procedure on<br />
a shared disk.<br />
Example 7–2 Common Procedure to Start <strong>OpenVMS</strong> <strong>Cluster</strong> Queues<br />
$!<br />
$! QSTARTUP.COM -- Common procedure to set up cluster queues<br />
$!<br />
$!<br />
!<br />
$ NODE = F$GETSYI("NODENAME")<br />
$!<br />
$! Determine the node-specific subroutine<br />
$!<br />
$ IF (NODE .NES. "JUPITR") .AND. (NODE .NES. "SATURN") .AND. (NODE .NES. "URANUS")<br />
$ THEN<br />
$ GOSUB SATELLITE_STARTUP<br />
$ ELSE<br />
"<br />
$!<br />
$! Configure remote LAT devices.<br />
$!<br />
$ SET TERMINAL LTA1: /PERM /DEVICE=LN03 /WIDTH=255 /PAGE=60 -<br />
/LOWERCASE /NOBROAD<br />
$ SET TERMINAL LTA2: /PERM /DEVICE=LA210 /WIDTH=255 /PAGE=66 -<br />
/NOBROAD<br />
$ SET DEVICE LTA1: /SPOOLED=(LN03$PRINT,SYS$SYSDEVICE:)<br />
$ SET DEVICE LTA2: /SPOOLED=(LA210$PRINT,SYS$SYSDEVICE:)<br />
#<br />
$ START/QUEUE/BATCH ’NODE’_BATCH<br />
$ GOSUB ’NODE’_STARTUP<br />
$ ENDIF<br />
$ GOTO ENDING<br />
$!<br />
$! Node-specific subroutines start here<br />
$!<br />
$<br />
$ SATELLITE_STARTUP:<br />
$!<br />
$! Start a batch queue for satellites.<br />
$!<br />
$ START/QUEUE/BATCH ’NODE’_BATCH<br />
$ RETURN<br />
$!<br />
%<br />
$JUPITR_STARTUP:<br />
$!<br />
$! Node-specific startup for JUPITR::<br />
$! Setup local devices and start nonautostart queues here<br />
$!<br />
$ SET PRINTER/PAGE=66 LPA0:<br />
$ RETURN<br />
(continued on next page)<br />
Setting Up and Managing <strong>Cluster</strong> Queues 7–15
Setting Up and Managing <strong>Cluster</strong> Queues<br />
7.12 Using a Common Command Procedure<br />
Example 7–2 (Cont.) Common Procedure to Start <strong>OpenVMS</strong> <strong>Cluster</strong> Queues<br />
$!<br />
$SATURN_STARTUP:<br />
$!<br />
$! Node-specific startup for SATURN::<br />
$! Setup local devices and start nonautostart queues here<br />
$! ...<br />
$ RETURN<br />
$!<br />
$URANUS_STARTUP:<br />
$!<br />
$! Node-specific startup for URANUS::<br />
$! Setup local devices and start nonautostart queues here<br />
$! ...<br />
$ RETURN<br />
$!<br />
$ENDING:<br />
&<br />
$! Enable autostart to start all autostart queues<br />
$!<br />
$ ENABLE AUTOSTART/QUEUES<br />
$ EXIT<br />
Following are descriptions of each phase of the common QSTARTUP.COM<br />
command procedure in Example 7–2.<br />
Command Description<br />
! Determine the name of the node executing the procedure.<br />
" On all large nodes, set up remote devices connected by the LAT. The queues<br />
for these devices are autostart queues and are started automatically when<br />
the ENABLE AUTOSTART/QUEUES command is executed at the end of this<br />
procedure.<br />
In the example, these autostart queues were set up to run on one of three<br />
nodes. The queues start when the first of those nodes executes the ENABLE<br />
AUTOSTART/QUEUES command. The queue remains running as long as one of<br />
those nodes is running and has autostart enabled.<br />
# On large nodes, start the local batch queue. In the example, the local batch queues<br />
are nonautostart queues and must be started explicitly with START/QUEUE<br />
commands.<br />
$ On satellite nodes, start the local batch queue.<br />
% Each node executes its own subroutine. On node JUPITR, set up the line printer<br />
device LPA0:. The queue for this device is an autostart queue and is started<br />
automatically when the ENABLE AUTOSTART/QUEUES command is executed.<br />
& Enable autostart to start all autostart queues.<br />
7.13 Disabling Autostart During Shutdown<br />
By default, the shutdown procedure disables autostart at the beginning of the<br />
shutdown sequence. Autostart is disabled to allow autostart queues with failover<br />
lists to fail over to another node. Autostart also prevents any autostart queue<br />
running on another node in the cluster to fail over to the node being shut down.<br />
7–16 Setting Up and Managing <strong>Cluster</strong> Queues
Setting Up and Managing <strong>Cluster</strong> Queues<br />
7.13 Disabling Autostart During Shutdown<br />
7.13.1 Options<br />
You can change the time at which autostart is disabled in the shutdown sequence<br />
in one of two ways:<br />
Option Description<br />
1 Define the logical name SHUTDOWN$DISABLE_AUTOSTART as follows:<br />
$ DEFINE/SYSTEM/EXECUTIVE SHUTDOWN$DISABLE_AUTOSTART number-of-minutes<br />
Set the value of number-of-minutes to the number of minutes before shutdown when<br />
autostart is to be disabled. You can add this logical name definition to SYLOGICALS.COM.<br />
The value of number-of-minutes is the default value for the node. If this number is greater<br />
than the number of minutes specified for the entire shutdown sequence, autostart is disabled<br />
at the beginning of the sequence.<br />
2 Specify the DISABLE_AUTOSTART number-of-minutes option during the shutdown<br />
procedure. (The value you specify for number-of-minutes overrides the value specified<br />
for the SHUTDOWN$DISABLE_AUTOSTART logical name.)<br />
Reference: See the <strong>OpenVMS</strong> System Manager’s Manual for more information<br />
about changing the time at which autostart is disabled during the shutdown<br />
sequence.<br />
Setting Up and Managing <strong>Cluster</strong> Queues 7–17
8<br />
Configuring an <strong>OpenVMS</strong> <strong>Cluster</strong> System<br />
This chapter provides an overview of the cluster configuration command<br />
procedures and describes the preconfiguration tasks required before running<br />
either command procedure. Then it describes each major function of the<br />
command procedures and the postconfiguration tasks, including running<br />
AUTOGEN.COM.<br />
8.1 Overview of the <strong>Cluster</strong> Configuration Procedures<br />
Two similar command procedures are provided for configuring and reconfiguring<br />
an <strong>OpenVMS</strong> <strong>Cluster</strong> system: CLUSTER_CONFIG_LAN.COM and CLUSTER_<br />
CONFIG.COM. The choice depends on whether you use the LANCP utility or<br />
DECnet for satellite booting in your cluster. CLUSTER_CONFIG_LAN.COM<br />
provides satellite booting services with the LANCP utility; CLUSTER_<br />
CONFIG.COM provides satellite booting sevices with DECnet. See Section 4.5 for<br />
the factors to consider when choosing a satellite booting service.<br />
These configuration procedures automate most of the tasks required to configure<br />
an <strong>OpenVMS</strong> <strong>Cluster</strong> system. When you invoke CLUSTER_CONFIG_LAN.COM<br />
or CLUSTER_CONFIG.COM, the following configuration options are displayed:<br />
• Add a computer to the cluster<br />
• Remove a computer from the cluster<br />
• Change a computer’s characteristics<br />
• Create a duplicate system disk<br />
• Make a directory structure for a new root on a system disk<br />
• Delete a root from a system disk<br />
By selecting the appropriate option, you can configure the cluster easily and<br />
reliably without invoking any <strong>OpenVMS</strong> utilities directly. Table 8–1 summarizes<br />
the functions that the configuration procedures perform for each configuration<br />
option.<br />
The phrase cluster configuration command procedure, when used in<br />
this chapter, refers to both CLUSTER_CONFIG_LAN.COM and CLUSTER_<br />
CONFIG.COM. The questions of the two configuration procedures are identical<br />
except where they pertain to LANCP and DECnet.<br />
Note: For help on any question in these command procedures, type a question<br />
mark (?) at the question.<br />
Configuring an <strong>OpenVMS</strong> <strong>Cluster</strong> System 8–1
Configuring an <strong>OpenVMS</strong> <strong>Cluster</strong> System<br />
8.1 Overview of the <strong>Cluster</strong> Configuration Procedures<br />
Table 8–1 Summary of <strong>Cluster</strong> Configuration Functions<br />
Option Functions Performed<br />
ADD Enables a node as a cluster member:<br />
• Establishes the new computer’s root directory on a cluster common system disk<br />
and generates the computer’s system parameter files (ALPHAVMSSYS.PAR for<br />
Alpha systems or VAXVMSSYS.PAR for VAX systems), and MODPARAMS.DAT<br />
in its SYS$SPECIFIC:[SYSEXE] directory.<br />
• Generates the new computer’s page and swap files (PAGEFILE.SYS and<br />
SWAPFILE.SYS).<br />
• Sets up a cluster quorum disk (optional).<br />
• Sets disk allocation class values, or port allocation class values (Alpha only), or<br />
both, with the ALLOCLASS parameter for the new computer, if the computer is<br />
being added as a disk server. If the computer is being added as a tape server,<br />
sets a tape allocation class value with the TAPE_ALLOCLASS parameter.<br />
Note: ALLOCLASS must be set to a value greater than zero if you are<br />
configuring an Alpha computer on a shared SCSI bus and you are not using<br />
a port allocation class.<br />
• Generates an initial (temporary) startup procedure for the new computer. This<br />
initial procedure:<br />
Runs NETCONFIG.COM to configure the network.<br />
Runs AUTOGEN to set appropriate system parameter values for the<br />
computer.<br />
Reboots the computer with normal startup procedures.<br />
• If the new computer is a satellite node, the configuration procedure updates:<br />
Network databases for the computer on which the configuration procedure<br />
is executed to add the new computer.<br />
SYS$MANAGER:NETNODE_UPDATE.COM command procedure on the<br />
local computer (as described in Section 10.4.2).<br />
REMOVE Disables a node as a cluster member:<br />
8–2 Configuring an <strong>OpenVMS</strong> <strong>Cluster</strong> System<br />
• Deletes another computer’s root directory and its contents from the local<br />
computer’s system disk. If the computer being removed is a satellite, the<br />
cluster configuration command procedure updates SYS$MANAGER:NETNODE_<br />
UPDATE.COM on the local computer.<br />
• Updates the permanent and volatile remote node network databases on the local<br />
computer.<br />
• Removes the quorum disk.<br />
(continued on next page)
Configuring an <strong>OpenVMS</strong> <strong>Cluster</strong> System<br />
8.1 Overview of the <strong>Cluster</strong> Configuration Procedures<br />
Table 8–1 (Cont.) Summary of <strong>Cluster</strong> Configuration Functions<br />
Option Functions Performed<br />
CHANGE Displays the CHANGE menu and prompts for appropriate information to:<br />
• Enable or disable the local computer as a disk server<br />
• Enable or disable the local computer as a boot server<br />
• Enable or disable the Ethernet or FDDI LAN for cluster communications on the<br />
local computer<br />
• Enable or disable a quorum disk on the local computer<br />
• Change a satellite’s Ethernet or FDDI hardware address<br />
• Enable or disable the local computer as a tape server<br />
• Change the local computer’s ALLOCLASS or TAPE_ALLOCLASS value<br />
• Change the local computer’s shared SCSI port allocation class value<br />
• Enable or disable MEMORY CHANNEL for node-to-node cluster communications<br />
on the local computer<br />
CREATE Duplicates the local computer’s system disk and removes all system roots from the<br />
new disk.<br />
MAKE Creates a directory structure for a new root on a system disk.<br />
DELETE Deletes a root from a system disk.<br />
8.1.1 Before Configuring the System<br />
Before invoking either the CLUSTER_CONFIG_LAN.COM or the CLUSTER_<br />
CONFIG.COM procedure to configure an <strong>OpenVMS</strong> <strong>Cluster</strong> system, perform the<br />
tasks described in Table 8–2.<br />
Table 8–2 Preconfiguration Tasks<br />
Task Procedure<br />
Determine whether the computer<br />
uses DECdtm.<br />
When you add a computer to or remove a computer from a cluster that uses DECdtm<br />
services, there are a number of tasks you need to do in order to ensure the integrity<br />
of your data.<br />
Reference: See the chapter about DECdtm services in the <strong>OpenVMS</strong> System<br />
Manager’s Manual for step-by-step instructions on setting up DECdtm in an<br />
<strong>OpenVMS</strong> <strong>Cluster</strong> system.<br />
If you are not sure whether your cluster uses DECdtm services, enter this command<br />
sequence:<br />
$ SET PROCESS /PRIVILEGES=SYSPRV<br />
$ RUN SYS$SYSTEM:LMCP<br />
LMCP> SHOW LOG<br />
If your cluster does not use DECdtm services, the SHOW LOG command will display<br />
a ‘‘file not found’’ error message. If your cluster uses DECdtm services, it displays a<br />
list of the files that DECdtm uses to store information about transactions.<br />
(continued on next page)<br />
Configuring an <strong>OpenVMS</strong> <strong>Cluster</strong> System 8–3
Configuring an <strong>OpenVMS</strong> <strong>Cluster</strong> System<br />
8.1 Overview of the <strong>Cluster</strong> Configuration Procedures<br />
Table 8–2 (Cont.) Preconfiguration Tasks<br />
Task Procedure<br />
Ensure the network software<br />
providing the satellite booting<br />
service is up and running and all<br />
computers are connected to the<br />
LAN.<br />
For nodes that will use the LANCP utility for satellite booting, run the LANCP<br />
utility and enter the LANCP command LIST DEVICE/MOPDLL to display a list of<br />
LAN devices on the system:<br />
$ RUN SYS$SYSTEM:LANCP<br />
LANCP> LIST DEVICE/MOPDLL<br />
For nodes running DECnet for <strong>OpenVMS</strong>, enter the DCL command SHOW<br />
NETWORK to determine whether the network is up and running:<br />
$ SHOW NETWORK<br />
VAX/VMS Network status for local node 63.452 VIVID on 5-NOV-1994<br />
This is a nonrouting node, and does not have any network information.<br />
The designated router for VIVID is node 63.1021 SATURN.<br />
This example shows that the node VIVID is running DECnet for <strong>OpenVMS</strong>. If<br />
DECnet has not been started, the message ‘‘SHOW-I-NONET, Network Unavailable’’<br />
is displayed.<br />
For nodes running DECnet–Plus, refer to DECnet for <strong>OpenVMS</strong> Network<br />
Management Utilities for information about determining whether the DECnet–Plus<br />
network is up and running.<br />
Select MOP and disk servers. Every <strong>OpenVMS</strong> <strong>Cluster</strong> configured with satellite nodes must include at least<br />
one Maintenance Operations Protocol (MOP) and disk server. When possible,<br />
select multiple computers as MOP and disk servers. Multiple servers give better<br />
availability, and they distribute the work load across more LAN adapters.<br />
Follow these guidelines when selecting MOP and disk servers:<br />
Make sure you are logged in to a<br />
privileged account.<br />
• Ensure that MOP servers have direct access to the system disk.<br />
• Ensure that disk servers have direct access to the storage that they are serving.<br />
• Choose the most powerful computers in the cluster. Low-powered computers can<br />
become overloaded when serving many busy satellites or when many satellites<br />
boot simultaneously. Note, however, that two or more moderately powered<br />
servers may provide better performance than a single high-powered server.<br />
• If you have several computers of roughly comparable power, it is reasonable to<br />
use them all as boot servers. This arrangement gives optimal load balancing. In<br />
addition, if one computer fails or is shut down, others remain available to serve<br />
satellites.<br />
• After compute power, the most important factor in selecting a server is the speed<br />
of its LAN adapter. Servers should be equipped with the highest-bandwidth<br />
LAN adapters in the cluster.<br />
Log in to a privileged account.<br />
Rules: If you are adding a satellite, you must be logged into the system manager’s<br />
account on a boot server. Note that the process privileges SYSPRV, OPER, CMKRNL,<br />
BYPASS, and NETMBX are required, because the procedure performs privileged<br />
system operations.<br />
Coordinate cluster common files. If your configuration has two or more system disks, follow the instructions in<br />
Chapter 5 to coordinate the cluster common files.<br />
Optionally, disable broadcast<br />
messages to your terminal.<br />
8–4 Configuring an <strong>OpenVMS</strong> <strong>Cluster</strong> System<br />
While adding and removing computers, many such messages are<br />
generated. To disable the messages, you can enter the DCL command<br />
REPLY/DISABLE=(NETWORK, CLUSTER). See also Section 10.6 for more<br />
information about controlling OPCOM messages.<br />
(continued on next page)
Table 8–2 (Cont.) Preconfiguration Tasks<br />
Task Procedure<br />
Predetermine answers to the<br />
questions asked by the cluster<br />
configuration procedure.<br />
Configuring an <strong>OpenVMS</strong> <strong>Cluster</strong> System<br />
8.1 Overview of the <strong>Cluster</strong> Configuration Procedures<br />
Table 8–3 describes the data requested by the cluster configuration command<br />
procedures.<br />
8.1.2 Data Requested by the <strong>Cluster</strong> Configuration Procedures<br />
The following table describes the questions asked by the cluster configuration<br />
command procedures and describes how you might answer them. The table is<br />
supplied here so that you can determine answers to the questions before you<br />
invoke the procedure.<br />
Because many of the questions are configuration specific, Table 8–3 lists the<br />
questions according to configuration type, and not in the order they are asked.<br />
Table 8–3 Data Requested by CLUSTER_CONFIG_LAN.COM and CLUSTER_CONFIG.COM<br />
Information Required How to Specify or Obtain<br />
For all configurations<br />
Device name of cluster system<br />
disk on which root directories<br />
will be created<br />
Computer’s root directory name<br />
on cluster system disk<br />
Press Return to accept the default device name which is the translation of the<br />
SYS$SYSDEVICE: logical name, or specify a logical name that points to the common<br />
system disk.<br />
Press Return to accept the procedure-supplied default, or specify a name in the form<br />
SYSx:<br />
• For computers with direct access to the system disk, x is a hexadecimal digit in<br />
the range of 1 through 9 or A through D (for example, SYS1 or SYSA).<br />
• For satellites, x must be in the range of 10 through FFFF.<br />
Workstation windowing system System manager specifies. Workstation software must be installed before<br />
workstation satellites are added. If it is not, the procedure indicates that fact.<br />
(continued on next page)<br />
Configuring an <strong>OpenVMS</strong> <strong>Cluster</strong> System 8–5
Configuring an <strong>OpenVMS</strong> <strong>Cluster</strong> System<br />
8.1 Overview of the <strong>Cluster</strong> Configuration Procedures<br />
Table 8–3 (Cont.) Data Requested by CLUSTER_CONFIG_LAN.COM and CLUSTER_<br />
CONFIG.COM<br />
Information Required How to Specify or Obtain<br />
For all configurations<br />
Location and sizes of page and<br />
swap files<br />
Value for local computer’s<br />
allocation class (ALLOCLASS or<br />
TAPE_ALLOCLASS) parameter.<br />
Physical device name of quorum<br />
disk<br />
This information is requested only when you add a computer to the cluster. Press<br />
Return to accept the default size and location. (The default sizes displayed in<br />
brackets by the procedure are minimum values. The default location is the device<br />
name of the cluster system disk.)<br />
If your configuration includes satellite nodes, you may realize a performance<br />
improvement by locating satellite page and swap files on a satellite’s local disk,<br />
if such a disk is available. The potential for performance improvement depends on<br />
the configuration of your <strong>OpenVMS</strong> <strong>Cluster</strong> system disk and network.<br />
To set up page and swap files on a satellite’s local disk, the cluster configuration<br />
procedure creates a command procedure called SATELLITE_PAGE.COM in<br />
the satellite’s [SYSn.SYSEXE] directory on the boot server’s system disk. The<br />
SATELLITE_PAGE.COM procedure performs the following functions:<br />
• Mounts the satellite’s local disk with a volume label that is unique in the cluster<br />
in the format node-name_SCSSYSTEMID.<br />
Reference: Refer to Section 8.6.5 for information about altering the volume<br />
label.<br />
• Installs the page and swap files on the satellite’s local disk.<br />
Note: For page and swap disks that are shadowed, you must edit the MOUNT and<br />
INIT commands in SATELLITE_PAGE.COM to the appropriate syntax for mounting<br />
any specialized ‘‘local’’ disks (that is, host-based shadowing disks (DSxxx), or hostbased<br />
RAID disks (DPxxxx), or DECram disks (MDAxxxx)) on the newly added node.<br />
CLUSTER_CONFIG(_LAN).COM does not create the MOUNT and INIT commands<br />
required for SHADOW, RAID, or DECram disks.<br />
Note: To relocate the satellite’s page and swap files (for example, from the satellite’s<br />
local disk to the boot server’s system disk, or the reverse) or to change file sizes:<br />
1. Create new PAGE and SWAP files on a shared device, as shown:<br />
$ MCR SYSGEN CREATE device:[dir] PAGEFILE.SYS/SIZE=block-count<br />
Note: If page and swap files will be created for a shadow set, you must edit<br />
SATELLITE_PAGE accordingly.<br />
2. Rename the SYS$SPECIFIC:[SYSEXE]PAGEFILE.SYS and SWAPFILE.SYS<br />
files to PAGEFILE.TMP and SWAPFILE.TMP.<br />
3. Reboot, and then delete the .TMP files.<br />
4. Modify the SYS$MANAGER:SYPAGSWPFILES.COM procedure to load the<br />
files.<br />
The ALLOCLASS parameter can be used for a node allocation class or, on Alpha<br />
computers, a port allocation class. Refer to Section 6.2.1 for complete information<br />
about specifying allocation classes.<br />
System manager specifies.<br />
8–6 Configuring an <strong>OpenVMS</strong> <strong>Cluster</strong> System<br />
(continued on next page)
Configuring an <strong>OpenVMS</strong> <strong>Cluster</strong> System<br />
8.1 Overview of the <strong>Cluster</strong> Configuration Procedures<br />
Table 8–3 (Cont.) Data Requested by CLUSTER_CONFIG_LAN.COM and CLUSTER_<br />
CONFIG.COM<br />
Information Required How to Specify or Obtain<br />
For systems running DECnet for <strong>OpenVMS</strong><br />
Computer’s DECnet node<br />
address for Phase IV<br />
For the DECnet node address, you obtain this information as follows:<br />
• If you are adding a computer, the network manager supplies the address.<br />
• If you are removing a computer, use the SHOW NETWORK command (as shown<br />
in Table 8–2).<br />
Computer’s DECnet node name Network manager supplies. The name must be from 1 to 6 alphanumeric characters<br />
and cannot include dollar signs ( $ ) or underscores ( _ ).<br />
For systems running DECnet–Plus<br />
Computer’s DECnet node<br />
address for Phase IV (if you<br />
need Phase IV compatibility)<br />
For the DECnet node address, you obtain this information as follows:<br />
• If you are adding a computer, the network manager supplies the address.<br />
• If you are removing a computer, use the SHOW NETWORK command (as shown<br />
in Table 8–2).<br />
Node’s DECnet full name Determine the full name with the help of your network manager. Enter a string<br />
comprised of:<br />
• The namespace name, ending with a colon (:). This is optional.<br />
• The root directory, designated by a period (.).<br />
• Zero or more hierarchical directories, designated by a character string followed<br />
by a period (.).<br />
• The simple name, a character string that, combined with the directory names,<br />
uniquely identifies the node. For example:<br />
.SALES.NETWORKS.MYNODE<br />
MEGA:.INDIANA.JONES<br />
COLUMBUS:.FLATWORLD<br />
SCS node name for this node Enter the <strong>OpenVMS</strong> <strong>Cluster</strong> node name, which is a string of 6 or fewer alphanumeric<br />
characters.<br />
DECnet synonym Press Return to define a DECnet synonym, which is a short name for the node’s full<br />
name. Otherwise, enter N.<br />
Synonym name for this node Enter a string of 6 or fewer alphanumeric characters. By default, it is the first 6<br />
characters of the last simple name in the full name. For example:<br />
MOP service client name for this<br />
node<br />
†DECnet–Plus full-name functionality is VAX specific.<br />
†Full name: BIGBANG:.GALAXY.NOVA.BLACKHOLE<br />
Synonym: BLACKH<br />
Note: The node synonym does not need to be the same as the <strong>OpenVMS</strong> <strong>Cluster</strong><br />
node name.<br />
Enter the name for the node’s MOP service client when the node is configured as a<br />
boot server. By default, it is the <strong>OpenVMS</strong> <strong>Cluster</strong> node name (for example, the SCS<br />
node name). This name does not need to be the same as the <strong>OpenVMS</strong> <strong>Cluster</strong> node<br />
name.<br />
(continued on next page)<br />
Configuring an <strong>OpenVMS</strong> <strong>Cluster</strong> System 8–7
Configuring an <strong>OpenVMS</strong> <strong>Cluster</strong> System<br />
8.1 Overview of the <strong>Cluster</strong> Configuration Procedures<br />
Table 8–3 (Cont.) Data Requested by CLUSTER_CONFIG_LAN.COM and CLUSTER_<br />
CONFIG.COM<br />
Information Required How to Specify or Obtain<br />
For systems running TCP/IP or the LANCP Utility for satellite booting, or both<br />
Computer’s SCS node name<br />
(SCSNODE) and SCS system ID<br />
(SCSSYSTEMID)<br />
For LAN configurations<br />
<strong>Cluster</strong> group number and<br />
password<br />
Satellite’s LAN hardware<br />
address<br />
†DECnet–Plus full-name functionality is VAX specific.<br />
‡Alpha specific.<br />
These prompts are described in Section 4.2.3. If a system is running TCP/IP,<br />
the procedure does not ask for a TCP/IP host name because a cluster node name<br />
(SCSNODE) does not have to match a TCP/IP host name. The TCP/IP host name<br />
might be longer than six characters, whereas the SCSNODE name must be no more<br />
than six characters. Note that if the system is running both DECnet and IP, then<br />
the procedure uses the DECnet defaults.<br />
This information is requested only when the CHANGE option is chosen. See<br />
Section 2.5 for information about assigning cluster group numbers and passwords.<br />
Address has the form xx-xx-xx-xx-xx-xx. You must include the hyphens when you<br />
specify a hardware address. Proceed as follows:<br />
• ‡On Alpha systems, enter the following command at the satellite’s console:<br />
>>> SHOW NETWORK<br />
Note that you can also use the SHOW CONFIG command.<br />
• †On MicroVAX II and VAXstation II satellite nodes. When the DECnet for<br />
<strong>OpenVMS</strong> network is running on a boot server, enter the following commands<br />
at the satellite’s console:<br />
>>> B/100 XQA0<br />
Bootfile: READ_ADDR<br />
• †On MicroVAX 2000 and VAXstation 2000 satellite nodes. When the DECnet for<br />
<strong>OpenVMS</strong> network is running on a boot server, enter the following commands<br />
at successive console-mode prompts:<br />
>>> T 53<br />
2 ?>>> 3<br />
>>> B/100 ESA0<br />
Bootfile: READ_ADDR<br />
If the second prompt appears as 3 ?>>>, press Return.<br />
• †On MicroVAX 3xxx and 4xxx series satellite nodes, enter the following<br />
command at the satellite’s console:<br />
>>> SHOW ETHERNET<br />
8.1.3 Invoking the Procedure<br />
Once you have made the necessary preparations, you can invoke the cluster<br />
configuration procedure to configure your <strong>OpenVMS</strong> <strong>Cluster</strong> system. Log in to the<br />
system manager account and make sure your default is SYS$MANAGER. Then,<br />
invoke the procedure at the DCL command prompt as follows:<br />
$ @CLUSTER_CONFIG_LAN<br />
or<br />
$ @CLUSTER_CONFIG<br />
8–8 Configuring an <strong>OpenVMS</strong> <strong>Cluster</strong> System
Configuring an <strong>OpenVMS</strong> <strong>Cluster</strong> System<br />
8.1 Overview of the <strong>Cluster</strong> Configuration Procedures<br />
Caution: Do not invoke multiple sessions simultaneously. You can run only one<br />
cluster configuration session at a time.<br />
Once invoked, both procedures display the following information and menu.<br />
(The only difference between CLUSTER_CONFIG_LAN.COM and CLUSTER_<br />
CONFIG.COM at this point is the command procedure name that is displayed.)<br />
Depending on the menu option you select, the procedure interactively requests<br />
configuration information from you. (Predetermine your answers as described in<br />
Table 8–3.)<br />
<strong>Cluster</strong> Configuration Procedure<br />
Use CLUSTER_CONFIG.COM to set up or change an <strong>OpenVMS</strong> <strong>Cluster</strong> configuration.<br />
To ensure that you have the required privileges, invoke this procedure<br />
from the system manager’s account.<br />
Enter ? for help at any prompt.<br />
1. ADD a node to the cluster.<br />
2. REMOVE a node from the cluster.<br />
3. CHANGE a cluster member’s characteristics.<br />
4. CREATE a second system disk for JUPITR.<br />
5. MAKE a directory structure for a new root on a system disk.<br />
6. DELETE a root from a system disk.<br />
Enter choice [1]:<br />
.<br />
..<br />
This chapter contains a number of sample sessions showing how to run the<br />
cluster configuration procedures. Although the CLUSTER_CONFIG_LAN.COM<br />
and the CLUSTER_CONFIG.COM procedure function the same for both Alpha<br />
and VAX systems, the questions and format may appear slightly different<br />
according to the type of computer system.<br />
8.2 Adding Computers<br />
In most cases, you invoke either CLUSTER_CONFIG_LAN.COM or CLUSTER_<br />
CONFIG.COM on an active <strong>OpenVMS</strong> <strong>Cluster</strong> computer and select the ADD<br />
function to enable a computer as an <strong>OpenVMS</strong> <strong>Cluster</strong> member. However, in<br />
some circumstances, you may need to perform extra steps to add computers. Use<br />
the information in Table 8–4 to determine your actions.<br />
Table 8–4 Preparing to Add Computers to an <strong>OpenVMS</strong> <strong>Cluster</strong><br />
IF... THEN...<br />
You are adding your first satellite<br />
node to the <strong>OpenVMS</strong> <strong>Cluster</strong>.<br />
Follow these steps:<br />
1. Log in to the computer that will be enabled as the cluster<br />
boot server.<br />
2. Invoke the cluster configuration procedure, and execute<br />
the CHANGE function described in Section 8.4 to enable<br />
the local computer as a boot server.<br />
3. After the CHANGE function completes, execute the ADD<br />
function to add satellites to the cluster.<br />
(continued on next page)<br />
Configuring an <strong>OpenVMS</strong> <strong>Cluster</strong> System 8–9
Configuring an <strong>OpenVMS</strong> <strong>Cluster</strong> System<br />
8.2 Adding Computers<br />
Table 8–4 (Cont.) Preparing to Add Computers to an <strong>OpenVMS</strong> <strong>Cluster</strong><br />
IF... THEN...<br />
The cluster uses DECdtm services. You must create a transaction log for the computer when<br />
you have configured it into your cluster. For step-by-step<br />
instructions on how to do this, see the chapter on DECdtm<br />
services in the <strong>OpenVMS</strong> System Manager’s Manual.<br />
You add a CI connected computer<br />
that boots from a cluster common<br />
system disk.<br />
You are adding computers to a<br />
cluster with more than one common<br />
system disk.<br />
You add a voting member to the<br />
cluster.<br />
You must create a new default bootstrap command procedure<br />
for the computer before booting it into the cluster. For<br />
instructions, refer to your computer-specific installation and<br />
operations guide.<br />
You must use a different device name for each system disk<br />
on which computers are added. For this reason, the cluster<br />
configuration procedure supplies as a default device name<br />
the logical volume name (for example, DISK$MARS_SYS1) of<br />
SYS$SYSDEVICE: on the local system.<br />
Using different device names ensures that each computer<br />
added has a unique root directory specification, even if the<br />
system disks contain roots with the same name—for example,<br />
DISK$MARS_SYS1:[SYS10] and DISK$MARS_SYS2:[SYS10].<br />
You must, after the ADD function completes, reconfigure the<br />
cluster according to the instructions in Section 8.6.<br />
Caution: If either the local or the new computer fails before the ADD function<br />
completes, you must, after normal conditions are restored, perform the REMOVE<br />
option to erase any invalid data and then restart the ADD option. Section 8.3<br />
describes the REMOVE option.<br />
8.2.1 Controlling Conversational Bootstrap Operations<br />
When you add a satellite to the cluster using either cluster configuration<br />
command procedure, the procedure asks whether you want to allow<br />
conversational bootstrap operations for the satellite (default is No).<br />
If you select the default, the NISCS_CONV_BOOT system parameter in the<br />
satellite’s system parameter file remains set to 0 to disable such operations. The<br />
parameter file (ALPHAVMSSYS.PAR for Alpha systems or VAXVMSSYS.PAR for<br />
VAX systems) resides in the satellite’s root directory on a boot server’s system<br />
disk (device:[SYSx.SYSEXE]). You can enable conversational bootstrap operations<br />
for a given satellite at any time by setting this parameter to 1.<br />
Example:<br />
To enable such operations for an <strong>OpenVMS</strong> VAX satellite booted from root 10 on<br />
device $1$DJA11, you would proceed as follows:<br />
Step Action<br />
1 Log in as system manager on the boot server.<br />
2 †On VAX systems, invoke the System Generation utility (SYSGEN) and enter the following<br />
commands:<br />
†VAX specific<br />
8–10 Configuring an <strong>OpenVMS</strong> <strong>Cluster</strong> System<br />
$ RUN SYS$SYSTEM:SYSGEN<br />
SYSGEN> USE $1$DJA11:[SYS10.SYSEXE]VAXVMSSYS.PAR<br />
SYSGEN> SET NISCS_CONV_BOOT 1<br />
SYSGEN> WRITE $1$DJA11:[SYS10.SYSEXE]VAXVMSSYS.PAR<br />
SYSGEN> EXIT<br />
$
Step Action<br />
Configuring an <strong>OpenVMS</strong> <strong>Cluster</strong> System<br />
8.2 Adding Computers<br />
‡On an Alpha satellite, enter the same commands, replacing VAXVMSSYS.PAR with<br />
ALPHAVMSSYS.PAR.<br />
3 Modify the satellite’s MODPARAMS.DAT file so that NISCS_CONV_BOOT is set to 1.<br />
‡Alpha specific<br />
8.2.2 Common AUTOGEN Parameter Files<br />
WHEN the node<br />
being added is<br />
a... THEN...<br />
When adding a node or a satellite to an <strong>OpenVMS</strong> <strong>Cluster</strong>, the cluster<br />
configuration command procedure adds one of the following lines in the<br />
MODPARAMS.DAT file:<br />
Satellite node The following line is added to the MODPARAMS.DAT file:<br />
Nonsatellite<br />
node<br />
AGEN$INCLUDE_PARAMS SYS$MANAGER:AGEN$NEW_SATELLITE_DEFAULTS.DAT<br />
The following line is added to the MODPARAMS.DAT file:<br />
AGEN$INCLUDE_PARAMS SYS$MANAGER:AGEN$NEW_NODE_DEFAULTS.DAT<br />
The AGEN$NEW_SATELLITE_DEFAULTS.DAT and AGEN$NEW_NODE_<br />
DEFAULTS.DAT files hold AUTOGEN parameter settings that are common<br />
to all satellite nodes or nonsatellite nodes in the cluster. Use of these files<br />
simplifies system management, because you can maintain common system<br />
parameters in either one or both of these files. When adding or changing the<br />
common parameters, this eliminates the need to make modifications in the<br />
MODPARAMS.DAT files located on every node in the cluster.<br />
Initially, these files contain no parameter settings. You edit the AGEN$NEW_<br />
SATELLITE_DEFAULTS.DAT and AGEN$NEW_NODE_DEFAULTS.DAT files,<br />
as appropriate, to add, modify, or edit system parameters. For example, you<br />
might edit the AGEN$NEW_SATELLITE_DEFAULTS.DAT file to set the MIN_<br />
GBLPAGECNT parameter to 5000. AUTOGEN makes the MIN_GBLPAGECNT<br />
parameter and all other parameter settings in the AGEN$NEW_SATELLITE_<br />
DEFAULTS.DAT file common to all satellite nodes in the cluster.<br />
AUTOGEN uses the parameter settings in the AGEN$NEW_SATELLITE_<br />
DEFAULTS.DAT or AGEN$NEW_NODE_DEFAULTS.DAT files the first time it<br />
is run, and with every subsequent execution.<br />
8.2.3 Examples<br />
Examples 8–1, 8–2, and 8–3 illustrate the use of CLUSTER_CONFIG.COM on<br />
JUPITR to add, respectively, a boot server running DECnet for <strong>OpenVMS</strong>, a boot<br />
server running DECnet–Plus, and a satellite node.<br />
Configuring an <strong>OpenVMS</strong> <strong>Cluster</strong> System 8–11
Configuring an <strong>OpenVMS</strong> <strong>Cluster</strong> System<br />
8.2 Adding Computers<br />
Example 8–1 Sample Interactive CLUSTER_CONFIG.COM Session to Add a<br />
Computer as a Boot Server<br />
$ @CLUSTER_CONFIG.COM<br />
<strong>Cluster</strong> Configuration Procedure<br />
Use CLUSTER_CONFIG.COM to set up or change an <strong>OpenVMS</strong> <strong>Cluster</strong> configuration.<br />
To ensure that you have the required privileges, invoke this procedure<br />
from the system manager’s account.<br />
Enter ? for help at any prompt.<br />
1. ADD a node to the cluster.<br />
2. REMOVE a node from the cluster.<br />
3. CHANGE a cluster node’s characteristics.<br />
4. CREATE a second system disk for JUPITR.<br />
5. MAKE a directory structure for a new root on a system disk.<br />
6. DELETE a root from a system disk.<br />
Enter choice [1]: Return<br />
The ADD function adds a new node to the cluster.<br />
If the node being added is a voting member, EXPECTED_VOTES in all<br />
other cluster members’ MODPARAMS.DAT must be adjusted. You must then<br />
reconfigure the cluster, using the procedure described in the<br />
<strong>OpenVMS</strong> <strong>Cluster</strong> <strong>Systems</strong> manual.<br />
If the new node is a satellite, the network databases on JUPITR are<br />
updated. The network databases on all other cluster members must be<br />
updated.<br />
For instructions, see the <strong>OpenVMS</strong> <strong>Cluster</strong> <strong>Systems</strong> manual.<br />
What is the node’s DECnet node name? SATURN<br />
What is the node’s DECnet address? 2.3<br />
Will SATURN be a satellite [Y]? N<br />
Will SATURN be a boot server [Y]? Return<br />
This procedure will now ask you for the device name of SATURN’s system root.<br />
The default device name (DISK$VAXVMSRL5:) is the logical volume name of<br />
SYS$SYSDEVICE:.<br />
What is the device name for SATURN’s system root [DISK$VAXVMSRL5:]? Return<br />
What is the name of the new system root [SYSA]? Return<br />
Creating directory tree SYSA...<br />
%CREATE-I-CREATED, $1$DJA11: created<br />
%CREATE-I-CREATED, $1$DJA11: created<br />
.<br />
..<br />
System root SYSA created.<br />
Enter a value for SATURN’s ALLOCLASS parameter: 1<br />
Does this cluster contain a quorum disk [N]? Y<br />
What is the device name of the quorum disk? $1$DJA12<br />
Updating network database...<br />
Size of page file for SATURN [10000 blocks]? 50000<br />
Size of swap file for SATURN [8000 blocks]? 20000<br />
Will a local (non-HSC) disk on SATURN be used for paging and swapping? N<br />
If you specify a device other than DISK$VAXVMSRL5: for SATURN’s<br />
page and swap files, this procedure will create PAGEFILE_SATURN.SYS<br />
and SWAPFILE_SATURN.SYS in the directory on the device you<br />
specify.<br />
What is the device name for the page and swap files [DISK$VAXVMSRL5:]? Return<br />
%SYSGEN-I-CREATED, $1$DJA11:PAGEFILE.SYS;1 created<br />
%SYSGEN-I-CREATED, $1$DJA11:SWAPFILE.SYS;1 created<br />
8–12 Configuring an <strong>OpenVMS</strong> <strong>Cluster</strong> System<br />
(continued on next page)
Configuring an <strong>OpenVMS</strong> <strong>Cluster</strong> System<br />
8.2 Adding Computers<br />
Example 8–1 (Cont.) Sample Interactive CLUSTER_CONFIG.COM Session to<br />
Add a Computer as a Boot Server<br />
The configuration procedure has completed successfully.<br />
SATURN has been configured to join the cluster.<br />
Before booting SATURN, you must create a new default bootstrap<br />
command procedure for SATURN. See your processor-specific<br />
installation and operations guide for instructions.<br />
The first time SATURN boots, NET$CONFIGURE.COM and<br />
AUTOGEN.COM will run automatically.<br />
The following parameters have been set for SATURN:<br />
VOTES = 1<br />
QDSKVOTES = 1<br />
After SATURN has booted into the cluster, you must increment<br />
the value for EXPECTED_VOTES in every cluster member’s<br />
MODPARAMS.DAT. You must then reconfigure the cluster, using the<br />
procedure described in the <strong>OpenVMS</strong> <strong>Cluster</strong> <strong>Systems</strong> manual.<br />
Example 8–2 Sample Interactive CLUSTER_CONFIG.COM Session to Add a Computer<br />
Running DECnet–Plus<br />
$ @CLUSTER_CONFIG.COM<br />
<strong>Cluster</strong> Configuration Procedure<br />
Use CLUSTER_CONFIG.COM to set up or change an <strong>OpenVMS</strong> <strong>Cluster</strong> configuration.<br />
To ensure that you have the required privileges, invoke this procedure<br />
from the system manager’s account.<br />
Enter ? for help at any prompt.<br />
1. ADD a node to the cluster.<br />
2. REMOVE a node from the cluster.<br />
3. CHANGE a cluster node’s characteristics.<br />
4. CREATE a second system disk for JUPITR.<br />
5. MAKE a directory structure for a new root on a system disk.<br />
6. DELETE a root from a system disk.<br />
Enter choice [1]: Return<br />
The ADD function adds a new node to the cluster.<br />
If the node being added is a voting member, EXPECTED_VOTES in all<br />
other cluster members’ MODPARAMS.DAT must be adjusted, and the<br />
cluster must be rebooted.<br />
If the new node is a satellite, the network databases on JUPITR are<br />
updated. The network databases on all other cluster members must be<br />
updated.<br />
For instructions, see the <strong>OpenVMS</strong> <strong>Cluster</strong> <strong>Systems</strong> manual.<br />
For additional networking information, please refer to the<br />
DECnet-Plus Network Management manual.<br />
What is the node’s DECnet fullname? OMNI:.DISCOVERY.SATURN<br />
What is the SCS node name for this node? SATURN<br />
Do you want to define a DECnet synonym [Y]? Y<br />
What is the synonym name for this node [SATURN]? SATURN<br />
What is the MOP service client name for this node [SATURN]? VENUS<br />
What is the node’s DECnet Phase IV address? 17.129<br />
Will SATURN be a satellite [Y]? N<br />
Will SATURN be a boot server [Y]? Return<br />
(continued on next page)<br />
Configuring an <strong>OpenVMS</strong> <strong>Cluster</strong> System 8–13
Configuring an <strong>OpenVMS</strong> <strong>Cluster</strong> System<br />
8.2 Adding Computers<br />
Example 8–2 (Cont.) Sample Interactive CLUSTER_CONFIG.COM Session to Add a Computer<br />
Running DECnet–Plus<br />
This procedure will now ask you for the device name of SATURN’s system root.<br />
The default device name (DISK$VAXVMSRL5:) is the logical volume name of<br />
SYS$SYSDEVICE:.<br />
What is the device name for SATURN’s system root [DISK$VAXVMSRL5:]? Return<br />
What is the name of the new system root [SYSA]? Return<br />
Creating directory tree SYSA...<br />
%CREATE-I-CREATED, $1$DJA11: created<br />
%CREATE-I-CREATED, $1$DJA11: created<br />
.<br />
..<br />
System root SYSA created.<br />
Enter a value for SATURN’s ALLOCLASS parameter: 1<br />
Does this cluster contain a quorum disk [N]? Y<br />
What is the device name of the quorum disk? $1$DJA12<br />
Updating network database...<br />
Size of page file for SATURN [10000 blocks]? 50000<br />
Size of swap file for SATURN [8000 blocks]? 20000<br />
Will a local (non-HSC) disk on SATURN be used for paging and swapping? N<br />
If you specify a device other than DISK$VAXVMSRL5: for SATURN’s<br />
page and swap files, this procedure will create PAGEFILE_SATURN.SYS<br />
and SWAPFILE_SATURN.SYS in the directory on the device you<br />
specify.<br />
What is the device name for the page and swap files [DISK$VAXVMSRL5:]? Return<br />
%SYSGEN-I-CREATED, $1$DJA11:PAGEFILE.SYS;1 created<br />
%SYSGEN-I-CREATED, $1$DJA11:SWAPFILE.SYS;1 created<br />
The configuration procedure has completed successfully.<br />
SATURN has been configured to join the cluster.<br />
Before booting SATURN, you must create a new default bootstrap<br />
command procedure for SATURN. See your processor-specific<br />
installation and operations guide for instructions.<br />
The first time SATURN boots, NETCONFIG.COM and<br />
AUTOGEN.COM will run automatically.<br />
The following parameters have been set for SATURN:<br />
VOTES = 1<br />
QDSKVOTES = 1<br />
After SATURN has booted into the cluster, you must increment<br />
the value for EXPECTED_VOTES in every cluster member’s<br />
MODPARAMS.DAT. You must then reconfigure the cluster, using the<br />
procedure described in the <strong>OpenVMS</strong> <strong>Cluster</strong> <strong>Systems</strong> manual.<br />
8–14 Configuring an <strong>OpenVMS</strong> <strong>Cluster</strong> System
Configuring an <strong>OpenVMS</strong> <strong>Cluster</strong> System<br />
8.2 Adding Computers<br />
Example 8–3 Sample Interactive CLUSTER_CONFIG.COM Session to Add a VAX Satellite with<br />
Local Page and Swap Files<br />
$ @CLUSTER_CONFIG.COM<br />
<strong>Cluster</strong> Configuration Procedure<br />
Use CLUSTER_CONFIG.COM to set up or change a <strong>OpenVMS</strong> <strong>Cluster</strong> configuration.<br />
To ensure that you have the required privileges, invoke this procedure<br />
from the system manager’s account.<br />
Enter ? for help at any prompt.<br />
1. ADD a node to the cluster.<br />
2. REMOVE a node from the cluster.<br />
3. CHANGE a cluster node’s characteristics.<br />
4. CREATE a second system disk for JUPITR.<br />
5. MAKE a directory structure for a new root on a system disk.<br />
6. DELETE a root from a system disk.<br />
Enter choice [1]: Return<br />
The ADD function adds a new node to the cluster.<br />
If the node being added is a voting member, EXPECTED_VOTES in all<br />
other cluster members’ MODPARAMS.DAT must be adjusted, and the<br />
cluster must be rebooted.<br />
If the new node is a satellite, the network databases on JUPITR are<br />
updated. The network databases on all other cluster members must be<br />
updated.<br />
For instructions, see the <strong>OpenVMS</strong> <strong>Cluster</strong> <strong>Systems</strong> manual.<br />
What is the node’s DECnet node name? EUROPA<br />
What is the node’s DECnet address? 2.21<br />
Will EUROPA be a satellite [Y]? Return<br />
Verifying circuits in network database...<br />
This procedure will now ask you for the device name of EUROPA’s system root.<br />
The default device name (DISK$VAXVMSRL5:) is the logical volume name of<br />
SYS$SYSDEVICE:.<br />
What is the device name for EUROPA’S system root [DISK$VAXVMSRL5:]? Return<br />
What is the name of the new system root [SYS10]? Return<br />
Allow conversational bootstraps on EUROPA [NO]? Return<br />
The following workstation windowing options are available:<br />
1. No workstation software<br />
2. VWS Workstation Software<br />
3. DECwindows Workstation Software<br />
Enter choice [1]: 3<br />
Creating directory tree SYS10...<br />
%CREATE-I-CREATED, $1$DJA11: created<br />
%CREATE-I-CREATED, $1$DJA11: created<br />
.<br />
..<br />
System root SYS10 created.<br />
Will EUROPA be a disk server [N]? Return<br />
What is EUROPA’s Ethernet hardware address? 08-00-2B-03-51-75<br />
Updating network database...<br />
Size of pagefile for EUROPA [10000 blocks]? 20000<br />
Size of swap file for EUROPA [8000 blocks]? 12000<br />
Will a local disk on EUROPA be used for paging and swapping? YES<br />
Creating temporary page file in order to boot EUROPA for the first time...<br />
%SYSGEN-I-CREATED, $1$DJA11:PAGEFILE.SYS;1 created<br />
This procedure will now wait until EUROPA joins the cluster.<br />
Once EUROPA joins the cluster, this procedure will ask you<br />
to specify a local disk on EUROPA for paging and swapping.<br />
(continued on next page)<br />
Configuring an <strong>OpenVMS</strong> <strong>Cluster</strong> System 8–15
Configuring an <strong>OpenVMS</strong> <strong>Cluster</strong> System<br />
8.2 Adding Computers<br />
Example 8–3 (Cont.) Sample Interactive CLUSTER_CONFIG.COM Session to Add a VAX<br />
Satellite with Local Page and Swap Files<br />
Please boot EUROPA now.<br />
Waiting for EUROPA to boot...<br />
.<br />
..<br />
(User enters boot command at satellite’s console-mode prompt (>>>).)<br />
. ..<br />
The local disks on EUROPA are:<br />
Device Device Error Volume Free Trans Mnt<br />
Name Status Count Label Blocks Count Cnt<br />
EUROPA$DUA0: Online 0<br />
EUROPA$DUA1: Online 0<br />
Which disk can be used for paging and swapping? EUROPA$DUA0:<br />
May this procedure INITIALIZE EUROPA$DUA0: [YES]? NO<br />
Mounting EUROPA$DUA0:...<br />
PAGEFILE.SYS already exists on EUROPA$DUA0:<br />
***************************************<br />
Directory EUROPA$DUA0:[SYS0.SYSEXE]<br />
PAGEFILE.SYS;1 23600/23600<br />
Total of 1 file, 23600/23600 blocks.<br />
***************************************<br />
What is the file specification for the page file on<br />
EUROPA$DUA0: [ PAGEFILE.SYS ]? Return<br />
%CREATE-I-EXISTS, EUROPA$DUA0: already exists<br />
This procedure will use the existing pagefile,<br />
EUROPA$DUA0:PAGEFILE.SYS;.<br />
SWAPFILE.SYS already exists on EUROPA$DUA0:<br />
***************************************<br />
Directory EUROPA$DUA0:[SYS0.SYSEXE]<br />
SWAPFILE.SYS;1 12000/12000<br />
Total of 1 file, 12000/12000 blocks.<br />
***************************************<br />
What is the file specification for the swap file on<br />
EUROPA$DUA0: [ SWAPFILE.SYS ]? Return<br />
This procedure will use the existing swapfile,<br />
EUROPA$DUA0:SWAPFILE.SYS;.<br />
AUTOGEN will now reconfigure and reboot EUROPA automatically.<br />
These operations will complete in a few minutes, and a<br />
completion message will be displayed at your terminal.<br />
The configuration procedure has completed successfully.<br />
8.2.4 Adding a Quorum Disk<br />
To enable a quorum disk on a node or nodes, use the cluster configuration<br />
procedure as described in Table 8–5.<br />
8–16 Configuring an <strong>OpenVMS</strong> <strong>Cluster</strong> System
Table 8–5 Preparing to Add a Quorum Disk Watcher<br />
IF... THEN...<br />
Other cluster nodes are already<br />
enabled as quorum disk<br />
watchers.<br />
The cluster does not contain any<br />
quorum disk watchers.<br />
8.3 Removing Computers<br />
Configuring an <strong>OpenVMS</strong> <strong>Cluster</strong> System<br />
8.2 Adding Computers<br />
Perform the following steps:<br />
1. Log in to the computer that is to be enabled as the quorum<br />
disk watcher and run CLUSTER_CONFIG_LAN.COM or<br />
CLUSTER_CONFIG.COM.<br />
2. Execute the CHANGE function and select menu item 7 to<br />
enable a quorum disk. (See Section 8.4.)<br />
3. Update the current system parameters and reboot the node.<br />
(See Section 8.6.1.)<br />
Perform the following steps:<br />
1. Perform the preceding steps 1 and 2 for each node to be<br />
enabled as a quorum disk watcher.<br />
2. Reconfigure the cluster according to the instructions in<br />
Section 8.6.<br />
To disable a computer as an <strong>OpenVMS</strong> <strong>Cluster</strong> member:<br />
1. Determine whether removing a member will cause you to lose quorum. Use<br />
the SHOW CLUSTER command to display the CL_QUORUM and CL_VOTES<br />
values.<br />
IF removing members... THEN...<br />
Will cause you to lose<br />
quorum<br />
Will not cause you to<br />
lose quorum<br />
Perform the steps in the following list:<br />
Caution: Do not perform these steps until you are ready to<br />
reboot the entire <strong>OpenVMS</strong> <strong>Cluster</strong> system. Because you are<br />
reducing quorum for the cluster, the votes cast by the node being<br />
removed could cause a cluster partition to be formed.<br />
• Reset the EXPECTED_VOTES parameter in the<br />
AUTOGEN parameter files and current system parameter<br />
files (see Section 8.6.1).<br />
• Shut down the cluster (see Section 8.6.2), and reboot<br />
without the node that is being removed.<br />
Note: Be sure that you do not specify an automatic reboot<br />
on that node.<br />
Proceed as follows:<br />
• Perform an orderly shutdown on the node being removed by<br />
invoking the SYS$SYSTEM:SHUTDOWN.COM command<br />
procedure (described in Section 8.6.3).<br />
• If the node was a voting member, use the DCL command<br />
SET CLUSTER/EXPECTED_VOTES to reduce the value of<br />
quorum.<br />
Reference: Refer also to Section 10.12 for information about adjusting<br />
expected votes.<br />
Configuring an <strong>OpenVMS</strong> <strong>Cluster</strong> System 8–17
Configuring an <strong>OpenVMS</strong> <strong>Cluster</strong> System<br />
8.3 Removing Computers<br />
2. Invoke CLUSTER_CONFIG_LAN.COM or CLUSTER_CONFIG.COM on an<br />
active <strong>OpenVMS</strong> <strong>Cluster</strong> computer and select the REMOVE option.<br />
3. Use the information in Table 8–6 to determine whether additional actions are<br />
required.<br />
Table 8–6 Preparing to Remove Computers from an <strong>OpenVMS</strong> <strong>Cluster</strong><br />
IF... THEN...<br />
You are removing a voting member. You must, after the REMOVE function completes, reconfigure<br />
the cluster according to the instructions in Section 8.6.<br />
The page and swap files for the<br />
computer being removed do not<br />
reside on the same disk as the<br />
computer’s root directory tree.<br />
You are removing a computer from a<br />
cluster that uses DECdtm services.<br />
The REMOVE function does not delete these files. It displays<br />
a message warning that the files will not be deleted, as in<br />
Example 8–4. If you want to delete the files, you must do so<br />
after the REMOVE function completes.<br />
Make sure that you have followed the step-by-step instructions<br />
in the chapter on DECdtm services in the <strong>OpenVMS</strong> System<br />
Manager’s Manual. These instructions describe how to remove<br />
a computer safely from the cluster, thereby preserving the<br />
integrity of your data.<br />
Note: When the REMOVE function deletes the computer’s entire root directory<br />
tree, it generates <strong>OpenVMS</strong> RMS informational messages while deleting the<br />
directory files. You can ignore these messages.<br />
8.3.1 Example<br />
Example 8–4 illustrates the use of CLUSTER_CONFIG.COM on JUPITR to<br />
remove satellite EUROPA from the cluster.<br />
Example 8–4 Sample Interactive CLUSTER_CONFIG.COM Session to Remove a Satellite with<br />
Local Page and Swap Files<br />
$ @CLUSTER_CONFIG.COM<br />
<strong>Cluster</strong> Configuration Procedure<br />
Use CLUSTER_CONFIG.COM to set up or change an <strong>OpenVMS</strong> <strong>Cluster</strong> configuration.<br />
To ensure that you have the required privileges, invoke this procedure<br />
from the system manager’s account.<br />
Enter ? for help at any prompt.<br />
1. ADD a node to the cluster.<br />
2. REMOVE a node from the cluster.<br />
3. CHANGE a cluster node’s characteristics.<br />
4. CREATE a second system disk for JUPITR.<br />
5. MAKE a directory structure for a new root on a system disk.<br />
6. DELETE a root from a system disk.<br />
Enter choice [1]: 2<br />
8–18 Configuring an <strong>OpenVMS</strong> <strong>Cluster</strong> System<br />
(continued on next page)
Configuring an <strong>OpenVMS</strong> <strong>Cluster</strong> System<br />
8.3 Removing Computers<br />
Example 8–4 (Cont.) Sample Interactive CLUSTER_CONFIG.COM Session to Remove a<br />
Satellite with Local Page and Swap Files<br />
The REMOVE function disables a node as a cluster member.<br />
o It deletes the node’s root directory tree.<br />
o It removes the node’s network information<br />
from the network database.<br />
If the node being removed is a voting member, you must adjust<br />
EXPECTED_VOTES in each remaining cluster member’s MODPARAMS.DAT.<br />
You must then reconfigure the cluster, using the procedure described<br />
in the <strong>OpenVMS</strong> <strong>Cluster</strong> <strong>Systems</strong> manual.<br />
What is the node’s DECnet node name? EUROPA<br />
Verifying network database...<br />
Verifying that SYS10 is EUROPA’s root...<br />
WARNING - EUROPA’s page and swap files will not be deleted.<br />
They do not reside on $1$DJA11:.<br />
Deleting directory tree SYS10...<br />
%DELETE-I-FILDEL, $1$DJA11:SYSCBI.DIR;1 deleted (1 block)<br />
%DELETE-I-FILDEL, $1$DJA11:SYSERR.DIR;1 deleted (1 block)<br />
.<br />
..<br />
System root SYS10 deleted.<br />
Updating network database...<br />
The configuration procedure has completed successfully.<br />
8.3.2 Removing a Quorum Disk<br />
To disable a quorum disk on a node or nodes, use the cluster configuration<br />
command procedure as described in Table 8–7.<br />
Table 8–7 Preparing to Remove a Quorum Disk Watcher<br />
IF... THEN...<br />
Other cluster nodes will still<br />
be enabled as quorum disk<br />
watchers.<br />
All quorum disk watchers will be<br />
disabled.<br />
Perform the following steps:<br />
1. Log in to the computer that is to be disabled as the quorum<br />
disk watcher and run CLUSTER_CONFIG_LAN.COM or<br />
CLUSTER_CONFIG.COM.<br />
2. Execute the CHANGE function and select menu item 7 to<br />
disable a quorum disk (see Section 8.4).<br />
3. Reboot the node (see Section 8.6.7).<br />
Perform the following steps:<br />
1. Perform the preceding steps 1 and 2 for all computers with<br />
the quorum disk enabled.<br />
2. Reconfigure the cluster according to the instructions in<br />
Section 8.6.<br />
Configuring an <strong>OpenVMS</strong> <strong>Cluster</strong> System 8–19
Configuring an <strong>OpenVMS</strong> <strong>Cluster</strong> System<br />
8.4 Changing Computer Characteristics<br />
8.4 Changing Computer Characteristics<br />
As your processing needs change, you may want to add satellites to an existing<br />
<strong>OpenVMS</strong> <strong>Cluster</strong>, or you may want to change an <strong>OpenVMS</strong> <strong>Cluster</strong> that is based<br />
on one interconnect (such as the CI or DSSI interconnect, or HSC subsystem) to<br />
include several interconnects.<br />
Table 8–8 describes the operations you can accomplish when you select the<br />
CHANGE option from the main menu of the cluster configuration command<br />
procedure.<br />
Note: All operations except changing a satellite’s LAN (Ethernet or FDDI)<br />
hardware address must be executed on the computer whose characteristics you<br />
want to change.<br />
Table 8–8 CHANGE Options of the <strong>Cluster</strong> Configuration Procedure<br />
Option Operation Performed<br />
Enable the local computer as a<br />
disk server<br />
Disable the local computer as a<br />
disk server<br />
Enable the local computer as a<br />
boot server<br />
Disable the local computer as a<br />
boot server<br />
Enable the LAN for cluster<br />
communications on the local<br />
computer<br />
Disable the LAN for cluster<br />
communications on the local<br />
computer<br />
Enable a quorum disk on the<br />
local computer<br />
Disable a quorum disk on the<br />
local computer<br />
Change a satellite’s LAN<br />
hardware address<br />
Enable the local computer as a<br />
tape server<br />
Loads the MSCP server by setting, in MODPARAMS.DAT, the value of the MSCP_<br />
LOAD parameter to 1 and the MSCP_SERVE_ALL parameter to 1 or 2.<br />
Sets MSCP_LOAD to 0.<br />
If you are setting up an <strong>OpenVMS</strong> <strong>Cluster</strong> that includes satellites, you must perform<br />
this operation once before you attempt to add satellites to the cluster. You thereby<br />
enable MOP service for the LAN adapter circuit that the computer uses to service<br />
operating system load requests from satellites. When you enable the computer as a<br />
boot server, it automatically becomes a disk server (if it is not one already) because<br />
it must serve its system disk to satellites.<br />
Disables DECnet MOP service for the computer’s adapter circuit.<br />
Loads the port driver PEDRIVER by setting the value of the NISCS_LOAD_PEA0<br />
parameter to 1 in MODPARAMS.DAT. Creates the cluster security database file,<br />
SYS$SYSTEM:[SYSEXE]CLUSTER_AUTHORIZE.DAT, on the local computer’s<br />
system disk.<br />
Caution: The VAXCLUSTER system parameter must be set to 2 if the NISCS_<br />
LOAD_PEA0 parameter is set to 1. This ensures coordinated access to shared<br />
resources in the cluster and prevents accidental data corruption.<br />
Sets NISCS_LOAD_PEA0 to 0.<br />
8–20 Configuring an <strong>OpenVMS</strong> <strong>Cluster</strong> System<br />
In MODPARAMS.DAT, sets the DISK_QUORUM system parameter to a device<br />
name; sets the value of QDSKVOTES to 1 (default value).<br />
In MODPARAMS.DAT, sets a blank value for the DISK_QUORUM system<br />
parameter; sets the value of QDSKVOTES to 1.<br />
Changes a satellite’s hardware address if its LAN device needs replacement. Both<br />
the permanent and volatile network databases and NETNODE_UPDATE.COM are<br />
updated on the local computer.<br />
Rule: You must perform this operation on each computer enabled as a boot server for<br />
the satellite.<br />
Loads the TMSCP server by setting, in MODPARAMS.DAT, the value of the TMSCP_<br />
LOAD parameter to 1 and the TMSCP_SERVE_ALL parameter to 1 or 2.<br />
(continued on next page)
Configuring an <strong>OpenVMS</strong> <strong>Cluster</strong> System<br />
8.4 Changing Computer Characteristics<br />
Table 8–8 (Cont.) CHANGE Options of the <strong>Cluster</strong> Configuration Procedure<br />
Option Operation Performed<br />
Disable the local computer as a<br />
tape server<br />
Change the local computer’s<br />
node allocation class value<br />
Change the local computer’s tape<br />
allocation class value<br />
Change the local computer’s port<br />
allocation class value<br />
Sets TMSCP_LOAD to zero.<br />
Sets a value for the computer’s ALLOCLASS parameter in MODPARAMS.DAT.<br />
Sets a value from 1 to 255 for the computer’s TAPE_ALLOCLASS parameter in<br />
MODPARAMS.DAT. The default value is zero. You must specify a nonzero tape<br />
allocation class parameter if this node is locally connected to a dual-ported tape, or<br />
if it will be serving any multiple-host tapes (for example, TFnn or HSC connected<br />
tapes) to other cluster members. Satellites usually have TAPE_ALLOCLASS set to<br />
zero.<br />
Sets a value for the computer’s ALLOCLASS parameter in MODPARAMS.DAT for<br />
all devices attached to it.<br />
Enable MEMORY CHANNEL Sets MC_SERVICES_P2 to 1 to load the PMDRIVER (PMA0) cluster driver. This<br />
system parameter enables MEMORY CHANNEL on the local computer for node-tonode<br />
cluster communications.<br />
Disable MEMORY CHANNEL Sets MC_SERVICES_P2 to 0 so that the PMDRIVER (PMA0) cluster driver is not<br />
loaded. The setting of 0 disables MEMORY CHANNEL on the local computer as the<br />
node-to-node cluster communications interconnect.<br />
8.4.1 Preparation<br />
You usually need to perform a number of steps before using the cluster<br />
configuration command procedure to change the configuration of your existing<br />
cluster.<br />
Table 8–9 suggests several typical configuration changes and describes the<br />
procedures required to make them.<br />
Table 8–9 Tasks Involved in Changing <strong>OpenVMS</strong> <strong>Cluster</strong> Configurations<br />
Task Procedure<br />
Add satellite nodes Perform these operations on the computer that will be enabled as a cluster boot<br />
server:<br />
Change an existing CI or DSSI<br />
cluster to include satellite nodes<br />
1. Execute the CHANGE function to enable the first installed computer as a boot<br />
server (see Example 8–7).<br />
2. Execute the ADD function to add the satellite (as described in Section 8.2).<br />
3. Reconfigure the cluster according to the postconfiguration instructions in<br />
Section 8.6.<br />
To enable cluster communications over the LAN (Ethernet or FDDI) on all computers,<br />
and to enable one or more computers as boot servers, proceed as follows:<br />
(continued on next page)<br />
Configuring an <strong>OpenVMS</strong> <strong>Cluster</strong> System 8–21
Configuring an <strong>OpenVMS</strong> <strong>Cluster</strong> System<br />
8.4 Changing Computer Characteristics<br />
Table 8–9 (Cont.) Tasks Involved in Changing <strong>OpenVMS</strong> <strong>Cluster</strong> Configurations<br />
Task Procedure<br />
Change an existing LAN-based<br />
cluster to include CI and DSSI<br />
interconnects<br />
8–22 Configuring an <strong>OpenVMS</strong> <strong>Cluster</strong> System<br />
1. Log in as system manager on each computer, invoke either CLUSTER_<br />
CONFIG_LAN.COM or CLUSTER_CONFIG.COM, and execute the CHANGE<br />
function to enable LAN communications.<br />
Rule: You must perform this operation on all computers.<br />
Note: You must establish a cluster group number and password on all system<br />
disks in the <strong>OpenVMS</strong> <strong>Cluster</strong> before you can successfully add a satellite node<br />
using the CHANGE function of the cluster configuration procedure.<br />
2. Execute the CHANGE function to enable one or more computers as boot servers.<br />
3. Reconfigure the cluster according to the postconfiguration instructions in<br />
Section 8.6.<br />
Before performing the operations described here, be sure that the computers and<br />
HSC subsystems or RF disks you intend to include in your new configuration are<br />
correctly installed and checked for proper operation.<br />
The method you use to include CI and DSSI interconnects with an existing LANbased<br />
cluster configuration depends on whether your current boot server is capable<br />
of being configured as a CI or DSSI computer.<br />
(continued on next page)
Configuring an <strong>OpenVMS</strong> <strong>Cluster</strong> System<br />
8.4 Changing Computer Characteristics<br />
Table 8–9 (Cont.) Tasks Involved in Changing <strong>OpenVMS</strong> <strong>Cluster</strong> Configurations<br />
Task Procedure<br />
Note: The following procedures assume that the system disk containing satellite<br />
roots will reside on an HSC disk (for CI configurations) or an RF disk (for DSSI<br />
configurations).<br />
• If the boot server can be configured as a CI or DSSI computer, proceed as<br />
follows:<br />
1. Log in as system manager on the boot server and perform an image<br />
backup operation to back up the current system disk to a disk on an<br />
HSC subsystem or RF storage device. (For more information about backup<br />
operations, refer to the <strong>OpenVMS</strong> System Management Utilities Reference<br />
Manual.)<br />
2. Modify the computer’s default bootstrap command procedure to boot the<br />
computer from the HSC or RF disk, according to the instructions in the<br />
appropriate system-specific installation and operations guide.<br />
3. Shut down the cluster. Shut down the satellites first, and then shut down<br />
the boot server.<br />
4. Boot the boot server from the newly created system disk on the HSC or RF<br />
storage subsystem.<br />
5. Reboot the satellites.<br />
• If your current boot server cannot be configured as a CI or a DSSI computer,<br />
proceed as follows:<br />
1. Shut down the old local area cluster. Shut down the satellites first, and<br />
then shut down the boot server.<br />
2. Install the <strong>OpenVMS</strong> operating system on the new CI computer’s HSC<br />
system disk or on the new DSSI computer’s RF disk, as appropriate. When<br />
the installation procedure asks whether you want to enable the LAN for<br />
cluster communications, answer YES.<br />
3. When the installation completes, log in as system manager, and configure<br />
and start the DECnet for <strong>OpenVMS</strong> network as described in Chapter 4.<br />
4. Execute the CHANGE function of either CLUSTER_CONFIG_LAN.COM or<br />
CLUSTER_CONFIG.COM to enable the computer as a boot server.<br />
5. Log in as system manager on the newly added computer and execute the<br />
ADD function of either CLUSTER_CONFIG_LAN.COM or CLUSTER_<br />
CONFIG.COM to add the former LAN cluster members (including the<br />
former boot server) as satellites.<br />
• Reconfigure the cluster according to the postconfiguration instructions in<br />
Section 8.6.<br />
(continued on next page)<br />
Configuring an <strong>OpenVMS</strong> <strong>Cluster</strong> System 8–23
Configuring an <strong>OpenVMS</strong> <strong>Cluster</strong> System<br />
8.4 Changing Computer Characteristics<br />
Table 8–9 (Cont.) Tasks Involved in Changing <strong>OpenVMS</strong> <strong>Cluster</strong> Configurations<br />
Task Procedure<br />
Convert a standalone computer<br />
to an <strong>OpenVMS</strong> <strong>Cluster</strong><br />
computer<br />
Enable or disable disk-serving or<br />
tape-serving functions<br />
Execute either CLUSTER_CONFIG_LAN.COM or CLUSTER_CONFIG.COM on a<br />
standalone computer to perform either of the following operations:<br />
• Add the standalone computer with its own system disk to an existing cluster.<br />
• Set up the standalone computer to form a new cluster if the computer was not<br />
set up as a cluster computer during installation of the operating system.<br />
• Reconfigure the cluster according to the postconfiguration instructions in<br />
Section 8.6.<br />
See Example 8–11, which illustrates the use of CLUSTER_CONFIG.COM on<br />
standalone computer PLUTO to convert PLUTO to a cluster boot server.<br />
If your cluster uses DECdtm services, you must create a transaction log for the<br />
computer when you have configured it into your cluster. For step-by-step instructions<br />
on how to do this, see the chapter on DECdtm services in the <strong>OpenVMS</strong> System<br />
Manager’s Manual.<br />
After invoking either CLUSTER_CONFIG_LAN.COM or CLUSTER_CONFIG.COM<br />
to enable or disable the disk or tape serving functions, run AUTOGEN with the<br />
REBOOT option to reboot the local computer (see Section 8.6.1).<br />
Note: When the cluster configuration command procedure sets or changes values<br />
in MODPARAMS.DAT, the new values are always appended at the end of the file<br />
so that they override earlier values. You may want to edit the file occasionally<br />
and delete lines that specify earlier values.<br />
8.4.2 Examples<br />
Examples 8–5 through 8–11 show the use of CLUSTER_CONFIG.COM to perform<br />
the following operations:<br />
• Enable a computer as a disk server (Example 8–5).<br />
• Change a computer’s ALLOCLASS value (Example 8–6).<br />
• Enable a computer as a boot server (Example 8–7).<br />
• Specify a new hardware address for a satellite node that boots from a common<br />
system disk (Example 8–8).<br />
• Enable a computer as a tape server (Example 8–9).<br />
• Change a computer’s TAPE_ALLOCLASS value (Example 8–10).<br />
• Convert a standalone computer to a cluster boot server (Example 8–11).<br />
Example 8–5 Sample Interactive CLUSTER_CONFIG.COM Session to Enable the Local<br />
Computer as a Disk Server<br />
$ @CLUSTER_CONFIG.COM<br />
<strong>Cluster</strong> Configuration Procedure<br />
Use CLUSTER_CONFIG.COM to set up or change an <strong>OpenVMS</strong> <strong>Cluster</strong> configuration.<br />
To ensure that you have the required privileges, invoke this procedure<br />
from the system manager’s account.<br />
Enter ? for help at any prompt.<br />
8–24 Configuring an <strong>OpenVMS</strong> <strong>Cluster</strong> System<br />
(continued on next page)
Configuring an <strong>OpenVMS</strong> <strong>Cluster</strong> System<br />
8.4 Changing Computer Characteristics<br />
Example 8–5 (Cont.) Sample Interactive CLUSTER_CONFIG.COM Session to Enable the Local<br />
Computer as a Disk Server<br />
1. ADD a node to the cluster.<br />
2. REMOVE a node from the cluster.<br />
3. CHANGE a cluster node’s characteristics.<br />
4. CREATE a second system disk for URANUS.<br />
5. MAKE a directory structure for a new root on a system disk.<br />
6. DELETE a root from a system disk.<br />
Enter choice [1]: 3<br />
CHANGE Menu<br />
1. Enable URANUS as a disk server.<br />
2. Disable URANUS as a disk server.<br />
3. Enable URANUS as a boot server.<br />
4. Disable URANUS as a boot server.<br />
5. Enable the LAN for cluster communications on URANUS.<br />
6. Disable the LAN for cluster communications on URANUS.<br />
7. Enable a quorum disk on URANUS.<br />
8. Disable a quorum disk on URANUS.<br />
9. Change URANUS’s satellite’s Ethernet or FDDI hardware address.<br />
10. Enable URANUS as a tape server.<br />
11. Disable URANUS as a tape server.<br />
12. Change URANUS’s ALLOCLASS value.<br />
13. Change URANUS’s TAPE_ALLOCLASS value.<br />
14. Change URANUS’s shared SCSI port allocation class value.<br />
15. Enable Memory Channel for cluster communications on Uranus.<br />
16. Disable Memory Channel for cluster communications on Uranus.<br />
Enter choice [1]: Return<br />
Will URANUS serve HSC disks [Y]? Return<br />
Enter a value for URANUS’s ALLOCLASS parameter: 2<br />
The configuration procedure has completed successfully.<br />
URANUS has been enabled as a disk server. MSCP_LOAD has been<br />
set to 1 in MODPARAMS.DAT. Please run AUTOGEN to reboot URANUS:<br />
$ @SYS$UPDATE:AUTOGEN GETDATA REBOOT<br />
If you have changed URANUS’s ALLOCLASS value, you must reconfigure the<br />
cluster, using the procedure described in the <strong>OpenVMS</strong> <strong>Cluster</strong> <strong>Systems</strong> manual.<br />
Example 8–6 Sample Interactive CLUSTER_CONFIG.COM Session to Change the Local<br />
Computer’s ALLOCLASS Value<br />
$ @CLUSTER_CONFIG.COM<br />
<strong>Cluster</strong> Configuration Procedure<br />
Use CLUSTER_CONFIG.COM to set up or change an <strong>OpenVMS</strong> <strong>Cluster</strong> configuration.<br />
To ensure that you have the required privileges, invoke this procedure<br />
from the system manager’s account.<br />
Enter ? for help at any prompt.<br />
1. ADD a node to the cluster.<br />
2. REMOVE a node from the cluster.<br />
3. CHANGE a cluster node’s characteristics.<br />
4. CREATE a second system disk for URANUS.<br />
5. MAKE a directory structure for a new root on a system disk.<br />
6. DELETE a root from a system disk.<br />
Enter choice [1]: 3<br />
(continued on next page)<br />
Configuring an <strong>OpenVMS</strong> <strong>Cluster</strong> System 8–25
Configuring an <strong>OpenVMS</strong> <strong>Cluster</strong> System<br />
8.4 Changing Computer Characteristics<br />
Example 8–6 (Cont.) Sample Interactive CLUSTER_CONFIG.COM Session to Change the<br />
Local Computer’s ALLOCLASS Value<br />
CHANGE Menu<br />
1. Enable URANUS as a disk server.<br />
2. Disable URANUS as a disk server.<br />
3. Enable URANUS as a boot server.<br />
4. Disable URANUS as a boot server.<br />
5. Enable the LAN for cluster communications on URANUS.<br />
6. Disable the LAN for cluster communications on URANUS.<br />
7. Enable a quorum disk on URANUS.<br />
8. Disable a quorum disk on URANUS.<br />
9. Change URANUS’s satellite’s Ethernet or FDDI hardware address.<br />
10. Enable URANUS as a tape server.<br />
11. Disable URANUS as a tape server.<br />
12. Change URANUS’s ALLOCLASS value.<br />
13. Change URANUS’s TAPE_ALLOCLASS value.<br />
14. Change URANUS’s shared SCSI port allocation class value.<br />
15. Enable Memory Channel for cluster communications on Uranus.<br />
16. Disable Memory Channel for cluster communications on Uranus.<br />
Enter choice [1]: 12<br />
Enter a value for URANUS’s ALLOCLASS parameter [2]: 1<br />
The configuration procedure has completed successfully<br />
If you have changed URANUS’s ALLOCLASS value, you must reconfigure the<br />
cluster, using the procedure described in the <strong>OpenVMS</strong> <strong>Cluster</strong> <strong>Systems</strong> manual.<br />
Example 8–7 Sample Interactive CLUSTER_CONFIG.COM Session to Enable the Local<br />
Computer as a Boot Server<br />
$ @CLUSTER_CONFIG.COM<br />
<strong>Cluster</strong> Configuration Procedure<br />
Use CLUSTER_CONFIG.COM to set up or change an <strong>OpenVMS</strong> <strong>Cluster</strong> configuration.<br />
To ensure that you have the required privileges, invoke this procedure<br />
from the system manager’s account.<br />
Enter ? for help at any prompt.<br />
1. ADD a node to the cluster.<br />
2. REMOVE a node from the cluster.<br />
3. CHANGE a cluster node’s characteristics.<br />
4. CREATE a second system disk for URANUS.<br />
5. MAKE a directory structure for a new root on a system disk.<br />
6. DELETE a root from a system disk.<br />
Enter choice [1]: 3<br />
8–26 Configuring an <strong>OpenVMS</strong> <strong>Cluster</strong> System<br />
(continued on next page)
Configuring an <strong>OpenVMS</strong> <strong>Cluster</strong> System<br />
8.4 Changing Computer Characteristics<br />
Example 8–7 (Cont.) Sample Interactive CLUSTER_CONFIG.COM Session to Enable the Local<br />
Computer as a Boot Server<br />
CHANGE Menu<br />
1. Enable URANUS as a disk server.<br />
2. Disable URANUS as a disk server.<br />
3. Enable URANUS as a boot server.<br />
4. Disable URANUS as a boot server.<br />
5. Enable the LAN for cluster communications on URANUS.<br />
6. Disable the LAN for cluster communications on URANUS.<br />
7. Enable a quorum disk on URANUS.<br />
8. Disable a quorum disk on URANUS.<br />
9. Change URANUS’s satellite’s Ethernet or FDDI hardware address.<br />
10. Enable URANUS as a tape server.<br />
11. Disable URANUS as a tape server.<br />
12. Change URANUS’s ALLOCLASS value.<br />
13. Change URANUS’s TAPE_ALLOCLASS value.<br />
14. Change URANUS’s shared SCSI port allocation class value.<br />
15. Enable Memory Channel for cluster communications on Uranus.<br />
16. Disable Memory Channel for cluster communications on Uranus.<br />
Enter choice [1]: 3<br />
Verifying circuits in network database...<br />
Updating permanent network database...<br />
In order to enable or disable DECnet MOP service in the volatile<br />
network database, DECnet traffic must be interrupted temporarily.<br />
Do you want to proceed [Y]? Return<br />
Enter a value for URANUS’s ALLOCLASS parameter [1]: Return<br />
The configuration procedure has completed successfully.<br />
URANUS has been enabled as a boot server. Disk serving and<br />
Ethernet capabilities are enabled automatically. If URANUS was<br />
not previously set up as a disk server, please run AUTOGEN to<br />
reboot URANUS:<br />
$ @SYS$UPDATE:AUTOGEN GETDATA REBOOT<br />
If you have changed URANUS’s ALLOCLASS value, you must reconfigure the<br />
cluster, using the procedure described in the <strong>OpenVMS</strong> <strong>Cluster</strong> <strong>Systems</strong><br />
manual.<br />
Example 8–8 Sample Interactive CLUSTER_CONFIG.COM Session to Change a Satellite’s<br />
Hardware Address<br />
$ @CLUSTER_CONFIG.COM<br />
<strong>Cluster</strong> Configuration Procedure<br />
Use CLUSTER_CONFIG.COM to set up or change an <strong>OpenVMS</strong> <strong>Cluster</strong> configuration.<br />
To ensure that you have the required privileges, invoke this procedure<br />
from the system manager’s account.<br />
Enter ? for help at any prompt.<br />
1. ADD a node to the cluster.<br />
2. REMOVE a node from the cluster.<br />
3. CHANGE a cluster node’s characteristics.<br />
4. CREATE a second system disk for URANUS.<br />
5. MAKE a directory structure for a new root on a system disk.<br />
6. DELETE a root from a system disk.<br />
Enter choice [1]: 3<br />
(continued on next page)<br />
Configuring an <strong>OpenVMS</strong> <strong>Cluster</strong> System 8–27
Configuring an <strong>OpenVMS</strong> <strong>Cluster</strong> System<br />
8.4 Changing Computer Characteristics<br />
Example 8–8 (Cont.) Sample Interactive CLUSTER_CONFIG.COM Session to Change a<br />
Satellite’s Hardware Address<br />
CHANGE Menu<br />
1. Enable URANUS as a disk server.<br />
2. Disable URANUS as a disk server.<br />
3. Enable URANUS as a boot server.<br />
4. Disable URANUS as a boot server.<br />
5. Enable the LAN for cluster communications on URANUS.<br />
6. Disable the LAN for cluster communications on URANUS.<br />
7. Enable a quorum disk on URANUS.<br />
8. Disable a quorum disk on URANUS.<br />
9. Change URANUS’s satellite’s Ethernet or FDDI hardware address.<br />
10. Enable URANUS as a tape server.<br />
11. Disable URANUS as a tape server.<br />
12. Change URANUS’s ALLOCLASS value.<br />
13. Change URANUS’s TAPE_ALLOCLASS value.<br />
14. Change URANUS’s shared SCSI port allocation class value.<br />
15. Enable Memory Channel for cluster communications on Uranus.<br />
16. Disable Memory Channel for cluster communications on Uranus.<br />
Enter choice [1]: 9<br />
What is the node’s DECnet node name? ARIEL<br />
What is the new hardware address [XX-XX-XX-XX-XX-XX]? 08-00-3B-05-37-78<br />
Updating network database...<br />
The configuration procedure has completed successfully.<br />
Example 8–9 Sample Interactive CLUSTER_CONFIG.COM Session to Enable the Local<br />
Computer as a Tape Server<br />
$ @CLUSTER_CONFIG.COM<br />
<strong>Cluster</strong> Configuration Procedure<br />
Use CLUSTER_CONFIG.COM to set up or change an <strong>OpenVMS</strong> <strong>Cluster</strong> configuration.<br />
To ensure that you have the required privileges, invoke this procedure<br />
from the system manager’s account.<br />
Enter ? for help at any prompt.<br />
1. ADD a node to the cluster.<br />
2. REMOVE a node from the cluster.<br />
3. CHANGE a cluster node’s characteristics.<br />
4. CREATE a second system disk for URANUS.<br />
5. MAKE a directory structure for a new root on a system disk.<br />
6. DELETE a root from a system disk.<br />
Enter choice [1]: 3<br />
8–28 Configuring an <strong>OpenVMS</strong> <strong>Cluster</strong> System<br />
(continued on next page)
Configuring an <strong>OpenVMS</strong> <strong>Cluster</strong> System<br />
8.4 Changing Computer Characteristics<br />
Example 8–9 (Cont.) Sample Interactive CLUSTER_CONFIG.COM Session to Enable the Local<br />
Computer as a Tape Server<br />
CHANGE Menu<br />
1. Enable URANUS as a disk server.<br />
2. Disable URANUS as a disk server.<br />
3. Enable URANUS as a boot server.<br />
4. Disable URANUS as a boot server.<br />
5. Enable the LAN for cluster communications on URANUS.<br />
6. Disable the LAN for cluster communications on URANUS.<br />
7. Enable a quorum disk on URANUS.<br />
8. Disable a quorum disk on URANUS.<br />
9. Change URANUS’s satellite’s Ethernet or FDDI hardware address.<br />
10. Enable URANUS as a tape server.<br />
11. Disable URANUS as a tape server.<br />
12. Change URANUS’s ALLOCLASS value.<br />
13. Change URANUS’s TAPE_ALLOCLASS value.<br />
14. Change URANUS’s shared SCSI port allocation class value.<br />
15. Enable Memory Channel for cluster communications on Uranus.<br />
16. Disable Memory Channel for cluster communications on Uranus.<br />
Enter choice [1]: 10<br />
Enter a value for URANUS’s TAPE_ALLOCLASS parameter [1]: Return<br />
URANUS has been enabled as a tape server. TMSCP_LOAD has been<br />
set to 1 in MODPARAMS.DAT. Please run AUTOGEN to reboot URANUS:<br />
$ @SYS$UPDATE:AUTOGEN GETDATA REBOOT<br />
If you have changed URANUS’s TAPE_ALLOCLASS value, you must reconfigure<br />
the cluster, using the procedure described in the <strong>OpenVMS</strong> <strong>Cluster</strong> <strong>Systems</strong><br />
manual.<br />
Configuring an <strong>OpenVMS</strong> <strong>Cluster</strong> System 8–29
Configuring an <strong>OpenVMS</strong> <strong>Cluster</strong> System<br />
8.4 Changing Computer Characteristics<br />
Example 8–10 Sample Interactive CLUSTER_CONFIG.COM Session to Change the Local<br />
Computer’s TAPE_ALLOCLASS Value<br />
$ @CLUSTER_CONFIG.COM<br />
<strong>Cluster</strong> Configuration Procedure<br />
Use CLUSTER_CONFIG.COM to set up or change an <strong>OpenVMS</strong> <strong>Cluster</strong> configuration.<br />
To ensure that you have the required privileges, invoke this procedure<br />
from the system manager’s account.<br />
Enter ? for help at any prompt.<br />
1. ADD a node to the cluster.<br />
2. REMOVE a node from the cluster.<br />
3. CHANGE a cluster node’s characteristics.<br />
4. CREATE a second system disk for URANUS.<br />
5. MAKE a directory structure for a new root on a system disk.<br />
6. DELETE a root from a system disk.<br />
Enter choice [1]: 3<br />
CHANGE Menu<br />
1. Enable URANUS as a disk server.<br />
2. Disable URANUS as a disk server.<br />
3. Enable URANUS as a boot server.<br />
4. Disable URANUS as a boot server.<br />
5. Enable the LAN for cluster communications on URANUS.<br />
6. Disable the LAN for cluster communications on URANUS.<br />
7. Enable a quorum disk on URANUS.<br />
8. Disable a quorum disk on URANUS.<br />
9. Change URANUS’s satellite’s Ethernet or FDDI hardware address.<br />
10. Enable URANUS as a tape server.<br />
11. Disable URANUS as a tape server.<br />
12. Change URANUS’s ALLOCLASS value.<br />
13. Change URANUS’s TAPE_ALLOCLASS value.<br />
14. Change URANUS’s shared SCSI port allocation class value.<br />
15. Enable Memory Channel for cluster communications on Uranus.<br />
16. Disable Memory Channel for cluster communications on Uranus.<br />
Enter choice [1]: 13<br />
Enter a value for URANUS’s TAPE_ALLOCLASS parameter [1]: 2<br />
If you have changed URANUS’s TAPE_ALLOCLASS value, you must reconfigure<br />
the cluster, using the procedure described in the <strong>OpenVMS</strong> <strong>Cluster</strong> <strong>Systems</strong><br />
manual.)<br />
8–30 Configuring an <strong>OpenVMS</strong> <strong>Cluster</strong> System
Configuring an <strong>OpenVMS</strong> <strong>Cluster</strong> System<br />
8.4 Changing Computer Characteristics<br />
Example 8–11 Sample Interactive CLUSTER_CONFIG.COM Session to Convert a Standalone<br />
Computer to a <strong>Cluster</strong> Boot Server<br />
$ @CLUSTER_CONFIG.COM<br />
<strong>Cluster</strong> Configuration Procedure<br />
Use CLUSTER_CONFIG.COM to set up or change an <strong>OpenVMS</strong> <strong>Cluster</strong> configuration.<br />
To ensure that you have the required privileges, invoke this procedure<br />
from the system manager’s account.<br />
Enter ? for help at any prompt.<br />
1. ADD a node to the cluster.<br />
2. REMOVE a node from the cluster.<br />
3. CHANGE a cluster node’s characteristics.<br />
4. CREATE a second system disk for URANUS.<br />
5. MAKE a directory structure for a new root on a system disk.<br />
6. DELETE a root from a system disk.<br />
Enter choice [1]: 3<br />
CHANGE Menu<br />
1. Enable URANUS as a disk server.<br />
2. Disable URANUS as a disk server.<br />
3. Enable URANUS as a boot server.<br />
4. Disable URANUS as a boot server.<br />
5. Enable the LAN for cluster communications on URANUS.<br />
6. Disable the LAN for cluster communications on URANUS.<br />
7. Enable a quorum disk on URANUS.<br />
8. Disable a quorum disk on URANUS.<br />
9. Change URANUS’s satellite’s Ethernet or FDDI hardware address.<br />
10. Enable URANUS as a tape server.<br />
11. Disable URANUS as a tape server.<br />
12. Change URANUS’s ALLOCLASS value.<br />
13. Change URANUS’s TAPE_ALLOCLASS value.<br />
14. Change URANUS’s shared SCSI port allocation class value.<br />
15. Enable Memory Channel for cluster communications on Uranus.<br />
16. Disable Memory Channel for cluster communications on Uranus.<br />
Enter choice [1]: 3<br />
This procedure sets up this standalone node to join an existing<br />
cluster or to form a new cluster.<br />
What is the node’s DECnet node name? PLUTO<br />
What is the node’s DECnet address? 2.5<br />
Will the Ethernet be used for cluster communications (Y/N)? Y<br />
Enter this cluster’s group number: 3378<br />
Enter this cluster’s password:<br />
Re-enter this cluster’s password for verification:<br />
Will PLUTO be a boot server [Y]? Return<br />
Verifying circuits in network database...<br />
Enter a value for PLUTO’s ALLOCLASS parameter: 1<br />
Does this cluster contain a quorum disk [N]? Return<br />
AUTOGEN computes the SYSGEN parameters for your configuration<br />
and then reboots the system with the new parameters.<br />
8.5 Creating a Duplicate System Disk<br />
As you continue to add Alpha computers running on an Alpha common system<br />
disk or VAX computers running on a VAX common system disk, you eventually<br />
reach the disk’s storage or I/O capacity. In that case, you want to add one or more<br />
common system disks to handle the increased load.<br />
Reminder: Remember that a system disk cannot be shared between VAX and<br />
Alpha computers. An Alpha system cannot be created from a VAX system disk,<br />
and a VAX system cannot be created from an Alpha system disk.<br />
Configuring an <strong>OpenVMS</strong> <strong>Cluster</strong> System 8–31
Configuring an <strong>OpenVMS</strong> <strong>Cluster</strong> System<br />
8.5 Creating a Duplicate System Disk<br />
8.5.1 Preparation<br />
You can use either CLUSTER_CONFIG_LAN.COM or CLUSTER_CONFIG.COM<br />
to set up additional system disks. After you have coordinated cluster common<br />
files as described in Chapter 5, proceed as follows:<br />
1. Locate an appropriate scratch disk for use as an additional system disk.<br />
2. Log in as system manager.<br />
3. Invoke either CLUSTER_CONFIG_LAN.COM or CLUSTER_CONFIG.COM<br />
and select the CREATE option.<br />
8.5.2 Example<br />
As shown in Example 8–12, the cluster configuration command procedure:<br />
1. Prompts for the device names of the current and new system disks.<br />
2. Backs up the current system disk to the new one.<br />
3. Deletes all directory roots (except SYS0) from the new disk.<br />
4. Mounts the new disk clusterwide.<br />
Note: <strong>OpenVMS</strong> RMS error messages are displayed while the procedure deletes<br />
directory files. You can ignore these messages.<br />
Example 8–12 Sample Interactive CLUSTER_CONFIG.COM CREATE Session<br />
$ @CLUSTER_CONFIG.COM<br />
<strong>Cluster</strong> Configuration Procedure<br />
Use CLUSTER_CONFIG.COM to set up or change an <strong>OpenVMS</strong> <strong>Cluster</strong> configuration.<br />
To ensure that you have the required privileges, invoke this procedure<br />
from the system manager’s account.<br />
Enter ? for help at any prompt.<br />
1. ADD a node to the cluster.<br />
2. REMOVE a node from the cluster.<br />
3. CHANGE a cluster node’s characteristics.<br />
4. CREATE a second system disk for JUPITR.<br />
5. MAKE a directory structure for a new root on a system disk.<br />
6. DELETE a root from a system disk.<br />
Enter choice [1]: 4<br />
The CREATE function generates a duplicate system disk.<br />
o It backs up the current system disk to the new system disk.<br />
o It then removes from the new system disk all system roots.<br />
WARNING - Do not proceed unless you have defined appropriate<br />
logical names for cluster common files in your<br />
site-specific startup procedures. For instructions,<br />
see the <strong>OpenVMS</strong> <strong>Cluster</strong> <strong>Systems</strong> manual.<br />
Do you want to continue [N]? YES<br />
This procedure will now ask you for the device name of JUPITR’s system root.<br />
The default device name (DISK$VAXVMSRL5:) is the logical volume name of<br />
SYS$SYSDEVICE:.<br />
What is the device name of the current system disk [DISK$VAXVMSRL5:]? Return<br />
8–32 Configuring an <strong>OpenVMS</strong> <strong>Cluster</strong> System<br />
(continued on next page)
Configuring an <strong>OpenVMS</strong> <strong>Cluster</strong> System<br />
8.5 Creating a Duplicate System Disk<br />
Example 8–12 (Cont.) Sample Interactive CLUSTER_CONFIG.COM CREATE Session<br />
What is the device name for the new system disk? $1$DJA16:<br />
%DCL-I-ALLOC, _$1$DJA16: allocated<br />
%MOUNT-I-MOUNTED, SCRATCH mounted on _$1$DJA16:<br />
What is the unique label for the new system disk [JUPITR_SYS2]? Return<br />
Backing up the current system disk to the new system disk...<br />
Deleting all system roots...<br />
Deleting directory tree SYS1...<br />
%DELETE-I-FILDEL, $1$DJA16:DECNET.DIR;1 deleted (2 blocks)<br />
.<br />
..<br />
System root SYS1 deleted.<br />
Deleting directory tree SYS2...<br />
%DELETE-I-FILDEL, $1$DJA16:DECNET.DIR;1 deleted (2 blocks)<br />
.<br />
..<br />
System root SYS2 deleted.<br />
All the roots have been deleted.<br />
%MOUNT-I-MOUNTED, JUPITR_SYS2 mounted on _$1$DJA16:<br />
The second system disk has been created and mounted clusterwide.<br />
Satellites can now be added.<br />
8.6 Postconfiguration Tasks<br />
Some configuration functions, such as adding or removing a voting member or<br />
enabling or disabling a quorum disk, require one or more additional operations.<br />
These operations are listed in Table 8–10 and affect the integrity of the entire<br />
cluster. Follow the instructions in the table for the action you should take after<br />
executing either CLUSTER_CONFIG_LAN.COM or CLUSTER_CONFIG.COM to<br />
make major configuration changes.<br />
Table 8–10 Actions Required to Reconfigure a <strong>Cluster</strong><br />
After running the cluster<br />
configuration procedure to... You should...<br />
Add or remove a voting<br />
member<br />
Enable a quorum disk Perform the following steps:<br />
Update the AUTOGEN parameter files and the current system<br />
parameter files for all nodes in the cluster, as described in<br />
Section 8.6.1.<br />
1. Update the AUTOGEN parameter files and the current system<br />
parameter files for all quorum watchers in the cluster, as<br />
described in Section 8.6.1.<br />
2. Reboot the nodes that have been enabled as quorum disk<br />
watchers (Section 2.3.8).<br />
Reference: See also Section 8.2.4 for more information about adding<br />
a quorum disk.<br />
(continued on next page)<br />
Configuring an <strong>OpenVMS</strong> <strong>Cluster</strong> System 8–33
Configuring an <strong>OpenVMS</strong> <strong>Cluster</strong> System<br />
8.6 Postconfiguration Tasks<br />
Table 8–10 (Cont.) Actions Required to Reconfigure a <strong>Cluster</strong><br />
After running the cluster<br />
configuration procedure to... You should...<br />
Disable a quorum disk Perform the following steps:<br />
Caution: Do not perform these steps until you are ready to reboot<br />
the entire <strong>OpenVMS</strong> <strong>Cluster</strong> system. Because you are reducing<br />
quorum for the cluster, the votes cast by the quorum disk being<br />
removed could cause cluster partitioning.<br />
1. Update the AUTOGEN parameter files and the current system<br />
parameter files for all quorum watchers in the cluster, as<br />
described in Section 8.6.1.<br />
2. Evaluate whether or not quorum will be lost without the<br />
quorum disk:<br />
IF... THEN...<br />
Quorum will not be lost Perform these steps:<br />
1. Use the DCL command SET<br />
CLUSTER/EXPECTED_<br />
VOTES to reduce the value<br />
of quorum.<br />
2. Reboot the nodes that have<br />
been disabled as quorum<br />
disk watchers. (Quorum disk<br />
watchers are described in<br />
Section 2.3.8.)<br />
Quorum will be lost Shut down and reboot the entire cluster.<br />
Reference: <strong>Cluster</strong> shutdown is described in Section 8.6.2.<br />
Reference: See also Section 8.3.2 for more information about<br />
removing a quorum disk.<br />
Add a satellite node Perform these steps:<br />
Enable or disable the LAN<br />
for cluster communications<br />
8–34 Configuring an <strong>OpenVMS</strong> <strong>Cluster</strong> System<br />
• Update the volatile network databases on other cluster members<br />
(Section 8.6.4).<br />
• Optionally, alter the satellite’s local disk label (Section 8.6.5).<br />
Update the current system parameter files and reboot the node on<br />
which you have enabled or disabled the LAN (Section 8.6.1).<br />
(continued on next page)
Configuring an <strong>OpenVMS</strong> <strong>Cluster</strong> System<br />
8.6 Postconfiguration Tasks<br />
Table 8–10 (Cont.) Actions Required to Reconfigure a <strong>Cluster</strong><br />
After running the cluster<br />
configuration procedure to... You should...<br />
Change allocation class<br />
values<br />
Change the cluster group<br />
number or password<br />
8.6.1 Updating Parameter Files<br />
Refer to the appropriate section, as follows:<br />
• Change allocation class values on HSC subsystems<br />
(Section 6.2.2.2).<br />
• Change allocation class values on HSJ subsystems<br />
(Section 6.2.2.3).<br />
• Change allocation class values on DSSI ISE subsystems<br />
(Section 6.2.2.5).<br />
• Update the current system parameter files and shut down and<br />
reboot the entire cluster (Sections 8.6.1 and 8.6.2).<br />
Shut down and reboot the entire cluster (Sections 8.6.2 and 8.6.7).<br />
The cluster configuration command procedures (CLUSTER_CONFIG_LAN.COM<br />
or CLUSTER_CONFIG.COM) can be used to modify parameters in the<br />
AUTOGEN parameter file for the node on which it is run.<br />
In some cases, such as when you add or remove a voting cluster member, or when<br />
you enable or disable a quorum disk, you must update the AUTOGEN files for all<br />
the other cluster members.<br />
Use either of the methods described in the following table.<br />
Method Description<br />
Update<br />
MODPARAMS.DAT<br />
files<br />
Edit MODPARAMS.DAT in all cluster members’ [SYSx.SYSEXE]<br />
directories and adjust the value for the EXPECTED_VOTES system<br />
parameter appropriately.<br />
For example, if you add a voting member or if you enable a quorum disk,<br />
you must increment the value by the number of votes assigned to the new<br />
member (usually 1). If you add a voting member with one vote and enable<br />
a quorum disk with one vote on that computer, you must increment the<br />
value by 2.<br />
Update AGEN$ files Update the parameter settings in the appropriate AGEN$ include files:<br />
• For satellites, edit SYS$MANAGER:AGEN$NEW_SATELLITE_<br />
DEFAULTS.DAT.<br />
• For nonsatellites, edit<br />
SYS$MANAGER:AGEN$NEW_NODE_DEFAULTS.DAT.<br />
Reference: These files are described in Section 8.2.2.<br />
You must also update the current system parameter files (VAXVMSSYS.PAR or<br />
ALPHAVMSSYS.PAR, as appropriate) so that the changes take effect on the next<br />
reboot.<br />
Configuring an <strong>OpenVMS</strong> <strong>Cluster</strong> System 8–35
Configuring an <strong>OpenVMS</strong> <strong>Cluster</strong> System<br />
8.6 Postconfiguration Tasks<br />
Use either of the methods described in the following table.<br />
Method Description<br />
SYSMAN<br />
utility<br />
AUTOGEN<br />
utility<br />
Perform the following steps:<br />
1. Log in as system manager.<br />
2. Run the SYSMAN utility to update the EXPECTED_VOTES system parameter<br />
on all nodes in the cluster. For example:<br />
$ RUN SYS$SYSTEM:SYSMAN<br />
%SYSMAN-I-ENV, current command environment:<br />
<strong>Cluster</strong>wide on local cluster<br />
Username SYSTEM will be used on nonlocal nodes<br />
SYSMAN> SET ENVIRONMENT/CLUSTER<br />
SYSMAN> PARAM USE CURRENT<br />
SYSMAN> PARAM SET EXPECTED_VOTES 2<br />
SYSMAN> PARAM WRITE CURRENT<br />
SYSMAN> EXIT<br />
Perform the following steps:<br />
1. Log in as system manager.<br />
2. Run the AUTOGEN utility to update the EXPECTED_VOTES system parameter<br />
on all nodes in the cluster. For example:<br />
$ RUN SYS$SYSTEM:SYSMAN<br />
%SYSMAN-I-ENV, current command environment:<br />
<strong>Cluster</strong>wide on local cluster<br />
Username SYSTEM will be used on nonlocal nodes<br />
SYSMAN> SET ENVIRONMENT/CLUSTER<br />
SYSMAN> DO @SYS$UPDATE:AUTOGEN GETDATA SETPARAMS<br />
SYSMAN> EXIT<br />
Do not specify the SHUTDOWN or REBOOT option.<br />
Hints: If your next action is to shut down the node, you can specify SHUTDOWN<br />
or REBOOT (in place of SETPARAMS) in the DO @SYS$UPDATE:AUTOGEN<br />
GETDATA command.<br />
Both of these methods propagate the values to the computer’s<br />
ALPHAVMSSYS.PAR file on Alpha computers or to the VAXVMSSYS.PAR<br />
file on VAX computers. In order for these changes to take effect, continue with<br />
the instructions in either Section 8.6.2 to shut down the cluster or in Section 8.6.3<br />
to shut down the node.<br />
8.6.2 Shutting Down the <strong>Cluster</strong><br />
Using the SYSMAN utility, you can shut down the entire cluster from a single<br />
node in the cluster. Follow these steps to perform an orderly shutdown:<br />
1. Log in to the system manager’s account on any node in the cluster.<br />
2. Run the SYSMAN utility and specify the SET ENVIRONMENT/CLUSTER<br />
command. Be sure to specify the /CLUSTER_SHUTDOWN qualifier to the<br />
SHUTDOWN NODE command. For example:<br />
$ RUN SYS$SYSTEM:SYSMAN<br />
SYSMAN> SET ENVIRONMENT/CLUSTER<br />
%SYSMAN-I-ENV, current command environment:<br />
<strong>Cluster</strong>wide on local cluster<br />
Username SYSTEM will be used on nonlocal nodes<br />
SYSMAN> SHUTDOWN NODE/CLUSTER_SHUTDOWN/MINUTES_TO_SHUTDOWN=5 -<br />
8–36 Configuring an <strong>OpenVMS</strong> <strong>Cluster</strong> System
Configuring an <strong>OpenVMS</strong> <strong>Cluster</strong> System<br />
8.6 Postconfiguration Tasks<br />
_SYSMAN> /AUTOMATIC_REBOOT/REASON="<strong>Cluster</strong> Reconfiguration"<br />
%SYSMAN-I-SHUTDOWN, SHUTDOWN request sent to node<br />
%SYSMAN-I-SHUTDOWN, SHUTDOWN request sent to node<br />
SYSMAN><br />
SHUTDOWN message on JUPITR from user SYSTEM at JUPITR Batch 11:02:10<br />
JUPITR will shut down in 5 minutes; back up shortly via automatic reboot.<br />
Please log off node JUPITR.<br />
<strong>Cluster</strong> Reconfiguration<br />
SHUTDOWN message on JUPITR from user SYSTEM at JUPITR Batch 11:02:10<br />
PLUTO will shut down in 5 minutes; back up shortly via automatic reboot.<br />
Please log off node PLUTO.<br />
<strong>Cluster</strong> Reconfiguration<br />
For more information, see Section 10.7.<br />
8.6.3 Shutting Down a Single Node<br />
To stop a single node in an <strong>OpenVMS</strong> <strong>Cluster</strong>, you can use either the SYSMAN<br />
SHUTDOWN NODE command with the appropriate SET ENVIRONMENT<br />
command or the SHUTDOWN command procedure. These methods are described<br />
in the following table.<br />
Method Description<br />
SYSMAN utility Follow these steps:<br />
SHUTDOWN<br />
command<br />
procedure<br />
1. Log in to the system manager’s account on any node in the <strong>OpenVMS</strong> <strong>Cluster</strong>.<br />
2. Run the SYSMAN utility to shut down the node, as follows:<br />
$ RUN SYS$SYSTEM:SYSMAN<br />
SYSMAN> SET ENVIRONMENT/NODE=JUPITR<br />
Individual nodes: JUPITR<br />
Username SYSTEM will be used on nonlocal nodes<br />
SYSMAN> SHUTDOWN NODE/REASON="Maintenance" -<br />
_SYSMAN> /MINUTES_TO_SHUTDOWN=5<br />
Hint: To shut down a subset of nodes in the cluster, you can enter several<br />
node names (separated by commas) on the SET ENVIRONMENT/NODE<br />
command. The following command shuts down nodes JUPITR and SATURN:<br />
SYSMAN> SET ENVIRONMENT/NODE=(JUPITR,SATURN)<br />
Follow these steps:<br />
1. Log in to the system manager’s account on the node to be shut down.<br />
2. Invoke the SHUTDOWN command procedure as follows:<br />
$ @SYS$SYSTEM:SHUTDOWN<br />
For more information, see Section 10.7.<br />
8.6.4 Updating Network Data<br />
Whenever you add a satellite, the cluster configuration command procedure<br />
you use (CLUSTER_CONFIG_LAN.COM or CLUSTER_CONFIG.COM) updates<br />
both the permanent and volatile remote node network databases (NETNODE_<br />
REMOTE.DAT) on the boot server. However, the volatile databases on other<br />
cluster members are not automatically updated.<br />
Configuring an <strong>OpenVMS</strong> <strong>Cluster</strong> System 8–37
Configuring an <strong>OpenVMS</strong> <strong>Cluster</strong> System<br />
8.6 Postconfiguration Tasks<br />
To share the new data throughout the cluster, you must update the volatile<br />
databases on all other cluster members. Log in as system manager, invoke the<br />
SYSMAN utility, and enter the following commands at the SYSMAN> prompt:<br />
$ RUN SYS$SYSTEM:SYSMAN<br />
SYSMAN> SET ENVIRONMENT/CLUSTER<br />
%SYSMAN-I-ENV, current command environment:<br />
<strong>Cluster</strong>wide on local cluster<br />
Username SYSTEM will be used on nonlocal nodes<br />
SYSMAN> SET PROFILE/PRIVILEGES=(OPER,SYSPRV)<br />
SYSMAN> DO MCR NCP SET KNOWN NODES ALL<br />
%SYSMAN-I-OUTPUT, command execution on node X...<br />
. ..<br />
SYSMAN> EXIT<br />
$<br />
The file NETNODE_REMOTE.DAT must be located in the directory<br />
SYS$COMMON:[SYSEXE].<br />
8.6.5 Altering Satellite Local Disk Labels<br />
If you want to alter the volume label on a satellite node’s local page and swap<br />
disk, follow these steps after the satellite has been added to the cluster:<br />
Step Action<br />
1 Log in as system manager and enter a DCL command in the following format:<br />
SET VOLUME/LABEL=volume-label device-spec[:]<br />
Note: The SET VOLUME command requires write access (W) to the index file on the<br />
volume. If you are not the volume’s owner, you must have either a system user identification<br />
code (UIC) or the SYSPRV privilege.<br />
2 Update the [SYSn.SYSEXE]SATELLITE_PAGE.COM procedure on the boot server’s system<br />
disk to reflect the new label.<br />
8.6.6 Changing Allocation Class Values<br />
If you must change allocation class values on any HSC, HSJ, or DSSI ISE<br />
subsystem, you must do so while the entire cluster is shut down.<br />
Reference: To change allocation class values:<br />
• On computer systems, see Section 6.2.2.1.<br />
• On HSC subsystems, see Section 6.2.2.2.<br />
• On HSJ subsystems, see Section 6.2.2.3.<br />
• On HSD subsystems, see Section 6.2.2.4.<br />
• On DSSI subsystems, see Section 6.2.2.5.<br />
8–38 Configuring an <strong>OpenVMS</strong> <strong>Cluster</strong> System
Configuring an <strong>OpenVMS</strong> <strong>Cluster</strong> System<br />
8.6 Postconfiguration Tasks<br />
8.6.7 Rebooting<br />
The following table describes booting actions for satellite and storage subsystems:<br />
For configurations<br />
with... You must...<br />
HSC and HSJ<br />
subsystems<br />
Reboot each computer after all HSC and HSJ subsystems have been set<br />
and rebooted.<br />
Satellite nodes Reboot boot servers before rebooting satellites.<br />
Note that several new messages might appear. For example, if you<br />
have used the CLUSTER_CONFIG.COM CHANGE function to enable<br />
cluster communications over the LAN, one message reports that the LAN<br />
<strong>OpenVMS</strong> <strong>Cluster</strong> security database is being loaded.<br />
Reference: See also Section 9.3 for more information about booting<br />
satellites.<br />
DSSI ISE subsystems Reboot the system after all the DSSI ISE subsystems have been set.<br />
For every disk-serving computer, a message reports that the MSCP server is<br />
being loaded.<br />
To verify that all disks are being served in the manner in which you designed<br />
the configuration, at the system prompt ($) of the node serving the disks, enter<br />
the SHOW DEVICE/SERVED command. For example, the following display<br />
represents a DSSI configuration:<br />
$ SHOW DEVICE/SERVED<br />
Device: Status Total Size Current Max Hosts<br />
$1$DIA0 Avail 1954050 0 0 0<br />
$1$DIA2 Avail 1800020 0 0 0<br />
Caution: If you boot a node into an existing <strong>OpenVMS</strong> <strong>Cluster</strong> using<br />
minimum startup (the system parameter STARTUP_P1 is set to MIN), a<br />
number of processes (for example, CACHE_SERVER, CLUSTER_SERVER,<br />
and CONFIGURE) are not started. Compaq recommends that you start these<br />
processes manually if you intend to run the node in an <strong>OpenVMS</strong> <strong>Cluster</strong> system.<br />
Running a node without these processes enabled prevents the cluster from<br />
functioning properly.<br />
Reference: Refer to the <strong>OpenVMS</strong> System Manager’s Manual for more<br />
information about starting these processes manually.<br />
8.6.8 Rebooting Satellites Configured with <strong>OpenVMS</strong> on a Local Disk<br />
Satellite nodes can be set up to reboot automatically when recovering from system<br />
failures or power failures.<br />
Reboot behavior varies from system to system. Many systems provide a console<br />
variable that allows you to specify which device to boot from by default. However,<br />
some systems have predefined boot ‘‘sniffers’’ that automatically detect a bootable<br />
device. The following table describes the rebooting conditions.<br />
Configuring an <strong>OpenVMS</strong> <strong>Cluster</strong> System 8–39
Configuring an <strong>OpenVMS</strong> <strong>Cluster</strong> System<br />
8.6 Postconfiguration Tasks<br />
IF... AND... THEN...<br />
If your system<br />
does not allow<br />
you to specify<br />
the boot device<br />
for automatic<br />
reboot (that is,<br />
it has a boot<br />
sniffer)<br />
†VAX specific<br />
An operating system is<br />
installed on the system’s<br />
local disk<br />
That disk will be booted in preference to requesting a satellite MOP<br />
load. To avoid this, you should take one of the measures in the following<br />
list before allowing any operation that causes an automatic reboot—for<br />
example, executing SYS$SYSTEM:SHUTDOWN.COM with the REBOOT<br />
option or using CLUSTER_CONFIG.COM to add that satellite to the<br />
cluster:<br />
• Rename the directory file ddcu:[000000]SYS0.DIR on the local disk<br />
to ddcu:[000000]SYSx.DIR (where SYSx is a root other than SYS0,<br />
SYSE, or SYSF). Then enter the DCL command SET FILE/REMOVE<br />
as follows to remove the old directory entry for the boot image<br />
SYSBOOT.EXE:<br />
$ RENAME DUA0:[000000]SYS0.DIR DUA0:[000000]SYS1.DIR<br />
$ SET FILE/REMOVE DUA0:[SYSEXE]SYSBOOT.EXE<br />
†On VAX systems, for subsequent reboots of VAX computers from<br />
the local disk, enter a command in the format B/x0000000 at the<br />
console-mode prompt (>>>). For example:<br />
>>> B/10000000<br />
8.7 Running AUTOGEN with Feedback<br />
8.7.1 Advantages<br />
• Disable the local disk. For instructions, refer to your computerspecific<br />
installation and operations guide. Note that this option is<br />
not available if the satellite’s local disk is being used for paging and<br />
swapping.<br />
AUTOGEN includes a mechanism called feedback. This mechanism examines<br />
data collected during normal system operations, and it adjusts system parameters<br />
on the basis of the collected data whenever you run AUTOGEN with the feedback<br />
option. For example, the system records each instance of a disk server waiting<br />
for buffer space to process a disk request. Based on this information, AUTOGEN<br />
can size the disk server’s buffer pool automatically to ensure that sufficient space<br />
is allocated.<br />
Execute SYS$UPDATE:AUTOGEN.COM manually as described in the <strong>OpenVMS</strong><br />
System Manager’s Manual.<br />
To ensure that computers are configured adequately when they first join the<br />
cluster, you can run AUTOGEN with feedback automatically as part of the initial<br />
boot sequence. Although this step adds an additional reboot before the computer<br />
can be used, the computer’s performance can be substantially improved.<br />
Compaq strongly recommends that you use the feedback option. Without<br />
feedback, it is difficult for AUTOGEN to anticipate patterns of resource<br />
usage, particularly in complex configurations. Factors such as the number of<br />
computers and disks in the cluster and the types of applications being run require<br />
adjustment of system parameters for optimal performance.<br />
Compaq also recommends using AUTOGEN with feedback rather than the<br />
SYSGEN utility to modify system parameters, because AUTOGEN:<br />
8–40 Configuring an <strong>OpenVMS</strong> <strong>Cluster</strong> System
Configuring an <strong>OpenVMS</strong> <strong>Cluster</strong> System<br />
8.7 Running AUTOGEN with Feedback<br />
• Uses parameter changes in MODPARAMS.DAT and AGEN$ files. (Changes<br />
recorded in MODPARAMS.DAT are not lost during updates to the <strong>OpenVMS</strong><br />
operating system.)<br />
• Reconfigures other system parameters to reflect changes.<br />
8.7.2 Initial Values<br />
When a computer is first added to an <strong>OpenVMS</strong> <strong>Cluster</strong>, system parameters that<br />
control the computer’s system resources are normally adjusted in several steps,<br />
as follows:<br />
1. The cluster configuration command procedure (CLUSTER_CONFIG_<br />
LAN.COM or CLUSTER_CONFIG.COM) sets initial parameters that are<br />
adequate to boot the computer in a minimum environment.<br />
2. When the computer boots, AUTOGEN runs automatically to size the static<br />
operating system (without using any dynamic feedback data), and the<br />
computer reboots into the <strong>OpenVMS</strong> <strong>Cluster</strong> environment.<br />
3. After the newly added computer has been subjected to typical use for a<br />
day or more, you should run AUTOGEN with feedback manually to adjust<br />
parameters for the <strong>OpenVMS</strong> <strong>Cluster</strong> environment.<br />
4. At regular intervals, and whenever a major change occurs in the cluster<br />
configuration or production environment, you should run AUTOGEN with<br />
feedback manually to readjust parameters for the changes.<br />
Because the first AUTOGEN operation (initiated by either CLUSTER_CONFIG_<br />
LAN.COM or CLUSTER_CONFIG.COM) is performed both in the minimum<br />
environment and without feedback, a newly added computer may be inadequately<br />
configured to run in the <strong>OpenVMS</strong> <strong>Cluster</strong> environment. For this reason, you<br />
might want to implement additional configuration measures like those described<br />
in Section 8.7.3 and Section 8.7.4.<br />
8.7.3 Obtaining Reasonable Feedback<br />
When a computer first boots into an <strong>OpenVMS</strong> <strong>Cluster</strong>, much of the computer’s<br />
resource utilization is determined by the current <strong>OpenVMS</strong> <strong>Cluster</strong> configuration.<br />
Factors such as the number of computers, the number of disk servers, and the<br />
number of disks available or mounted contribute to a fixed minimum resource<br />
requirements. Because this minimum does not change with continued use of<br />
the computer, feedback information about the required resources is immediately<br />
valid.<br />
Other feedback information, however, such as that influenced by normal user<br />
activity, is not immediately available, because the only ‘‘user’’ has been the<br />
system startup process. If AUTOGEN were run with feedback at this point, some<br />
system values might be set too low.<br />
By running a simulated user load at the end of the first production boot, you<br />
can ensure that AUTOGEN has reasonable feedback information. The User<br />
Environment Test Package (UETP) supplied with your operating system contains<br />
a test that simulates such a load. You can run this test (the UETP LOAD phase)<br />
as part of the initial production boot, and then run AUTOGEN with feedback<br />
before a user is allowed to log in.<br />
Configuring an <strong>OpenVMS</strong> <strong>Cluster</strong> System 8–41
Configuring an <strong>OpenVMS</strong> <strong>Cluster</strong> System<br />
8.7 Running AUTOGEN with Feedback<br />
To implement this technique, you can create a command file like that in step<br />
1 of the procedure in Section 8.7.4, and submit the file to the computer’s local<br />
batch queue from the cluster common SYSTARTUP procedure. Your command<br />
file conditionally runs the UETP LOAD phase and then reboots the computer<br />
with AUTOGEN feedback.<br />
8.7.4 Creating a Command File to Run AUTOGEN<br />
As shown in the following sample file, UETP lets you specify a typical user<br />
load to be run on the computer when it first joins the cluster. The UETP run<br />
generates data that AUTOGEN uses to set appropriate system parameter values<br />
for the computer when rebooting it with feedback. Note, however, that the<br />
default setting for the UETP user load assumes that the computer is used as a<br />
timesharing system. This calculation can produce system parameter values that<br />
might be excessive for a single-user workstation, especially if the workstation has<br />
large memory resources. Therefore, you might want to modify the default user<br />
load setting, as shown in the sample file.<br />
Follow these steps:<br />
1. Create a command file like the following:<br />
$!<br />
$! ***** SYS$COMMON:[SYSMGR]UETP_AUTOGEN.COM *****<br />
$!<br />
$! For initial boot only, run UETP LOAD phase and<br />
$! reboot with AUTOGEN feedback.<br />
$!<br />
$ SET NOON<br />
$ SET PROCESS/PRIVILEGES=ALL<br />
$!<br />
$! Run UETP to simulate a user load for a satellite<br />
$! with 8 simultaneously active user processes. For a<br />
$! CI connected computer, allow UETP to calculate the load.<br />
$!<br />
$ LOADS = "8"<br />
$ IF F$GETDVI("PAA0:","EXISTS") THEN LOADS = ""<br />
$ @UETP LOAD 1 ’loads’<br />
$!<br />
$! Create a marker file to prevent resubmission of<br />
$! UETP_AUTOGEN.COM at subsequent reboots.<br />
$!<br />
$ CREATE SYS$SPECIFIC:[SYSMGR]UETP_AUTOGEN.DONE<br />
$!<br />
$! Reboot with AUTOGEN to set SYSGEN values.<br />
$!<br />
$ @SYS$UPDATE:AUTOGEN SAVPARAMS REBOOT FEEDBACK<br />
$!<br />
$ EXIT<br />
2. Edit the cluster common SYSTARTUP file and add the following commands<br />
at the end of the file. Assume that queues have been started and that<br />
a batch queue is running on the newly added computer. Submit UETP_<br />
AUTOGEN.COM to the computer’s local batch queue.<br />
$!<br />
$ NODE = F$GETSYI("NODE")<br />
$ IF F$SEARCH ("SYS$SPECIFIC:[SYSMGR]UETP_AUTOGEN.DONE") .EQS. ""<br />
$ THEN<br />
$ SUBMIT /NOPRINT /NOTIFY /USERNAME=SYSTEST -<br />
_$ /QUEUE=’NODE’_BATCH SYS$MANAGER:UETP_AUTOGEN<br />
8–42 Configuring an <strong>OpenVMS</strong> <strong>Cluster</strong> System
Configuring an <strong>OpenVMS</strong> <strong>Cluster</strong> System<br />
8.7 Running AUTOGEN with Feedback<br />
$ WAIT_FOR_UETP:<br />
$ WRITE SYS$OUTPUT "Waiting for UETP and AUTOGEN... ’’F$TIME()’"<br />
$ WAIT 00:05:00.00 ! Wait 5 minutes<br />
$ GOTO WAIT_FOR_UETP<br />
$ ENDIF<br />
$!<br />
Note: UETP must be run under the user name SYSTEST.<br />
3. Execute CLUSTER_CONFIG_LAN.COM or CLUSTER_CONFIG.COM to add<br />
the computer.<br />
When you boot the computer, it runs UETP_AUTOGEN.COM to simulate the<br />
user load you have specified, and it then reboots with AUTOGEN feedback to set<br />
appropriate system parameter values.<br />
Configuring an <strong>OpenVMS</strong> <strong>Cluster</strong> System 8–43
9<br />
Building Large <strong>OpenVMS</strong> <strong>Cluster</strong> <strong>Systems</strong><br />
This chapter provides guidelines for building <strong>OpenVMS</strong> <strong>Cluster</strong> systems that<br />
include many computers—approximately 20 or more—and describes procedures<br />
that you might find helpful. (Refer to the <strong>OpenVMS</strong> <strong>Cluster</strong> Software Software<br />
Product Description (SPD) for configuration limitations.) Typically, such<br />
<strong>OpenVMS</strong> <strong>Cluster</strong> systems include a large number of satellites.<br />
Note that the recommendations in this chapter also can prove beneficial in some<br />
clusters with fewer than 20 computers. Areas of discussion include:<br />
• Booting<br />
• Availability of MOP and disk servers<br />
• Multiple system disks<br />
• Shared resource availability<br />
• Hot system files<br />
• System disk space<br />
• System parameters<br />
• Network problems<br />
• <strong>Cluster</strong> alias<br />
9.1 Setting Up the <strong>Cluster</strong><br />
When building a new large cluster, you must be prepared to run AUTOGEN and<br />
reboot the cluster several times during the installation. The parameters that<br />
AUTOGEN sets for the first computers added to the cluster will probably be<br />
inadequate when additional computers are added. Readjustment of parameters is<br />
critical for boot and disk servers.<br />
One solution to this problem is to run the UETP_AUTOGEN.COM command<br />
procedure (described in Section 8.7.4) to reboot computers at regular intervals<br />
as new computers or storage interconnects are added. For example, each time<br />
there is a 10% increase in the number of computers, storage, or interconnects,<br />
you should run UETP_AUTOGEN.COM. For best results, the last time you<br />
run the procedure should be as close as possible to the final <strong>OpenVMS</strong> <strong>Cluster</strong><br />
environment.<br />
Building Large <strong>OpenVMS</strong> <strong>Cluster</strong> <strong>Systems</strong> 9–1
Building Large <strong>OpenVMS</strong> <strong>Cluster</strong> <strong>Systems</strong><br />
9.1 Setting Up the <strong>Cluster</strong><br />
To set up a new, large <strong>OpenVMS</strong> <strong>Cluster</strong>, follow these steps:<br />
Step Task<br />
1 Configure boot and disk servers using the CLUSTER_CONFIG_LAN.COM or the CLUSTER_<br />
CONFIG.COM command procedure (described in Chapter 8).<br />
2 Install all layered products and site-specific applications required for the <strong>OpenVMS</strong> <strong>Cluster</strong><br />
environment, or as many as possible.<br />
3 Prepare the cluster startup procedures so that they are as close as possible to those that will<br />
be used in the final <strong>OpenVMS</strong> <strong>Cluster</strong> environment.<br />
4 Add a small number of satellites (perhaps two or three) using the cluster configuration<br />
command procedure.<br />
5 Reboot the cluster to verify that the startup procedures work as expected.<br />
6 After you have verified that startup procedures work, run UETP_AUTOGEN.COM on<br />
every computer’s local batch queue to reboot the cluster again and to set initial production<br />
environment values. When the cluster has rebooted, all computers should have reasonable<br />
parameter settings. However, check the settings to be sure.<br />
7 Add additional satellites to double their number. Then rerun UETP_AUTOGEN on<br />
each computer’s local batch queue to reboot the cluster, and set values appropriately to<br />
accommodate the newly added satellites.<br />
8 Repeat the previous step until all satellites have been added.<br />
9 When all satellites have been added, run UETP_AUTOGEN a final time on each computer’s<br />
local batch queue to reboot the cluster and to set new values for the production environment.<br />
For best performance, do not run UETP_AUTOGEN on every computer<br />
simultaneously, because the procedure simulates a user load that is probably<br />
more demanding than that for the final production environment. A better method<br />
is to run UETP_AUTOGEN on several satellites (those with the least recently<br />
adjusted parameters) while adding new computers. This technique increases<br />
efficiency because little is gained when a satellite reruns AUTOGEN shortly after<br />
joining the cluster.<br />
For example, if the entire cluster is rebooted after 30 satellites have been added,<br />
few adjustments are made to system parameter values for the 28th satellite<br />
added, because only two satellites have joined the cluster since that satellite ran<br />
UETP_AUTOGEN as part of its initial configuration.<br />
9.2 General Booting Considerations<br />
Two general booting considerations, concurrent booting and minimizing boot time,<br />
are described in this section.<br />
9.2.1 Concurrent Booting<br />
One of the rare times when all <strong>OpenVMS</strong> <strong>Cluster</strong> computers are simultaneously<br />
active is during a cluster reboot—for example, after a power failure. All satellites<br />
are waiting to reload the operating system, and as soon as a boot server is<br />
available, they begin to boot in parallel. This booting activity places a significant<br />
I/O load on the system disk or disks, interconnects, and boot servers.<br />
For example, Table 9–1 shows a VAX system disk’s I/O activity and elapsed<br />
time until login for a single satellite with minimal startup procedures when the<br />
satellite is the only one booting. Table 9–2 shows system disk I/O activity and<br />
time elapsed between boot server response and login for various numbers of<br />
satellites booting from a single system disk. The disk in these examples has a<br />
capacity of 40 I/O operations per second.<br />
9–2 Building Large <strong>OpenVMS</strong> <strong>Cluster</strong> <strong>Systems</strong>
Building Large <strong>OpenVMS</strong> <strong>Cluster</strong> <strong>Systems</strong><br />
9.2 General Booting Considerations<br />
Note that the numbers in the tables are fabricated and are meant to provide only<br />
a generalized picture of booting activity. Elapsed time until login on satellites<br />
in any particular cluster depends on the complexity of the site-specific system<br />
startup procedures. Computers in clusters with many layered products or sitespecific<br />
applications require more system disk I/O operations to complete booting<br />
operations.<br />
Table 9–1 Sample System Disk I/O Activity and Boot Time for a Single VAX<br />
Satellite<br />
Total I/O Requests to<br />
System Disk<br />
Average System Disk I/O<br />
Operations per Second Elapsed Time Until Login (minutes)<br />
4200 6 12<br />
Table 9–2 Sample System Disk I/O Activity and Boot Times for Multiple VAX<br />
Satellites<br />
Number of<br />
Satellites<br />
I/Os Requested per<br />
Second<br />
I/Os Serviced per<br />
Second<br />
1 6 6 12<br />
2 12 12 12<br />
4 24 24 12<br />
6 36 36 12<br />
8 48 40 14<br />
12 72 40 21<br />
16 96 40 28<br />
24 144 40 42<br />
32 192 40 56<br />
48 288 40 84<br />
64 384 40 112<br />
96 576 40 168<br />
Elapsed Time Until Login<br />
(minutes)<br />
While the elapsed times shown in Table 9–2 do not include the time required for<br />
the boot server itself to reload, they illustrate that the I/O capacity of a single<br />
system disk can be the limiting factor for cluster reboot time.<br />
9.2.2 Minimizing Boot Time<br />
A large cluster needs to be carefully configured so that there is sufficient capacity<br />
to boot the desired number of nodes in the desired amount of time. As shown<br />
in Table 9–2, the effect of 96 satellites rebooting could induce an I/O bottleneck<br />
that can stretch the <strong>OpenVMS</strong> <strong>Cluster</strong> reboot times into hours. The following list<br />
provides a few methods to minimize boot times.<br />
• Careful configuration techniques<br />
Guidelines for <strong>OpenVMS</strong> <strong>Cluster</strong> Configurations contains data on<br />
configurations and the capacity of the computers, system disks, and<br />
interconnects involved.<br />
• Adequate system disk throughput<br />
Achieving enough system disk throughput typically requires a combination of<br />
techniques. Refer to Section 9.5 for complete information.<br />
Building Large <strong>OpenVMS</strong> <strong>Cluster</strong> <strong>Systems</strong> 9–3
Building Large <strong>OpenVMS</strong> <strong>Cluster</strong> <strong>Systems</strong><br />
9.2 General Booting Considerations<br />
• Sufficient network bandwidth<br />
A single Ethernet is unlikely to have sufficient bandwidth to meet the needs<br />
of a large <strong>OpenVMS</strong> <strong>Cluster</strong>. Likewise, a single Ethernet adapter may become<br />
a bottleneck, especially for a disk server. Sufficient network bandwidth can<br />
be provided using some of the techniques listed in step 1 of Table 9–3.<br />
• Installation of only the required layered products and devices.<br />
9.3 Booting Satellites<br />
<strong>OpenVMS</strong> <strong>Cluster</strong> satellite nodes use a single LAN adapter for the initial stages<br />
of booting. If a satellite is configured with multiple LAN adapters, the system<br />
manager can specify with the console BOOT command which adapter to use for<br />
the initial stages of booting. Once the system is running, the <strong>OpenVMS</strong> <strong>Cluster</strong><br />
uses all available LAN adapters. This flexibility allows you to work around<br />
broken adapters or network problems.<br />
The procedures and utilities for configuring and booting satellite nodes are the<br />
same or vary only slightly between Alpha and VAX systems. These are described<br />
in Section 9.4.<br />
In addition, VAX nodes can MOP load Alpha satellites, and Alpha nodes can MOP<br />
load VAX satellites. Cross-architecture booting is described in Section 10.5.<br />
9.4 Configuring and Booting Satellite Nodes<br />
Complete the items in the following Table 9–3 before proceeding with satellite<br />
booting.<br />
Table 9–3 Checklist for Satellite Booting<br />
Step Action<br />
1 Configure disk server LAN adapters.<br />
Because disk-serving activity in an <strong>OpenVMS</strong> <strong>Cluster</strong> system can generate a substantial<br />
amount of I/O traffic on the LAN, boot and disk servers should use the highest-bandwidth<br />
LAN adapters in the cluster. The servers can also use multiple LAN adapters in a single<br />
system to distribute the load across the LAN adapters.<br />
The following list suggests ways to provide sufficient network bandwidth:<br />
Select network adapters with sufficient bandwidth.<br />
Use switches to segregate traffic and to provide increased total bandwidth.<br />
Use multiple LAN adapters on MOP and disk servers.<br />
Use switch or higher speed LAN, fanning out to slower LAN segments.<br />
Use multiple independent networks.<br />
Provide sufficient MOP and disk server CPU capacity by selecting a computer with<br />
sufficient power and by configuring multiple server nodes to share the load.<br />
2 If the MOP server node and system-disk server node (Alpha or VAX) are not already<br />
configured as cluster members, follow the directions in Section 8.4 for using the cluster<br />
configuration command procedure to configure each of the VAX or Alpha nodes. Include<br />
multiple boot and disk servers to enhance availability and distribute I/O traffic over several<br />
cluster nodes.<br />
(continued on next page)<br />
9–4 Building Large <strong>OpenVMS</strong> <strong>Cluster</strong> <strong>Systems</strong>
Table 9–3 (Cont.) Checklist for Satellite Booting<br />
Step Action<br />
Building Large <strong>OpenVMS</strong> <strong>Cluster</strong> <strong>Systems</strong><br />
9.4 Configuring and Booting Satellite Nodes<br />
3 Configure additional memory for disk serving.<br />
4 Run the cluster configuration procedure on the Alpha or VAX node for each satellite you<br />
want to boot into the <strong>OpenVMS</strong> <strong>Cluster</strong>.<br />
9.4.1 Booting from a Single LAN Adapter<br />
To boot a satellite, enter the following command:<br />
>>> BOOT LAN-adapter-device-name<br />
In the example, the LAN-adapter-device-name could be any valid LAN adapter<br />
name, for example EZA0 or XQB0.<br />
If you need to perform a conversational boot, use the command shown in the<br />
following table.<br />
IF the<br />
system<br />
is... THEN...<br />
An Alpha<br />
system<br />
AVAX<br />
system<br />
At the Alpha system console prompt (>>>), enter:<br />
>>> b -flags 0,1 eza0<br />
In this example, -flags stands for the flags command line qualifier, which takes two<br />
values:<br />
• System root number<br />
The ‘‘0’’ tells the console to boot from the system root [SYS0]. This is ignored when<br />
booting satellite nodes because the system root comes from the network database of<br />
the boot node.<br />
• Conversational boot flag<br />
The ‘‘1’’ indicates that the boot should be conversational.<br />
The argument eza0 is the LAN adapter to be used for booting.<br />
Finally, notice that a load file is not specified in this boot command line. For satellite<br />
booting, the load file is part of the node description in the DECnet or LANCP database.<br />
At the VAX system console prompt (>>>), specify the full device name in the boot<br />
command:<br />
>>>B/R5=1 XQB0<br />
The exact syntax will vary depending on system type. Please refer to the hardware<br />
user’s guide for your system.<br />
If the boot fails:<br />
• If the configuration permits and the network database is properly set up,<br />
reenter the boot command using another LAN adapter (see Section 9.4.4).<br />
• See Section C.3.5 for information about troubleshooting satellite booting<br />
problems.<br />
Building Large <strong>OpenVMS</strong> <strong>Cluster</strong> <strong>Systems</strong> 9–5
Building Large <strong>OpenVMS</strong> <strong>Cluster</strong> <strong>Systems</strong><br />
9.4 Configuring and Booting Satellite Nodes<br />
9.4.2 Changing the Default Boot Adapter<br />
To change the default boot adapter, you need the physical address of the alternate<br />
LAN adapter. You use the address to update the satellite’s node definition in<br />
the DECnet or LANCP database on the MOP servers so that they recognize the<br />
satellite (described in Section 9.4.4).<br />
IF the system is... THEN...<br />
An Alpha system Use the SHOW CONFIG console command to find the LAN address of<br />
additional adapters.<br />
A VAX system Use the following method to find the LAN address of additional adapters:<br />
• Enter the console command SHOW ETHERNET.<br />
• Boot the READ_ADDR program using the following commands:<br />
>>>B/100 XQB0<br />
Bootfile:READ_ADDR<br />
9.4.3 Booting from Multiple LAN Adapters (Alpha Only)<br />
On Alpha systems, availability can be increased by using multiple LAN adapters<br />
for booting because access to the MOP server and disk server can occur via<br />
different LAN adapters. To use multiple adapter booting, perform the steps in<br />
the following table.<br />
Step Task<br />
1 Obtain the physical addresses of the additional LAN adapters.<br />
2 Use these addresses to update the node definition in the DECnet or LANCP database on<br />
some of the MOP servers so that they recognize the satellite (described in Section 9.4.4).<br />
3 If the satellite is already defined in the DECnet database, skip to step 4. If the satellite<br />
is not defined in the DECnet database, specify the SYS$SYSTEM:APB.EXE downline load<br />
file in the Alpha network database (see Example 10–2).<br />
4 Specify multiple LAN adapters on the boot command line. (Use the SHOW DEVICE or<br />
SHOW CONFIG console command to obtain the names of adapters.)<br />
The following command line is the same as that used for booting from a single<br />
LAN adapter on an Alpha system (see Section 9.4.2) except that it lists two LAN<br />
adapters, eza0 and ezb0, as the devices from which to boot:<br />
>>> b -flags 0,1 eza0, ezb0<br />
In this command line:<br />
Stage What Happens<br />
1 MOP booting is attempted from the first device (eza0). If that fails, MOP booting is<br />
attempted from the next device (ezb0). When booting from network devices, if the MOP<br />
boot attempt fails from all devices, then the console starts again from the first device.<br />
2 Once the MOP load has completed, the boot driver starts the NISCA protocol on all of the<br />
LAN adapters. The NISCA protocol is used to access the system disk server and finish<br />
loading the operating system (see Appendix F).<br />
9–6 Building Large <strong>OpenVMS</strong> <strong>Cluster</strong> <strong>Systems</strong>
Building Large <strong>OpenVMS</strong> <strong>Cluster</strong> <strong>Systems</strong><br />
9.4 Configuring and Booting Satellite Nodes<br />
9.4.4 Enabling Satellites to Use Alternate LAN Adapters for Booting<br />
<strong>OpenVMS</strong> supports only one hardware address attribute per remote node<br />
definition in either a DECnet or LANCP database. To enable a satellite with<br />
multiple LAN adapters to use any LAN adapter to boot into the cluster, two<br />
different methods are available:<br />
• Define a pseudonode for each additional LAN adapter.<br />
• Create and maintain different node databases for different boot nodes.<br />
Defining Pseudonodes for Additional LAN Adapters<br />
When defining a pseudonode with a different DECnet or LANCP address:<br />
• Make sure the address points to the same cluster satellite root directory as<br />
the existing node definition (to associate the pseudonode with the satellite).<br />
• Specify the hardware address of the alternate LAN adapter in the pseudonode<br />
definition.<br />
For DECnet, follow the procedure shown in Table 9–4. For LANCP, follow the<br />
procedure shown in Table 9–5.<br />
Table 9–4 Procedure for Defining a Pseudonode Using DECnet MOP Services<br />
Step Procedure Comments<br />
1 Display the node’s existing definition using the following<br />
NCP command:<br />
$ RUN SYS$SYSTEM:NCP<br />
NCP> SHOW NODE node-name CHARACTERISTICS<br />
2 Create a pseudonode by defining a unique DECnet<br />
address and node name at the NCP command prompt,<br />
as follows:<br />
‡Alpha specific<br />
‡DEFINE NODE pseudo-area.pseudo-number -<br />
NAME pseudo-node-name -<br />
LOAD FILE APB.EXE -<br />
LOAD ASSIST AGENT SYS$SHARE:NISCS_LAA.EXE -<br />
LOAD ASSIST PARAMETER disk$sys:[] -<br />
HARDWARE ADDRESS xx-xx-xx-xx-xx-xx<br />
This command displays a list of the satellite’s<br />
characteristics, such as its hardware address, load<br />
assist agent, load assist parameter, and more.<br />
This example is specific to an Alpha node.<br />
For a VAX node, replace the command LOAD<br />
FILE APB.EXE, with TERTIARY LOADER<br />
SYS$SYSTEM:TERTIARY_VMB.EXE.<br />
Table 9–5 Procedure for Defining a Pseudonode Using LANCP MOP Services<br />
Step Procedure Comments<br />
1 Display the node’s existing definition using the following<br />
LANCP command:<br />
$ RUN SYS$SYSTEM:LANCP<br />
LANCP> SHOW NODE node-name<br />
This command displays a list of the satellite’s<br />
characteristics, such as its hardware address and<br />
root directory address.<br />
(continued on next page)<br />
Building Large <strong>OpenVMS</strong> <strong>Cluster</strong> <strong>Systems</strong> 9–7
Building Large <strong>OpenVMS</strong> <strong>Cluster</strong> <strong>Systems</strong><br />
9.4 Configuring and Booting Satellite Nodes<br />
Table 9–5 (Cont.) Procedure for Defining a Pseudonode Using LANCP MOP Services<br />
Step Procedure Comments<br />
2 Create a pseudonode by defining a unique LANCP<br />
address and node name at the LANCP command prompt,<br />
as follows:<br />
‡Alpha specific<br />
‡DEFINE NODE pseudo-node-name -<br />
/FILE= APB.EXE -<br />
/ROOT=disk$sys:[] -<br />
/ADDRESS=xx-xx-xx-xx-xx-xx<br />
This example is specific to an Alpha node. For a<br />
VAX node, replace the qualifier FILE=APB.EXE<br />
with FILE=NISCS_LOAD.EXE.<br />
Creating Different Node Databases for Different Boot Nodes<br />
When creating different DECnet or LANCP databases on different boot nodes:<br />
• Set up the databases so that a system booting from one LAN adapter receives<br />
responses from a subset of the MOP servers. The same system booting from a<br />
different LAN adapter receives responses from a different subset of the MOP<br />
servers.<br />
• In each database, list a different LAN address for the same node definition.<br />
The procedures are similar for DECnet and LANCP, but the database file names,<br />
utilities, and commands differ. For the DECnet procedure, see Table 9–6. For the<br />
LANCP procedure, see Table 9–7.<br />
Table 9–6 Procedure for Creating Different DECnet Node Databases<br />
Step Procedure Comments<br />
1 Define the logical name NETNODE_REMOTE to different<br />
values on different nodes so that it points to different<br />
files.<br />
2 Locate NETNODE_REMOTE.DAT files in the systemspecific<br />
area for each node.<br />
On each of the various boot servers, ensure that the<br />
hardware address is defined as a unique address that<br />
matches one of the adapters on the satellite. Enter the<br />
following commands at the NCP command prompt:<br />
‡Alpha specific<br />
‡DEFINE NODE area.number -<br />
NAME node-name -<br />
LOAD FILE APB.EXE -<br />
LOAD ASSIST AGENT SYS$SHARE:NISCS_LAA.EXE -<br />
LOAD ASSIST PARAMETER disk$sys:[] -<br />
HARDWARE ADDRESS xx-xx-xx-xx-xx-xx<br />
Table 9–7 Procedure for Creating Different LANCP Node Databases<br />
Step Procedure Comments<br />
1 Define the logical name LAN$NODE_DATABASE to<br />
different values on different nodes so that it points to<br />
different files.<br />
9–8 Building Large <strong>OpenVMS</strong> <strong>Cluster</strong> <strong>Systems</strong><br />
The logical NETNODE_REMOTE points to the<br />
working copy of the remote node file you are<br />
creating.<br />
A NETNODE_REMOTE.DAT file located<br />
in [SYS0.SYSEXE] overrides one located in<br />
[SYS0.SYSCOMMON.SYSEXE] for a system<br />
booting from system root 0.<br />
If the NETNODE_REMOTE.DAT files are copies<br />
of each other, the node name, tertiary loader (for<br />
VAX) or LOAD FILE (for Alpha), load assist agent,<br />
and load assist parameter are already set up. You<br />
need only specify the new hardware address.<br />
Because the default hardware address is stored in<br />
NETUPDATE.COM, you must also edit this file<br />
on the second boot server.<br />
The logical LAN$NODE_DATABASE points to<br />
the working copy of the remote node file you are<br />
creating.<br />
(continued on next page)
Building Large <strong>OpenVMS</strong> <strong>Cluster</strong> <strong>Systems</strong><br />
9.4 Configuring and Booting Satellite Nodes<br />
Table 9–7 (Cont.) Procedure for Creating Different LANCP Node Databases<br />
Step Procedure Comments<br />
2 Locate LAN$NODE_DATABASE.DAT files in the systemspecific<br />
area for each node.<br />
On each of the various boot servers, ensure that the<br />
hardware address is defined as a unique address that<br />
matches one of the adapters on the satellite. Enter the<br />
following commands at the LANCP command prompt:<br />
‡Alpha specific<br />
‡DEFINE NODE node-name -<br />
/FILE= APB.EXE -<br />
/ROOT=disk$sys:[] -<br />
/ADDRESS=xx-xx-xx-xx-xx-xx<br />
If the LAN$NODE_DATABASE.DAT files are<br />
copies of each other, the node name and the FILE<br />
and ROOT qualifier values are already set up.<br />
You need only specify the new address.<br />
Once the satellite receives the MOP downline load from the MOP server, the<br />
satellite uses the booting LAN adapter to connect to any node serving the system<br />
disk. The satellite continues to use the LAN adapters on the boot command line<br />
exclusively until after the run-time drivers are loaded. The satellite then switches<br />
to using the run-time drivers and starts the local area <strong>OpenVMS</strong> <strong>Cluster</strong> protocol<br />
on all of the LAN adapters.<br />
For additional information about the NCP command syntax, refer to DECnet for<br />
<strong>OpenVMS</strong> Network Management Utilities.<br />
For DECnet–Plus: On an <strong>OpenVMS</strong> <strong>Cluster</strong> running DECnet–Plus, you do not<br />
need to take the same actions in order to support a satellite with more than one<br />
LAN adapter. The DECnet–Plus support to downline load a satellite allows for<br />
an entry in the database that contains a list of LAN adapter addresses. See the<br />
DECnet–Plus documentation for complete information.<br />
9.4.5 Configuring MOP Service<br />
On a boot node, CLUSTER_CONFIG.COM enables the DECnet MOP downline<br />
load service on the first circuit that is found in the DECnet database.<br />
On systems running DECnet for <strong>OpenVMS</strong>, display the circuit state and the<br />
service (MOP downline load service) state using the following command:<br />
$ MCR NCP SHOW CHAR KNOWN CIRCUITS<br />
.<br />
..<br />
Circuit = SVA-0<br />
State<br />
Service<br />
.<br />
..<br />
= on<br />
= enabled<br />
This example shows that circuit SVA-0 is in the ON state with the MOP downline<br />
service enabled. This is the correct state to support MOP downline loading for<br />
satellites.<br />
Enabling MOP service on additional LAN adapters (circuits) must be performed<br />
manually. For example, enter the following NCP commands to enable service for<br />
the circuit QNA-1:<br />
Building Large <strong>OpenVMS</strong> <strong>Cluster</strong> <strong>Systems</strong> 9–9
Building Large <strong>OpenVMS</strong> <strong>Cluster</strong> <strong>Systems</strong><br />
9.4 Configuring and Booting Satellite Nodes<br />
$ MCR NCP SET CIRCUIT QNA-1 STATE OFF<br />
$ MCR NCP SET CIRCUIT QNA-1 SERVICE ENABLED STATE ON<br />
$ MCR NCP DEFINE CIRCUIT QNA-1 SERVICE ENABLED<br />
Reference: For more details, refer to DECnet-Plus for <strong>OpenVMS</strong> Network<br />
Management.<br />
9.4.6 Controlling Satellite Booting<br />
You can control the satellite boot process in a number of ways. Table 9–8<br />
shows examples specific to DECnet for <strong>OpenVMS</strong>. Refer to the DECnet–Plus<br />
documentation for equivalent information.<br />
Table 9–8 Controlling Satellite Booting<br />
Method Comments<br />
Use DECbootsync<br />
To control the number of workstations starting up<br />
simulataneously, use DECbootsync, which is available through<br />
the NSIS Reusable Software library:<br />
http://eadc.aeo.dec.com/. DECbootsync uses the<br />
distributed lock manager to control the number of satellites<br />
allowed to continue their startup command procedures at the<br />
same time.<br />
Disable MOP service on MOP servers temporarily<br />
Until the MOP server can complete its own startup operations,<br />
boot requests can be temporarily disabled by setting the DECnet<br />
Ethernet circuit to a ‘‘Service Disabled’’ state as shown:<br />
1 To disable MOP service during startup of a MOP server,<br />
enter the following commands:<br />
$ MCR NCP DEFINE CIRCUIT MNA-1 -<br />
_$ SERVICE DISABLED<br />
$ @SYS$MANAGER:STARTNET<br />
$ MCR NCP DEFINE CIRCUIT MNA-1 -<br />
_$ SERVICE ENABLED<br />
2 To reenable MOP service later, enter the following<br />
commands in a command procedure so that they execute<br />
quickly and so that DECnet service to the users is not<br />
disrupted:<br />
$ MCR NCP<br />
NCP> SET CIRCUIT MNA-1 STATE OFF<br />
NCP> SET CIRCUIT MNA-1 SERVICE ENABLED<br />
NCP> SET CIRCUIT MNA-1 STATE ON<br />
9–10 Building Large <strong>OpenVMS</strong> <strong>Cluster</strong> <strong>Systems</strong><br />
DECbootsync does not control booting of the<br />
<strong>OpenVMS</strong> operating system, but controls the<br />
execution of startup command procedures and<br />
installation of layered products on satellites.<br />
This method prevents the MOP server from<br />
servicing the satellites; it does not prevent the<br />
satellites from requesting a boot from other MOP<br />
servers.<br />
If a satellite that is requesting a boot receives no<br />
response, it will make fewer boot requests over<br />
time. Thus, booting the satellite may take longer<br />
than normal once MOP service is reenabled.<br />
1. MNA-1 represents the MOP service circuit.<br />
After entering these commands, service will<br />
be disabled in the volatile database. Do not<br />
disable service permanently.<br />
2. Reenable service as shown.<br />
(continued on next page)
Table 9–8 (Cont.) Controlling Satellite Booting<br />
Method Comments<br />
Disable MOP service for individual satellites<br />
You can disable requests temporarily on a per-node basis in order<br />
to clear a node’s information from the DECnet database. Clear<br />
a node’s information from DECnet database on the MOP server<br />
using NCP, then reenable nodes as desired to control booting:<br />
1 To disable MOP service for a given node, enter the following<br />
command:<br />
$ MCR NCP<br />
NCP> CLEAR NODE satellite HARDWARE ADDRESS<br />
2 To reenable MOP service for that node, enter the following<br />
command:<br />
$ MCR NCP<br />
NCP> SET NODE satellite ALL<br />
Building Large <strong>OpenVMS</strong> <strong>Cluster</strong> <strong>Systems</strong><br />
9.4 Configuring and Booting Satellite Nodes<br />
This method does not prevent satellites from<br />
requesting boot service from another MOP server.<br />
1. After entering the commands, service will<br />
be disabled in the volatile database. Do not<br />
disable service permanently.<br />
2. Reenable service as shown.<br />
(continued on next page)<br />
Building Large <strong>OpenVMS</strong> <strong>Cluster</strong> <strong>Systems</strong> 9–11
Building Large <strong>OpenVMS</strong> <strong>Cluster</strong> <strong>Systems</strong><br />
9.4 Configuring and Booting Satellite Nodes<br />
Table 9–8 (Cont.) Controlling Satellite Booting<br />
Method Comments<br />
Bring satellites to console prompt on shutdown<br />
Use any of the following methods to halt a satellite so that it halts<br />
(rather than reboots) upon restoration of power.<br />
1 Use the VAXcluster Console System (VCS).<br />
2 Stop in console mode upon Halt or powerup:<br />
For Alpha computers:<br />
>>> (SET AUTO_ACTION HALT)<br />
For VAX 3100 or VAX 4000 series computers:<br />
>>> (SET HALT 3)<br />
For VAX 2000 series computers:<br />
>>> TEST 53<br />
2 ? >>> 3<br />
3 Set up a satellite so that it will stop in console mode when a<br />
HALT instruction is executed according to the instructions<br />
in the following list.<br />
a. Enter the following NCP commands so that a reboot<br />
will load an image that does a HALT instruction:<br />
$ MCR NCP<br />
NCP> CLEAR NODE node LOAD ASSIST PARAMETER<br />
NCP> CLEAR NODE node LOAD ASSIST AGENT<br />
NCP> SET NODE node LOAD FILE -<br />
_ MOM$LOAD:READ_ADDR.SYS<br />
b. Shut down the satellite, and specify an immediate<br />
reboot using the following SYSMAN command:<br />
$ MCR SYSMAN<br />
SYSMAN> SET ENVIRONMENT/NODE=satellite<br />
SYSMAN> DO @SYS$UPDATE:AUTOGEN REBOOT<br />
c. When you want to allow the satellite to boot normally,<br />
enter the following NCP commands so that <strong>OpenVMS</strong><br />
will be loaded later:<br />
$ MCR NCP<br />
NCP> SET NODE satellite ALL<br />
9–12 Building Large <strong>OpenVMS</strong> <strong>Cluster</strong> <strong>Systems</strong><br />
If you plan to use the DECnet Trigger operation, it<br />
is important to use a program to perform a HALT<br />
instruction that causes the satellite to enter console<br />
mode. This is because systems that support remote<br />
triggering only support it while the system is in<br />
console mode.<br />
1. Some, but not all, satellites can be set up<br />
so they halt upon restoration of power or<br />
execution of a HALT instruction rather than<br />
automatically rebooting.<br />
Note: You need to enter the SET commands<br />
only once on each system because the settings<br />
are saved in nonvolatile RAM.<br />
2. The READ_ADDR.SYS program, which is<br />
normally used to find out the Ethernet address<br />
of a satellite node, also executes a HALT<br />
instruction upon its completion.<br />
(continued on next page)
Table 9–8 (Cont.) Controlling Satellite Booting<br />
Method Comments<br />
Boot satellites remotely with Trigger<br />
The console firmware in some satellites, such as the VAX 3100<br />
and VAX 4000, allow you to boot them remotely using the DECnet<br />
Trigger operation. You must turn on this capability at the console<br />
prompt before you enter the NCP command TRIGGER, as shown:<br />
1 To boot VAX satellites using the DECnet Trigger facility,<br />
enter these commands at the console prompt:<br />
>>> SET MOP 1<br />
>>> SET TRIGGER 1<br />
>>> SET PSWD<br />
2 Trigger a satellite boot remotely by entering the following<br />
commands at the NCP prompt, as follows:<br />
$ MCR NCP<br />
NCP> TRIGGER NODE satellite -<br />
_ VIA MNA-1 SERVICE PASSWORD password<br />
Building Large <strong>OpenVMS</strong> <strong>Cluster</strong> <strong>Systems</strong><br />
9.4 Configuring and Booting Satellite Nodes<br />
Optionally, you can set up the MOP server to run a<br />
command procedure and trigger 5 or 10 satellites at<br />
a time to stagger the boot-time work load. You can<br />
boot satellites in a priority order, for example, first<br />
boot your satellite, then high-priority satellites, and<br />
so on.<br />
1. The SET TRIGGER 1 command enables the<br />
DECnet MOP listener, and the SET PSWD<br />
command enables remote triggering. The SET<br />
PSWD command prompts you twice for a 16digit<br />
hexadecimal password string that is used<br />
to validate a remote trigger request.<br />
Note: You need to enter the SET commands<br />
only once on each system, because the settings<br />
are saved in nonvolatile RAM.<br />
2. MNA-1 represents the MOP service circuit,<br />
and password is the hexadecimal number that<br />
you specified in step 1 with the SET PSWD<br />
command.<br />
Important: When the SET HALT command is set up as described in Table 9–8,<br />
a power failure will cause the satellite to stop at the console prompt instead of<br />
automatically rebooting when power is restored. This is appropriate for a mass<br />
power failure, but if someone trips over the power cord for a single satellite it can<br />
result in unnecessary unavailability.<br />
You can provide a way to scan and trigger a reboot of satellites that go down this<br />
way by simply running a batch job periodically that performs the following tasks:<br />
1. Uses the DCL lexical function F$GETSYI to check each node that should be<br />
in the cluster.<br />
2. Checks the CLUSTER_MEMBER lexical item.<br />
3. Issues an NCP TRIGGER command for any satellite that is not currently a<br />
member of the cluster.<br />
9.5 System-Disk Throughput<br />
Achieving enough system-disk throughput requires some combination of the<br />
following techniques:<br />
Technique Reference<br />
Avoid disk rebuilds at boot time. Section 9.5.1<br />
Offload work from the system disk. Section 9.5.2<br />
Configure multiple system disks. Section 9.5.3<br />
Use Volume Shadowing for <strong>OpenVMS</strong>. Section 6.6<br />
Building Large <strong>OpenVMS</strong> <strong>Cluster</strong> <strong>Systems</strong> 9–13
Building Large <strong>OpenVMS</strong> <strong>Cluster</strong> <strong>Systems</strong><br />
9.5 System-Disk Throughput<br />
9.5.1 Avoiding Disk Rebuilds<br />
The <strong>OpenVMS</strong> file system maintains a cache of preallocated file headers and<br />
disk blocks. When a disk is not properly dismounted, such as when a system<br />
fails, this preallocated space becomes temporarily unavailable. When the disk is<br />
mounted again, <strong>OpenVMS</strong> scans the disk to recover that space. This is called a<br />
disk rebuild.<br />
A large <strong>OpenVMS</strong> <strong>Cluster</strong> system must ensure sufficient capacity to boot nodes<br />
in a reasonable amount of time. To minimize the impact of disk rebuilds at boot<br />
time, consider making the following changes:<br />
Action Result<br />
Use the DCL command<br />
MOUNT/NOREBUILD for all user disks,<br />
at least on the satellite nodes. Enter this<br />
command into startup procedures that<br />
mount user disks.<br />
Set the system parameter ACP_<br />
REBLDSYSD to 0, at least for the<br />
satellite nodes.<br />
Avoid a disk rebuild during prime<br />
working hours by using the SET<br />
VOLUME/REBUILD command during<br />
times when the system is not so<br />
heavily used. Once the computer is<br />
running, you can run a batch job or a<br />
command procedure to execute the SET<br />
VOLUME/REBUILD command for each<br />
disk drive.<br />
It is undesirable to have a satellite node rebuild the disk,<br />
yet this is likely to happen if a satellite is the first to<br />
reboot after it or another node fails.<br />
This prevents a rebuild operation on the system disk<br />
when it is mounted implicitly by <strong>OpenVMS</strong> early in the<br />
boot process.<br />
User response times can be degraded during a disk<br />
rebuild operation because most I/O activity on that<br />
disk is blocked. Because the SET VOLUME/REBUILD<br />
command determines whether a rebuild is needed, the<br />
job can execute the command for every disk. This job can<br />
be run during off hours, preferably on one of the more<br />
powerful nodes.<br />
Caution: In large <strong>OpenVMS</strong> <strong>Cluster</strong> systems, large amounts of disk space can<br />
be preallocated to caches. If many nodes abruptly leave the cluster (for example,<br />
during a power failure), this space becomes temporarily unavailable. If your<br />
system usually runs with nearly full disks, do not disable rebuilds on the server<br />
nodes at boot time.<br />
9.5.2 Offloading Work<br />
In addition to the system disk throughput issues during an entire <strong>OpenVMS</strong><br />
<strong>Cluster</strong> boot, access to particular system files even during steady-state operations<br />
(such as logging in, starting up applications, or issuing a PRINT command) can<br />
affect response times.<br />
You can identify hot system files using a performance or monitoring tool (such<br />
as those listed in Section 1.5.2), and use the techniques in the following table to<br />
reduce hot file I/O activity on system disks:<br />
Potential Hot Files Methods to Help<br />
Page and swap files When you run CLUSTER_CONFIG_LAN.COM or CLUSTER_<br />
CONFIG.COM to add computers to specify the sizes and locations<br />
of page and swap files, relocate the files as follows:<br />
9–14 Building Large <strong>OpenVMS</strong> <strong>Cluster</strong> <strong>Systems</strong><br />
• Move page and swap files for computers off system disks.<br />
• Set up page and swap files for satellites on the satellites’<br />
local disks, if such disks are available.
Potential Hot Files Methods to Help<br />
Move these high-activity files off<br />
the system disk:<br />
• SYSUAF.DAT<br />
• NETPROXY.DAT<br />
• RIGHTSLIST.DAT<br />
• ACCOUNTNG.DAT<br />
• VMSMAIL_PROFILE.DATA<br />
• QMAN$MASTER.DAT<br />
• †VMS$OBJECTS.DAT<br />
• Layered product and other<br />
application files<br />
†VAX specific<br />
Building Large <strong>OpenVMS</strong> <strong>Cluster</strong> <strong>Systems</strong><br />
9.5 System-Disk Throughput<br />
Use any of the following methods:<br />
• Specify new locations for the files according to the<br />
instructions in Chapter 5.<br />
• Use caching in the HSC subsystem or in RF or RZ disks to<br />
improve the effective system-disk throughput.<br />
• Add a solid-state disk to your configuration. These devices<br />
have lower latencies and can handle a higher request rate<br />
than a regular magnetic disk. A solid-state disk can be used<br />
as a system disk or to hold system files.<br />
• Use DECram software to create RAMdisks on MOP servers<br />
to hold copies of selected hot read-only files to improve boot<br />
times. A RAMdisk is an area of main memory within a<br />
system that is set aside to store data, but it is accessed as if<br />
it were a disk.<br />
Moving these files from the system disk to a separate disk eliminates most of the<br />
write activity to the system disk. This raises the read/write ratio and, if you are<br />
using Volume Shadowing for <strong>OpenVMS</strong>, maximizes the performance of shadowing<br />
on the system disk.<br />
9.5.3 Configuring Multiple System Disks<br />
Depending on the number of computers to be included in a large cluster and the<br />
work being done, you must evaluate the tradeoffs involved in configuring a single<br />
system disk or multiple system disks.<br />
While a single system disk is easier to manage, a large cluster often requires<br />
more system disk I/O capacity than a single system disk can provide. To achieve<br />
satisfactory performance, multiple system disks may be needed. However,<br />
you should recognize the increased system management efforts involved in<br />
maintaining multiple system disks.<br />
Consider the following when determining the need for multiple system disks:<br />
• Concurrent user activity<br />
In clusters with many satellites, the amount and type of user activity on those<br />
satellites influence system-disk load and, therefore, the number of satellites<br />
that can be supported by a single system disk. For example:<br />
IF... THEN... Comments<br />
Many users are active or<br />
run multiple applications<br />
simultaneously<br />
The load on the system disk can<br />
be significant; multiple system<br />
disks may be required.<br />
Some <strong>OpenVMS</strong> <strong>Cluster</strong> systems may need to<br />
be configured on the assumption that all users<br />
are constantly active. Such working conditions<br />
may require a larger, more expensive <strong>OpenVMS</strong><br />
<strong>Cluster</strong> system that handles peak loads without<br />
performance degradation.<br />
Building Large <strong>OpenVMS</strong> <strong>Cluster</strong> <strong>Systems</strong> 9–15
Building Large <strong>OpenVMS</strong> <strong>Cluster</strong> <strong>Systems</strong><br />
9.5 System-Disk Throughput<br />
IF... THEN... Comments<br />
Few users are active<br />
simultaneously<br />
Most users run a single<br />
application for extended periods<br />
A single system disk might<br />
support a large number of<br />
satellites.<br />
A single system disk might<br />
support a large number of<br />
satellites if significant numbers<br />
of I/O requests can be directed to<br />
application data disks.<br />
For most configurations, the probability is low<br />
that most users are active simultaneously. A<br />
smaller and less expensive <strong>OpenVMS</strong> <strong>Cluster</strong><br />
system can be configured for these typical<br />
working conditions but may suffer some<br />
performance degradation during peak load<br />
periods.<br />
Because each workstation user in an <strong>OpenVMS</strong><br />
<strong>Cluster</strong> system has a dedicated computer, a<br />
user who runs large compute-bound jobs on<br />
that dedicated computer does not significantly<br />
affect users of other computers in the <strong>OpenVMS</strong><br />
<strong>Cluster</strong> system. For clustered workstations, the<br />
critical shared resource is a disk server. Thus, if<br />
a workstation user runs an I/O-intensive job, its<br />
effect on other workstations sharing the same<br />
disk server might be noticeable.<br />
• Concurrent booting activity<br />
One of the few times when all <strong>OpenVMS</strong> <strong>Cluster</strong> computers are<br />
simultaneously active is during a cluster reboot. All satellites are waiting to<br />
reload the operating system, and as soon as a boot server is available, they<br />
begin to boot in parallel. This booting activity places a significant I/O load on<br />
the boot server, system disk, and interconnect.<br />
Note: You can reduce overall cluster boot time by configuring multiple<br />
system disks and by distributing system roots for computers evenly across<br />
those disks. This technique has the advantage of increasing overall system<br />
disk I/O capacity, but it has the disadvantage of requiring additional system<br />
management effort. For example, installation of layered products or upgrades<br />
of the <strong>OpenVMS</strong> operating system must be repeated once for each system<br />
disk.<br />
• System management<br />
Because system management work load increases as separate system disks<br />
are added and does so in direct proportion to the number of separate system<br />
disks that need to be maintained, you want to minimize the number of system<br />
disks added to provide the required level of performance.<br />
Volume Shadowing for <strong>OpenVMS</strong> is an alternative to creating multiple system<br />
disks. Volume shadowing increases the read I/O capacity of a single system disk<br />
and minimizes the number of separate system disks that have to be maintained<br />
because installations or updates need only be applied once to a volume-shadowed<br />
system disk. For clusters with substantial system disk I/O requirements, you can<br />
use multiple system disks, each configured as a shadow set.<br />
Cloning the system disk is a way to manage multiple system disks. To clone the<br />
system disk:<br />
• Create a system disk (or shadow set) with roots for all <strong>OpenVMS</strong> <strong>Cluster</strong><br />
nodes.<br />
• Use this as a master copy, and perform all software upgrades on this system<br />
disk.<br />
• Back up the master copy to the other disks to create the cloned system disks.<br />
• Change the volume names so they are unique.<br />
9–16 Building Large <strong>OpenVMS</strong> <strong>Cluster</strong> <strong>Systems</strong>
Building Large <strong>OpenVMS</strong> <strong>Cluster</strong> <strong>Systems</strong><br />
9.5 System-Disk Throughput<br />
• If you have not moved system files off the system disk, you must have the<br />
SYLOGICALS.COM startup file point to system files on the master system<br />
disk.<br />
• Before an upgrade, be sure to save any changes you need from the cloned<br />
disks since the last upgrade, such as MODPARAMS.DAT and AUTOGEN<br />
feedback data, accounting files for billing, and password history.<br />
9.6 Conserving System Disk Space<br />
The essential files for a satellite root take up very little space, so that more than<br />
96 roots can easily fit on a single system disk. However, if you use separate dump<br />
files for each satellite node or put page and swap files for all the satellite nodes<br />
on the system disk, you quickly run out of disk space.<br />
9.6.1 Techniques<br />
To avoid running out of disk space, set up common dump files for all the satellites<br />
or for groups of satellite nodes. For debugging purposes, it is best to have<br />
separate dump files for each MOP and disk server. Also, you can use local disks<br />
on satellite nodes to hold page and swap files, instead of putting them on the<br />
system disk. In addition, move page and swap files for MOP and disk servers off<br />
the system disk.<br />
Reference: See Section 10.8 to plan a strategy for managing dump files.<br />
9.7 Adjusting System Parameters<br />
As an <strong>OpenVMS</strong> <strong>Cluster</strong> system grows, certain data structures within <strong>OpenVMS</strong><br />
need to grow in order to accommodate the large number of nodes. If growth is<br />
not possible (for example, because of a shortage of nonpaged pool) this will induce<br />
intermittent problems that are difficult to diagnose.<br />
You should run AUTOGEN with FEEDBACK frequently as a cluster grows, so<br />
that settings for many parameters can be adjusted. Refer to Section 8.7 for more<br />
information about running AUTOGEN.<br />
In addition to running AUTOGEN with FEEDBACK, you should check and<br />
manually adjust the following parameters:<br />
• SCSBUFFCNT<br />
• SCSRESPCNT<br />
• CLUSTER_CREDITS<br />
Prior to <strong>OpenVMS</strong> Version 7.2, customers were advised to also check the<br />
SCSCONNCNT parameter. However, beginning with <strong>OpenVMS</strong> Version 7.2, the<br />
SCSCONNCNT parameter is obsolete. SCS connections are now allocated and<br />
expanded only as needed, up to a limit of 65,000.<br />
9.7.1 The SCSBUFFCNT Parameter (VAX Only)<br />
Note: On Alpha systems, the SCS buffers are allocated as needed, and the<br />
SCSBUFFCNT parameter is reserved for <strong>OpenVMS</strong> use only.<br />
Description: On VAX systems, the SCSBUFFCNT parameter controls the<br />
number of buffer descriptor table (BDT) entries that describe data buffers used in<br />
block data transfers between nodes.<br />
Building Large <strong>OpenVMS</strong> <strong>Cluster</strong> <strong>Systems</strong> 9–17
Building Large <strong>OpenVMS</strong> <strong>Cluster</strong> <strong>Systems</strong><br />
9.7 Adjusting System Parameters<br />
Symptoms of entry shortages: A shortage of entries affects performance, most<br />
likely affecting nodes that perform MSCP serving.<br />
How to determine a shortage of BDT entries: Use the SDA utility (or the<br />
Show <strong>Cluster</strong> utility) to identify systems that have waited for BDT entries.<br />
SDA> READ SYS$SYSTEM:SCSDEF<br />
%SDA-I-READSYM, reading symbol table SYS$COMMON:[SYSEXE]SCSDEF.STB;1<br />
SDA> EXAM @SCS$GL_BDT + CIBDT$L_QBDT_CNT<br />
8046BB6C: 00000000 "...."<br />
SDA><br />
How to resolve shortages: If the SDA EXAMINE command displays a nonzero<br />
value, BDT waits have occurred. If the number is nonzero and continues to<br />
increase during normal operations, increase the value of SCSBUFFCNT.<br />
9.7.2 The SCSRESPCNT Parameter<br />
Description: The SCSRESPCNT parameter controls the number of response<br />
descriptor table (RDT) entries available for system use. An RDT entry is required<br />
for every in-progress message exchange between two nodes.<br />
Symptoms of entry shortages: A shortage of entries affects performance, since<br />
message transmissions must be delayed until a free entry is available.<br />
How to determine a shortage of RDT entries: Use the SDA utility as follows<br />
to check each system for requests that waited because there were not enough free<br />
RDTs.<br />
SDA> READ SYS$SYSTEM:SCSDEF<br />
%SDA-I-READSYM, reading symbol table SYS$COMMON:[SYSEXE]SCSDEF.STB;1<br />
SDA> EXAM @SCS$GL_RDT + RDT$L_QRDT_CNT<br />
8044DF74: 00000000 "...."<br />
SDA><br />
How to resolve shortages: If the SDA EXAMINE command displays a nonzero<br />
value, RDT waits have occurred. If you find a count that tends to increase over<br />
time under normal operations, increase SCSRESPCNT.<br />
9.7.3 The CLUSTER_CREDITS Parameter<br />
Description: The CLUSTER_CREDITS parameter specifies the number of perconnection<br />
buffers a node allocates to receiving VMS$VAXcluster communications.<br />
This system parameter is not dynamic; that is, if you change the value, you must<br />
reboot the node on which you changed it.<br />
Default: The default value is 10. The default value may be insufficient for a<br />
cluster that has very high locking rates.<br />
Symptoms of cluster credit problem: A shortage of credits affects<br />
performance, since message transmissions are delayed until free credits are<br />
available. These are visible as credit waits in the SHOW CLUSTER display.<br />
How to determine whether credit waits exist: Use the SHOW CLUSTER<br />
utility as follows:<br />
1. Run SHOW CLUSTER/CONTINUOUS.<br />
2. Type REMOVE SYSTEM/TYPE=HS.<br />
3. Type ADD LOC_PROC, CR_WAIT.<br />
4. Type SET CR_WAIT/WIDTH=10.<br />
9–18 Building Large <strong>OpenVMS</strong> <strong>Cluster</strong> <strong>Systems</strong>
Building Large <strong>OpenVMS</strong> <strong>Cluster</strong> <strong>Systems</strong><br />
9.7 Adjusting System Parameters<br />
5. Check to see whether the number of CR_WAITS (credit waits) logged<br />
against the VMS$VAXcluster connection for any remote node is incrementing<br />
regularly. Ideally, credit waits should not occur. However, occasional waits<br />
under very heavy load conditions are acceptable.<br />
How to resolve incrementing credit waits:<br />
If the number of CR_WAITS is incrementing more than once per minute, perform<br />
the following steps:<br />
1. Increase the CLUSTER_CREDITS parameter on the node against which they<br />
are being logged by five. The parameter should be modified on the remote<br />
node, not on the node which is running SHOW CLUSTER.<br />
2. Reboot the node.<br />
Note that it is not necessary for the CLUSTER_CREDITS parameter to be the<br />
same on every node.<br />
9.8 Minimize Network Instability<br />
Network instability also affects <strong>OpenVMS</strong> <strong>Cluster</strong> operations. Table 9–9 lists<br />
techniques to minimize typical network problems.<br />
Table 9–9 Techniques to Minimize Network Problems<br />
Technique Recommendation<br />
Adjust the<br />
RECNXINTERVAL<br />
parameter.<br />
The RECNXINTERVAL system parameter specifies the number of seconds<br />
the <strong>OpenVMS</strong> <strong>Cluster</strong> system waits when it loses contact with a node,<br />
before removing the node from the configuration. Many large <strong>OpenVMS</strong><br />
<strong>Cluster</strong> configurations operate with the RECNXINTERVAL parameter set<br />
to 40 seconds (the default value is 20 seconds).<br />
Raising the value of RECNXINTERVAL can result in longer perceived<br />
application pauses, especially when the node leaves the <strong>OpenVMS</strong> <strong>Cluster</strong><br />
system abnormally. The pause is caused by the connection manager<br />
waiting for the number of seconds specified by RECNXINTERVAL.<br />
Protect the network. Treat the LAN as if it was a part of the <strong>OpenVMS</strong> <strong>Cluster</strong> system. For<br />
example, do not allow an environment in which a random user can<br />
disconnect a ThinWire segment to attach a new PC while 20 satellites<br />
hang.<br />
Choose your hardware<br />
and configuration<br />
carefully.<br />
Use the<br />
LAVC$FAILURE_<br />
ANALYSIS facility.<br />
Certain hardware is not suitable for use in a large <strong>OpenVMS</strong> <strong>Cluster</strong><br />
system.<br />
• Some network components can appear to work well with light loads,<br />
but are unable to operate properly under high traffic conditions.<br />
Improper operation can result in lost or corrupted packets that will<br />
require packet retransmissions. This reduces performance and can<br />
affect the stability of the <strong>OpenVMS</strong> <strong>Cluster</strong> configuration.<br />
• Beware of bridges that cannot filter and forward at full line rates and<br />
repeaters that do not handle congested conditions well.<br />
• Refer to Guidelines for <strong>OpenVMS</strong> <strong>Cluster</strong> Configurations to determine<br />
appropriate <strong>OpenVMS</strong> <strong>Cluster</strong> configurations and capabilities.<br />
See Section D.5 for assistance in the isolation of network faults.<br />
Building Large <strong>OpenVMS</strong> <strong>Cluster</strong> <strong>Systems</strong> 9–19
Building Large <strong>OpenVMS</strong> <strong>Cluster</strong> <strong>Systems</strong><br />
9.9 DECnet <strong>Cluster</strong> Alias<br />
9.9 DECnet <strong>Cluster</strong> Alias<br />
You should define a cluster alias name for the <strong>OpenVMS</strong> <strong>Cluster</strong> to ensure that<br />
remote access will be successful when at least one <strong>OpenVMS</strong> <strong>Cluster</strong> member is<br />
available to process the client program’s requests.<br />
The cluster alias acts as a single network node identifier for an <strong>OpenVMS</strong> <strong>Cluster</strong><br />
system. Computers in the cluster can use the alias for communications with<br />
other computers in a DECnet network. Note that it is possible for nodes running<br />
DECnet for <strong>OpenVMS</strong> to have a unique and separate cluster alias from nodes<br />
running DECnet–Plus. In addition, clusters running DECnet–Plus can have one<br />
cluster alias for VAX, one for Alpha, and another for both.<br />
Note: A single cluster alias can include nodes running either DECnet for<br />
<strong>OpenVMS</strong> or DECnet–Plus, but not both. Also, an <strong>OpenVMS</strong> <strong>Cluster</strong> running<br />
both DECnet for <strong>OpenVMS</strong> and DECnet–Plus requires multiple system disks<br />
(one for each).<br />
Reference: See Chapter 4 for more information about setting up and using a<br />
cluster alias in an <strong>OpenVMS</strong> <strong>Cluster</strong> system.<br />
9–20 Building Large <strong>OpenVMS</strong> <strong>Cluster</strong> <strong>Systems</strong>
10<br />
Maintaining an <strong>OpenVMS</strong> <strong>Cluster</strong> System<br />
Once your cluster is up and running, you can implement routine, site-specific<br />
maintenance operations—for example, backing up disks or adding user accounts,<br />
performing software upgrades and installations, running AUTOGEN with the<br />
feedback option on a regular basis, and monitoring the system for performance.<br />
You should also maintain records of current configuration data, especially any<br />
changes to hardware or software components. If you are managing a cluster that<br />
includes satellite nodes, it is important to monitor LAN activity.<br />
From time to time, conditions may occur that require the following special<br />
maintenance operations:<br />
• Restoring cluster quorum after an unexpected computer failure<br />
• Executing conditional shutdown operations<br />
• Performing security functions in LAN and mixed-interconnect clusters<br />
10.1 Backing Up Data and Files<br />
As a part of the regular system management procedure, you should copy<br />
operating system files, application software files, and associated files to an<br />
alternate device using the <strong>OpenVMS</strong> Backup utility.<br />
Some backup operations are the same in an <strong>OpenVMS</strong> <strong>Cluster</strong> as they are on a<br />
single <strong>OpenVMS</strong> system. For example, an incremental back up of a disk while it<br />
is in use, or the backup of a nonshared disk.<br />
Backup tools for use in a cluster include those listed in Table 10–1.<br />
Table 10–1 Backup Methods<br />
Tool Usage<br />
Online backup Use from a running system to back up:<br />
• The system’s local disks<br />
• <strong>Cluster</strong>-shareable disks other than system disks<br />
• The system disk or disks<br />
Caution: Files open for writing at the time of the backup procedure may<br />
not be backed up correctly.<br />
(continued on next page)<br />
Maintaining an <strong>OpenVMS</strong> <strong>Cluster</strong> System 10–1
Maintaining an <strong>OpenVMS</strong> <strong>Cluster</strong> System<br />
10.1 Backing Up Data and Files<br />
Table 10–1 (Cont.) Backup Methods<br />
Tool Usage<br />
Menu-driven or<br />
†standalone BACKUP<br />
†VAX specific<br />
Use one of the following methods:<br />
• If you have access to the <strong>OpenVMS</strong> Alpha or VAX distribution CD–<br />
ROM, back up your system using the menu system provided on that<br />
disc. This menu system, which is displayed automatically when you<br />
boot the CD–ROM, allows you to:<br />
Enter a DCL environment, from which you can perform backup<br />
and restore operations on the system disk (instead of using<br />
standalone BACKUP).<br />
Install or upgrade the operating system and layered products,<br />
using the POLYCENTER Software Installation utility.<br />
Reference: For more detailed information about using the menudriven<br />
procedure, see the <strong>OpenVMS</strong> Upgrade and Installation<br />
Manual and the <strong>OpenVMS</strong> System Manager’s Manual.<br />
• If you do not have access to the <strong>OpenVMS</strong> VAX distribution CD–<br />
ROM, you should use standalone BACKUP to back up and restore<br />
your system disk. Standalone BACKUP:<br />
Should be used with caution because it does not:<br />
a. Participate in the cluster<br />
b. Synchronize volume ownership or file I/O with other<br />
systems in the cluster<br />
Can boot from the system disk instead of the console media.<br />
Standalone BACKUP is built in the reserved root on any system<br />
disk.<br />
Reference: For more information about standalone BACKUP, see<br />
the <strong>OpenVMS</strong> System Manager’s Manual.<br />
Plan to perform the backup process regularly, according to a schedule that is<br />
consistent with application and user needs. This may require creative scheduling<br />
so that you can coordinate backups with times when user and application system<br />
requirements are low.<br />
Reference: See the <strong>OpenVMS</strong> System Management Utilities Reference Manual:<br />
A–L for complete information about the <strong>OpenVMS</strong> Backup utility.<br />
10.2 Updating the <strong>OpenVMS</strong> Operating System<br />
When updating the <strong>OpenVMS</strong> operating system, follow the steps in Table 10–2.<br />
Table 10–2 Upgrading the <strong>OpenVMS</strong> Operating System<br />
Step Action<br />
1 Back up the system disk.<br />
2 Perform the update procedure once for each system disk.<br />
3 Install any mandatory updates.<br />
10–2 Maintaining an <strong>OpenVMS</strong> <strong>Cluster</strong> System<br />
(continued on next page)
Maintaining an <strong>OpenVMS</strong> <strong>Cluster</strong> System<br />
10.2 Updating the <strong>OpenVMS</strong> Operating System<br />
Table 10–2 (Cont.) Upgrading the <strong>OpenVMS</strong> Operating System<br />
Step Action<br />
4 Run AUTOGEN on each node that boots from that system disk.<br />
5 Run the user environment test package (UETP) to test the installation.<br />
6 Use the <strong>OpenVMS</strong> Backup utility to make a copy of the new system volume.<br />
Reference: See the appropriate <strong>OpenVMS</strong> upgrade and installation manual for<br />
complete instructions.<br />
10.2.1 Rolling Upgrades<br />
The <strong>OpenVMS</strong> operating system allows an <strong>OpenVMS</strong> <strong>Cluster</strong> system running on<br />
multiple system disks to continue to provide service while the system software is<br />
being upgraded. This process is called a rolling upgrade because each node is<br />
upgraded and rebooted in turn, until all the nodes have been upgraded.<br />
If you must first migrate your system from running on one system disk to running<br />
on two or more system disks, follow these steps:<br />
Step Action<br />
1 Follow the procedures in Section 8.5 to create a duplicate disk.<br />
2 Follow the instructions in Section 5.10 for information about coordinating system files.<br />
These sections help you add a system disk and prepare a common user<br />
environment on multiple system disks to make the shared system files such<br />
as the queue database, rightslists, proxies, mail, and other files available across<br />
the <strong>OpenVMS</strong> <strong>Cluster</strong> system.<br />
10.3 LAN Network Failure Analysis<br />
The <strong>OpenVMS</strong> operating system provides a sample program to help you<br />
analyze <strong>OpenVMS</strong> <strong>Cluster</strong> network failures on the LAN. You can edit and use<br />
the SYS$EXAMPLES:LAVC$FAILURE_ANALYSIS.MAR program to detect<br />
and isolate failed network components. Using the network failure analysis<br />
program can help reduce the time required to detect and isolate a failed network<br />
component, thereby providing a significant increase in cluster availability.<br />
Reference: For a description of the network failure analysis program, refer to<br />
Appendix D.<br />
10.4 Recording Configuration Data<br />
To maintain an <strong>OpenVMS</strong> <strong>Cluster</strong> system effectively, you must keep accurate<br />
records about the current status of all hardware and software components and<br />
about any changes made to those components. Changes to cluster components<br />
can have a significant effect on the operation of the entire cluster. If a failure<br />
occurs, you may need to consult your records to aid problem diagnosis.<br />
Maintaining current records for your configuration is necessary both for routine<br />
operations and for eventual troubleshooting activities.<br />
Maintaining an <strong>OpenVMS</strong> <strong>Cluster</strong> System 10–3
Maintaining an <strong>OpenVMS</strong> <strong>Cluster</strong> System<br />
10.4 Recording Configuration Data<br />
10.4.1 Record Information<br />
At a minimum, your configuration records should include the following<br />
information:<br />
• A diagram of your physical cluster configuration. (Appendix D includes a<br />
discussion of keeping a LAN configuration diagram.)<br />
• SCSNODE and SCSSYSTEMID parameter values for all computers.<br />
• VOTES and EXPECTED_VOTES parameter values.<br />
• DECnet names and addresses for all computers.<br />
• Current values for cluster-related system parameters, especially ALLOCLASS<br />
and TAPE_ALLOCLASS values for HSC subsystems and computers.<br />
Reference: <strong>Cluster</strong> system parameters are described in Appendix A.<br />
• Names and locations of default bootstrap command procedures for all<br />
computers connected with the CI.<br />
• Names of cluster disk and tape devices.<br />
• In LAN and mixed-interconnect clusters, LAN hardware addresses for<br />
satellites.<br />
• Names of LAN adapters.<br />
• Names of LAN segments or rings.<br />
• Names of LAN bridges.<br />
• Names of wiring concentrators or of DELNI or DEMPR adapters.<br />
• Serial numbers of all hardware components.<br />
• Changes to any hardware or software components (including site-specific<br />
command procedures), along with dates and times when changes were made.<br />
10.4.2 Satellite Network Data<br />
The first time you execute CLUSTER_CONFIG.COM to add a satellite, the<br />
procedure creates the file NETNODE_UPDATE.COM in the boot server’s<br />
SYS$SPECIFIC:[SYSMGR] directory. (For a common-environment cluster, you<br />
must rename this file to the SYS$COMMON:[SYSMGR] directory, as described<br />
in Section 5.10.2.) This file, which is updated each time you add or remove a<br />
satellite or change its Ethernet or FDDI hardware address, contains all essential<br />
network configuration data for the satellite.<br />
If an unexpected condition at your site causes configuration data to be lost, you<br />
can use NETNODE_UPDATE.COM to restore it. You can also read the file when<br />
you need to obtain data about individual satellites. Note that you may want to<br />
edit the file occasionally to remove obsolete entries.<br />
Example 10–1 shows the contents of the file after satellites EUROPA and<br />
GANYMD have been added to the cluster.<br />
10–4 Maintaining an <strong>OpenVMS</strong> <strong>Cluster</strong> System
Example 10–1 Sample NETNODE_UPDATE.COM File<br />
Maintaining an <strong>OpenVMS</strong> <strong>Cluster</strong> System<br />
10.4 Recording Configuration Data<br />
$ RUN SYS$SYSTEM:NCP<br />
define node EUROPA address 2.21<br />
define node EUROPA hardware address 08-00-2B-03-51-75<br />
define node EUROPA load assist agent sys$share:niscs_laa.exe<br />
define node EUROPA load assist parameter $1$DJA11:<br />
define node EUROPA tertiary loader sys$system:tertiary_vmb.exe<br />
define node GANYMD address 2.22<br />
define node GANYMD hardware address 08-00-2B-03-58-14<br />
define node GANYMD load assist agent sys$share:niscs_laa.exe<br />
define node GANYMD load assist parameter $1$DJA11:<br />
define node GANYMD tertiary loader sys$system:tertiary_vmb.exe<br />
Reference: See the DECnet–Plus documentation for equivalent NCL command<br />
information.<br />
10.5 Cross-Architecture Satellite Booting<br />
Cross-architecture satellite booting permits VAX boot nodes to provide boot<br />
service to Alpha satellites and Alpha boot nodes to provide boot service to VAX<br />
satellites. For some <strong>OpenVMS</strong> <strong>Cluster</strong> configurations, cross-architecture boot<br />
support can simplify day-to-day system operation and reduce the complexity of<br />
managing <strong>OpenVMS</strong> <strong>Cluster</strong> that include both VAX and Alpha systems.<br />
Note: Compaq will continue to provide cross-architecture boot support while it<br />
is technically feasible. This support may be removed in future releases of the<br />
<strong>OpenVMS</strong> operating system.<br />
10.5.1 Sample Configurations<br />
The sample configurations that follow show how you might configure an<br />
<strong>OpenVMS</strong> <strong>Cluster</strong> to include both Alpha and VAX boot nodes and satellite<br />
nodes. Note that each architecture must include a system disk that is used for<br />
installations and upgrades.<br />
Caution: The <strong>OpenVMS</strong> operating system and layered product installations<br />
and upgrades cannot be performed across architectures. For example, <strong>OpenVMS</strong><br />
Alpha software installations and upgrades must be performed using an Alpha<br />
system. When configuring <strong>OpenVMS</strong> <strong>Cluster</strong> systems that use the crossarchitecture<br />
booting feature, configure at least one system of each architecture<br />
with a disk that can be used for installations and upgrades. In the configurations<br />
shown in Figure 10–1 and Figure 10–2, one of the workstations has been<br />
configured with a local disk for this purpose.<br />
In Figure 10–1, several Alpha workstations have been added to an existing<br />
VAXcluster configuration that contains two VAX boot nodes based on the DSSI<br />
interconnect and several VAX workstations. For high availability, the Alpha<br />
system disk is located on the DSSI for access by multiple boot servers.<br />
Maintaining an <strong>OpenVMS</strong> <strong>Cluster</strong> System 10–5
Maintaining an <strong>OpenVMS</strong> <strong>Cluster</strong> System<br />
10.5 Cross-Architecture Satellite Booting<br />
Figure 10–1 VAX Nodes Boot Alpha Satellites<br />
Alpha<br />
System<br />
Disk<br />
VAX<br />
Alpha Software<br />
Installation Disk<br />
Boot Node<br />
DSSI<br />
VAX<br />
Alpha Satellites VAX Satellites<br />
VAX<br />
System<br />
Disk<br />
Boot Node<br />
LAN<br />
ZK−6975A−GE<br />
In Figure 10–2, the configuration originally consisted of a VAX boot node and<br />
several VAX workstations. The VAX boot node has been replaced with a new,<br />
high-performance Alpha boot node. Some Alpha workstations have also been<br />
added. The original VAX workstations remain in the configuration and still<br />
require boot service. The new Alpha boot node can perform this service.<br />
10–6 Maintaining an <strong>OpenVMS</strong> <strong>Cluster</strong> System
Maintaining an <strong>OpenVMS</strong> <strong>Cluster</strong> System<br />
10.5 Cross-Architecture Satellite Booting<br />
Figure 10–2 Alpha and VAX Nodes Boot Alpha and VAX Satellites<br />
Alpha<br />
System<br />
Disk<br />
DSSI<br />
Alpha Satellites<br />
Alpha<br />
Boot Node<br />
VAX Software<br />
Installation Disk<br />
VAX<br />
System<br />
Disk<br />
VAX Satellites<br />
LAN<br />
ZK−6974A−GE<br />
10.5.2 Usage Notes<br />
Consider the following guidelines when using the cross-architecture booting<br />
feature:<br />
• The <strong>OpenVMS</strong> software installation and upgrade procedures are architecture<br />
specific. The operating system must be installed and upgraded on a disk<br />
that is directly accessible from a system of the appropriate architecture.<br />
Configuring a boot server with a system disk of the opposite architecture<br />
involves three distinct system management procedures:<br />
Installation of the operating system on a disk that is directly accessible<br />
from a system of the same architecture.<br />
Moving the resulting system disk so that it is accessible by the target boot<br />
server. Depending on the specific configuration, this can be done using<br />
the Backup utility or by physically relocating the disk.<br />
Setting up the boot server’s network database to service satellite boot<br />
requests. Sample procedures for performing this step are included in<br />
Section 10.5.3.<br />
• System disks can contain only a single version of the <strong>OpenVMS</strong> operating<br />
system and are architecture specific. For example, <strong>OpenVMS</strong> VAX Version 7.1<br />
cannot coexist on a system disk with <strong>OpenVMS</strong> Alpha Version 7.1.<br />
• The CLUSTER_CONFIG command procedure can be used only to manage<br />
cluster nodes of the same architecture as the node executing the procedure.<br />
For example, when run from an Alpha system, CLUSTER_CONFIG can<br />
manipulate only Alpha system disks and perform node management<br />
procedures for Alpha systems.<br />
Maintaining an <strong>OpenVMS</strong> <strong>Cluster</strong> System 10–7
Maintaining an <strong>OpenVMS</strong> <strong>Cluster</strong> System<br />
10.5 Cross-Architecture Satellite Booting<br />
• No support is provided for cross-architecture installation of layered products.<br />
10.5.3 Configuring DECnet<br />
The following examples show how to configure DECnet databases to perform<br />
cross-architecture booting. Note that this feature is available for systems running<br />
DECnet for <strong>OpenVMS</strong> (Phase IV) only.<br />
Customize the command procedures in Examples 10–2 and 10–3 according to the<br />
following instructions.<br />
Replace... With...<br />
alpha_system_disk or vax_<br />
system_disk<br />
The appropriate disk name on the server<br />
label The appropriate label name for the disk on the server<br />
ccc-n The server circuit name<br />
alpha or vax The DECnet node name of the satellite<br />
xx.yyyy The DECnet area.address of the satellite<br />
aa-bb-cc-dd-ee-ff The hardware address of the LAN adapter on the satellite over which<br />
the satellite is to be loaded<br />
satellite_root The root on the system disk (for example, SYS10) of the satellite<br />
Example 10–2 shows how to set up a VAX system to serve a locally mounted<br />
Alpha system disk.<br />
Example 10–2 Defining an Alpha Satellite in a VAX Boot Node<br />
$! VAX system to load Alpha satellite<br />
$!<br />
$! On the VAX system:<br />
$! -----------------<br />
$!<br />
$! Mount the system disk for MOP server access.<br />
$!<br />
$ MOUNT /SYSTEM alpha_system_disk: label ALPHA$SYSD<br />
$!<br />
$! Enable MOP service for this server.<br />
$!<br />
$ MCR NCP<br />
NCP> DEFINE CIRCUIT ccc-n SERVICE ENABLED STATE ON<br />
NCP> SET CIRCUIT ccc-n STATE OFF<br />
NCP> SET CIRCUIT ccc-n ALL<br />
NCP> EXIT<br />
$!<br />
$! Configure MOP service for the ALPHA satellite.<br />
$!<br />
$ MCR NCP<br />
NCP> DEFINE NODE alpha ADDRESS xx.yyyy<br />
NCP> DEFINE NODE alpha HARDWARE ADDRESS aa-bb-cc-dd-ee-ff<br />
NCP> DEFINE NODE alpha LOAD ASSIST AGENT SYS$SHARE:NISCS_LAA.EXE<br />
NCP> DEFINE NODE alpha LOAD ASSIST PARAMETER ALPHA$SYSD:[satellite_root.]<br />
NCP> DEFINE NODE alpha LOAD FILE APB.EXE<br />
NCP> SET NODE alpha ALL<br />
NCP> EXIT<br />
10–8 Maintaining an <strong>OpenVMS</strong> <strong>Cluster</strong> System
Maintaining an <strong>OpenVMS</strong> <strong>Cluster</strong> System<br />
10.5 Cross-Architecture Satellite Booting<br />
Example 10–3 shows how to set up an Alpha system to serve a locally mounted<br />
VAX system disk.<br />
Example 10–3 Defining a VAX Satellite in an Alpha Boot Node<br />
$! Alpha system to load VAX satellite<br />
$!<br />
$! On the Alpha system:<br />
$! --------------------<br />
$!<br />
$! Mount the system disk for MOP server access.<br />
$!<br />
$ MOUNT /SYSTEM vax_system_disk: label VAX$SYSD<br />
$!<br />
$! Enable MOP service for this server.<br />
$!<br />
$ MCR NCP<br />
NCP> DEFINE CIRCUIT ccc-n SERVICE ENABLED STATE ON<br />
NCP> SET CIRCUIT ccc-n STATE OFF<br />
NCP> SET CIRCUIT ccc-n ALL<br />
NCP> EXIT<br />
$!<br />
$! Configure MOP service for the VAX satellite.<br />
$!<br />
$ MCR NCP<br />
NCP> DEFINE NODE vax ADDRESS xx.yyyy<br />
NCP> DEFINE NODE vax HARDWARE ADDRESS aa-bb-cc-dd-ee-ff<br />
NCP> DEFINE NODE vax TERTIARY LOADER SYS$SYSTEM:TERTIARY_VMB.EXE<br />
NCP> DEFINE NODE vax LOAD ASSIST AGENT SYS$SHARE:NISCS_LAA.EXE<br />
NCP> DEFINE NODE vax LOAD ASSIST PARAMETER VAX$SYSD:[satellite_root.]<br />
NCP> SET NODE vax ALL<br />
NCP> EXIT<br />
Then, to boot the satellite, perform these steps:<br />
1. Execute the appropriate command procedure from a privileged account on the<br />
server<br />
2. Boot the satellite over the adapter represented by the hardware address you<br />
entered into the command procedure earlier.<br />
10.6 Controlling OPCOM Messages<br />
When a satellite joins the cluster, the Operator Communications Manager<br />
(OPCOM) has the following default states:<br />
• For all systems in an <strong>OpenVMS</strong> <strong>Cluster</strong> configuration except workstations:<br />
OPA0: is enabled for all message classes.<br />
The log file SYS$MANAGER:OPERATOR.LOG is opened for all classes.<br />
• For workstations in an <strong>OpenVMS</strong> <strong>Cluster</strong> configuration, even though the<br />
OPCOM process is running:<br />
OPA0: is not enabled.<br />
No log file is opened.<br />
Maintaining an <strong>OpenVMS</strong> <strong>Cluster</strong> System 10–9
Maintaining an <strong>OpenVMS</strong> <strong>Cluster</strong> System<br />
10.6 Controlling OPCOM Messages<br />
10.6.1 Overriding OPCOM Defaults<br />
Table 10–3 shows how to define the following system logical names in the<br />
command procedure SYS$MANAGER:SYLOGICALS.COM to override the<br />
OPCOM default states.<br />
Table 10–3 OPCOM System Logical Names<br />
System Logical Name Function<br />
OPC$OPA0_ENABLE If defined to be true, OPA0: is enabled as an operator console. If defined to be<br />
false, OPA0: is not enabled as an operator console. DCL considers any string<br />
beginning with T or Y or any odd integer to be true, all other values are false.<br />
OPC$OPA0_CLASSES Defines the operator classes to be enabled on OPA0:. The logical name can be<br />
a search list of the allowed classes, a list of classes, or a combination of the<br />
two. For example:<br />
$ DEFINE/SYSTEM OP$OPA0_CLASSES CENTRAL,DISKS,TAPE<br />
$ DEFINE/SYSTEM OP$OPA0_CLASSES "CENTRAL,DISKS,TAPE"<br />
$ DEFINE/SYSTEM OP$OPA0_CLASSES "CENTRAL,DISKS",TAPE<br />
You can define OPC$OPA0_CLASSES even if OPC$OPA0_ENABLE is not<br />
defined. In this case, the classes are used for any operator consoles that are<br />
enabled, but the default is used to determine whether to enable the operator<br />
console.<br />
OPC$LOGFILE_ENABLE If defined to be true, an operator log file is opened. If defined to be false, no<br />
log file is opened.<br />
OPC$LOGFILE_CLASSES Defines the operator classes to be enabled for the log file. The logical name<br />
can be a search list of the allowed classes, a comma-separated list, or a<br />
combination of the two. You can define this system logical even when the<br />
OPC$LOGFILE_ENABLE system logical is not defined. In this case, the<br />
classes are used for any log files that are open, but the default is used to<br />
determine whether to open the log file.<br />
OPC$LOGFILE_NAME Supplies information that is used in conjunction with the default name<br />
SYS$MANAGER:OPERATOR.LOG to define the name of the log file. If the<br />
log file is directed to a disk other than the system disk, you should include<br />
commands to mount that disk in the SYLOGICALS.COM command procedure.<br />
10.6.2 Example<br />
The following example shows how to use the OPC$OPA0_CLASSES system<br />
logical to define the operator classes to be enabled. The following command<br />
prevents SECURITY class messages from being displayed on OPA0.<br />
$ DEFINE/SYSTEM OPC$OPA0_CLASSES CENTRAL,PRINTER,TAPES,DISKS,DEVICES, -<br />
_$ CARDS,NETWORK,CLUSTER,LICENSE,OPER1,OPER2,OPER3,OPER4,OPER5, -<br />
_$ OPER6,OPER7,OPER8,OPER9,OPER10,OPER11,OPER12<br />
In large clusters, state transitions (computers joining or leaving the cluster)<br />
generate many multiline OPCOM messages on a boot server’s console<br />
device. You can avoid such messages by including the DCL command<br />
REPLY/DISABLE=CLUSTER in the appropriate site-specific startup command<br />
file or by entering the command interactively from the system manager’s<br />
account.<br />
10.7 Shutting Down a <strong>Cluster</strong><br />
The SHUTDOWN command of the SYSMAN utility provides five options for<br />
shutting down <strong>OpenVMS</strong> <strong>Cluster</strong> computers:<br />
• NONE (the default)<br />
• REMOVE_NODE<br />
10–10 Maintaining an <strong>OpenVMS</strong> <strong>Cluster</strong> System
• CLUSTER_SHUTDOWN<br />
• REBOOT_CHECK<br />
• SAVE_FEEDBACK<br />
These options are described in the following sections.<br />
Maintaining an <strong>OpenVMS</strong> <strong>Cluster</strong> System<br />
10.7 Shutting Down a <strong>Cluster</strong><br />
10.7.1 The NONE Option<br />
If you select the default SHUTDOWN option NONE, the shutdown procedure<br />
performs the normal operations for shutting down a standalone computer. If you<br />
want to shut down a computer that you expect will rejoin the cluster shortly, you<br />
can specify the default option NONE. In that case, cluster quorum is not adjusted<br />
because the operating system assumes that the computer will soon rejoin the<br />
cluster.<br />
In response to the ‘‘Shutdown options [NONE]:’’ prompt, you can specify the<br />
DISABLE_AUTOSTART=n option, where n is the number of minutes before<br />
autostart queues are disabled in the shutdown sequence. For more information<br />
about this option, see Section 7.13.<br />
10.7.2 The REMOVE_NODE Option<br />
If you want to shut down a computer that you expect will not rejoin the cluster for<br />
an extended period, use the REMOVE_NODE option. For example, a computer<br />
may be waiting for new hardware, or you may decide that you want to use a<br />
computer for standalone operation indefinitely.<br />
When you use the REMOVE_NODE option, the active quorum in the remainder of<br />
the cluster is adjusted downward to reflect the fact that the removed computer’s<br />
votes no longer contribute to the quorum value. The shutdown procedure<br />
readjusts the quorum by issuing the SET CLUSTER/EXPECTED_VOTES<br />
command, which is subject to the usual constraints described in Section 10.12.<br />
Note: The system manager is still responsible for changing the EXPECTED_<br />
VOTES system parameter on the remaining <strong>OpenVMS</strong> <strong>Cluster</strong> computers to<br />
reflect the new configuration.<br />
10.7.3 The CLUSTER_SHUTDOWN Option<br />
When you choose the CLUSTER_SHUTDOWN option, the computer completes<br />
all shut down activities up to the point where the computer would leave the<br />
cluster in a normal shutdown situation. At this point the computer waits until<br />
all other nodes in the cluster have reached the same point. When all nodes<br />
have completed their shutdown activities, the entire cluster dissolves in one<br />
synchronized operation. The advantage of this is that individual nodes do not<br />
complete shutdown independently, and thus do not trigger state transitions or<br />
potentially leave the cluster without quorum.<br />
When performing a CLUSTER_SHUTDOWN you must specify this option on<br />
every <strong>OpenVMS</strong> <strong>Cluster</strong> computer. If any computer is not included, clusterwide<br />
shutdown cannot occur.<br />
Maintaining an <strong>OpenVMS</strong> <strong>Cluster</strong> System 10–11
Maintaining an <strong>OpenVMS</strong> <strong>Cluster</strong> System<br />
10.7 Shutting Down a <strong>Cluster</strong><br />
10.7.4 The REBOOT_CHECK Option<br />
When you choose the REBOOT_CHECK option, the shutdown procedure checks<br />
for the existence of basic system files that are needed to reboot the computer<br />
successfully and notifies you if any files are missing. You should replace such files<br />
before proceeding. If all files are present, the following informational message<br />
appears:<br />
%SHUTDOWN-I-CHECKOK, Basic reboot consistency check completed.<br />
Note: You can use the REBOOT_CHECK option separately or in conjunction<br />
with either the REMOVE_NODE or the CLUSTER_SHUTDOWN option. If you<br />
choose REBOOT_CHECK with one of the other options, you must specify the<br />
options in the form of a comma-separated list.<br />
10.7.5 The SAVE_FEEDBACK Option<br />
Use the SAVE_FEEDBACK option to enable the AUTOGEN feedback operation.<br />
Note: Select this option only when a computer has been running long enough to<br />
reflect your typical work load.<br />
Reference: For detailed information about AUTOGEN feedback, see the<br />
<strong>OpenVMS</strong> System Manager’s Manual.<br />
10.8 Dump Files<br />
Whether your <strong>OpenVMS</strong> <strong>Cluster</strong> system uses a single common system disk or<br />
multiple system disks, you should plan a strategy to manage dump files.<br />
10.8.1 Controlling Size and Creation<br />
Dump-file management is especially important for large clusters with a single<br />
system disk. For example, on a 256 MB <strong>OpenVMS</strong> Alpha computer, AUTOGEN<br />
creates a dump file in excess of 500,000 blocks.<br />
In the event of a software-detected system failure, each computer normally writes<br />
the contents of memory to a full dump file on its system disk for analysis. By<br />
default, this full dump file is the size of physical memory plus a small number<br />
of pages. If system disk space is limited (as is probably the case if a single<br />
system disk is used for a large cluster), you may want to specify that no dump<br />
file be created for satellites or that AUTOGEN create a selective dump file. The<br />
selective dump file is typically 30% to 60% of the size of a full dump file.<br />
You can control dump-file size and creation for each computer by specifying<br />
appropriate values for the AUTOGEN symbols DUMPSTYLE and DUMPFILE<br />
in the computer’s MODPARAMS.DAT file. Specify dump files as shown in<br />
Table 10–4.<br />
Table 10–4 AUTOGEN Dump-File Symbols<br />
Value Specified Result<br />
DUMPSTYLE = 0 Full dump file created (default)<br />
DUMPSTYLE = 1 Selective dump file created<br />
DUMPFILE = 0 No dump file created<br />
Caution: Although you can configure computers without dump files, the lack of a<br />
dump file can make it difficult or impossible to determine the cause of a system<br />
failure.<br />
10–12 Maintaining an <strong>OpenVMS</strong> <strong>Cluster</strong> System
Maintaining an <strong>OpenVMS</strong> <strong>Cluster</strong> System<br />
10.8 Dump Files<br />
For example, use the following commands to modify the system dump-file size on<br />
large-memory systems:<br />
$ MCR SYSGEN<br />
SYSGEN> USE CURRENT<br />
SYSGEN> SET DUMPSTYLE 1<br />
SYSGEN> CREATE SYS$SYSTEM:SYSDUMP.DMP/SIZE=70000<br />
SYSGEN> WRITE CURRENT<br />
SYSGEN> EXIT<br />
$ @SHUTDOWN<br />
The dump-file size of 70,000 blocks is sufficient to cover about 32 MB of memory.<br />
This size is usually large enough to encompass the information needed to analyze<br />
a system failure.<br />
After the system reboots, you can purge SYSDUMP.DMP.<br />
10.8.2 Sharing Dump Files<br />
Another option for saving dump-file space is to share a single dump file among<br />
multiple computers. This technique makes it possible to analyze isolated<br />
computer failures. But dumps are lost if multiple computers fail at the same<br />
time or if a second computer fails before you can analyze the first failure.<br />
Because boot server failures have a greater impact on cluster operation than do<br />
failures of other computers you should configure full dump files on boot servers to<br />
help ensure speedy analysis of problems.<br />
VAX systems cannot share dump files with Alpha computers and vice versa.<br />
However, you can share a single dump file among multiple Alpha computers and<br />
another single dump file among VAX computers. Follow these steps for each<br />
operating system:<br />
Step Action<br />
1 Decide whether to use full or selective dump files.<br />
2 Determine the size of the largest dump file needed by any satellite.<br />
3 Select a satellite whose memory configuration is the largest of any in the cluster and do the<br />
following:<br />
1. Specify DUMPSTYLE = 0 (or DUMPSTYLE = 1) in that satellite’s MODPARAMS.DAT<br />
file.<br />
2. Remove any DUMPFILE symbol from the satellite’s MODPARAMS.DAT file.<br />
3. Run AUTOGEN on that satellite to create a dump file.<br />
4 Rename the dump file to SYS$COMMON:[SYSEXE]SYSDUMP-COMMON.DMP or create a<br />
new dump file named SYSDUMP-COMMON.DMP in SYS$COMMON:[SYSEXE].<br />
Maintaining an <strong>OpenVMS</strong> <strong>Cluster</strong> System 10–13
Maintaining an <strong>OpenVMS</strong> <strong>Cluster</strong> System<br />
10.8 Dump Files<br />
Step Action<br />
5 For each satellite that is to share the dump file, do the following:<br />
1. Create a file synonym entry for the dump file in the system-specific root. For example,<br />
to create a synonym for the satellite using root SYS1E, enter a command like the<br />
following:<br />
$ SET FILE SYS$COMMON:[SYSEXE]SYSDUMP-COMMON.DMP -<br />
_$ /ENTER=SYS$SYSDEVICE:[SYS1E.SYSEXE]SYSDUMP.DMP<br />
2. Add the following lines to the satellite’s MODPARAMS.DAT file:<br />
DUMPFILE = 0<br />
DUMPSTYLE = 0 (or DUMPSTYLE = 1)<br />
6 Rename the old system-specific dump file on each system that has its own dump file:<br />
$ RENAME SYS$SYSDEVICE:[SYSn.SYSEXE]SYSDUMP.DMP .OLD<br />
The value of n in the command line is the root for each system (for example, SYS0 or SYS1).<br />
Rename the file so that the operating system software does not use it as the dump file when<br />
the system is rebooted.<br />
7 Reboot each node so it can map to the new common dump file. The operating system<br />
software cannot use the new file for a crash dump until you reboot the system.<br />
8 After you reboot, delete the SYSDUMP.OLD file in each system-specific root. Do not delete<br />
any file called SYSDUMP.DMP; instead, rename it, reboot, and then delete it as described in<br />
steps 6 and 7.<br />
10.9 Maintaining the Integrity of <strong>OpenVMS</strong> <strong>Cluster</strong> Membership<br />
Because multiple LAN and mixed-interconnect clusters coexist on a single<br />
extended LAN, the operating system provides mechanisms to ensure the integrity<br />
of individual clusters and to prevent access to a cluster by an unauthorized<br />
computer.<br />
The following mechanisms are designed to ensure the integrity of the cluster:<br />
• A cluster authorization file (SYS$COMMON:[SYSEXE]CLUSTER_<br />
AUTHORIZE.DAT), which is initialized during installation of the operating<br />
system or during execution of the CLUSTER_CONFIG.COM CHANGE<br />
function. The file is maintained with the SYSMAN utility.<br />
• Control of conversational bootstrap operations on satellites.<br />
The purpose of the cluster group number and password is to prevent accidental<br />
access to the cluster by an unauthorized computer. Under normal conditions, the<br />
system manager specifies the cluster group number and password either during<br />
installation or when you run CLUSTER_CONFIG.COM (see Example 8–11) to<br />
convert a standalone computer to run in an <strong>OpenVMS</strong> <strong>Cluster</strong> system.<br />
<strong>OpenVMS</strong> <strong>Cluster</strong> systems use these mechanisms to protect the integrity of<br />
the cluster in order to prevent problems that could otherwise occur under<br />
circumstances like the following:<br />
• When setting up a new cluster, the system manager specifies a group number<br />
identical to that of an existing cluster on the same Ethernet.<br />
• A satellite user with access to a local system disk tries to join a cluster by<br />
executing a conversational SYSBOOT operation at the satellite’s console.<br />
Reference: These mechanisms are discussed in Section 10.9.1 and Section 8.2.1,<br />
respectively.<br />
10–14 Maintaining an <strong>OpenVMS</strong> <strong>Cluster</strong> System
Maintaining an <strong>OpenVMS</strong> <strong>Cluster</strong> System<br />
10.9 Maintaining the Integrity of <strong>OpenVMS</strong> <strong>Cluster</strong> Membership<br />
10.9.1 <strong>Cluster</strong> Group Data<br />
The cluster authorization file, SYS$COMMON:[SYSEXE]CLUSTER_<br />
AUTHORIZE.DAT, contains the cluster group number and (in scrambled form)<br />
the cluster password. The CLUSTER_AUTHORIZE.DAT file is accessible only to<br />
users with the SYSPRV privilege.<br />
Under normal conditions, you need not alter records in the CLUSTER_<br />
AUTHORIZE.DAT file interactively. However, if you suspect a security breach,<br />
you may want to change the cluster password. In that case, you use the SYSMAN<br />
utility to make the change.<br />
To change the cluster password, follow these instructions:<br />
Step Action<br />
1 Invoke the SYSMAN utility.<br />
2 Log in as system manager on a boot server.<br />
3 Enter the following command:<br />
$ RUN SYS$SYSTEM:SYSMAN<br />
SYSMAN><br />
4 At the SYSMAN> prompt, enter any of the CONFIGURATION commands in the following<br />
list.<br />
• CONFIGURATION SET CLUSTER_AUTHORIZATION<br />
Updates the cluster authorization file, CLUSTER_AUTHORIZE.DAT, in the directory<br />
SYS$COMMON:[SYSEXE]. (The SET command creates this file if it does not already<br />
exist.) You can include the following qualifiers on this command:<br />
/GROUP_NUMBER—Specifies a cluster group number. Group number must be in<br />
the range from 1 to 4095 or 61440 to 65535.<br />
/PASSWORD—Specifies a cluster password. Password may be from 1 to 31<br />
characters in length and may include alphanumeric characters, dollar signs ( $ ),<br />
and underscores ( _ ).<br />
• CONFIGURATION SHOW CLUSTER_AUTHORIZATION<br />
Displays the cluster group number.<br />
• HELP CONFIGURATION SET CLUSTER_AUTHORIZATION<br />
Explains the command’s functions.<br />
5 If your configuration has multiple system disks, each disk must have a copy of CLUSTER_<br />
AUTHORIZE.DAT. You must run the SYSMAN utility to update all copies.<br />
Caution: If you change either the group number or the password, you must reboot the entire cluster.<br />
For instructions, see Section 8.6.<br />
10.9.2 Example<br />
Example 10–4 illustrates the use of the SYSMAN utility to change the cluster<br />
password.<br />
Example 10–4 Sample SYSMAN Session to Change the <strong>Cluster</strong> Password<br />
(continued on next page)<br />
Maintaining an <strong>OpenVMS</strong> <strong>Cluster</strong> System 10–15
Maintaining an <strong>OpenVMS</strong> <strong>Cluster</strong> System<br />
10.9 Maintaining the Integrity of <strong>OpenVMS</strong> <strong>Cluster</strong> Membership<br />
Example 10–4 (Cont.) Sample SYSMAN Session to Change the <strong>Cluster</strong><br />
Password<br />
$ RUN SYS$SYSTEM:SYSMAN<br />
SYSMAN> SET ENVIRONMENT/CLUSTER<br />
%SYSMAN-I-ENV, current command environment:<br />
<strong>Cluster</strong>wide on local cluster<br />
Username SYSTEM will be used on nonlocal nodes<br />
SYSMAN> SET PROFILE/PRIVILEGES=SYSPRV<br />
SYSMAN> CONFIGURATION SET CLUSTER_AUTHORIZATION/PASSWORD=NEWPASSWORD<br />
%SYSMAN-I-CAFOLDGROUP, existing group will not be changed<br />
%SYSMAN-I-CAFREBOOT, cluster authorization file updated<br />
The entire cluster should be rebooted.<br />
SYSMAN> EXIT<br />
$<br />
10.10 Adjusting Maximum Packet Size for LAN Configurations<br />
You can adjust the maximum packet size for LAN configurations with the NISCS_<br />
MAX_PKTSZ system parameter.<br />
10.10.1 System Parameter Settings for LANs<br />
Starting with <strong>OpenVMS</strong> Version 7.3, the operating system (PEdriver)<br />
automatically detects the maximum packet size of all the virtual circuits to which<br />
the system is connected. If the maximum packet size of the system’s interconnects<br />
is smaller than the default packet-size setting, PEdriver automatically reduces<br />
the default packet size.<br />
For earlier versions of <strong>OpenVMS</strong> (VAX Version 6.0 to Version 7.2; Alpha Version<br />
1.5 to Version 7.2-1), the NISCS_MAX_PKTSZ parameter should be set to 1498<br />
for Ethernet clusters and to 4468 for FDDI clusters.<br />
10.10.2 How to Use NISCS_MAX_PKTSZ<br />
To obtain this parameter’s current, default, minimum, and maximum values,<br />
issue the following command:<br />
$ MC SYSGEN SHOW NISCS_MAX_PKTSZ<br />
You can use the NISCS_MAX_PKTSZ parameter to reduce packet size, which in<br />
turn can reduce memory consumption. However, reducing packet size can also<br />
increase CPU utilization for block data transfers, because more packets will be<br />
required to transfer a given amount of data. Lock message packets are smaller<br />
than the minimum value, so the NISCS_MAX_PKTSZ setting will not affect<br />
locking performance.<br />
You can also use NISCS_MAX_PKTSZ to force use of a common packet size on all<br />
LAN paths by bounding the packet size to that of the LAN path with the smallest<br />
packet size. Using a common packet size can avoid VC closure due to packet size<br />
reduction when failing down to a slower, smaller packet size network.<br />
If a memory-constrained system, such as a workstation, has adapters to a<br />
network path with large-size packets, such as FDDI or Gigabit Ethernet with<br />
jumbo packets, then you may want to conserve memory by reducing the value of<br />
the NISCS_MAX_PKTSZ parameter.<br />
10–16 Maintaining an <strong>OpenVMS</strong> <strong>Cluster</strong> System
Maintaining an <strong>OpenVMS</strong> <strong>Cluster</strong> System<br />
10.10 Adjusting Maximum Packet Size for LAN Configurations<br />
10.10.3 Editing Parameter Files<br />
If you decide to change the value of the NISCS_MAX_PKTSZ parameter, edit the<br />
SYS$SPECIFIC:[SYSEXE]MODPARAMS.DAT file to permit AUTOGEN to factor<br />
the changed packet size into its calculations.<br />
10.11 Determining Process Quotas<br />
On Alpha systems, process quota default values in SYSUAF.DAT are often higher<br />
than the SYSUAF.DAT defaults on VAX systems. How, then, do you choose<br />
values for processes that could run on Alpha systems or on VAX systems in an<br />
<strong>OpenVMS</strong> <strong>Cluster</strong>? Understanding how a process is assigned quotas when the<br />
process is created in a dual-architecture <strong>OpenVMS</strong> <strong>Cluster</strong> configuration will help<br />
you manage this task.<br />
10.11.1 Quota Values<br />
The quotas to be used by a new process are determined by the <strong>OpenVMS</strong><br />
LOGINOUT software. LOGINOUT works the same on <strong>OpenVMS</strong> Alpha<br />
and <strong>OpenVMS</strong> VAX systems. When a user logs in and a process is started,<br />
LOGINOUT uses the larger of:<br />
• The value of the quota defined in the process’s SYSUAF.DAT record<br />
• The current value of the corresponding PQL_Mquota system parameter on the<br />
host node in the <strong>OpenVMS</strong> <strong>Cluster</strong><br />
Example: LOGINOUT compares the value of the account’s ASTLM process limit<br />
(as defined in the common SYSUAF.DAT) with the value of the PQL_MASTLM<br />
system parameter on the host Alpha system or on the host VAX system in the<br />
<strong>OpenVMS</strong> <strong>Cluster</strong>.<br />
10.11.2 PQL Parameters<br />
The letter M in PQL_M means minimum. The PQL_Mquota system parameters<br />
set a minumum value for the quotas. In the Current and Default columns of<br />
the following edited SYSMAN display, note how the current value of each PQL_<br />
Mquota parameter exceeds its system-defined default value in most cases. Note<br />
that the following display is Alpha specific. A similar SYSMAN display on a VAX<br />
system would show ‘‘Pages’’ in the Unit column instead of ‘‘Pagelets’’.<br />
SYSMAN> PARAMETER SHOW/PQL<br />
%SYSMAN-I-USEACTNOD, a USE ACTIVE has been defaulted on node DASHER<br />
Node DASHER: Parameters in use: ACTIVE<br />
Parameter Name Current Default Minimum Maximum Unit Dynamic<br />
-------------- ------- ------- ------- ------- ---- -------<br />
PQL_MASTLM 120 4 -1 -1 Ast D<br />
PQL_MBIOLM 100 4 -1 -1 I/O D<br />
PQL_MBYTLM 100000 1024 -1 -1 Bytes D<br />
PQL_MCPULM 0 0 -1 -1 10Ms D<br />
PQL_MDIOLM 100 4 -1 -1 I/O D<br />
PQL_MFILLM 100 2 -1 -1 Files D<br />
PQL_MPGFLQUOTA 65536 2048 -1 -1 Pagelets D<br />
PQL_MPRCLM 10 0 -1 -1 Processes D<br />
PQL_MTQELM 0 0 -1 -1 Timers D<br />
PQL_MWSDEFAULT 2000 2000 -1 -1 Pagelets<br />
PQL_MWSQUOTA 4000 4000 -1 -1 Pagelets D<br />
PQL_MWSEXTENT 8192 4000 -1 -1 Pagelets D<br />
PQL_MENQLM 300 4 -1 -1 Locks D<br />
PQL_MJTQUOTA 0 0 -1 -1 Bytes D<br />
Maintaining an <strong>OpenVMS</strong> <strong>Cluster</strong> System 10–17
Maintaining an <strong>OpenVMS</strong> <strong>Cluster</strong> System<br />
10.11 Determining Process Quotas<br />
10.11.3 Examples<br />
In this display, the values for many PQL_Mquota parameters increased from<br />
the defaults to their current values. Typically, this happens over time when<br />
AUTOGEN feedback is run periodically on your system. The PQL_Mquota values<br />
also can change, of course, when you modify the values in MODPARAMS.DAT or<br />
in SYSMAN. As you consider the use of a common SYSUAF.DAT in an <strong>OpenVMS</strong><br />
<strong>Cluster</strong> with both VAX and Alpha computers, keep the dynamic nature of the<br />
PQL_Mquota parameters in mind.<br />
The following table summarizes common SYSUAF.DAT scenarios and probable<br />
results on VAX and Alpha computers in an <strong>OpenVMS</strong> <strong>Cluster</strong> system.<br />
Table 10–5 Common SYSUAF.DAT Scenarios and Probable Results<br />
WHEN you set<br />
values at...<br />
THEN a process that<br />
starts on... Will result in...<br />
Alpha level An Alpha node Execution with the values you deemed appropriate.<br />
A VAX node LOGINOUT not using the system-specific PQL_Mquota<br />
values defined on the VAX system because LOGINOUT<br />
finds higher values for each quota in the Alpha style<br />
SYSUAF.DAT. This could cause VAX processes in the<br />
<strong>OpenVMS</strong> <strong>Cluster</strong> to use inappropriately high resources.<br />
VAX level A VAX node Execution with the values you deemed appropriate.<br />
An Alpha node LOGINOUT ignoring the typically lower VAX level<br />
values in the SYSUAF and instead use the value of<br />
each quota’s current PQL_Mquota values on the Alpha<br />
system. Monitor the current values of PQL_Mquota<br />
system parameters if you choose to try this approach.<br />
Increase as necessary the appropriate PQL_Mquota<br />
values on the Alpha system in MODPARAMS.DAT.<br />
You might decide to experiment with the higher process-quota values that usually<br />
are associated with an <strong>OpenVMS</strong> Alpha system’s SYSUAF.DAT as you determine<br />
values for a common SYSUAF.DAT in an <strong>OpenVMS</strong> <strong>Cluster</strong> environment. The<br />
higher Alpha-level process quotas might be appropriate for processes created on<br />
host VAX nodes in the <strong>OpenVMS</strong> <strong>Cluster</strong> if the VAX systems have large available<br />
memory resources.<br />
You can determine the values that are appropriate for processes on your VAX and<br />
Alpha systems by experimentation and modification over time. Factors in your<br />
decisions about appropriate limit and quota values for each process will include<br />
the following:<br />
• Amount of available memory<br />
• CPU processing power<br />
• Average work load of the applications<br />
• Peak work loads of the applications<br />
10–18 Maintaining an <strong>OpenVMS</strong> <strong>Cluster</strong> System
10.12 Restoring <strong>Cluster</strong> Quorum<br />
Maintaining an <strong>OpenVMS</strong> <strong>Cluster</strong> System<br />
10.12 Restoring <strong>Cluster</strong> Quorum<br />
During the life of an <strong>OpenVMS</strong> <strong>Cluster</strong> system, computers join and leave the<br />
cluster. For example, you may need to add more computers to the cluster to<br />
extend the cluster’s processing capabilities, or a computer may shut down because<br />
of a hardware or fatal software error. The connection management software<br />
coordinates these cluster transitions and controls cluster operation.<br />
When a computer shuts down, the remaining computers, with the help of the<br />
connection manager, reconfigure the cluster, excluding the computer that shut<br />
down. The cluster can survive the failure of the computer and continue process<br />
operations as long as the cluster votes total is greater than the cluster quorum<br />
value. If the cluster votes total falls below the cluster quorum value, the cluster<br />
suspends the execution of all processes.<br />
10.12.1 Restoring Votes<br />
For process execution to resume, the cluster votes total must be restored to a<br />
value greater than or equal to the cluster quorum value. Often, the required<br />
votes are added as computers join or rejoin the cluster. However, waiting for a<br />
computer to join the cluster and increasing the votes value is not always a simple<br />
or convenient remedy. An alternative solution, for example, might be to shut<br />
down and reboot all the computers with a reduce quorum value.<br />
After the failure of a computer, you may want to run the Show <strong>Cluster</strong> utility<br />
and examine values for the VOTES, EXPECTED_VOTES, CL_VOTES, and<br />
CL_QUORUM fields. (See the <strong>OpenVMS</strong> System Management Utilities Reference<br />
Manual for a complete description of these fields.) The VOTES and EXPECTED_<br />
VOTES fields show the settings for each cluster member; the CL_VOTES and<br />
CL_QUORUM fields show the cluster votes total and the current cluster quorum<br />
value.<br />
To examine these values, enter the following commands:<br />
$ SHOW CLUSTER/CONTINUOUS<br />
COMMAND> ADD CLUSTER<br />
Note: If you want to enter SHOW CLUSTER commands interactively, you must<br />
specify the /CONTINUOUS qualifier as part of the SHOW CLUSTER command<br />
string. If you do not specify this qualifier, SHOW CLUSTER displays cluster<br />
status information returned by the DCL command SHOW CLUSTER and returns<br />
you to the DCL command level.<br />
If the display from the Show <strong>Cluster</strong> utility shows the CL_VOTES value equal to<br />
the CL_QUORUM value, the cluster cannot survive the failure of any remaining<br />
voting member. If one of these computers shuts down, all process activity in the<br />
cluster stops.<br />
10.12.2 Reducing <strong>Cluster</strong> Quorum Value<br />
To prevent the disruption of cluster process activity, you can reduce the cluster<br />
quorum value as described in Table 10–6.<br />
Maintaining an <strong>OpenVMS</strong> <strong>Cluster</strong> System 10–19
Maintaining an <strong>OpenVMS</strong> <strong>Cluster</strong> System<br />
10.12 Restoring <strong>Cluster</strong> Quorum<br />
Table 10–6 Reducing the Value of <strong>Cluster</strong> Quorum<br />
Technique Description<br />
Use the DCL<br />
command SET<br />
CLUSTER/EXPECTED_<br />
VOTES to adjust the<br />
cluster quorum to a<br />
value you specify.<br />
Use the IPC Q command<br />
to recalculate the<br />
quorum.<br />
Select one of the clusterrelated<br />
shutdown<br />
options.<br />
If you do not specify a value, the operating system calculates an appropriate value for you.<br />
You need to enter the command on only one computer to propagate the new value throughout<br />
the cluster. When you enter the command, the operating system reports the new value.<br />
Suggestion: Normally, you use the SET CLUSTER/EXPECTED_VOTES command only<br />
after a computer has left the cluster for an extended period. (For more information about<br />
this command, see the <strong>OpenVMS</strong> DCL Dictionary.)<br />
Example: For example, if you want to change expected votes to set the cluster quorum to 2,<br />
enter the following command:<br />
$ SET CLUSTER/EXPECTED_VOTES=3<br />
10.13 <strong>Cluster</strong> Performance<br />
The resulting value for quorum is (3 + 2)/2 = 2.<br />
Note: No matter what value you specify for the SET CLUSTER/EXPECTED_VOTES<br />
command, you cannot increase quorum to a value that is greater than the number of the<br />
votes present, nor can you reduce quorum to a value that is half or fewer of the votes present.<br />
When a computer that previously was a cluster member is ready to rejoin, you must reset<br />
the EXPECTED_VOTES system parameter to its original value in MODPARAMS.DAT on<br />
all computers and then reconfigure the cluster according to the instructions in Section 8.6.<br />
You do not need to use the SET CLUSTER/EXPECTED_VOTES command to increase cluster<br />
quorum, because the quorum value is increased automatically when the computer rejoins the<br />
cluster.<br />
Refer to the <strong>OpenVMS</strong> System Manager’s Manual, Volume 1: Essentials for a description of<br />
the Q command.<br />
Refer to Section 10.7 for a description of the shutdown options.<br />
Sometimes performance issues involve monitoring and tuning applications and<br />
the system as a whole. Tuning involves collecting and reporting on system and<br />
network processes to improve performance. A number of tools can help you collect<br />
information about an active system and its applications.<br />
10.13.1 Using the SHOW Commands<br />
The following table briefly describes the SHOW commands available with the<br />
<strong>OpenVMS</strong> operating system. Use the SHOW DEVICE commands and qualifiers<br />
shown in the table.<br />
Command Purpose<br />
SHOW DEVICE/FULL Shows the complete status of a device, including:<br />
10–20 Maintaining an <strong>OpenVMS</strong> <strong>Cluster</strong> System<br />
• Whether the disk is available to the cluster<br />
• Whether the disk is MSCP served or dual ported<br />
• The name and type (VAX or HSC) of the primary and secondary hosts<br />
• Whether the disk is mounted on the system where you enter the<br />
command<br />
• The systems in the cluster on which the disk is mounted
Command Purpose<br />
Maintaining an <strong>OpenVMS</strong> <strong>Cluster</strong> System<br />
10.13 <strong>Cluster</strong> Performance<br />
SHOW DEVICE/FILES Displays a list of the names of all files open on a volume and their<br />
associated process name and process identifier (PID). The command:<br />
SHOW<br />
DEVICE/SERVED<br />
• Lists files opened only on this node.<br />
• Finds all open files on a disk. You can use either the SHOW<br />
DEVICE/FILES command or SYSMAN commands on each node<br />
that has the disk mounted.<br />
Displays information about disks served by the MSCP server on the node<br />
where you enter the command. Use the following qualifiers to customize<br />
the information:<br />
• /HOST displays the names of processors that have devices online<br />
through the local MSCP server, and the number of devices.<br />
• /RESOURCE displays the resources available to the MSCP server,<br />
total amount of nonpaged dynamic memory available for I/O buffers,<br />
and number of I/O request packets.<br />
• /COUNT displays the number of each size and type of I/O operation<br />
the MSCP server has performed since it was started.<br />
• /ALL displays all of the information listed for the SHOW<br />
DEVICE/SERVED command.<br />
The SHOW CLUSTER command displays a variety of information about the<br />
<strong>OpenVMS</strong> <strong>Cluster</strong> system. The display output provides a view of the cluster as<br />
seen from a single node, rather than a complete view of the cluster.<br />
Reference: The <strong>OpenVMS</strong> System Management Utilities Reference Manual<br />
contains complete information about all the SHOW commands and the Show<br />
<strong>Cluster</strong> utility.<br />
10.13.2 Using the Monitor Utility<br />
The following table describes using the <strong>OpenVMS</strong> Monitor utility to locate disk<br />
I/O bottlenecks. I/O bottlenecks can cause the <strong>OpenVMS</strong> <strong>Cluster</strong> system to<br />
appear to hang.<br />
Maintaining an <strong>OpenVMS</strong> <strong>Cluster</strong> System 10–21
Maintaining an <strong>OpenVMS</strong> <strong>Cluster</strong> System<br />
10.13 <strong>Cluster</strong> Performance<br />
Step Action<br />
1 To determine which clusterwide disks may be problem disks:<br />
1. Create a node-by-node summary of disk I/O using the MONITOR/NODE command<br />
2. Adjust the ‘‘row sum’’ column for MSCP served disks as follows:<br />
• I/O rate on serving node includes local requests and all requests from other nodes<br />
• I/O rate on other nodes includes requests generated from that node<br />
• Requests from remote nodes are counted twice in the row sum column<br />
3. Note disks with the row sum more than 8 I/Os per second<br />
4. Eliminate from the list of cluster problem disks the disks that are:<br />
• Not shared<br />
• Dedicated to an application<br />
• In the process of being backed up<br />
2 For each node, determine the impact of potential problem disks:<br />
• If a disporportionate amount of a disk’s I/O comes from a particular node, the problem<br />
is most likely specific to the node.<br />
• If a disk’s I/O is spread evenly over the cluster, the problem may be clusterwide overuse.<br />
• If the average queue length for a disk on a given node is less than 0.2, then the disk is<br />
having little impact on the node.<br />
3 For each problem disk, determine whether:<br />
• Page and swap files from any node are on the disk.<br />
• Commonly used programs or data files are on the disk (use the SHOW DEVICE/FILES<br />
command).<br />
• Users with default directories on the disk are causing the problem.<br />
10.13.3 Using Compaq Availability Manager and DECamds<br />
Compaq Availability Manager and DECamds are real-time monitoring, diagnostic,<br />
and correction tools used by system managers to improve the availability and<br />
throughput of a system. Availability Manager runs on <strong>OpenVMS</strong> Alpha or on a<br />
Windows node. DECamds runs on both <strong>OpenVMS</strong> VAX and <strong>OpenVMS</strong> Alpha and<br />
uses the DECwindows interface.<br />
These products, which are included with the operating system, help system<br />
managers correct system resource utilization problems for CPU usage, low<br />
memory, lock contention, hung or runaway processes, I/O, disks, page files, and<br />
swap files.<br />
Availability Manager enables you to monitor one or more <strong>OpenVMS</strong> nodes on an<br />
extended LAN from either an <strong>OpenVMS</strong> Alpha or a Windows node. Availability<br />
Manager collects system and process data from multiple <strong>OpenVMS</strong> nodes<br />
simultaneously. It analyzes the data and displays the output using a native Java<br />
GUI.<br />
10–22 Maintaining an <strong>OpenVMS</strong> <strong>Cluster</strong> System
Maintaining an <strong>OpenVMS</strong> <strong>Cluster</strong> System<br />
10.13 <strong>Cluster</strong> Performance<br />
DECamds collects and analyzes data from multiple nodes (VAX and Alpha)<br />
simultaneously, directing all output to a centralized DECwindows display.<br />
DECamds helps you observe and troubleshoot availability problems, as follows:<br />
• Alerts users to resource availability problems, suggests paths for further<br />
investigation, and recommends actions to improve availability.<br />
• Centralizes management of remote nodes within an extended LAN.<br />
• Allows real-time intervention, including adjustment of node and process<br />
parameters, even when remote nodes are hung.<br />
• Adjusts to site-specific requirements through a wide range of customization<br />
options.<br />
Reference: For more information about Availability Manager, see the<br />
Availability Manager User’s Guide and the Availability Manager web site,<br />
which you can access from the Compaq <strong>OpenVMS</strong> site:<br />
http://www.openvms.compaq.com/openvms/<br />
For more information about DECamds, see the DECamds User’s Guide.<br />
10.13.4 Monitoring LAN Activity<br />
It is important to monitor LAN activity on a regular basis. Using the SCA<br />
(<strong>Systems</strong> Communications Architecture) Control Program (SCACP), you can<br />
monitor LAN activity as well as set and show default ports, start and stop LAN<br />
devices, and assign priority values to channels.<br />
Reference: For more information about SCACP, see the <strong>OpenVMS</strong> System<br />
Management Utilities Reference Manual: A–L.<br />
Using NCP commands like the following, you can set up a convenient monitoring<br />
procedure to report activity for each 12-hour period. Note that DECnet event<br />
logging for event 0.2 (automatic line counters) must be enabled.<br />
Reference: For detailed information on DECnet for <strong>OpenVMS</strong> event logging,<br />
refer to the DECnet for <strong>OpenVMS</strong> Network Management Utilities manual.<br />
In these sample commands, BNA-0 is the line ID of the Ethernet line.<br />
NCP> DEFINE LINE BNA-0 COUNTER TIMER 43200<br />
NCP> SET LINE BNA-0 COUNTER TIMER 43200<br />
At every timer interval (in this case, 12 hours), DECnet will create an event that<br />
sends counter data to the DECnet event log. If you experience a performance<br />
degradation in your cluster, check the event log for increases in counter values<br />
that exceed normal variations for your cluster. If all computers show the same<br />
increase, there may be a general problem with your Ethernet configuration. If, on<br />
the other hand, only one computer shows a deviation from usual values, there is<br />
probably a problem with that computer or with its Ethernet interface device.<br />
The following layered products can be used in conjunction with one of Compaq’s<br />
LAN bridges to monitor the LAN traffic levels: RBMS, DECelms, DECmcc, and<br />
LAN Traffic Monitor (LTM).<br />
Maintaining an <strong>OpenVMS</strong> <strong>Cluster</strong> System 10–23
A<br />
<strong>Cluster</strong> System Parameters<br />
For systems to boot properly into a cluster, certain system parameters must be<br />
set on each cluster computer. Table A–1 lists system parameters used in cluster<br />
configurations.<br />
A.1 Values for Alpha and VAX Computers<br />
Some system parameters for Alpha computers are in units of pagelets, whereas<br />
others are in pages. AUTOGEN determines the hardware page size and records<br />
it in the PARAMS.DAT file.<br />
Caution: When reviewing AUTOGEN recommended values or when setting<br />
system parameters with SYSGEN, note carefully which units are required for<br />
each parameter.<br />
Table A–1 describes system parameters that are specific to <strong>OpenVMS</strong> <strong>Cluster</strong><br />
configurations that may require adjustment in certain configurations. Table A–2<br />
describes <strong>OpenVMS</strong> <strong>Cluster</strong> specific system parameters that are reserved for<br />
<strong>OpenVMS</strong> use.<br />
Reference: System parameters, including cluster and volume shadowing system<br />
parameters, are documented in the <strong>OpenVMS</strong> System Management Utilities<br />
Reference Manual.<br />
Table A–1 Adjustable <strong>Cluster</strong> System Parameters<br />
Parameter Description<br />
ALLOCLASS Specifies a numeric value from 0 to 255 to be assigned as the disk allocation<br />
class for the computer. The default value is 0.<br />
CHECK_CLUSTER Serves as a VAXCLUSTER parameter sanity check. When CHECK_<br />
CLUSTER is set to 1, SYSBOOT outputs a warning message and forces<br />
a conversational boot if it detects the VAXCLUSTER parameter is set to 0.<br />
CLUSTER_CREDITS Specifies the number of per-connection buffers a node allocates to receiving<br />
VMS$VAXcluster communications.<br />
If the SHOW CLUSTER command displays a high number of credit waits<br />
for the VMS$VAXcluster connection, you might consider increasing the<br />
value of CLUSTER_CREDITS on the other node. However, in large cluster<br />
configurations, setting this value unnecessarily high will consume a large<br />
quantity of nonpaged pool. Each receive buffer is at least SCSMAXMSG<br />
bytes in size but might be substantially larger depending on the underlying<br />
transport.<br />
It is not required that all nodes in the cluster have the same value for<br />
CLUSTER_CREDITS. For small or memory-constrained systems, the default<br />
value of CLUSTER_CREDITS should be adequate.<br />
(continued on next page)<br />
<strong>Cluster</strong> System Parameters A–1
<strong>Cluster</strong> System Parameters<br />
A.1 Values for Alpha and VAX Computers<br />
Table A–1 (Cont.) Adjustable <strong>Cluster</strong> System Parameters<br />
Parameter Description<br />
CWCREPRC_ENABLE Controls whether an unprivileged user can create a process on another<br />
<strong>OpenVMS</strong> <strong>Cluster</strong> node. The default value of 1 allows an unprivileged user<br />
to create a detached process with the same UIC on another node. A value<br />
of 0 requires that a user have DETACH or CMKRNL privilege to create a<br />
process on another node.<br />
DISK_QUORUM The physical device name, in ASCII, of an optional quorum disk. ASCII<br />
spaces indicate that no quorum disk is being used. DISK_QUORUM must<br />
be defined on one or more cluster computers capable of having a direct (not<br />
MSCP served) connection to the disk. These computers are called quorum<br />
disk watchers. The remaining computers (computers with a blank value for<br />
DISK_QUORUM) recognize the name defined by the first watcher computer<br />
with which they communicate.<br />
‡DR_UNIT_BASE Specifies the base value from which unit numbers for DR devices<br />
(StorageWorks RAID Array 200 Family logical RAID drives) are counted.<br />
DR_UNIT_BASE provides a way for unique RAID device numbers to<br />
be generated. DR devices are numbered starting with the value of<br />
DR_UNIT_BASE and then counting from there. For example, setting<br />
DR_UNIT_BASE to 10 will produce device names such as $1$DRA10,<br />
$1$DRA11, and so on. Setting DR_UNIT_BASE to appropriate,<br />
nonoverlapping values on all cluster members that share the same (nonzero)<br />
allocation class will ensure that no two RAID devices are given the same<br />
name.<br />
EXPECTED_VOTES Specifies a setting that is used to derive the initial quorum value. This<br />
setting is the sum of all VOTES held by potential cluster members.<br />
By default, the value is 1. The connection manager sets a quorum value to a<br />
number that will prevent cluster partitioning (see Section 2.3). To calculate<br />
quorum, the system uses the following formula:<br />
estimated quorum = (EXPECTED_VOTES + 2)/2<br />
LOCKDIRWT Lock manager directory system weight. Determines the portion of lock<br />
manager directory to be handled by this system. The default value is<br />
adequate for most systems.<br />
†LRPSIZE For VAX computers running VMS Version 5.5–2 and earlier, the LRPSIZE<br />
parameter specifies the size, in bytes, of the large request packets. The<br />
actual physical memory consumed by a large request packet is LRPSIZE plus<br />
overhead for buffer management. Normally, the default value is adequate.<br />
The value of LRPSIZE affects the transfer size used by VAX nodes on an<br />
FDDI ring.<br />
FDDI supports transfers using large packets (up to 4468 bytes). PEDRIVER<br />
does not use large packets by default, but can take advantage of the larger<br />
packet sizes if you increase the LRPSIZE system parameter to 4474 or<br />
higher.<br />
PEDRIVER uses the full FDDI packet size if the LRPSIZE is set to 4474<br />
or higher. However, only FDDI nodes connected to the same ring use large<br />
packets. Nodes connected to an Ethernet segment restrict packet size to that<br />
of an Ethernet packet (1498 bytes).<br />
†VAX specific<br />
‡Alpha specific<br />
A–2 <strong>Cluster</strong> System Parameters<br />
(continued on next page)
Table A–1 (Cont.) Adjustable <strong>Cluster</strong> System Parameters<br />
Parameter Description<br />
‡MC_SERVICES_P0<br />
(dynamic)<br />
<strong>Cluster</strong> System Parameters<br />
A.1 Values for Alpha and VAX Computers<br />
Controls whether other MEMORY CHANNEL nodes in the cluster continue<br />
to run if this node bugchecks or shuts down.<br />
A value of 1 causes other nodes in the MEMORY CHANNEL cluster to fail<br />
with bugcheck code MC_FORCED_CRASH if this node bugchecks or shuts<br />
down.<br />
The default value is 0. A setting of 1 is intended only for debugging<br />
purposes; the parameter should otherwise be left at its default state.<br />
‡MC_SERVICES_P2 (static) Specifies whether to load the PMDRIVER (PMA0) MEMORY CHANNEL<br />
cluster port driver. PMDRIVER is a new driver that serves as the MEMORY<br />
CHANNEL cluster port driver. It works together with MCDRIVER (the<br />
MEMORY CHANNEL device driver and device interface) to provide<br />
MEMORY CHANNEL clustering. If PMDRIVER is not loaded, cluster<br />
connections will not be made over the MEMORY CHANNEL interconnect.<br />
The default for MC_SERVICES_P2 is 1. This default value causes<br />
PMDRIVER to be loaded when you boot the system.<br />
Compaq recommends that this value not be changed. This parameter value<br />
must be the same on all nodes connected by MEMORY CHANNEL.<br />
‡MC_SERVICES_P3<br />
(dynamic)<br />
Specifies the maximum number of tags supported. The maximum value is<br />
2048 and the minimum value is 100.<br />
The default value is 800. Compaq recommends that this value not be<br />
changed.<br />
This parameter value must be the same on all nodes connected by MEMORY<br />
CHANNEL.<br />
‡MC_SERVICES_P4 (static) Specifies the maximum number of regions supported. The maximum value<br />
is 4096 and the minimum value is 100.<br />
The default value is 200. Compaq recommends that this value not be<br />
changed.<br />
This parameter value must be the same on all nodes connected by MEMORY<br />
CHANNEL.<br />
‡MC_SERVICES_P6 (static) Specifies MEMORY CHANNEL message size, the body of an entry in a free<br />
queue, or a work queue. The maximum value is 65536 and the minimum<br />
value is 544. The default value is 992, which is suitable in all cases except<br />
systems with highly constrained memory.<br />
For such systems, you can reduce the memory consumption of MEMORY<br />
CHANNEL by slightly reducing the default value of 992. This value must<br />
always be equal to or greater than the result of the following calculation:<br />
‡Alpha specific<br />
1. Select the larger of SCS_MAXMSG and SCS_MAXDG.<br />
2. Round that value to the next quadword.<br />
This parameter value must be the same on all nodes connected by MEMORY<br />
CHANNEL.<br />
(continued on next page)<br />
<strong>Cluster</strong> System Parameters A–3
<strong>Cluster</strong> System Parameters<br />
A.1 Values for Alpha and VAX Computers<br />
Table A–1 (Cont.) Adjustable <strong>Cluster</strong> System Parameters<br />
Parameter Description<br />
‡MC_SERVICES_P7<br />
(dynamic)<br />
Specifies whether to suppress or display messages about cluster activities on<br />
this node. Can be set to a value of 0, 1, or 2. The meanings of these values<br />
are:<br />
Value Meaning<br />
0 Nonverbose mode—no informational messages will appear on the<br />
console or in the error log.<br />
1 Verbose mode—informational messages from both MCDRIVER<br />
and PMDRIVER will appear on the console and in the error log.<br />
2 Same as verbose mode plus PMDRIVER stalling and recovery<br />
messages.<br />
The default value is 0. Compaq recommends that this value not be changed<br />
except for debugging MEMORY CHANNEL problems or adjusting the MC_<br />
SERVICES_P9 parameter.<br />
‡MC_SERVICES_P9 (static) Specifies the number of initial entries in a single channel’s free queue. The<br />
maximum value is 2048 and the minimum value is 10.<br />
Note that MC_SERVICES_P9 is not a dynamic parameter; you must reboot<br />
the system after each change in order for the change to take effect.<br />
The default value is 150. Compaq recommends that this value not be<br />
changed.<br />
This parameter value must be the same on all nodes connected by MEMORY<br />
CHANNEL.<br />
‡MPDEV_AFB_INTVL<br />
(disks only)<br />
Specifies the automatic failback interval in seconds. The automatic failback<br />
interval is the minimum number of seconds that must elapse before the<br />
system will attempt another failback from an MSCP path to a direct path on<br />
the same device.<br />
MPDEV_POLLER must be set to ON to enable automatic failback. You can<br />
disable automatic failback without disabling the poller by setting MPDEV_<br />
AFB_INTVL to 0. The default is 300 seconds.<br />
‡MPDEV_D1 (disks only) Reserved for use by the operating system.<br />
‡MPDEV_D2 (disks only) Reserved for use by the operating system.<br />
‡MPDEV_D3 (disks only) Reserved for use by the operating system.<br />
‡MPDEV_D4 (disks only) Reserved for use by the operating system.<br />
‡MPDEV_ENABLE Enables the formation of multipath sets when set to ON ( 1 ). When set to<br />
OFF ( 0 ), the formation of additional multipath sets and the addition of new<br />
paths to existing multipath sets is disabled. However, existing multipath<br />
sets remain in effect. The default is ON.<br />
MPDEV_REMOTE and MPDEV_AFB_INTVL have no effect when MPDEV<br />
ENABLE is set to OFF.<br />
‡MPDEV_LCRETRIES Controls the number of times the system retries the direct paths to the<br />
(disks only)<br />
controller that the logical unit is online to, before moving on to direct paths<br />
to the other controller, or to an MSCP served path to the device. The valid<br />
range for retries is 1 through 256. The default is 1.<br />
‡Alpha specific<br />
A–4 <strong>Cluster</strong> System Parameters<br />
(continued on next page)
Table A–1 (Cont.) Adjustable <strong>Cluster</strong> System Parameters<br />
Parameter Description<br />
<strong>Cluster</strong> System Parameters<br />
A.1 Values for Alpha and VAX Computers<br />
‡MPDEV_POLLER Enables polling of the paths to multipath set members when set to ON ( 1 ).<br />
Polling allows early detection of errors on inactive paths. If a path becomes<br />
unavailable or returns to service, the system manager is notified with an<br />
OPCOM message. When set to OFF ( 0 ), multipath polling is disabled.<br />
The default is ON. Note that this parameter must be set to ON to use the<br />
automatic failback feature.<br />
‡MPDEV_REMOTE (disks<br />
only)<br />
Enables MSCP served disks to become members of a multipath set when<br />
set to ON ( 1 ). When set to OFF ( 0 ), only local paths to a SCSI or Fibre<br />
Channel device are used in the formation of additional multipath sets.<br />
MPDEV_REMOTE is enabled by default. However, setting this parameter to<br />
OFF has no effect on existing multipath sets that have remote paths.<br />
To use multipath failover to a served path, MPDEV_REMOTE must be<br />
enabled on all systems that have direct access to shared SCSI/Fibre Channel<br />
devices. The first release to provide this feature is <strong>OpenVMS</strong> Alpha Version<br />
7.3-1. Therefore, all nodes on which MPDEV_REMOTE is enabled must be<br />
running <strong>OpenVMS</strong> Alpha Version 7.3-1 (or later).<br />
If MPDEV_ENABLE is set to OFF (0), the setting of MPDEV_REMOTE has<br />
no effect because the addition of all new paths to multipath sets is disabled.<br />
The default is ON.<br />
MSCP_BUFFER This buffer area is the space used by the server to transfer data between<br />
client systems and local disks.<br />
On VAX systems, MSCP_BUFFER specifies the number of pages to be<br />
allocated to the MSCP server’s local buffer area.<br />
On Alpha systems, MSCP_BUFFER specifies the number of pagelets to be<br />
allocated the MSCP server’s local buffer area.<br />
MSCP_CMD_TMO Specifies the time in seconds that the <strong>OpenVMS</strong> MSCP server uses to detect<br />
MSCP command timeouts. The MSCP server must complete the command<br />
within a built-in time of approximately 40 seconds plus the value of the<br />
MSCP_CMD_TMO parameter.<br />
An MSCP_CMD_TMO value of 0 is normally adequate. A value of 0 provides<br />
the same behavior as in previous releases of <strong>OpenVMS</strong> (which did not have<br />
an MSCP_CMD_TMO system parameter). A nonzero setting increases the<br />
amount of time before an MSCP command times out.<br />
If command timeout errors are being logged on client nodes, setting the<br />
parameter to a nonzero value on <strong>OpenVMS</strong> servers reduces the number<br />
of errors logged. Increasing the value of this parameter reduces the numb<br />
client MSCP command timeouts and increases the time it takes to detect<br />
faulty devices.<br />
If you need to decrease the number of command timeout errors, set an initial<br />
value of 60. If timeout errors continue to be logged, you can increase this<br />
value in increments of 20 seconds.<br />
MSCP_CREDITS Specifies the number of outstanding I/O requests that can be active from one<br />
client system.<br />
MSCP_LOAD Controls whether the MSCP server is loaded. Specify 1 to load the server,<br />
and use the default CPU load rating. A value greater than 1 loads the server<br />
and uses this value as a constant load rating. By default, the value is set to<br />
0 and the server is not loaded.<br />
‡Alpha specific<br />
(continued on next page)<br />
<strong>Cluster</strong> System Parameters A–5
<strong>Cluster</strong> System Parameters<br />
A.1 Values for Alpha and VAX Computers<br />
Table A–1 (Cont.) Adjustable <strong>Cluster</strong> System Parameters<br />
Parameter Description<br />
MSCP_SERVE_ALL Controls the serving of disks. The settings take effect when the system<br />
boots. You cannot change the settings when the system is running.<br />
Starting with <strong>OpenVMS</strong> Version 7.2, the serving types are implemented as a<br />
bit mask. To specify the type of serving your system will perform, locate the<br />
type you want in the following table and specify its value. For some systems,<br />
you may want to specify two serving types, such as serving the system disk<br />
and serving locally attached disks. To specify such a combination, add the<br />
values of each type, and specify the sum.<br />
In a mixed-version cluster that includes any systems running <strong>OpenVMS</strong><br />
Version 7.1-x or earlier, serving all available disks is restricted to serving all<br />
disks except those whose allocation class does not match the system’s node<br />
allocation class (pre-Version 7.2 meaning). To specify this type of serving,<br />
use the value 9 (which sets bit 0 and bit 3).<br />
The following table describes the serving type controlled by each bit and its<br />
decimal value.<br />
A–6 <strong>Cluster</strong> System Parameters<br />
Bit and<br />
Value<br />
When Set Description<br />
Bit 0 (1) Serve all available disks (locally attached and those connected<br />
to HSx and DSSI controllers). Disks with allocation classes<br />
that differ from the system’s allocation class (set by the<br />
ALLOCLASS parameter) are also served if bit 3 is not set.<br />
Bit 1 (2) Serve locally attached (non-HSx and non-DSSI) disks.<br />
Bit 2 (4) Serve the system disk. This is the default setting. This<br />
setting is important when other nodes in the cluster rely on<br />
this system being able to serve its system disk. This setting<br />
prevents obscure contention problems that can occur when<br />
a system attempts to complete I/O to a remote system disk<br />
whose system has failed.<br />
Bit 3 (8) Restrict the serving specified by bit 0. All disks except those<br />
with allocation classes that differ from the system’s allocation<br />
class (set by the ALLOCLASS parameter) are served.<br />
This is pre-Version 7.2 behavior. If your cluster includes<br />
systems running Open 7.1-x or earlier, and you want to serve<br />
all available disks, you must specify 9, the result of setting<br />
this bit and bit 0.<br />
Although the serving types are now implemented as a bit mask, the values<br />
of 0, 1, and 2, specified by bit 0 and bit 1, retain their original meanings:<br />
• 0 — Do not serve any disks (the default for earlier versions of<br />
<strong>OpenVMS</strong>).<br />
• 1 — Serve all available disks.<br />
• 2 — Serve only locally attached (non-HSx and non-DSSI) disks.<br />
If the MSCP_LOAD system parameter is 0, MSCP_SERVE_ALL is ignored.<br />
For more information about this system parameter, see Section 6.3.1.<br />
(continued on next page)
Table A–1 (Cont.) Adjustable <strong>Cluster</strong> System Parameters<br />
Parameter Description<br />
<strong>Cluster</strong> System Parameters<br />
A.1 Values for Alpha and VAX Computers<br />
NISCS_CONV_BOOT During booting as an <strong>OpenVMS</strong> <strong>Cluster</strong> satellite, specifies whether<br />
conversational bootstraps are enabled on the computer. The default value of<br />
0 specifies that conversational bootstraps are disabled. A value of 1 enables<br />
conversational bootstraps.<br />
NISCS_LAN_OVRHD Starting with <strong>OpenVMS</strong> Version 7.3, this parameter is obsolete. This<br />
parameter was formerly provided to reserve space in a LAN packet for<br />
encryption fields applied by external encryption devices. PEDRIVER now<br />
automatically determines the maximum packet size a LAN path can deliver,<br />
including any packet-size reductions required by external encryption devices.<br />
NISCS_LOAD_PEA0 Specifies whether the port driver (PEDRIVER) is to be loaded to enable<br />
cluster communications over the local area network (LAN). The default value<br />
of 0 specifies that the driver is not loaded. A value of 1 specifies that that<br />
driver is loaded.<br />
Caution: If the NISCS_LOAD_PEA0 parameter is set to 1, the<br />
VAXCLUSTER system parameter must be set to 2. This ensures coordinated<br />
access to shared resources in the <strong>OpenVMS</strong> <strong>Cluster</strong> and prevents accidental<br />
data corruption.<br />
(continued on next page)<br />
<strong>Cluster</strong> System Parameters A–7
<strong>Cluster</strong> System Parameters<br />
A.1 Values for Alpha and VAX Computers<br />
Table A–1 (Cont.) Adjustable <strong>Cluster</strong> System Parameters<br />
Parameter Description<br />
NISCS_MAX_PKTSZ Specifies an upper limit, in bytes, on the size of the user data area in the<br />
largest packet sent by NISCA on any local area network (LAN).<br />
The NISCS_MAX_PKTSZ parameter allows the system manager to change<br />
the packet size used for cluster communications on network communication<br />
paths. PEDRIVER automatically allocates memory to support the largest<br />
packet size that is usable by any virtual circuit connected to the system up<br />
to the limit set by this parameter.<br />
Its default values are different for <strong>OpenVMS</strong> Alpha and <strong>OpenVMS</strong> VAX.<br />
A–8 <strong>Cluster</strong> System Parameters<br />
• On Alpha, the default value is the largest packet size currently<br />
supported by <strong>OpenVMS</strong>, in order to optimize performance.<br />
• On VAX, the default value is the Ethernet packet size.<br />
PEDRIVER uses NISCS_MAX_PKTSZ to compute the maximum amount of<br />
data to transmit in any LAN packet as follows:<br />
LAN packet size
Table A–1 (Cont.) Adjustable <strong>Cluster</strong> System Parameters<br />
Parameter Description<br />
<strong>Cluster</strong> System Parameters<br />
A.1 Values for Alpha and VAX Computers<br />
NISCS_PORT_SERV NISCS_PORT_SERV provides flag bits for PEDRIVER port services. Setting<br />
bits 0 and 1 (decimal value 3) enables data checking. The remaining bits are<br />
reserved for future use. Starting with <strong>OpenVMS</strong> Version 7.3-1, you can use<br />
the SCACP command SET VC/CHECKSUMMING to specify data checking<br />
on the VCs to certain nodes. You can do this on a running system. (Refer to<br />
the SCACP documentation in the <strong>OpenVMS</strong> System Management Utilities<br />
Reference Manual for more information.)<br />
On the other hand, changing the setting of NISCS_PORT_SERV requires a<br />
reboot. Furthermore, this parameter applies to all virtual circuits between<br />
the node on which it is set and other nodes in the cluster.<br />
NISCS_PORT_SERV has the AUTOGEN attribute.<br />
PASTDGBUF Specifies the number of datagram receive buffers to queue initially for the<br />
cluster port driver’s configuration poller. The initial value is expanded<br />
during system operation, if needed.<br />
MEMORY CHANNEL devices ignore this parameter.<br />
QDSKINTERVAL Specifies, in seconds, the disk quorum polling interval. The maximum is<br />
32767, the minimum is 1, and the default is 3. Lower values trade increased<br />
overhead cost for greater responsiveness.<br />
This parameter should be set to the same value on each cluster computer.<br />
QDSKVOTES Specifies the number of votes contributed to the cluster votes total by a<br />
quorum disk. The maximum is 127, the minimum is 0, and the default is 1.<br />
This parameter is used only when DISK_QUORUM is defined.<br />
RECNXINTERVAL Specifies, in seconds, the interval during which the connection manager<br />
attempts to reconnect a broken connection to another computer. If a new<br />
connection cannot be established during this period, the connection is<br />
declared irrevocably broken, and either this computer or the other must<br />
leave the cluster. This parameter trades faster response to certain types<br />
of system failures for the ability to survive transient faults of increasing<br />
duration.<br />
This parameter should be set to the same value on each cluster computer.<br />
This parameter also affects the tolerance of the <strong>OpenVMS</strong> <strong>Cluster</strong> system for<br />
LAN bridge failures (see Section 3.4.7).<br />
SCSBUFFCNT †On VAX systems, SCSBUFFCNT is the number of buffer descriptors<br />
configured for all SCS devices. If no SCS device is configured on your<br />
system, this parameter is ignored. Generally, each data transfer needs a<br />
buffer descriptor: thus, the number of buffer descriptors limit the number<br />
of possible simultaneous I/Os. Various performance monitors report when a<br />
system is out of buffer descriptors for a given work load, indicating that a<br />
larger value for SCSBUFFCNT is worth considering.<br />
Note: AUTOGEN provides feedback for this parameter on VAX systems<br />
only.<br />
‡On Alpha systems, the SCS buffers are allocated as needed, and<br />
SCSBUFFCNT is reserved for <strong>OpenVMS</strong> use, only.<br />
†VAX specific<br />
‡Alpha specific<br />
(continued on next page)<br />
<strong>Cluster</strong> System Parameters A–9
<strong>Cluster</strong> System Parameters<br />
A.1 Values for Alpha and VAX Computers<br />
Table A–1 (Cont.) Adjustable <strong>Cluster</strong> System Parameters<br />
Parameter Description<br />
SCSCONNCNT The initial number of SCS connections that are configured for use by all<br />
system applications, including the one used by Directory Service Listen. The<br />
initial number will be expanded by the system if needed.<br />
If no SCS ports are configured on your system, this parameter is ignored.<br />
The default value is adequate for all SCS hardware combinations.<br />
Note: AUTOGEN provides feedback for this parameter on VAX systems<br />
only.<br />
SCSNODE 1<br />
Specifies the name of the computer. This parameter is not dynamic.<br />
Specify SCSNODE as a string of up to six characters. Enclose the string in<br />
quotation marks.<br />
If the computer is in an <strong>OpenVMS</strong> <strong>Cluster</strong>, specify a value that is unique<br />
within the cluster. Do not specify the null string.<br />
If the computer is running DECnet for <strong>OpenVMS</strong>, the value must be the<br />
same as the DECnet node name.<br />
SCSRESPCNT SCSRESPCNT is the total number of response descriptor table entries<br />
(RDTEs) configured for use by all system applications.<br />
If no SCS or DSA port is configured on your system, this parameter is<br />
ignored.<br />
SCSSYSTEMID 1<br />
Specifies a number that identifies the computer. This parameter is not<br />
dynamic. SCSSYSTEMID is the low-order 32 bits of the 48-bit system<br />
identification number.<br />
If the computer is in an <strong>OpenVMS</strong> <strong>Cluster</strong>, specify a value that is unique<br />
within the cluster.<br />
If the computer is running DECnet for <strong>OpenVMS</strong>, calculate the value from<br />
the DECnet address using the following formula:<br />
SCSSYSTEMID = (DECnet-area-number * 1024)<br />
+ DECnet-node-number<br />
Example: If the DECnet address is 2.211, calculate the value as follows:<br />
SCSSYSTEMID = (2 * 1024) + 211 = 2259<br />
SCSSYSTEMIDH Specifies the high-order 16 bits of the 48-bit system identification number.<br />
This parameter must be set to 0. It is reserved by <strong>OpenVMS</strong> for future use.<br />
TAPE_ALLOCLASS Specifies a numeric value from 0 to 255 to be assigned as the tape allocation<br />
class for tape devices connected to the computer. The default value is 0.<br />
TIMVCFAIL Specifies the time required for a virtual circuit failure to be detected.<br />
Compaq recommends that you use the default value. Compaq further<br />
recommends that you decrease this value only in <strong>OpenVMS</strong> <strong>Cluster</strong> systems<br />
of three or fewer CPUs, use the same value on each computer in the cluster,<br />
and use dedicated LAN segments for cluster I/O.<br />
TMSCP_LOAD Controls whether the TMSCP server is loaded. Specify a value of 1 to load<br />
the server and set all available TMSCP tapes served. By default, the value<br />
is set to 0, and the server is not loaded.<br />
1 Once a computer has been recognized by another computer in the cluster, you cannot change the SCSSYSTEMID or<br />
SCSNODE parameter without either changing both or rebooting the entire cluster.<br />
A–10 <strong>Cluster</strong> System Parameters<br />
(continued on next page)
Table A–1 (Cont.) Adjustable <strong>Cluster</strong> System Parameters<br />
Parameter Description<br />
<strong>Cluster</strong> System Parameters<br />
A.1 Values for Alpha and VAX Computers<br />
TMSCP_SERVE_ALL Controls the serving of tapes. The settings take effect when the system<br />
boots. You cannot change the settings when the system is running.<br />
Starting with <strong>OpenVMS</strong> Version 7.2, the serving types are implemented as a<br />
bit mask. To specify the type of serving your system will perform, locate the<br />
type you want in the following table and specify its value. For some systems,<br />
you may want to specify two serving types, such as serving all tapes except<br />
those whose allocation class does not match. To specify such a combination,<br />
add the values of each type, and specify the sum.<br />
In a mixed-version cluster that includes any systems running <strong>OpenVMS</strong><br />
Version 7.1-x or earlier, serving all available tapes is restricted to serving<br />
all tapes except those whose allocation class does not match the system’s<br />
allocation class (pre-Version 7.2 meaning). To specify this type of serving,<br />
use the value 9, which sets bit 0 and bit 3.<br />
The following table describes the serving type controlled by each bit and its<br />
decimal value.<br />
Bit<br />
Value<br />
When Set Description<br />
Bit 0 1 Serve all available tapes (locally attached and those<br />
connected to HSx and DSSI controllers). Tapes<br />
with allocation classes that differ from the system’s<br />
allocation class (set by the ALLOCLASS parameter)<br />
are also served if bit 3 is not set.<br />
Bit 1 2 Serve locally attached (non-HSx and non-DSSI)<br />
tapes.<br />
Bit 2 n/a Reserved.<br />
Bit 3 8 Restrict the serving specified by bit 0. All tapes<br />
except those with allocation classes that differ from<br />
the system’s allocation class (set by the ALLOCLASS<br />
parameter) are served.<br />
This is pre-Version 7.2 behavior. If your cluster<br />
includes systems running <strong>OpenVMS</strong> Version 7.1-x<br />
or earlier, and you want to serve all available tapes,<br />
you must specify 9, the result of setting this bit and<br />
bit 0.<br />
Although the serving types are now implemented as a bit mask, the values<br />
of 0, 1, and 2, specified by bit 0 and bit 1, retain their original meanings:<br />
• 0 — Do not serve any disks (the default for earlier versions of<br />
<strong>OpenVMS</strong>).<br />
• 1 — Serve all available disks.<br />
• 2 — Serve only locally attached (non-HSx and non-DSSI) disks.<br />
If the TMSCP_LOAD system parameter is 0, TMSCP_SERVE_ALL is<br />
ignored.<br />
(continued on next page)<br />
<strong>Cluster</strong> System Parameters A–11
<strong>Cluster</strong> System Parameters<br />
A.1 Values for Alpha and VAX Computers<br />
Table A–1 (Cont.) Adjustable <strong>Cluster</strong> System Parameters<br />
Parameter Description<br />
VAXCLUSTER Controls whether the computer should join or form a cluster. This parameter<br />
accepts the following three values:<br />
• 0 — Specifies that the computer will not participate in a cluster.<br />
• 1 — Specifies that the computer should participate in a cluster if<br />
hardware supporting SCS (CI or DSSI) is present or if NISCS_LOAD_<br />
PEA0 is set to 1, indicating that cluster communications is enabled over<br />
the local area network (LAN).<br />
• 2 — Specifies that the computer should participate in a cluster.<br />
You should always set this parameter to 2 on computers intended to run in a<br />
cluster, to 0 on computers that boot from a UDA disk controller and are not<br />
intended to be part of a cluster, and to 1 (the default) otherwise.<br />
Caution: If the NISCS_LOAD_PEA0 system parameter is set to 1, the<br />
VAXCLUSTER parameter must be set to 2. This ensures coordinated<br />
access to shared resources in the <strong>OpenVMS</strong> <strong>Cluster</strong> system and prevents<br />
accidental data corruption. Data corruption may occur on shared resources<br />
if the NISCS_LOAD_PEA0 parameter is set to 1 and the VAXCLUSTER<br />
parameter is set to 0.<br />
VOTES Specifies the number of votes toward a quorum to be contributed by the<br />
computer. The default is 1.<br />
Table A–2 lists system parameters that should not require adjustment at any<br />
time. These parameters are provided for use in system debugging. Compaq<br />
recommends that you do not change these parameters unless you are advised<br />
to do so by your Compaq support representative. Incorrect adjustment of these<br />
parameters can result in cluster failures.<br />
Table A–2 <strong>Cluster</strong> System Parameters Reserved for <strong>OpenVMS</strong> Use Only<br />
Parameter Description<br />
‡MC_SERVICES_P1 The value of this parameter must be the same on all nodes connected by<br />
(dynamic)<br />
MEMORY CHANNEL.<br />
‡MC_SERVICES_P5 This parameter must remain at the default value of 8000000. This<br />
(dynamic)<br />
parameter value must be the same on all nodes connected by MEMORY<br />
CHANNEL.<br />
‡MC_SERVICES_P8 (static) This parameter must remain at the default value of 0. This parameter value<br />
must be the same on all nodes connected by MEMORY CHANNEL.<br />
‡MPDEV_D1 A multipath system parameter.<br />
‡Alpha specific<br />
A–12 <strong>Cluster</strong> System Parameters<br />
(continued on next page)
<strong>Cluster</strong> System Parameters<br />
A.1 Values for Alpha and VAX Computers<br />
Table A–2 (Cont.) <strong>Cluster</strong> System Parameters Reserved for <strong>OpenVMS</strong> Use Only<br />
Parameter Description<br />
PAMAXPORT PAMAXPORT specifies the maximum port number to be polled on each CI<br />
and DSSI. The CI and DSSI port drivers poll to discover newly initialized<br />
ports or the absence or failure of previously responding remote ports.<br />
A system will not detect the existence of ports whose port numbers are<br />
higher than this parameter’s value. Thus, this parameter should be set to a<br />
value that is greater than or equal to the highest port number being used on<br />
any CI or DSSI connected to the system.<br />
You can decrease this parameter to reduce polling activity if the hardware<br />
configuration has fewer than 16 ports. For example, if the CI or DSSI with<br />
the largest configuration has a total of five ports assigned to port numbers 0<br />
through 4, you could set PAMAXPORT to 4.<br />
If no CI or DSSI devices are configured on your system, this parameter is<br />
ignored.<br />
The default for this parameter is 15 (poll for all possible ports 0 through 15).<br />
Compaq recommends that you set this parameter to the same value on each<br />
cluster computer.<br />
PANOPOLL Disables CI and DSSI polling for ports if set to 1. (The default is 0.) When<br />
PANOPOLL is set, a computer will not promptly discover that another<br />
computer has shut down or powered down and will not discover a new<br />
computer that has booted. This parameter is useful when you want to bring<br />
up a computer detached from the rest of the cluster for debugging purposes.<br />
PANOPOLL is functionally equivalent to uncabling the system from the<br />
DSSI or star coupler. This parameter does not affect <strong>OpenVMS</strong> <strong>Cluster</strong><br />
communications over the LAN.<br />
The default value of 0 is the normal setting and is required if you are booting<br />
from an HSC controller or if your system is joining an <strong>OpenVMS</strong> <strong>Cluster</strong>.<br />
This parameter is ignored if there are no CI or DSSI devices configured on<br />
your system.<br />
PANUMPOLL Establishes the number of CI and DSSI ports to be polled during each polling<br />
interval. The normal setting for PANUMPOLL is 16.<br />
On older systems with less powerful CPUs, the parameter may be useful<br />
in applications sensitive to the amount of contiguous time that the system<br />
spends at IPL 8. Reducing PANUMPOLL reduces the amount of time spent<br />
at IPL 8 during each polling interval while increasing the number of polling<br />
intervals needed to discover new or failed ports.<br />
If no CI or DSSI devices are configured on your system, this parameter is<br />
ignored.<br />
PAPOLLINTERVAL Specifies, in seconds, the polling interval the CI port driver uses to poll for<br />
a newly booted computer, a broken port-to-port virtual circuit, or a failed<br />
remote computer.<br />
This parameter trades polling overhead against quick response to virtual<br />
circuit failures. This parameter should be set to the same value on each<br />
cluster computer.<br />
PAPOOLINTERVAL Specifies, in seconds, the interval at which the port driver checks available<br />
nonpaged pool after a pool allocation failure.<br />
This parameter trades faster response to pool allocation failures for increased<br />
system overhead.<br />
If CI or DSSI devices are not configured on your system, this parameter is<br />
ignored.<br />
(continued on next page)<br />
<strong>Cluster</strong> System Parameters A–13
<strong>Cluster</strong> System Parameters<br />
A.1 Values for Alpha and VAX Computers<br />
Table A–2 (Cont.) <strong>Cluster</strong> System Parameters Reserved for <strong>OpenVMS</strong> Use Only<br />
Parameter Description<br />
PASANITY PASANITY controls whether the CI and DSSI port sanity timers are enabled<br />
to permit remote systems to detect a system that has been hung at IPL 8 or<br />
higher for 100 seconds. It also controls whether virtual circuit checking gets<br />
enabled on the local system. The TIMVCFAIL parameter controls the time<br />
(1–99 seconds).<br />
PASANITY is normally set to 1 and should be set to 0 only if you are<br />
debugging with XDELTA or planning to halt the CPU for periods of 100<br />
seconds or more.<br />
PASANITY is only semidynamic. A new value of PASANITY takes effect on<br />
the next CI or DSSI port reinitialization.<br />
If CI or DSSI devices are not configured on your system, this parameter is<br />
ignored.<br />
PASTDGBUF The number of datagram receive buffers to queue initially for each CI or<br />
DSSI port driver’s configuration poller; the initial value is expanded during<br />
system operation, if needed.<br />
If no CI or DSSI devices are configured on your system, this parameter is<br />
ignored.<br />
PASTIMOUT The basic interval at which the CI port driver wakes up to perform timebased<br />
bookkeeping operations. It is also the period after which a timeout<br />
will be declared if no response to a start handshake datagram has been<br />
received.<br />
If no CI or DSSI device is configured on your system, this parameter is<br />
ignored.<br />
PRCPOLINTERVAL Specifies, in seconds, the polling interval used to look for SCS applications,<br />
such as the connection manager and MSCP disks, on other computers. Each<br />
computer is polled, at most, once each interval.<br />
This parameter trades polling overhead against quick recognition of new<br />
computers or servers as they appear.<br />
SCSMAXMSG The maximum number of bytes of system application data in one sequenced<br />
message. The amount of physical memory consumed by one message is<br />
SCSMAXMSG plus the overhead for buffer management.<br />
If an SCS port is not configured on your system, this parameter is ignored.<br />
SCSMAXDG Specifies the maximum number of bytes of application data in one datagram.<br />
If an SCS port is not configured on your system, this parameter is ignored.<br />
SCSFLOWCUSH Specifies the lower limit for receive buffers at which point SCS starts to<br />
notify the remote SCS of new receive buffers. For each connection, SCS<br />
tracks the number of receive buffers available. SCS communicates this<br />
number to the SCS at the remote end of the connection. However, SCS does<br />
not need to do this for each new receive buffer added. Instead, SCS notifies<br />
the remote SCS of new receive buffers if the number of receive buffers falls<br />
as low as the SCSFLOWCUSH value.<br />
If an SCS port is not configured on your system, this parameter is ignored.<br />
A–14 <strong>Cluster</strong> System Parameters
B<br />
Building Common Files<br />
This appendix provides guidelines for building a common user authorization file<br />
(UAF) from computer-specific files. It also describes merging RIGHTSLIST.DAT<br />
files.<br />
For more detailed information about how to set up a computer-specific<br />
authorization file, see the descriptions in the <strong>OpenVMS</strong> Guide to System<br />
Security.<br />
B.1 Building a Common SYSUAF.DAT File<br />
To build a common SYSUAF.DAT file, follow the steps in Table B–1.<br />
Table B–1 Building a Common SYSUAF.DAT File<br />
Step Action<br />
1 Print a listing of SYSUAF.DAT on each computer. To print this listing, invoke AUTHORIZE<br />
and specify the AUTHORIZE command LIST as follows:<br />
$ SET DEF SYS$SYSTEM<br />
$ RUN AUTHORIZE<br />
UAF> LIST/FULL [*,*]<br />
(continued on next page)<br />
Building Common Files B–1
Building Common Files<br />
B.1 Building a Common SYSUAF.DAT File<br />
Table B–1 (Cont.) Building a Common SYSUAF.DAT File<br />
Step Action<br />
B–2 Building Common Files<br />
2 Use the listings to compare the accounts from each computer. On the listings, mark any<br />
necessary changes. For example:<br />
• Delete any accounts that you no longer need.<br />
• Make sure that UICs are set appropriately:<br />
User UICs<br />
Check each user account in the cluster to see whether it should have a unique user<br />
identification code (UIC). For example, <strong>OpenVMS</strong> <strong>Cluster</strong> member VENUS may<br />
have a user account JONES that has the same UIC as user account SMITH on<br />
computer MARS. When computers VENUS and MARS are joined to form a cluster,<br />
accounts JONES and SMITH will exist in the cluster environment with the same<br />
UIC. If the UICs of these accounts are not differentiated, each user will have the<br />
same access rights to various objects in the cluster. In this case, you should assign<br />
each account a unique UIC.<br />
Group UICs<br />
Make sure that accounts that perform the same type of work have the same group<br />
UIC. Accounts in a single-computer environment probably follow this convention.<br />
However, there may be groups of users on each computer that will perform the<br />
same work in the cluster but that have group UICs unique to their local computer.<br />
As a rule, the group UIC for any given work category should be the same on each<br />
computer in the cluster. For example, data entry accounts on VENUS should have<br />
the same group UIC as data entry accounts on MARS.<br />
Note: If you change the UIC for a particular user, you should also change the owner<br />
UICs for that user’s existing files and directories. You can use the DCL commands SET<br />
FILE and SET DIRECTORY to make these changes. These commands are described in<br />
detail in the <strong>OpenVMS</strong> DCL Dictionary.<br />
3 Choose the SYSUAF.DAT file from one of the computers to be a master SYSUAF.DAT.<br />
Note: The default values for a number of SYSUAF process limits and quotas are higher<br />
on an Alpha computer than they are on a VAX computer. See A Comparison of System<br />
Management on <strong>OpenVMS</strong> AXP and <strong>OpenVMS</strong> VAX1 for information about setting values on<br />
both computers.<br />
1 This manual has been archived but is available in PostScript and DECW$BOOK (Bookreader)<br />
formats on the <strong>OpenVMS</strong> Documentation CD–ROM. A printed book cn be ordered through DECdirect<br />
(800-354-4825).<br />
(continued on next page)
Building Common Files<br />
B.1 Building a Common SYSUAF.DAT File<br />
Table B–1 (Cont.) Building a Common SYSUAF.DAT File<br />
Step Action<br />
4 Merge the SYSUAF.DAT files from the other computers to the master SYSUAF.DAT by<br />
running the Convert utility (CONVERT) on the computer that owns the master SYSUAF.DAT.<br />
(See the <strong>OpenVMS</strong> Record Management Utilities Reference Manual for a description of<br />
CONVERT.) To use CONVERT to merge the files, each SYSUAF.DAT file must be accessible<br />
to the computer that is running CONVERT.<br />
Syntax: To merge the UAFs into the master SYSUAF.DAT file, specify the CONVERT<br />
command in the following format:<br />
CONVERT SYSUAF1,SYSUAF2,...SYSUAFn MASTER_SYSUAF<br />
Note that if a given user name appears in more than one source file, only the first occurrence<br />
of that name appears in the merged file.<br />
Example: The following command sequence example creates a new SYSUAF.DAT file from<br />
the combined contents of the two input files:<br />
$ SET DEFAULT SYS$SYSTEM<br />
$ CONVERT/MERGE [SYS1.SYSEXE]SYSUAF.DAT, -<br />
_$ [SYS2.SYSEXE]SYSUAF.DAT SYSUAF.DAT<br />
The CONVERT command in this example adds the records from the files<br />
[SYS1.SYSEXE]SYSUAF.DAT and [SYS2.SYSEXE]SYSUAF.DAT to the file SYSUAF.DAT<br />
on the local computer.<br />
After you run CONVERT, you have a master SYSUAF.DAT that contains records from the<br />
other SYSUAF.DAT files.<br />
5 Use AUTHORIZE to modify the accounts in the master SYSUAF.DAT according to the<br />
changes you marked on the initial listings of the SYSUAF.DAT files from each computer.<br />
6 Place the master SYSUAF.DAT file in SYS$COMMON:[SYSEXE].<br />
7 Remove all node-specific SYSUAF.DAT files.<br />
B.2 Merging RIGHTSLIST.DAT Files<br />
If you need to merge RIGHTSLIST.DAT files, you can use a command sequence<br />
like the following:<br />
$ ACTIVE_RIGHTSLIST = F$PARSE("RIGHTSLIST","SYS$SYSTEM:.DAT")<br />
$ CONVERT/SHARE/STAT ’ACTIVE_RIGHTSLIST’ RIGHTSLIST.NEW<br />
$ CONVERT/MERGE/STAT/EXCEPTION=RIGHTSLIST_DUPLICATES.DAT -<br />
_$ [SYS1.SYSEXE]RIGHTSLIST.DAT, [SYS2.SYSEXE]RIGHTSLIST.DAT RIGHTSLIST.NEW<br />
$ DUMP/RECORD RIGHTSLIST_DUPLICATES.DAT<br />
$ CONVERT/NOSORT/FAST/STAT RIGHTSLIST.NEW ’ACTIVE_RIGHTSLIST’<br />
The commands in this example add the RIGHTSLIST.DAT files from two<br />
<strong>OpenVMS</strong> <strong>Cluster</strong> computers to the master RIGHTSLIST.DAT file in the current<br />
default directory. For detailed information about creating and maintaining<br />
RIGHTSLIST.DAT files, see the security guide for your system.<br />
Building Common Files B–3
C.1 Diagnosing Computer Failures<br />
C<br />
<strong>Cluster</strong> Troubleshooting<br />
This appendix contains information to help you perform troubleshooting<br />
operations for the following:<br />
• Failures of computers to boot or to join the cluster<br />
• <strong>Cluster</strong> hangs<br />
• CLUEXIT bugchecks<br />
• Port device problems<br />
C.1.1 Preliminary Checklist<br />
Before you initiate diagnostic procedures, be sure to verify that these conditions<br />
are met:<br />
• All cluster hardware components are correctly connected and checked for<br />
proper operation.<br />
• When you attempt to add a new or recently repaired CI computer to the<br />
cluster, verify that the CI cables are correctly connected, as described in<br />
Section C.10.5.<br />
• <strong>OpenVMS</strong> <strong>Cluster</strong> computers and mass storage devices are configured<br />
according to requirements specified in the <strong>OpenVMS</strong> <strong>Cluster</strong> Software<br />
Software Product Description (SPD 29.78.xx).<br />
• When attempting to add a satellite to a cluster, you must verify that the LAN<br />
is configured according to requirements specified in the <strong>OpenVMS</strong> <strong>Cluster</strong><br />
Software SPD. You must also verify that you have correctly configured and<br />
started the network, following the procedures described in Chapter 4.<br />
If, after performing preliminary checks and taking appropriate corrective action,<br />
you find that a computer still fails to boot or to join the cluster, you can follow the<br />
procedures in Sections C.2 through C.4 to attempt recovery.<br />
C.1.2 Sequence of Booting Events<br />
To perform diagnostic and recovery procedures effectively, you must understand<br />
the events that occur when a computer boots and attempts to join the cluster.<br />
This section outlines those events and shows typical messages displayed at the<br />
console.<br />
Note that events vary, depending on whether a computer is the first to boot in<br />
a new cluster or whether it is booting in an active cluster. Note also that some<br />
events (such as loading the cluster database containing the password and group<br />
number) occur only in <strong>OpenVMS</strong> <strong>Cluster</strong> systems on a LAN.<br />
<strong>Cluster</strong> Troubleshooting C–1
<strong>Cluster</strong> Troubleshooting<br />
C.1 Diagnosing Computer Failures<br />
C–2 <strong>Cluster</strong> Troubleshooting<br />
The normal sequence of events is shown in Table C–1.<br />
Table C–1 Sequence of Booting Events<br />
Step Action<br />
1 The computer boots. If the computer is a satellite, a message like the following shows<br />
the name and LAN address of the MOP server that has downline loaded the satellite.<br />
At this point, the satellite has completed communication with the MOP server and<br />
further communication continues with the system disk server, using <strong>OpenVMS</strong> <strong>Cluster</strong><br />
communications.<br />
%VAXcluster-I-SYSLOAD, system loaded from Node X...<br />
For any booting computer, the <strong>OpenVMS</strong> ‘‘banner message’’ is displayed in the following<br />
format:<br />
operating-system Version n.n dd-mmm-yyyy hh:mm.ss<br />
2 The computer attempts to form or join the cluster, and the following message appears:<br />
waiting to form or join an <strong>OpenVMS</strong> <strong>Cluster</strong> system<br />
If the computer is a member of an <strong>OpenVMS</strong> <strong>Cluster</strong> based on the LAN, the cluster security<br />
database (containing the cluster password and group number) is loaded. Optionally, the<br />
MSCP server and TMSCP server can be loaded:<br />
%VAXcluster-I-LOADSECDB, loading the cluster security database<br />
%MSCPLOAD-I-LOADMSCP, loading the MSCP disk server<br />
%TMSCPLOAD-I-LOADTMSCP, loading the TMSCP tape server<br />
3 If the computer discovers a cluster, the computer attempts to join it. If a cluster is found, the<br />
connection manager displays one or more messages in the following format:<br />
%CNXMAN, Sending VAXcluster membership request to system X...<br />
Otherwise, the connection manager forms the cluster when it has enough votes to establish<br />
quorum (that is, when enough voting computers have booted).<br />
4 As the booting computer joins the cluster, the connection manager displays a message in the<br />
following format:<br />
%CNXMAN, now a VAXcluster member -- system X...<br />
Note that if quorum is lost while the computer is booting, or if a computer is unable to join<br />
the cluster within 2 minutes of booting, the connection manager displays messages like the<br />
following:<br />
%CNXMAN, Discovered system X...<br />
%CNXMAN, Deleting CSB for system X...<br />
%CNXMAN, Established "connection" to quorum disk<br />
%CNXMAN, Have connection to system X...<br />
%CNXMAN, Have "connection" to quorum disk<br />
The last two messages show any connections that have already been formed.<br />
(continued on next page)
Table C–1 (Cont.) Sequence of Booting Events<br />
Step Action<br />
<strong>Cluster</strong> Troubleshooting<br />
C.1 Diagnosing Computer Failures<br />
5 If the cluster includes a quorum disk, you may also see messages like the following:<br />
%CNXMAN, Using remote access method for quorum disk<br />
%CNXMAN, Using local access method for quorum disk<br />
The first message indicates that the connection manager is unable to access the quorum disk<br />
directly, either because the disk is unavailable or because it is accessed through the MSCP<br />
server. Another computer in the cluster that can access the disk directly must verify that a<br />
reliable connection to the disk exists.<br />
The second message indicates that the connection manager can access the quorum disk<br />
directly and can supply information about the status of the disk to computers that cannot<br />
access the disk directly.<br />
Note: The connection manager may not see the quorum disk initially because the disk may<br />
not yet be configured. In that case, the connection manager first uses remote access, then<br />
switches to local access.<br />
6 Once the computer has joined the cluster, normal startup procedures execute. One of the first<br />
functions is to start the OPCOM process:<br />
%%%%%%%%%%% OPCOM 15-JAN-1994 16:33:55.33 %%%%%%%%%%%<br />
Logfile has been initialized by operator _X...$OPA0:<br />
Logfile is SYS$SYSROOT:[SYSMGR]OPERATOR.LOG;17<br />
%%%%%%%%%%% OPCOM 15-JAN-1994 16:33:56.43 %%%%%%%%%%%<br />
16:32:32.93 Node X... (csid 0002000E) is now a VAXcluster member<br />
7 As other computers join the cluster, OPCOM displays messages like the following:<br />
%%%%% OPCOM 15-JAN-1994 16:34:25.23 %%%%% (from node X...)<br />
16:34:24.42 Node X... (csid 000100F3)<br />
received VAXcluster membership request from node X...<br />
As startup procedures continue, various messages report startup events.<br />
Hint: For troubleshooting purposes, you can include in your site-specific startup<br />
procedures messages announcing each phase of the startup process—for example,<br />
mounting disks or starting queues.<br />
C.2 Computer on the CI Fails to Boot<br />
If a CI computer fails to boot, perform the following checks:<br />
Step Action<br />
1 Verify that the computer’s SCSNODE and SCSSYSTEMID parameters are unique in the<br />
cluster. If they are not, you must either alter both values or reboot all other computers.<br />
2 Verify that you are using the correct bootstrap command file. This file must specify the<br />
internal bus computer number (if applicable), the HSC or HSJ node number, and the<br />
disk from which the computer is to boot. Refer to your processor-specific installation<br />
and operations guide for information about setting values in default bootstrap command<br />
procedures.<br />
3 Verify that the PAMAXPORT system parameter is set to a value greater than or equal to<br />
the largest CI port number.<br />
4 Verify that the CI port has a unique hardware station address.<br />
5 Verify that the HSC subsystem is on line. The ONLINE switch on the HSC operator<br />
control panel should be pressed in.<br />
6 Verify that the disk is available. The correct port switches on the disk’s operator control<br />
panel should be pressed in.<br />
<strong>Cluster</strong> Troubleshooting C–3
<strong>Cluster</strong> Troubleshooting<br />
C.2 Computer on the CI Fails to Boot<br />
Step Action<br />
7 Verify that the computer has access to the HSC subsystem. The SHOW HOSTS command<br />
of the HSC SETSHO utility displays status for all computers (hosts) in the cluster. If the<br />
computer in question appears in the display as DISABLED, use the SETSHO utility to set<br />
the computer to the ENABLED state.<br />
Reference: For complete information about the SETSHO utility, consult the HSC<br />
hardware documentation.<br />
8 Verify that the HSC subsystem allows access to the boot disk. Invoke the SETSHO utility<br />
to ensure that the boot disk is available to the HSC subsystem. The utility’s SHOW DISKS<br />
command displays the current state of all disks visible to the HSC subsystem and displays<br />
all disks in the no-host-access table.<br />
C.3 Satellite Fails to Boot<br />
IF... THEN...<br />
The boot disk appears in the no-host-access<br />
table.<br />
The boot disk is available or mounted and<br />
host access is enabled, but the disk does<br />
not appear in the no-host-access table.<br />
Use the SETSHO utility to set the boot disk<br />
to host-access.<br />
Contact your support representative and<br />
explain both the problem and the steps you<br />
have taken.<br />
To boot successfully, a satellite must communicate with a MOP server over the<br />
LAN. You can use DECnet event logging to verify this communication. Proceed as<br />
follows:<br />
Step Action<br />
C–4 <strong>Cluster</strong> Troubleshooting<br />
1 Log in as system manager on the MOP server.<br />
2 If event logging for management-layer events is not already enabled, enter the following<br />
NCP commands to enable it:<br />
NCP> SET LOGGING MONITOR EVENT 0.*<br />
NCP> SET LOGGING MONITOR STATE ON<br />
3 Enter the following DCL command to enable the terminal to receive DECnet messages<br />
reporting downline load events:<br />
$ REPLY/ENABLE=NETWORK
Step Action<br />
<strong>Cluster</strong> Troubleshooting<br />
C.3 Satellite Fails to Boot<br />
4 Boot the satellite. If the satellite and the MOP server can communicate and all boot<br />
parameters are correctly set, messages like the following are displayed at the MOP<br />
server’s terminal:<br />
DECnet event 0.3, automatic line service<br />
From node 2.4 (URANUS), 15-JAN-1994 09:42:15.12<br />
Circuit QNA-0, Load, Requested, Node = 2.42 (OBERON)<br />
File = SYS$SYSDEVICE:, Operating system<br />
Ethernet address = 08-00-2B-07-AC-03<br />
DECnet event 0.3, automatic line service<br />
From node 2.4 (URANUS), 15-JAN-1994 09:42:16.76<br />
Circuit QNA-0, Load, Successful, Node = 2.44 (ARIEL)<br />
File = SYS$SYSDEVICE:, Operating system<br />
Ethernet address = 08-00-2B-07-AC-13<br />
WHEN... THEN...<br />
The satellite cannot<br />
communicate with the MOP<br />
server (VAX or Alpha).<br />
The satellite’s data in the<br />
DECnet database is incorrectly<br />
specified (for example, if the<br />
hardware address is incorrect).<br />
No message for that satellite appears. There may be<br />
a problem with a LAN cable connection or adapter<br />
service.<br />
A message like the following displays the correct<br />
address and indicates that a load was requested:<br />
DECnet event 0.7, aborted service<br />
request<br />
From node 2.4 (URANUS) 15-JAN-1994<br />
Circuit QNA-0, Line open error<br />
Ethernet address=08-00-2B-03-29-99<br />
Note the absence of the node name, node address, and<br />
system root.<br />
Sections C.3.2 through C.3.5 provide more information about satellite boot<br />
troubleshooting and often recommend that you ensure that the system<br />
parameters are set correctly.<br />
C.3.1 Displaying Connection Messages<br />
To enable the display of connection messages during a conversational boot,<br />
perform the following steps:<br />
Step Action<br />
1 Enable conversational booting by setting the satellite’s NISCS_CONV_BOOT system<br />
parameter to 1. On Alpha systems, update the ALPHAVMSSYS.PAR file, and on VAX<br />
systems update the VAXVMSSYS.PAR file in the system root on the disk server.<br />
2 Perform a conversational boot.<br />
‡On Alpha systems, enter the following command at the console:<br />
>>> b -flags 0,1<br />
†VAX specific<br />
‡Alpha specific<br />
†On VAX systems, set bit in register R5. For example, on a VAXstation 3100 system,<br />
enter the following command on the console:<br />
>>> B/1<br />
<strong>Cluster</strong> Troubleshooting C–5
<strong>Cluster</strong> Troubleshooting<br />
C.3 Satellite Fails to Boot<br />
Step Action<br />
3 Observe connection messages.<br />
Display connection messages during a satellite boot to determine which system in a large<br />
cluster is serving the system disk to a cluster satellite during the boot process. If booting<br />
problems occur, you can use this display to help isolate the problem with the system that is<br />
currently serving the system disk. Then, if your server system has multiple LAN adapters,<br />
you can isolate specific LAN adapters.<br />
4 Isolate LAN adapters.<br />
Isolate a LAN adapter by methodically rebooting with only one adapter connected. That is,<br />
disconnect all but one of the LAN adapters on the server system and reboot the satellite.<br />
If the satellite boots when it is connected to the system disk server, then follow the same<br />
procedure using a different LAN adapter. Continue these steps until you have located the<br />
bad adapter.<br />
Reference: See also Appendix C for help with troubleshooting satellite booting<br />
problems.<br />
C.3.2 General <strong>OpenVMS</strong> <strong>Cluster</strong> Satellite-Boot Troubleshooting<br />
If a satellite fails to boot, use the steps outlined in this section to diagnose and<br />
correct problems in <strong>OpenVMS</strong> <strong>Cluster</strong> systems.<br />
C–6 <strong>Cluster</strong> Troubleshooting
Step Action<br />
<strong>Cluster</strong> Troubleshooting<br />
C.3 Satellite Fails to Boot<br />
1 Verify that the boot device is available. This check is particularly important for clusters in<br />
which satellites boot from multiple system disks.<br />
2 Verify that the DECnet network is up and running.<br />
3 Check the cluster group code and password. The cluster group code and password are set<br />
using the CLUSTER_CONFIG.COM procedure.<br />
4 Verify that you have installed the correct <strong>OpenVMS</strong> Alpha and <strong>OpenVMS</strong> VAX licenses.<br />
5 Verify system parameter values on each satellite node, as follows:<br />
VAXCLUSTER = 2<br />
NISCS_LOAD_PEA0 = 1<br />
NISCS_LAN_OVRHD = 0<br />
NISCS_MAX_PKTSZ = 1498 1<br />
SCSNODE is the name of the computer.<br />
SCSSYSTEMID is a number that identifies the computer.<br />
VOTES = 0<br />
The SCS parameter values are set differently depending on your system configuration.<br />
Reference: Appendix A describes how to set these SCS parameters.<br />
1 For Ethernet adapters, the value of NISCS_MAX_PKTSZ is 1498. For FDDI adapters, the value is<br />
4468.<br />
<strong>Cluster</strong> Troubleshooting C–7
<strong>Cluster</strong> Troubleshooting<br />
C.3 Satellite Fails to Boot<br />
Step Action<br />
C–8 <strong>Cluster</strong> Troubleshooting<br />
To check system parameter values on a satellite node that cannot boot, invoke the<br />
SYSGEN utility on a running system in the <strong>OpenVMS</strong> <strong>Cluster</strong> that has access to the<br />
satellite node’s local root. (Note that you must invoke the SYSGEN utility from a node<br />
that is running the same type of operating system—for example, to troubleshoot an Alpha<br />
satellite node, you must run the SYSGEN utility on an Alpha system.) Check system<br />
parameters as follows:<br />
Step Action<br />
A Find the local root of the satellite node on the system disk. The following<br />
example is from an Alpha system running DECnet for <strong>OpenVMS</strong>:<br />
$ MCR NCP SHOW NODE HOME CHARACTERISTICS<br />
Node Volatile Characteristics as of 10-JAN-1994 09:32:56<br />
Remote node = 63.333 (HOME)<br />
Hardware address = 08-00-2B-30-96-86<br />
Load file = APB.EXE<br />
Load Assist Agent = SYS$SHARE:NISCS_LAA.EXE<br />
Load Assist Parameter = ALPHA$SYSD:[SYS17.]<br />
The local root in this example is ALPHA$SYSD:[SYS17.].<br />
Reference: Refer to the DECnet–Plus documentation for equivalent information<br />
using NCL commands.<br />
B Enter the SHOW LOGICAL command at the system prompt to translate the<br />
logical name for ALPHA$SYSD.<br />
$ SHO LOG ALPHA$SYSD<br />
"ALPHA$SYSD" = "$69$DUA121:" (LNM$SYSTEM_TABLE)<br />
C Invoke the SYSGEN utility on the system from which you can access the<br />
satellite’s local disk. (This example invokes the SYSGEN utility on an<br />
Alpha system using the Alpha parameter file ALPHAVMSSYS.PAR. The<br />
SYSGEN utility on VAX systems differs in that it uses the VAX parameter file<br />
VAXVMSSYS.PAR). The following example illustrates how to enter the SYSGEN<br />
command USE with the system parameter file on the local root for the satellite<br />
node and then enter the SHOW command to query the parameters in question.<br />
$ MCR SYSGEN<br />
SYSGEN> USE $69$DUA121:[SYS17.SYSEXE]ALPHAVMSSYS.PAR<br />
SYSGEN> SHOW VOTES<br />
Parameter<br />
Name Current Default Min. Max. Unit Dynamic<br />
--------- ------- ------- --- ----- ---- -------<br />
VOTES 0 1 0 127 Votes<br />
SYSGEN> EXIT
<strong>Cluster</strong> Troubleshooting<br />
C.3 Satellite Fails to Boot<br />
C.3.3 MOP Server Troubleshooting<br />
To diagnose and correct problems for MOP servers, follow the steps outlined in<br />
this section.<br />
Step Action<br />
1 Perform the steps outlined in Section C.3.2.<br />
2 Verify the NCP circuit state is on and the service is enabled. Enter the following<br />
commands to run the NCP utility and check the NCP circuit state.<br />
$ MCR NCP<br />
NCP> SHOW CIRCUIT ISA-0 CHARACTERISTICS<br />
Circuit Volatile Characteristics as of 12-JAN-1994 10:08:30<br />
Circuit = ISA-0<br />
State = on<br />
Service = enabled<br />
Designated router = 63.1021<br />
Cost = 10<br />
Maximum routers allowed = 33<br />
Router priority = 64<br />
Hello timer = 15<br />
Type = Ethernet<br />
Adjacent node = 63.1021<br />
Listen timer = 45<br />
3 If service is not enabled, you can enter NCP commands like the following to enable it:<br />
NCP> SET CIRCUIT circuit-id STATE OFF<br />
NCP> DEFINE CIRCUIT circuit-id SERVICE ENABLED<br />
NCP> SET CIRCUIT circuit-id SERVICE ENABLED STATE ON<br />
The DEFINE command updates the permanent database and ensures that service is<br />
enabled the next time you start the network. Note that DECnet traffic is interrupted<br />
while the circuit is off.<br />
4 Verify that the load assist parameter points to the system disk and the system root for the<br />
satellite.<br />
5 Verify that the satellite’s system disk is mounted on the MOP server node.<br />
6 ‡On Alpha systems, verify that the load file is APB.EXE.<br />
7 For MOP booting, the satellite node’s parameter file (ALPHAVMSYS.PAR for Alpha<br />
computers and VAXVMSSYS.PAR for VAX computers) must be located in the [SYSEXE]<br />
directory of the satellite system root.<br />
8 Ensure that the file CLUSTER_AUTHORIZE.DAT is located in the<br />
[SYSCOMMON.SYSEXE] directory of the satellite system root.<br />
‡Alpha specific<br />
<strong>Cluster</strong> Troubleshooting C–9
<strong>Cluster</strong> Troubleshooting<br />
C.3 Satellite Fails to Boot<br />
C.3.4 Disk Server Troubleshooting<br />
To diagnose and correct problems for disk servers, follow the steps outlined in<br />
this section.<br />
Step Action<br />
1 Perform the steps in Section C.3.2.<br />
2 For each satellite node, verify the following system parameter values:<br />
MSCP_LOAD = 1<br />
MSCP_SERVE_ALL = 1<br />
3 The disk servers for the system disk must be connected directly to the disk.<br />
C.3.5 Satellite Booting Troubleshooting<br />
To diagnose and correct problems for satellite booting, follow the steps outlined in<br />
this section.<br />
Step Action<br />
1 Perform the steps in Sections C.3.2, C.3.3, and C.3.4.<br />
2 For each satellite node, verify that the VOTES system parameter is set to 0.<br />
3 ‡On Alpha systems, verify the DECnet network database on the MOP servers by running<br />
the NCP utility and entering the following commands to display node characteristics. The<br />
following example displays information about an Alpha node named UTAH:<br />
$ MCR NCP<br />
NCP> SHOW NODE UTAH CHARACTERISTICS<br />
Node Volatile Characteristics as of 15-JAN-1994 10:28:09<br />
Remote node = 63.227 (UTAH)<br />
Hardware address = 08-00-2B-2C-CE-E3<br />
Load file = APB.EXE<br />
Load Assist Agent = SYS$SHARE:NISCS_LAA.EXE<br />
Load Assist Parameter = $69$DUA100:[SYS17.]<br />
The load file must be APB.EXE. In addition, when booting Alpha nodes, for each LAN<br />
adapter specified on the boot command line, the load assist parameter must point to the<br />
same system disk and root number.<br />
4 †On VAX systems, verify the DECnet network database on the MOP servers by running<br />
the NCP utility and entering the following commands to display node characteristics. The<br />
following example displays information about a VAX node named ARIEL:<br />
†VAX specific<br />
‡Alpha specific<br />
C–10 <strong>Cluster</strong> Troubleshooting<br />
$ MCR NCP<br />
NCP> SHOW CHAR NODE ARIEL<br />
Node Volatile Characteristics as of 15-JAN-1994 13:15:28<br />
Remote node = 2.41 (ARIEL)<br />
Hardware address = 08-00-2B-03-27-95<br />
Tertiary loader = SYS$SYSTEM:TERTIARY_VMB.EXE<br />
Load Assist Agent = SYS$SHARE:NISCS_LAA.EXE<br />
Load Assist Parameter = DISK$VAXVMSRL5:<br />
Note that on VAX nodes, the tertiary loader is SYS$SYSTEM:TERTIARY_VMB.EXE.
Step Action<br />
<strong>Cluster</strong> Troubleshooting<br />
C.3 Satellite Fails to Boot<br />
5 On Alpha and VAX systems, verify the following information in the NCP display:<br />
Step Action<br />
A Verify the DECnet address for the node.<br />
B Verify the load assist agent is SYS$SHARE:NISCS_LAA.EXE.<br />
C Verify the load assist parameter points to the satellite system disk and correct root.<br />
D Verify that the hardware address matches the satellite’s Ethernet address. At the<br />
satellite’s console prompt, use the information shown in Table 8–3 to obtain the<br />
satellite’s current LAN hardware address.<br />
Compare the hardware address values displayed by NCP and at the satellite’s<br />
console. The values should be identical and should also match the value shown in<br />
the SYS$MANAGER:NETNODE_UPDATE.COM file. If the values do not match,<br />
you must make appropriate adjustments. For example, if you have recently replaced<br />
the satellite’s LAN adapter, you must execute CLUSTER_CONFIG.COM CHANGE<br />
function to update the network database and NETNODE_UPDATE.COM on the<br />
appropriate MOP server.<br />
6 Perform a conversational boot to determine more precisely why the satellite is having<br />
trouble booting. The conversational boot procedure displays messages that can help you<br />
solve network booting problems. The messages provide information about the state of the<br />
network and the communications process between the satellite and the system disk server.<br />
Reference: Section C.3.6 describes booting messages for Alpha systems.<br />
C.3.6 Alpha Booting Messages (Alpha Only)<br />
On Alpha systems, the messages are displayed as shown in Table C–2.<br />
Table C–2 Alpha Booting Messages (Alpha Only)<br />
Message Comments<br />
%VMScluster-I-MOPSERVER, MOP server for downline load was node UTAH<br />
This message displays the name of the system providing the DECnet MOP<br />
downline load. This message acknowledges that control was properly<br />
transferred from the console performing the MOP load to the image that<br />
was loaded.<br />
%VMScluster-I-BUSONLINE, LAN adapter is now running 08-00-2B-2C-CE-E3<br />
This message displays the LAN address of the Ethernet or FDDI adapter<br />
specified in the boot command. Multiple lines can be displayed if multiple<br />
LAN devices were specified in the boot command line. The booting<br />
satellite can now attempt to locate the system disk by sending a message<br />
to the cluster multicast address.<br />
If this message is not displayed, either the<br />
MOP load failed or the wrong file was MOP<br />
downline loaded.<br />
If this message is not displayed, the LAN<br />
adapter is not initialized properly. Check<br />
the physical network connection. For<br />
FDDI, the adapter must be on the ring.<br />
(continued on next page)<br />
<strong>Cluster</strong> Troubleshooting C–11
<strong>Cluster</strong> Troubleshooting<br />
C.3 Satellite Fails to Boot<br />
Table C–2 (Cont.) Alpha Booting Messages (Alpha Only)<br />
Message Comments<br />
%VMScluster-I-VOLUNTEER, System disk service volunteered by node EUROPA AA-00-04-00-4C-FD<br />
This message displays the name of a system claiming to serve the satellite<br />
system disk. This system has responded to the multicast message sent by<br />
the booting satellite to locate the servers of the system disk.<br />
If this message is not displayed, one or<br />
more of the following situations may be<br />
causing the problem:<br />
The network path between the<br />
satellite and the boot server either<br />
is broken or is filtering the local area<br />
<strong>OpenVMS</strong> <strong>Cluster</strong> multicast messages.<br />
The system disk is not being served.<br />
The CLUSTER_AUTHORIZE.DAT file<br />
on the system disk does not match the<br />
other cluster members.<br />
%VMScluster-I-CREATECH, Creating channel to node EUROPA 08-00-2B-2C-CE-E2 08-00-2B-12-AE-A2<br />
This message displays the LAN address of the local LAN adapter (first<br />
address) and of the remote LAN adapter (second address) that form a<br />
communications path through the network. These adapters can be used<br />
to support a NISCA virtual circuit for booting. Multiple messages can<br />
be displayed if either multiple LAN adapters were specified on the boot<br />
command line or the system serving the system disk has multiple LAN<br />
adapters.<br />
%VMScluster-I-OPENVC, Opening virtual circuit to node EUROPA<br />
This message displays the name of a system that has established an<br />
NISCA virtual circuit to be used for communications during the boot<br />
process. Booting uses this virtual circuit to connect to the remote MSCP<br />
server.<br />
%VMScluster-I-MSCPCONN, Connected to a MSCP server for the system disk, node EUROPA<br />
This message displays the name of a system that is actually serving the<br />
satellite system disk.<br />
If you do not see as many of these messages<br />
as you expect, there may be network<br />
problems related to the LAN adapters<br />
whose addresses are not displayed. Use the<br />
Local Area <strong>OpenVMS</strong> <strong>Cluster</strong> Network<br />
Failure Analysis Program for better<br />
troubleshooting (see Section D.5).<br />
If this message is not displayed, the system<br />
that claimed to serve the system disk could<br />
not serve the disk. Check the <strong>OpenVMS</strong><br />
<strong>Cluster</strong> configuration.<br />
%VMScluster-W-SHUTDOWNCH, Shutting down channel to node EUROPA 08-00-2B-2C-CE-E3 08-00-2B-12-AE-A2<br />
This message displays the LAN address of the local LAN adapter (first<br />
address) and of the remote LAN adapter (second address) that have just<br />
lost communications. Depending on the type of failure, multiple messages<br />
may be displayed if either the booting system or the system serving the<br />
system disk has multiple LAN adapters.<br />
%VMScluster-W-CLOSEVC, Closing virtual circuit to node EUROPA<br />
This message indicates that NISCA communications have failed to the<br />
system whose name is displayed.<br />
C–12 <strong>Cluster</strong> Troubleshooting<br />
(continued on next page)
Table C–2 (Cont.) Alpha Booting Messages (Alpha Only)<br />
Message Comments<br />
%VMScluster-I-RETRY, Attempting to reconnect to a system disk server<br />
This message indicates that an attempt will be made to locate another<br />
system serving the system disk. The LAN adapters will be reinitialized<br />
and all communications will be restarted.<br />
%VMScluster-W-PROTOCOL_TIMEOUT, NISCA protocol timeout<br />
Either the booting node has lost connections to the remote system or the<br />
remote system is no longer responding to requests made by the booting<br />
system. In either case, the booting system has declared a failure and will<br />
reestablish communications to a boot server.<br />
C.4 Computer Fails to Join the <strong>Cluster</strong><br />
<strong>Cluster</strong> Troubleshooting<br />
C.3 Satellite Fails to Boot<br />
If a computer fails to join the cluster, follow the procedures in this section to<br />
determine the cause.<br />
C.4.1 Verifying <strong>OpenVMS</strong> <strong>Cluster</strong> Software Load<br />
To verify that <strong>OpenVMS</strong> <strong>Cluster</strong> software has been loaded, follow these<br />
instructions:<br />
Step Action<br />
1 Look for connection manager (%CNXMAN) messages like those shown in Section C.1.2.<br />
2 If no such messages are displayed, <strong>OpenVMS</strong> <strong>Cluster</strong> software probably was not loaded at<br />
boot time. Reboot the computer in conversational mode. At the SYSBOOT> prompt, set<br />
the VAXCLUSTER parameter to 2.<br />
3 For <strong>OpenVMS</strong> <strong>Cluster</strong> systems communicating over the LAN or mixed interconnects, set<br />
NISCS_LOAD_PEA0 to 1 and VAXCLUSTER to 2. These parameters should also be set in<br />
the computer’s MODPARAMS.DAT file. (For more information about booting a computer<br />
in conversational mode, consult your installation and operations guide).<br />
4 For <strong>OpenVMS</strong> <strong>Cluster</strong> systems on the LAN, verify that the cluster security database file<br />
(SYS$COMMON:CLUSTER_AUTHORIZE.DAT) exists and that you have specified the<br />
correct group number for this cluster (see Section 10.9.1).<br />
C.4.2 Verifying Boot Disk and Root<br />
To verify that the computer has booted from the correct disk and system root,<br />
follow these instructions:<br />
Step Action<br />
1 If %CNXMAN messages are displayed, and if, after the conversational reboot, the<br />
computer still does not join the cluster, check the console output on all active computers<br />
and look for messages indicating that one or more computers found a remote computer<br />
that conflicted with a known or local computer. Such messages suggest that two computers<br />
have booted from the same system root.<br />
2 Review the boot command files for all CI computers and ensure that all are booting from<br />
the correct disks and from unique system roots.<br />
<strong>Cluster</strong> Troubleshooting C–13
<strong>Cluster</strong> Troubleshooting<br />
C.4 Computer Fails to Join the <strong>Cluster</strong><br />
Step Action<br />
3 If you find it necessary to modify the computer’s bootstrap command procedure (console<br />
media), you may be able to do so on another processor that is already running in the<br />
cluster.<br />
Replace the running processor’s console media with the media to be modified, and use the<br />
Exchange utility and a text editor to make the required changes. Consult the appropriate<br />
processor-specific installation and operations guide for information about examining and<br />
editing boot command files.<br />
C.4.3 Verifying SCSNODE and SCSSYSTEMID Parameters<br />
To be eligible to join a cluster, a computer must have unique SCSNODE and<br />
SCSSYSTEMID parameter values.<br />
Step Action<br />
1 Check that the current values do not duplicate any values set for existing <strong>OpenVMS</strong> <strong>Cluster</strong><br />
computers. To check values, you can perform a conversational bootstrap operation.<br />
2 If the values of SCSNODE or SCSSYSTEMID are not unique, do either of the following:<br />
• Alter both values.<br />
• Reboot all other computers.<br />
Note: To modify values, you can perform a conversational bootstrap operation. However, for<br />
reliable future bootstrap operations, specify appropriate values for these parameters in the<br />
computer’s MODPARAMS.DAT file.<br />
WHEN you change... THEN...<br />
The SCSNODE parameter Change the DECnet node name too, because both names<br />
must be the same.<br />
Either the SCSNODE parameter<br />
or the SCSSYSTEMID<br />
parameter on a node that was<br />
previously an <strong>OpenVMS</strong> <strong>Cluster</strong><br />
member<br />
Change the DECnet node number, too, because both<br />
numbers must be the same. Reboot the entire cluster.<br />
C.4.4 Verifying <strong>Cluster</strong> Security Information<br />
To verify the cluster group code and password, follow these instructions:<br />
Step Action<br />
1 Verify that the database file SYS$COMMON:CLUSTER_AUTHORIZE.DAT exists.<br />
2 For clusters with multiple system disks, ensure that the correct (same) group number and<br />
password were specified for each.<br />
Reference: See Section 10.9 to view the group number and to reset the password in the<br />
CLUSTER_AUTHORIZE.DAT file using the SYSMAN utility.<br />
C.5 Startup Procedures Fail to Complete<br />
C–14 <strong>Cluster</strong> Troubleshooting<br />
If a computer boots and joins the cluster but appears to hang before startup<br />
procedures complete—that is, before you are able to log in to the system—be sure<br />
that you have allowed sufficient time for the startup procedures to execute.
IF... THEN...<br />
The startup procedures fail to<br />
complete after a period that is<br />
normal for your site.<br />
You suspect that the value for<br />
the NPAGEDYN parameter is<br />
set too low.<br />
You suspect a shortage of<br />
page file space, and another<br />
<strong>OpenVMS</strong> <strong>Cluster</strong> computer is<br />
available.<br />
The computer still cannot<br />
complete the startup procedures.<br />
C.6 Diagnosing LAN Component Failures<br />
<strong>Cluster</strong> Troubleshooting<br />
C.5 Startup Procedures Fail to Complete<br />
Try to access the procedures from another <strong>OpenVMS</strong> <strong>Cluster</strong><br />
computer and make appropriate adjustments. For example, verify<br />
that all required devices are configured and available. One cause<br />
of such a failure could be the lack of some system resource, such<br />
as NPAGEDYN or page file space.<br />
Perform a conversational bootstrap operation to increase it. Use<br />
SYSBOOT to check the current value, and then double the value.<br />
Log in on that computer and use the System Generation utility<br />
(SYSGEN) to provide adequate page file space for the problem<br />
computer.<br />
Note: Insufficent page-file space on the booting computer might<br />
cause other computers to hang.<br />
Contact your Compaq support representative.<br />
Section D.5 provides troubleshooting techniques for LAN component failures (for<br />
example, broken LAN bridges). That appendix also describes techniques for using<br />
the Local Area <strong>OpenVMS</strong> <strong>Cluster</strong> Network Failure Analysis Program.<br />
Intermittent LAN component failures (for example, packet loss) can cause<br />
problems in the NISCA transport protocol that delivers System Communications<br />
Services (SCS) messages to other nodes in the <strong>OpenVMS</strong> <strong>Cluster</strong>. Appendix F<br />
describes troubleshooting techniques and requirements for LAN analyzer tools.<br />
C.7 Diagnosing <strong>Cluster</strong> Hangs<br />
Conditions like the following can cause a <strong>OpenVMS</strong> <strong>Cluster</strong> computer to suspend<br />
process or system activity (that is, to hang):<br />
Condition Reference<br />
<strong>Cluster</strong> quorum is lost. Section C.7.1<br />
A shared cluster resource is inaccessible. Section C.7.2<br />
C.7.1 <strong>Cluster</strong> Quorum is Lost<br />
The <strong>OpenVMS</strong> <strong>Cluster</strong> quorum algorithm coordinates activity among <strong>OpenVMS</strong><br />
<strong>Cluster</strong> computers and ensures the integrity of shared cluster resources. (The<br />
quorum algorithm is described fully in Chapter 2.) Quorum is checked after any<br />
change to the cluster configuration—for example, when a voting computer leaves<br />
or joins the cluster. If quorum is lost, process and I/O activity on all computers in<br />
the cluster are blocked.<br />
Information about the loss of quorum and about clusterwide events that cause<br />
loss of quorum are sent to the OPCOM process, which broadcasts messages<br />
to designated operator terminals. The information is also broadcast to each<br />
computer’s operator console (OPA0), unless broadcast activity is explicitly<br />
disabled on that terminal. However, because quorum may be lost before OPCOM<br />
has been able to inform the operator terminals, the messages sent to OPA0 are<br />
the most reliable source of information about events that cause loss of quorum.<br />
If quorum is lost, you might add or reboot a node with additional votes.<br />
<strong>Cluster</strong> Troubleshooting C–15
<strong>Cluster</strong> Troubleshooting<br />
C.7 Diagnosing <strong>Cluster</strong> Hangs<br />
Reference: See also the information about cluster quorum in Section 10.12.<br />
C.7.2 Inaccessible <strong>Cluster</strong> Resource<br />
Access to shared cluster resources is coordinated by the distributed lock manager.<br />
If a particular process is granted a lock on a resource (for example, a shared<br />
data file), other processes in the cluster that request incompatible locks on that<br />
resource must wait until the original lock is released. If the original process<br />
retains its lock for an extended period, other processes waiting for the lock to be<br />
released may appear to hang.<br />
Occasionally, a system activity must acquire a restrictive lock on a resource for<br />
an extended period. For example, to perform a volume rebuild, system software<br />
takes out an exclusive lock on the volume being rebuilt. While this lock is held,<br />
no processes can allocate space on the disk volume. If they attempt to do so, they<br />
may appear to hang.<br />
Access to files that contain data necessary for the operation of the system itself<br />
is coordinated by the distributed lock manager. For this reason, a process that<br />
acquires a lock on one of these resources and is then unable to proceed may cause<br />
the cluster to appear to hang.<br />
For example, this condition may occur if a process locks a portion of the system<br />
authorization file (SYS$SYSTEM:SYSUAF.DAT) for write access. Any activity<br />
that requires access to that portion of the file, such as logging in to an account<br />
with the same or similar user name or sending mail to that user name, is blocked<br />
until the original lock is released. Normally, this lock is released quickly, and<br />
users do not notice the locking operation.<br />
However, if the process holding the lock is unable to proceed, other processes<br />
could enter a wait state. Because the authorization file is used during login<br />
and for most process creation operations (for example, batch and network<br />
jobs), blocked processes could rapidly accumulate in the cluster. Because the<br />
distributed lock manager is functioning normally under these conditions, users<br />
are not notified by broadcast messages or other means that a problem has<br />
occurred.<br />
C.8 Diagnosing CLUEXIT Bugchecks<br />
The operating system performs bugcheck operations only when it detects<br />
conditions that could compromise normal system activity or endanger data<br />
integrity. A CLUEXIT bugcheck is a type of bugcheck initiated by the<br />
connection manager, the <strong>OpenVMS</strong> <strong>Cluster</strong> software component that manages the<br />
interaction of cooperating <strong>OpenVMS</strong> <strong>Cluster</strong> computers. Most such bugchecks are<br />
triggered by conditions resulting from hardware failures (particularly failures in<br />
communications paths), configuration errors, or system management errors.<br />
C.8.1 Conditions Causing Bugchecks<br />
The most common conditions that result in CLUEXIT bugchecks are as follows:<br />
C–16 <strong>Cluster</strong> Troubleshooting
Possible Bugcheck Causes Recommendations<br />
The cluster connection between two computers<br />
is broken for longer than RECNXINTERVAL<br />
seconds. Thereafter, the connection is declared<br />
irrevocably broken. If the connection is later<br />
reestablished, one of the computers shut down<br />
with a CLUEXIT bugcheck.<br />
This condition can occur:<br />
• Upon recovery with battery backup after a<br />
power failure<br />
• After the repair of an SCS communication<br />
link<br />
• After the computer was halted for a period<br />
longer than the number of seconds specified<br />
for the RECNXINTERVAL parameter and<br />
was restarted with a CONTINUE command<br />
entered at the operator console<br />
<strong>Cluster</strong> partitioning occurs. A member of a<br />
cluster discovers or establishes connection to a<br />
member of another cluster, or a foreign cluster is<br />
detected in the quorum file.<br />
The value specified for the SCSMAXMSG system<br />
parameter on a computer is too small.<br />
C.9 Port Communications<br />
<strong>Cluster</strong> Troubleshooting<br />
C.8 Diagnosing CLUEXIT Bugchecks<br />
Determine the cause of the interrupted<br />
connection and correct the problem. For example,<br />
if recovery from a power failure is longer than<br />
RECNXINTERVAL seconds, you may want to<br />
increase the value of the RECNXINTERVAL<br />
parameter on all computers.<br />
Review the setting of EXPECTED_VOTES on all<br />
computers.<br />
Verify that the value of SCSMAXMSG on all<br />
<strong>OpenVMS</strong> <strong>Cluster</strong> computers is set to a value<br />
that is at the least the default value.<br />
These sections provide detailed information about port communications to assist<br />
in diagnosing port communication problems.<br />
C.9.1 Port Polling<br />
Shortly after a CI computer boots, the CI port driver (PADRIVER) begins<br />
configuration polling to discover other active ports on the CI. Normally, the<br />
poller runs every 5 seconds (the default value of the PAPOLLINTERVAL system<br />
parameter). In the first polling pass, all addresses are probed over cable path A;<br />
on the second pass, all addresses are probed over path B; on the third pass, path<br />
A is probed again; and so on.<br />
The poller probes by sending Request ID (REQID) packets to all possible port<br />
numbers, including itself. Active ports receiving the REQIDs return ID Received<br />
packet (IDREC) to the port issuing the REQID. A port might respond to a REQID<br />
even if the computer attached to the port is not running.<br />
For <strong>OpenVMS</strong> <strong>Cluster</strong> systems communicating over the CI, DSSI, or a<br />
combination of these interconnects, the port drivers perform a start handshake<br />
when a pair of ports and port drivers has successfully exchanged ID packets. The<br />
port drivers exchange datagrams containing information about the computers,<br />
such as the type of computer and the operating system version. If this exchange<br />
is successful, each computer declares a virtual circuit open. An open virtual<br />
circuit is prerequisite to all other activity.<br />
<strong>Cluster</strong> Troubleshooting C–17
<strong>Cluster</strong> Troubleshooting<br />
C.9 Port Communications<br />
C.9.2 LAN Communications<br />
For clusters that include Ethernet or FDDI interconnects, a multicast scheme<br />
is used to locate computers on the LAN. Approximately every 3 seconds, the<br />
port emulator driver (PEDRIVER) sends a HELLO datagram message through<br />
each LAN adapter to a cluster-specific multicast address that is derived from the<br />
cluster group number. The driver also enables the reception of these messages<br />
from other computers. When the driver receives a HELLO datagram message<br />
from a computer with which it does not currently share an open virtual circuit,<br />
it attempts to create a circuit. HELLO datagram messages received from a<br />
computer with a currently open virtual circuit indicate that the remote computer<br />
is operational.<br />
A standard, three-message exchange handshake is used to create a virtual circuit.<br />
The handshake messages contain information about the transmitting computer<br />
and its record of the cluster password. These parameters are verified at the<br />
receiving computer, which continues the handshake only if its verification is<br />
successful. Thus, each computer authenticates the other. After the final message,<br />
the virtual circuit is opened for use by both computers.<br />
C.9.3 System Communications Services (SCS) Connections<br />
System services such as the disk class driver, connection manager, and the<br />
MSCP and TMSCP servers communicate between computers with a protocol<br />
called System Communications Services (SCS). SCS is responsible primarily for<br />
forming and breaking intersystem process connections and for controlling flow of<br />
message traffic over those connections. SCS is implemented in the port driver (for<br />
example, PADRIVER, PBDRIVER, PEDRIVER, PIDRIVER), and in a loadable<br />
piece of the operating system called SCSLOA.EXE (loaded automatically during<br />
system initialization).<br />
When a virtual circuit has been opened, a computer periodically probes a remote<br />
computer for system services that the remote computer may be offering. The<br />
SCS directory service, which makes known services that a computer is offering,<br />
is always present both on computers and HSC subsystems. As system services<br />
discover their counterparts on other computers and HSC subsystems, they<br />
establish SCS connections to each other. These connections are full duplex and<br />
are associated with a particular virtual circuit. Multiple connections are typically<br />
associated with a virtual circuit.<br />
C.10 Diagnosing Port Failures<br />
This section describes the hierarchy of communication paths and describes where<br />
failures can occur.<br />
C.10.1 Hierarchy of Communication Paths<br />
Taken together, SCS, the port drivers, and the port itself support a hierarchy of<br />
communication paths. Starting with the most fundamental level, these are as<br />
follows:<br />
• The physical wires. The Ethernet is a single coaxial cable. FDDI typically<br />
has a pair of fiber-optic cables for redundancy. The CI has two pairs of<br />
transmitting and receiving cables (path A transmit and receive and path B<br />
transmit and receive). For the CI, the operating system software normally<br />
sends traffic in automatic path-select mode. The port chooses the free path or,<br />
if both are free, an arbitrary path (implemented in the cables and star coupler<br />
and managed by the port).<br />
C–18 <strong>Cluster</strong> Troubleshooting
<strong>Cluster</strong> Troubleshooting<br />
C.10 Diagnosing Port Failures<br />
• The virtual circuit (implemented partly in the CI port or LAN port emulator<br />
driver (PEDRIVER) and partly in SCS software).<br />
• The SCS connections (implemented in system software).<br />
C.10.2 Where Failures Occur<br />
Failures can occur at each communication level and in each component. Failures<br />
at one level translate into failures elsewhere, as described in Table C–3.<br />
Table C–3 Port Failures<br />
Communication<br />
Level Failures<br />
Wires If the LAN fails or is disconnected, LAN traffic stops or is interrupted,<br />
depending on the nature of the failure. For the CI, either path A or B<br />
can fail while the virtual circuit remains intact. All traffic is directed over<br />
the remaining good path. When the wire is repaired, the repair is detected<br />
automatically by port polling, and normal operations resume on all ports.<br />
Virtual circuit If no path works between a pair of ports, the virtual circuit fails and is closed.<br />
A path failure is discovered as follows:<br />
For the CI, when polling fails or when attempts are made to send normal<br />
traffic, and the port reports that neither path yielded transmit success.<br />
For the LAN, when no multicast HELLO datagram message or incoming<br />
traffic is received from another computer.<br />
When a virtual circuit fails, every SCS connection on it is closed. The<br />
software automatically reestablishes connections when the virtual circuit is<br />
reestablished. Normally, reestablishing a virtual circuit takes several seconds<br />
after the problem is corrected.<br />
CI port If a port fails, all virtual circuits to that port fail, and all SCS connections<br />
on those virtual circuits are closed. If the port is successfully reinitialized,<br />
virtual circuits and connections are reestablished automatically. Normally, port<br />
reinitialization and reestablishment of connections take several seconds.<br />
LAN adapter If a LAN adapter device fails, attempts are made to restart it. If repeated<br />
attempts fail, all channels using that adapter are broken. A channel is a pair<br />
of LAN addresses, one local and one remote. If the last open channel for a<br />
virtual circuit fails, the virtual circuit is closed and the connections are broken.<br />
SCS connection When the software protocols fail or, in some instances, when the software<br />
detects a hardware malfunction, a connection is terminated. Other connections<br />
are usually unaffected, as is the virtual circuit. Breaking of connections is<br />
also used under certain conditions as an error recovery mechanism—most<br />
commonly when there is insufficient nonpaged pool available on the computer.<br />
Computer If a computer fails because of operator shutdown, bugcheck, or halt, all other<br />
computers in the cluster record the shutdown as failures of their virtual<br />
circuits to the port on the shut down computer.<br />
C.10.3 Verifying CI Port Functions<br />
Before you boot in a cluster a CI connected computer that is new, just repaired, or<br />
suspected of having a problem, you should have Compaq services verify that the<br />
computer runs correctly on its own.<br />
<strong>Cluster</strong> Troubleshooting C–19
<strong>Cluster</strong> Troubleshooting<br />
C.10 Diagnosing Port Failures<br />
C.10.4 Verifying Virtual Circuits<br />
To diagnose communication problems, you can invoke the Show <strong>Cluster</strong> utility<br />
using the instructions in Table C–4.<br />
Table C–4 How to Verify Virtual Circuit States<br />
Step Action What to Look for<br />
1 Tailor the SHOW CLUSTER report by entering the SHOW<br />
CLUSTER command ADD CIRCUIT,CABLE_STATUS. This<br />
command adds a class of information about all the virtual<br />
circuits as seen from the computer on which you are running<br />
SHOW CLUSTER. CABLE_STATUS indicates the status of the<br />
path for the circuit from the CI interface on the local system to<br />
the CI interface on the remote system.<br />
2 Run SHOW CLUSTER from each active computer in the<br />
cluster to verify whether each computer’s view of the failing<br />
computer is consistent with every other computer’s view.<br />
WHEN... THEN...<br />
All the active computers have<br />
a consistent view of the failing<br />
computer<br />
Only one of several active<br />
computers detects that the<br />
newcomer is failing<br />
The problem may be in<br />
the failing computer.<br />
That particular computer<br />
may have a problem.<br />
Primarily, you are checking whether there<br />
is a virtual circuit in the OPEN state to the<br />
failing computer. Common causes of failure<br />
to open a virtual circuit and keep it open are<br />
the following:<br />
• Port errors on one side or the other<br />
• Cabling errors<br />
• A port set off line because of software<br />
problems<br />
• Insufficient nonpaged pool on both sides<br />
• Failure to set correct values for<br />
the SCSNODE, SCSSYSTEMID,<br />
PAMAXPORT, PANOPOLL,<br />
PASTIMOUT, and PAPOLLINTERVAL<br />
system parameters<br />
If no virtual circuit is open to the failing<br />
computer, check the bottom of the SHOW<br />
CLUSTER display:<br />
• For information about circuits to the<br />
port of the failing computer. Virtual<br />
circuits in partially open states are<br />
shown at the bottom of the display. If<br />
the circuit is shown in a state other<br />
than OPEN, communications between<br />
the local and remote ports are taking<br />
place, and the failure is probably at<br />
a higher level than in port or cable<br />
hardware.<br />
• To see whether both path A and path B<br />
to the failing port are good. The loss of<br />
one path should not prevent a computer<br />
from participating in a cluster.<br />
C.10.5 Verifying CI Cable Connections<br />
Whenever the configuration poller finds that no virtual circuits are open and<br />
that no handshake procedures are currently opening virtual circuits, the poller<br />
analyzes its environment. It does so by using the send-loopback-datagram facility<br />
of the CI port in the following fashion:<br />
1. The send-loopback-datagram facility tests the connections between the CI<br />
port and the star coupler by routing messages across them. The messages are<br />
called loopback datagrams. (The port processes other self-directed messages<br />
without using the star coupler or external cables.)<br />
C–20 <strong>Cluster</strong> Troubleshooting
<strong>Cluster</strong> Troubleshooting<br />
C.10 Diagnosing Port Failures<br />
2. The configuration poller makes entries in the error log whenever it detects<br />
a change in the state of a circuit. Note, however, that it is possible two<br />
changed-to-failed-state messages can be entered in the log without an<br />
intervening changed-to-succeeded-state message. Such a series of entries<br />
means that the circuit state continues to be faulty.<br />
C.10.6 Diagnosing CI Cabling Problems<br />
The following paragraphs discuss various incorrect CI cabling configurations and<br />
the entries made in the error log when these configurations exist. Figure C–1<br />
shows a two-computer configuration with all cables correctly connected.<br />
Figure C–2 shows a CI cluster with a pair of crossed cables.<br />
Figure C–1 Correctly Connected Two-Computer CI <strong>Cluster</strong><br />
Local<br />
CI<br />
Port<br />
TA<br />
TB<br />
RA<br />
RB<br />
Figure C–2 Crossed CI Cable Pair<br />
Local<br />
CI<br />
Port<br />
TA<br />
TB<br />
RA<br />
RB<br />
X<br />
Star<br />
Coupler<br />
Star<br />
Coupler<br />
RA<br />
RB<br />
TA<br />
TB<br />
RA<br />
RB<br />
TA<br />
TB<br />
Remote<br />
CI<br />
Port<br />
ZK−1924−GE<br />
Remote<br />
CI<br />
Port<br />
ZK−1925−GE<br />
If a pair of transmitting cables or a pair of receiving cables is crossed, a message<br />
sent on TA is received on RB, and a message sent on TB is received on RA. This<br />
is a hardware error condition from which the port cannot recover. An entry is<br />
made in the error log indicating that a single pair of crossed cables exists. The<br />
entry contains the following lines:<br />
DATA CABLE(S) CHANGE OF STATE<br />
PATH 1. LOOPBACK HAS GONE FROM GOOD TO BAD<br />
If this situation exists, you can correct it by reconnecting the cables properly. The<br />
cables could be misconnected in several places. The coaxial cables that connect<br />
the port boards to the bulkhead cable connectors can be crossed, or the cables can<br />
be misconnected to the bulkhead or the star coupler.<br />
<strong>Cluster</strong> Troubleshooting C–21
<strong>Cluster</strong> Troubleshooting<br />
C.10 Diagnosing Port Failures<br />
C–22 <strong>Cluster</strong> Troubleshooting<br />
Configuration 1: The information illustrated in Figure C–2 is represented more<br />
simply in Example C–1. It shows the cables positioned as in Figure C–2, but it<br />
does not show the star coupler or the computers. The labels LOC (local) and REM<br />
(remote) indicate the pairs of transmitting ( T ) and receiving ( R ) cables on the<br />
local and remote computers, respectively.<br />
Example C–1 Crossed Cables: Configuration 1<br />
T x = R<br />
R = = T<br />
LOC REM<br />
The pair of crossed cables causes loopback datagrams to fail on the local computer<br />
but to succeed on the remote computer. Crossed pairs of transmitting cables and<br />
crossed pairs of receiving cables cause the same behavior.<br />
Note that only an odd number of crossed cable pairs causes these problems. If<br />
an even number of cable pairs is crossed, communications succeed. An error log<br />
entry is made in some cases, however, and the contents of the entry depends on<br />
which pairs of cables are crossed.<br />
Configuration 2: Example C–2 shows two-computer clusters with the<br />
combinations of two crossed cable pairs. These crossed pairs cause the following<br />
entry to be made in the error log of the computer that has the cables crossed:<br />
DATA CABLE(S) CHANGE OF STATE<br />
CABLES HAVE GONE FROM UNCROSSED TO CROSSED<br />
Loopback datagrams succeed on both computers, and communications are<br />
possible.<br />
Example C–2 Crossed Cables: Configuration 2<br />
T x = R T = x R<br />
R x = T R = x T<br />
LOC REM LOC REM<br />
Configuration 3: Example C–3 shows the possible combinations of two pairs of<br />
crossed cables that cause loopback datagrams to fail on both computers in the<br />
cluster. Communications can still take place between the computers. An entry<br />
stating that cables are crossed is made in the error log of each computer.<br />
Example C–3 Crossed Cables: Configuration 3<br />
T x = R T = x R<br />
R = x T R x = T<br />
LOC REM LOC REM
<strong>Cluster</strong> Troubleshooting<br />
C.10 Diagnosing Port Failures<br />
Configuration 4: Example C–4 shows the possible combinations of two pairs of<br />
crossed cables that cause loopback datagrams to fail on both computers in the<br />
cluster but that allow communications. No entry stating that cables are crossed<br />
is made in the error log of either computer.<br />
Example C–4 Crossed Cables: Configuration 4<br />
T x x R T = = R<br />
R = = T R x x T<br />
LOC REM LOC REM<br />
Configuration 5: Example C–5 shows the possible combinations of four pairs of<br />
crossed cables. In each case, loopback datagrams fail on the computer that has<br />
only one crossed pair of cables. Loopback datagrams succeed on the computer<br />
with both pairs crossed. No communications are possible.<br />
Example C–5 Crossed Cables: Configuration 5<br />
T x x R T x = R T = x R T x x R<br />
R x = T R x x T R x x T R = x T<br />
LOC REM LOC REM LOC REM LOC REM<br />
If all four cable pairs between two computers are crossed, communications<br />
succeed, loopback datagrams succeed, and no crossed-cable message entries are<br />
made in the error log. You might detect such a condition by noting error log<br />
entries made by a third computer in the cluster, but this occurs only if the third<br />
computer has one of the crossed-cable cases described.<br />
C.10.7 Repairing CI Cables<br />
This section describes some ways in which Compaq support representatives can<br />
make repairs on a running computer. This information is provided to aid system<br />
managers in scheduling repairs.<br />
For cluster software to survive cable-checking activities or cable-replacement<br />
activities, you must be sure that either path A or path B is intact at all times<br />
between each port and between every other port in the cluster.<br />
For example, you can remove path A and path B in turn from a particular port to<br />
the star coupler. To make sure that the configuration poller finds a path that was<br />
previously faulty but is now operational, follow these steps:<br />
Step Action<br />
1 Remove path B.<br />
2 After the poller has discovered that path B is faulty, reconnect path B.<br />
<strong>Cluster</strong> Troubleshooting C–23
<strong>Cluster</strong> Troubleshooting<br />
C.10 Diagnosing Port Failures<br />
Step Action<br />
3 Wait two poller intervals, 1 and then take either of the following actions:<br />
• Enter the DCL command SHOW CLUSTER to make sure that the poller has<br />
reestablished path B.<br />
• Enter the DCL command SHOW CLUSTER/CONTINUOUS followed by the SHOW<br />
CLUSTER command ADD CIRCUITS, CABLE_ST.<br />
4 Wait for SHOW CLUSTER to tell you that path B has been reestablished.<br />
5 Remove path A.<br />
6 After the poller has discovered that path A is faulty, reconnect path A.<br />
7 Wait two poller intervals1 to make sure that the poller has reestablished path A.<br />
1 Approximately 10 seconds at the default system parameter settings<br />
If both paths are lost at the same time, the virtual circuits are lost between the<br />
port with the broken cables and all other ports in the cluster. This condition<br />
will in turn result in loss of SCS connections over the broken virtual circuits.<br />
However, recovery from this situation is automatic after an interruption in<br />
service on the affected computer. The length of the interruption varies, but it is<br />
approximately two poller intervals at the default system parameter settings.<br />
C.10.8 Verifying LAN Connections<br />
The Local Area <strong>OpenVMS</strong> <strong>Cluster</strong> Network Failure Analysis Program described<br />
in Section D.4 uses the HELLO datagram messages to verify continuously the<br />
network paths (channels) used by PEDRIVER. This verification process, combined<br />
with physical description of the network, can:<br />
• Isolate failing network components<br />
• Group failing channels together and map them onto the physical network<br />
description<br />
• Call out the common components related to the channel failures<br />
C.11 Analyzing Error-Log Entries for Port Devices<br />
Monitoring events recorded in the error log can help you anticipate and avoid<br />
potential problems. From the total error count (displayed by the DCL command<br />
SHOW DEVICES device-name), you can determine whether errors are increasing.<br />
If so, you should examine the error log.<br />
C.11.1 Examine the Error Log<br />
The DCL command ANALYZE/ERROR_LOG invokes the Error Log utility to<br />
report the contents of an error-log file.<br />
Reference: For more information about the Error Log utility, see the <strong>OpenVMS</strong><br />
System Management Utilities Reference Manual.<br />
Some error-log entries are informational only while others require action.<br />
C–24 <strong>Cluster</strong> Troubleshooting
Table C–5 Informational and Other Error-Log Entries<br />
Error Type<br />
Informational error-log entries require no action. For example, if you<br />
shut down a computer in the cluster, all other active computers that<br />
have open virtual circuits between themselves and the computer that<br />
has been shut down make entries in their error logs. Such computers<br />
record up to three errors for the event:<br />
• Path A received no response.<br />
• Path B received no response.<br />
• The virtual circuit is being closed.<br />
Other error-log entries indicate problems that degrade operation or<br />
nonfatal hardware problems. The operating system might continue to<br />
run satisfactorily under these conditions.<br />
<strong>Cluster</strong> Troubleshooting<br />
C.11 Analyzing Error-Log Entries for Port Devices<br />
Action<br />
Required? Purpose<br />
No These messages are normal<br />
and reflect the change of state<br />
in the circuits to the computer<br />
that has been shut down.<br />
Yes Detecting these problems<br />
early is important to<br />
preventing nonfatal problems<br />
(such as loss of a single CI<br />
path) from becoming serious<br />
problems (such as loss of both<br />
paths).<br />
C.11.2 Formats<br />
Errors and other events on the CI, DSSI, or LAN cause port drivers to enter<br />
information in the system error log in one of two formats:<br />
• Device attention<br />
Device-attention entries for the CI record events that, in general, are<br />
indicated by the setting of a bit in a hardware register. For the LAN,<br />
device-attention entries typically record errors on a LAN adapter device.<br />
<strong>Cluster</strong> Troubleshooting C–25
<strong>Cluster</strong> Troubleshooting<br />
C.11 Analyzing Error-Log Entries for Port Devices<br />
• Logged message<br />
Logged-message entries record the receipt of a message packet that contains<br />
erroneous data or that signals an error condition.<br />
Sections C.11.3 and C.11.6 describe those formats.<br />
C.11.3 CI Device-Attention Entries<br />
Example C–6 shows device-attention entries for the CI. The left column gives<br />
the name of a device register or a memory location. The center column gives<br />
the value contained in that register or location, and the right column gives an<br />
interpretation of that value.<br />
Example C–6 CI Device-Attention Entries<br />
************************* ENTRY<br />
ERROR SEQUENCE 10.<br />
83. **************************** !<br />
LOGGED ON: SID 0150400A<br />
DATE/TIME 15-JAN-1994 11:45:27.61 SYS_TYPE 01010000 "<br />
DEVICE ATTENTION KA780<br />
SCS NODE: MARS<br />
#<br />
CI SUB-SYSTEM, MARS$PAA0: - PORT POWER DOWN $<br />
CNFGR 00800038<br />
PMCSR 000000CE<br />
PSR 80000001<br />
PFAR 00000000<br />
PESR 00000000<br />
PPR 03F80001<br />
ADAPTER IS CI<br />
ADAPTER POWER-DOWN<br />
MAINTENANCE TIMER DISABLE<br />
MAINTENANCE INTERRUPT ENABLE<br />
MAINTENANCE INTERRUPT FLAG<br />
PROGRAMMABLE STARTING ADDRESS<br />
UNINITIALIZED STATE<br />
RESPONSE QUEUE AVAILABLE<br />
MAINTENANCE ERROR<br />
UCB$B_ERTCNT 32 %<br />
50. RETRIES REMAINING<br />
UCB$B_ERTMAX 32 &<br />
50. RETRIES ALLOWABLE<br />
UCB$L_CHAR 0C450000<br />
SHAREABLE<br />
AVAILABLE<br />
ERROR LOGGING<br />
CAPABLE OF INPUT<br />
CAPABLE OF OUTPUT<br />
UCB$W_STS 0010<br />
ONLINE<br />
UCB$W_ERRCNT 000B ’<br />
11. ERRORS THIS UNIT<br />
The following table describes the device-attention entries in Example C–6.<br />
Entry Description<br />
C–26 <strong>Cluster</strong> Troubleshooting<br />
! The first two lines are the entry heading. These lines contain the number of the entry in<br />
this error log file, the sequence number of this error, and the identification number (SID) of<br />
this computer. Each entry in the log file contains such a heading.<br />
" This line contains the date, the time, and the computer type.
Entry Description<br />
<strong>Cluster</strong> Troubleshooting<br />
C.11 Analyzing Error-Log Entries for Port Devices<br />
# The next two lines contain the entry type, the processor type (KA780), and the computer’s<br />
SCS node name.<br />
$ This line shows the name of the subsystem and the device that caused the entry and the<br />
reason for the entry. The CI subsystem’s device PAA0 on MARS was powered down.<br />
The next 15 lines contain the names of hardware registers in the port, their contents, and<br />
interpretations of those contents. See the appropriate CI hardware manual for a description<br />
of all the CI port registers.<br />
% The UCB$B_ERTCNT field contains the number of reinitializations that the port driver can<br />
still attempt. The difference between this value and UCB$B_ERTMAX is the number of<br />
reinitializations already attempted.<br />
& The UCB$B_ERTMAX field contains the maximum number of times the port can be<br />
reinitialized by the port driver.<br />
’ The UCB$W_ERRCNT field contains the total number of errors that have occurred on this<br />
port since it was booted. This total includes both errors that caused reinitialization of the<br />
port and errors that did not.<br />
C.11.4 Error Recovery<br />
The CI port can recover from many errors, but not all. When an error occurs from<br />
which the CI cannot recover, the following process occurs:<br />
Step Action<br />
1 The port notifies the port driver.<br />
2 The port driver logs the error and attempts to reinitialize the port.<br />
3 If the port fails after 50 such initialization attempts, the driver takes it off line, unless the<br />
system disk is connected to the failing port or unless this computer is supposed to be a<br />
cluster member.<br />
4 If the CI port is required for system disk access or cluster participation and all 50<br />
reinitialization attempts have been used, then the computer bugchecks with a CIPORT-type<br />
bugcheck.<br />
Once a CI port is off line, you can put the port back on line only by rebooting the<br />
computer.<br />
C.11.5 LAN Device-Attention Entries<br />
Example C–7 shows device-attention entries for the LAN. The left column gives<br />
the name of a device register or a memory location. The center column gives<br />
the value contained in that register or location, and the right column gives an<br />
interpretation of that value.<br />
Example C–7 LAN Device-Attention Entry<br />
************************* ENTRY 80. **************************** !<br />
ERROR SEQUENCE 26. LOGGED ON: SID 08000000<br />
DATE/TIME 15-JAN-1994 11:30:53.07 SYS_TYPE 01010000 "<br />
DEVICE ATTENTION KA630 #<br />
SCS NODE: PHOBOS<br />
NI-SCS SUB-SYSTEM, PHOBOS$PEA0: $<br />
FATAL ERROR DETECTED BY DATALINK %<br />
(continued on next page)<br />
<strong>Cluster</strong> Troubleshooting C–27
<strong>Cluster</strong> Troubleshooting<br />
C.11 Analyzing Error-Log Entries for Port Devices<br />
Example C–7 (Cont.) LAN Device-Attention Entry<br />
STATUS1 0000002C &<br />
STATUS2 00000000<br />
DATALINK UNIT 0001 ’<br />
DATALINK NAME 41515803 (<br />
00000000<br />
00000000<br />
00000000<br />
DATALINK NAME = XQA1:<br />
REMOTE NODE 00000000 )<br />
00000000<br />
00000000<br />
00000000<br />
REMOTE ADDR 00000000 +><br />
0000<br />
LOCAL ADDR 000400AA +?<br />
4C07<br />
ETHERNET ADDR = AA-00-04-00-07-4C<br />
ERROR CNT 0001 +@<br />
1. ERROR OCCURRENCES THIS ENTRY<br />
UCB$W_ERRCNT 0007<br />
7. ERRORS THIS UNIT<br />
The following table describes the LAN device-attention entries in Example C–7.<br />
Entry Description<br />
C–28 <strong>Cluster</strong> Troubleshooting<br />
! The first two lines are the entry heading. These lines contain the number of the entry in<br />
this error log file, the sequence number of this error, and the identification number (SID) of<br />
this computer. Each entry in the log file contains such a heading.<br />
" This line contains the date and time and the computer type.<br />
# The next two lines contain the entry type, the processor type (KA630), and the computer’s<br />
SCS node name.<br />
$ This line shows the name of the subsystem and component that caused the entry.<br />
% This line shows the reason for the entry. The LAN driver has shut down the data link<br />
because of a fatal error. The data link will be restarted automatically, if possible.<br />
& STATUS1 shows the I/O completion status returned by the LAN driver. STATUS2 is the<br />
VCI event code delivered to PEDRIVER by the LAN driver. The event values and meanings<br />
are described in the following table:<br />
Event Code Meaning<br />
1200 Port usable<br />
1201 Port unusable<br />
1202 Change address<br />
If a message transmit was involved, the status applies to that transmit.<br />
’ DATALINK UNIT shows the unit number of the LAN device on which the error occurred.<br />
( DATALINK NAME is the name of the LAN device on which the error occurred.<br />
) REMOTE NODE is the name of the remote node to which the packet was being sent. If<br />
zeros are displayed, either no remote node was available or no packet was associated with<br />
the error.<br />
+> REMOTE ADDR is the LAN address of the remote node to which the packet was being sent.<br />
If zeros are displayed, no packet was associated with the error.<br />
+? LOCAL ADDR is the LAN address of the local node.
Entry Description<br />
<strong>Cluster</strong> Troubleshooting<br />
C.11 Analyzing Error-Log Entries for Port Devices<br />
+@ ERROR CNT. Because some errors can occur at extremely high rates, some error log entries<br />
represent more than one occurrence of an error. This field indicates how many. The errors<br />
counted occurred in the 3 seconds preceding the timestamp on the entry.<br />
C.11.6 Logged Message Entries<br />
Logged-message entries are made when the CI or LAN port receives a response<br />
that contains either data that the port driver cannot interpret or an error code in<br />
status field of the response.<br />
Example C–8 shows a logged-message entry with an error code in the status field<br />
PPD$B_STATUS for a CI port.<br />
Example C–8 CI Port Logged-Message Entry<br />
************************* ENTRY<br />
ERROR SEQUENCE 3.<br />
3. *************************** !<br />
LOGGED ON SID 01188542<br />
ERL$LOGMESSAGE, 15-JAN-1994 13:40:25.13 "<br />
KA780 REV #3. SERIAL #1346. MFG PLANT 15. #<br />
CI SUB-SYSTEM, MARS$PAA0: $<br />
DATA CABLE(S) STATE CHANGE - PATH #0. WENT FROM GOOD TO BAD %<br />
LOCAL STATION ADDRESS, 000000000002 (HEX) &<br />
LOCAL SYSTEM ID, 000000000001 (HEX) ’<br />
REMOTE STATION ADDRESS, 000000000004 (HEX) (<br />
REMOTE SYSTEM ID, 0000000000A9 (HEX) )<br />
UCB$B_ERTCNT<br />
UCB$B_ERTMAX<br />
UCB$W_ERRCNT<br />
PPD$B_PORT<br />
PPD$B_STATUS<br />
PPD$B_OPC<br />
32<br />
32<br />
0001<br />
04<br />
A5<br />
05<br />
50. RETRIES REMAINING<br />
50. RETRIES ALLOWABLE<br />
1. ERRORS THIS UNIT<br />
REMOTE NODE #4.<br />
FAIL<br />
PATH #0., NO RESPONSE<br />
PATH #1., "ACK" OR NOT USED<br />
NO PATH<br />
IDREQ<br />
+><br />
+?<br />
+@<br />
+A<br />
PPD$B_FLAGS 03<br />
RESPONSE QUEUE BIT<br />
SELECT PATH #0.<br />
+B<br />
(continued on next page)<br />
<strong>Cluster</strong> Troubleshooting C–29
<strong>Cluster</strong> Troubleshooting<br />
C.11 Analyzing Error-Log Entries for Port Devices<br />
Example C–8 (Cont.) CI Port Logged-Message Entry<br />
"CI" MESSAGE +C<br />
00000000<br />
00000000<br />
80000004<br />
0000FE15<br />
4F503000<br />
00000507<br />
00000000<br />
00000000<br />
00000000<br />
00000000<br />
00000000<br />
00000000<br />
00000000<br />
00000000<br />
00000000<br />
00000000<br />
00000000<br />
The following table describes the logged-message entries in Example C–8.<br />
Entry Description<br />
C–30 <strong>Cluster</strong> Troubleshooting<br />
! The first two lines are the entry heading. These lines contain the number of the entry in<br />
this error log file, the sequence number of the error, and the identification number (SID) of<br />
the computer. Each entry in the log file contains a heading.<br />
" This line contains the entry type, the date, and the time.<br />
# This line contains the processor type (KA780), the hardware revision number of the<br />
computer (REV #3), the serial number of the computer (SERIAL #1346), and the plant<br />
number ( 15 ).<br />
$ This line shows the name of the subsystem and the device that caused the entry.<br />
% This line gives the reason for the entry (one or more data cables have changed state) and a<br />
more detailed reason for the entry. Path 0, which the port used successfully before, cannot<br />
be used now.<br />
Note: ANALYZE/ERROR_LOG uses the notation ‘‘path 0’’ and ‘‘path 1’’; cable labels use the<br />
notation ‘‘path A ( =0 )’’ and ‘‘path B ( =1 )’’.<br />
& The local (&) and remote (() station addresses are the port numbers (range 0 to 15) of the<br />
local and remote ports. The port numbers are set in hardware switches by Compaq support<br />
representatives.<br />
’ The local (’) and remote ()) system IDs are the SCS system IDs set by the system<br />
parameter SCSSYSTEMID for the local and remote systems. For HSC subsystems, the<br />
system ID is set with the HSC console.<br />
( See &.<br />
) See ’.<br />
+> The next three lines consist of the entry fields that begin with UCB$. These fields give<br />
information on the contents of the unit control block (UCB) for this CI device.<br />
+? The lines that begin with PPD$ are fields in the message packet that the local port has<br />
received. PPD$B_PORT contains the station address of the remote port. In a loopback<br />
datagram, however, this field contains the local station address.<br />
+@ The PPD$B_STATUS field contains information about the nature of the failure that occurred<br />
during the current operation. When the operation completes without error, ERF prints the<br />
word NORMAL beside this field; otherwise, ERF decodes the error information contained in<br />
PPD$B_STATUS. Here a NO PATH error occurred because of a lack of response on path 0,<br />
the selected path.
Entry Description<br />
<strong>Cluster</strong> Troubleshooting<br />
C.11 Analyzing Error-Log Entries for Port Devices<br />
+A The PPD$B_OPC field contains the code for the operation that the port was attempting<br />
when the error occurred. The port was trying to send a request-for-ID message.<br />
+B The PPD$B_FLAGS field contains bits that indicate, among other things, the path that was<br />
selected for the operation.<br />
+C ‘‘CI’’ MESSAGE is a hexadecimal listing of bytes 16 through 83 (decimal) of the response<br />
(message or datagram). Because responses are of variable length, depending on the port<br />
opcode, bytes 16 through 83 may contain either more or fewer bytes than actually belong to<br />
the message.<br />
C.11.7 Error-Log Entry Descriptions<br />
This section describes error-log entries for the CI and LAN ports. Each entry<br />
shown is followed by a brief description of what the associated port driver (for<br />
example, PADRIVER, PBDRIVER, PEDRIVER) does, and the suggested action<br />
a system manager should take. In cases where you are advised to contact your<br />
Compaq support representative. and save crash dumps, it is important to capture<br />
the crash dumps as soon as possible after the error. For CI entries, note that<br />
path A and path 0 are the same path, and that path B and path 1 are the same<br />
path.<br />
Table C–6 lists error-log messages.<br />
Table C–6 Port Messages for All Devices<br />
Message Result User Action<br />
BIIC FAILURE The port driver attempts to reinitialize the port;<br />
after 50 failed attempts, it marks the device off<br />
line.<br />
CI PORT TIMEOUT The port driver attempts to reinitialize the port;<br />
after 50 failed attempts, it marks the device off<br />
line.<br />
11/750 CPU MICROCODE NOT<br />
ADEQUATE FOR PORT<br />
PORT MICROCODE REV NOT<br />
CURRENT, BUT SUPPORTED<br />
PORT MICROCODE REV NOT<br />
SUPPORTED<br />
The port driver sets the port off line with no<br />
retries attempted. In addition, if this port is<br />
needed because the computer is booted from an<br />
HSC subsystem or is participating in a cluster,<br />
the computer bugchecks with a UCODEREV code<br />
bugcheck.<br />
The port driver detected that the microcode is<br />
not at the current level, but the port driver will<br />
continue normally. This error is logged as a<br />
warning only.<br />
The port driver sets the port off line without<br />
attempting any retries.<br />
Contact your Compaq support<br />
representative.<br />
Increase the<br />
PAPOLLINTERVAL system<br />
parameter. If the problem<br />
disappears and you are not<br />
running privileged user-written<br />
software, contact your Compaq<br />
support representative.<br />
Read the appropriate section in<br />
the current <strong>OpenVMS</strong> <strong>Cluster</strong><br />
Software SPD for information<br />
on required computer microcode<br />
revisions. Contact your Compaq<br />
support representative, if<br />
necessary.<br />
Contact your Compaq support<br />
representative when it is<br />
convenient to have the<br />
microcode updated.<br />
Read the <strong>OpenVMS</strong> <strong>Cluster</strong><br />
Software SPD for information<br />
on the required CI port<br />
microcode revisions. Contact<br />
your Compaq support<br />
representative, if necessary.<br />
(continued on next page)<br />
<strong>Cluster</strong> Troubleshooting C–31
<strong>Cluster</strong> Troubleshooting<br />
C.11 Analyzing Error-Log Entries for Port Devices<br />
Table C–6 (Cont.) Port Messages for All Devices<br />
Message Result User Action<br />
DATA CABLE(S) STATE<br />
CHANGE\CABLES HAVE<br />
GONE FROM CROSSED TO<br />
UNCROSSED<br />
DATA CABLE(S) STATE<br />
CHANGE\CABLES HAVE<br />
GONE FROM UNCROSSED TO<br />
CROSSED<br />
DATA CABLE(S) STATE<br />
CHANGE\PATH 0. WENT<br />
FROM BAD TO GOOD<br />
DATA CABLE(S) STATE<br />
CHANGE\PATH 0. WENT<br />
FROM GOOD TO BAD<br />
DATA CABLE(S) STATE<br />
CHANGE\PATH 0. LOOPBACK<br />
IS NOW GOOD, UNCROSSED<br />
DATA CABLE(S) STATE<br />
CHANGE\PATH 0. LOOPBACK<br />
WENT FROM GOOD TO BAD<br />
DATA CABLE(S) STATE<br />
CHANGE\PATH 1. WENT<br />
FROM BAD TO GOOD<br />
DATA CABLE(S) STATE<br />
CHANGE\PATH 1. WENT<br />
FROM GOOD TO BAD<br />
DATA CABLE(S) STATE<br />
CHANGE\PATH 1. LOOPBACK<br />
IS NOW GOOD, UNCROSSED<br />
DATA CABLE(S) STATE<br />
CHANGE\PATH 1. LOOPBACK<br />
WENT FROM GOOD TO BAD<br />
DATAGRAM FREE QUEUE<br />
INSERT FAILURE<br />
DATAGRAM FREE QUEUE<br />
REMOVE FAILURE<br />
C–32 <strong>Cluster</strong> Troubleshooting<br />
The port driver logs this event. No action needed.<br />
The port driver logs this event. Check for crossed cable pairs.<br />
(See Section C.10.5.)<br />
The port driver logs this event. No action needed.<br />
The port driver logs this event. Check path A cables to see<br />
that they are not broken or<br />
improperly connected.<br />
The port driver logs this event. No action needed.<br />
The port driver logs this event. Check for crossed cable pairs<br />
or faulty CI hardware. (See<br />
Sections C.10.3 and C.10.5.)<br />
The port driver logs this event. No action needed.<br />
The port driver logs this event. Check path B cables to see<br />
that they are not broken or<br />
improperly connected.<br />
The port driver logs this event. No action needed.<br />
The port driver logs this event. Check for crossed cable pairs<br />
or faulty CI hardware. (See<br />
Sections C.10.3 and C.10.5.)<br />
The port driver attempts to reinitialize the port;<br />
after 50 failed attempts, it marks the device off<br />
line.<br />
The port driver attempts to reinitialize the port;<br />
after 50 failed attempts, it marks the device off<br />
line.<br />
Contact your Compaq support<br />
representative. This error is<br />
caused by a failure to obtain<br />
access to an interlocked queue.<br />
Possible sources of the problem<br />
are CI hardware failures, or<br />
memory, SBI (11/780), CMI<br />
(11/750), or BI (8200, 8300, and<br />
8800) contention.<br />
Contact your Compaq support<br />
representative. This error is<br />
caused by a failure to obtain<br />
access to an interlocked queue.<br />
Possible sources of the problem<br />
are CI hardware failures, or<br />
memory, SBI (11/780), CMI<br />
(11/750), or BI (8200, 8300, and<br />
8800) contention.<br />
(continued on next page)
Table C–6 (Cont.) Port Messages for All Devices<br />
<strong>Cluster</strong> Troubleshooting<br />
C.11 Analyzing Error-Log Entries for Port Devices<br />
Message Result User Action<br />
FAILED TO LOCATE PORT<br />
MICROCODE IMAGE<br />
HIGH PRIORITY COMMAND<br />
QUEUE INSERT FAILURE<br />
MSCP ERROR LOGGING<br />
DATAGRAM RECEIVED<br />
INAPPROPRIATE SCA<br />
CONTROL MESSAGE<br />
INSUFFICIENT NON-PAGED<br />
POOL FOR INITIALIZATION<br />
LOW PRIORITY CMD QUEUE<br />
INSERT FAILURE<br />
MESSAGE FREE QUEUE<br />
INSERT FAILURE<br />
MESSAGE FREE QUEUE<br />
REMOVE FAILURE<br />
The port driver marks the device off line and<br />
makes no retries.<br />
The port driver attempts to reinitialize the port;<br />
after 50 failed attempts, it marks the device off<br />
line.<br />
On receipt of an error message from the HSC<br />
subsystem, the port driver logs the error and<br />
takes no other action. You should disable<br />
the sending of HSC informational error-log<br />
datagrams with the appropriate HSC console<br />
command because such datagrams take<br />
considerable space in the error-log data file.<br />
The port driver closes the port-to-port virtual<br />
circuit to the remote port.<br />
The port driver marks the device off line and<br />
makes no retries.<br />
The port driver attempts to reinitialize the port;<br />
after 50 failed attempts, it marks the device off<br />
line.<br />
The port driver attempts to reinitialize the port;<br />
after 50 failed attempts, it marks the device off<br />
line.<br />
The port driver attempts to reinitialize the port;<br />
after 50 failed attempts, it marks the device off<br />
line.<br />
Make sure console volume<br />
contains the microcode file<br />
CI780.BIN (for the CI780,<br />
CI750, or CIBCI) or the<br />
microcode file CIBCA.BIN<br />
for the CIBCA–AA. Then reboot<br />
the computer.<br />
Contact your Compaq support<br />
representative. This error is<br />
caused by a failure to obtain<br />
access to an interlocked queue.<br />
Possible sources of the problem<br />
are CI hardware failures, or<br />
memory, SBI (11/780), CMI<br />
(11/750), or BI (8200, 8300, and<br />
8800) contention.<br />
Error-log datagrams are useful<br />
to read only if they are not<br />
captured on the HSC console<br />
for some reason (for example,<br />
if the HSC console ran out of<br />
paper.) This logged information<br />
duplicates messages logged on<br />
the HSC console.<br />
Contact your Compaq support<br />
representative. Save the error<br />
logs and the crash dumps from<br />
the local and remote computers.<br />
Reboot the computer with a<br />
larger value for NPAGEDYN or<br />
NPAGEVIR.<br />
Contact your Compaq support<br />
representative. This error is<br />
caused by a failure to obtain<br />
access to an interlocked queue.<br />
Possible sources of the problem<br />
are CI hardware failures, or<br />
memory, SBI (11/780), CMI<br />
(11/750), or BI (8200, 8300, and<br />
8800) contention.<br />
Contact your Compaq support<br />
representative. This error is<br />
caused by a failure to obtain<br />
access to an interlocked queue.<br />
Possible sources of the problem<br />
are CI hardware failures, or<br />
memory, SBI (11/780), CMI<br />
(11/750), or BI (8200, 8300, and<br />
8800) contention.<br />
Contact your Compaq support<br />
representative. This error is<br />
caused by a failure to obtain<br />
access to an interlocked queue.<br />
Possible sources of the problem<br />
are CI hardware failures, or<br />
memory, SBI (11/780), CMI<br />
(11/750), or BI (8200, 8300, and<br />
8800) contention.<br />
(continued on next page)<br />
<strong>Cluster</strong> Troubleshooting C–33
<strong>Cluster</strong> Troubleshooting<br />
C.11 Analyzing Error-Log Entries for Port Devices<br />
Table C–6 (Cont.) Port Messages for All Devices<br />
Message Result User Action<br />
MICRO-CODE VERIFICATION<br />
ERROR<br />
NO PATH-BLOCK DURING<br />
VIRTUAL CIRCUIT CLOSE<br />
NO TRANSITION FROM<br />
UNINITIALIZED TO<br />
DISABLED<br />
The port driver detected an error while reading<br />
the microcode that it just loaded into the port.<br />
The driver attempts to reinitialize the port; after<br />
50 failed attempts, it marks the device off line.<br />
The port driver attempts to reinitialize the port;<br />
after 50 failed attempts, it marks the device off<br />
line.<br />
The port driver attempts to reinitialize the port;<br />
after 50 failed attempts, it marks the device off<br />
line.<br />
PORT ERROR BIT(S) SET The port driver attempts to reinitialize the port;<br />
after 50 failed attempts, it marks the device off<br />
line.<br />
PORT HAS CLOSED VIRTUAL<br />
CIRCUIT<br />
The port driver closed the virtual circuit that the<br />
local port opened to the remote port.<br />
PORT POWER DOWN The port driver halts port operations and then<br />
waits for power to return to the port hardware.<br />
PORT POWER UP The port driver reinitializes the port and restarts<br />
port operations.<br />
RECEIVED CONNECT<br />
WITHOUT PATH-BLOCK<br />
C–34 <strong>Cluster</strong> Troubleshooting<br />
The port driver attempts to reinitialize the port;<br />
after 50 failed attempts, it marks the device off<br />
line.<br />
Contact your Compaq support<br />
representative.<br />
Contact your Compaq support<br />
representative. Save the error<br />
log and a crash dump from the<br />
local computer.<br />
Contact your Compaq support<br />
representative.<br />
A maintenance timer<br />
expiration bit may mean<br />
that the PASTIMOUT system<br />
parameter is set too low<br />
and should be increased,<br />
especially if the local computer<br />
is running privileged userwritten<br />
software. For all other<br />
bits, call your Compaq support<br />
representative.<br />
Check the PPD$B_STATUS<br />
field of the error-log entry<br />
for the reason the virtual<br />
circuit was closed. This error is<br />
normal if the remote computer<br />
failed or was shut down.<br />
For PEDRIVER, ignore the<br />
PPD$B_OPC field value; it is an<br />
unknown opcode.<br />
If PEDRIVER logs a large<br />
number of these errors, there<br />
may be a problem either with<br />
the LAN or with a remote<br />
system, or nonpaged pool may<br />
be insufficient on the local<br />
system.<br />
Restore power to the port<br />
hardware.<br />
No action needed.<br />
Contact your Compaq support<br />
representative. Save the error<br />
log and a crash dump from the<br />
local computer.<br />
(continued on next page)
Table C–6 (Cont.) Port Messages for All Devices<br />
<strong>Cluster</strong> Troubleshooting<br />
C.11 Analyzing Error-Log Entries for Port Devices<br />
Message Result User Action<br />
REMOTE SYSTEM CONFLICTS<br />
WITH KNOWN SYSTEM<br />
RESPONSE QUEUE REMOVE<br />
FAILURE<br />
SCSSYSTEMID MUST BE SET<br />
TO NON-ZERO VALUE<br />
SOFTWARE IS CLOSING<br />
VIRTUAL CIRCUIT<br />
SOFTWARE SHUTTING DOWN<br />
PORT<br />
The configuration poller discovered a remote<br />
computer with SCSSYSTEMID and/or SCSNODE<br />
equal to that of another computer to which a<br />
virtual circuit is already open.<br />
The port driver attempts to reinitialize the port;<br />
after 50 failed attempts, it marks the device off<br />
line.<br />
The port driver sets the port off line without<br />
attempting any retries.<br />
The port driver closes the virtual circuit to the<br />
remote port.<br />
The port driver attempts to reinitialize the port;<br />
after 50 failed attempts, it marks the device off<br />
line.<br />
UNEXPECTED INTERRUPT The port driver attempts to reinitialize the port;<br />
after 50 failed attempts, it marks the device off<br />
line.<br />
UNRECOGNIZED SCA<br />
PACKET<br />
The port driver closes the virtual circuit to the<br />
remote port. If the virtual circuit is already<br />
closed, the port driver inhibits datagram<br />
reception from the remote port.<br />
VIRTUAL CIRCUIT TIMEOUT The port driver closes the virtual circuit that the<br />
local CI port opened to the remote port. This<br />
closure occurs if the remote computer is running<br />
CI microcode Version 7 or later, and if the remote<br />
computer has failed to respond to any messages<br />
sent by the local computer.<br />
Shut down the new computer<br />
as soon as possible. Reboot it<br />
with a unique SCSYSTEMID<br />
and SCSNODE. Do not leave<br />
the new computer up any<br />
longer than necessary. If you<br />
are running a cluster, and two<br />
computers with conflicting<br />
identity are polling when any<br />
other virtual circuit failure<br />
takes place in the cluster, then<br />
computers in the cluster may<br />
shut down with a CLUEXIT<br />
bugcheck.<br />
Contact your Compaq support<br />
representative. This error is<br />
caused by a failure to obtain<br />
access to an interlocked queue.<br />
Possible sources of the problem<br />
are CI hardware failures, or<br />
memory, SBI (11/780), CMI<br />
(11/750), or BI (8200, 8300, and<br />
8800) contention.<br />
Reboot the computer with a<br />
conversational boot and set the<br />
SCSSYSTEMID to the correct<br />
value. At the same time, check<br />
that SCSNODE has been set to<br />
the correct nonblank value.<br />
Check error-log entries for the<br />
cause of the virtual circuit<br />
closure. Faulty transmission<br />
or reception on both paths, for<br />
example, causes this error and<br />
may be detected from the one<br />
or two previous error-log entries<br />
noting bad paths to this remote<br />
computer.<br />
Check other error-log entries for<br />
the possible cause of the port<br />
reinitialization failure.<br />
Contact your Compaq support<br />
representative.<br />
Contact your Compaq support<br />
representative. Save the errorlog<br />
file that contains this entry<br />
and the crash dumps from both<br />
the local and remote computers.<br />
This error is normal if the<br />
remote computer has halted,<br />
failed, or was shut down.<br />
This error may mean that the<br />
local computer’s TIMVCFAIL<br />
system parameter is set too<br />
low, especially if the remote<br />
computer is running privileged<br />
user-written software.<br />
(continued on next page)<br />
<strong>Cluster</strong> Troubleshooting C–35
<strong>Cluster</strong> Troubleshooting<br />
C.11 Analyzing Error-Log Entries for Port Devices<br />
Table C–6 (Cont.) Port Messages for All Devices<br />
Message Result User Action<br />
INSUFFICIENT NON-<br />
PAGED POOL FOR VIRTUAL<br />
CIRCUITS<br />
The port driver closes virtual circuits because of<br />
insufficient pool.<br />
The descriptions in Table C–7 apply only to LAN devices.<br />
Enter the DCL command<br />
SHOW MEMORY to determine<br />
pool requirements, and then<br />
adjust the appropriate system<br />
parameter requirements.<br />
Table C–7 Port Messages for LAN Devices<br />
Message Completion Status Explanation User Action<br />
FATAL ERROR<br />
DETECTED BY<br />
DATALINK<br />
TRANSMIT<br />
ERROR FROM<br />
DATALINK<br />
First longword SS$_<br />
NORMAL (00000001),<br />
second longword<br />
(00001201)<br />
First longword is<br />
any value other than<br />
(00000001), second<br />
longword (00001201)<br />
First longword<br />
(undefined), second<br />
longword (00001200)<br />
SS$_OPINCOMPL<br />
(000002D4)<br />
C–36 <strong>Cluster</strong> Troubleshooting<br />
The LAN driver stopped the local area<br />
<strong>OpenVMS</strong> <strong>Cluster</strong> protocol on the device.<br />
This completion status is returned when the<br />
SYS$LAVC_STOP_BUS routine completes<br />
successfully. The SYS$LAVC_STOP_BUS<br />
routine is called either from within the<br />
LAVC$STOP_BUS.MAR program found in<br />
SYS$EXAMPLES or from a user-written<br />
program. The local area <strong>OpenVMS</strong> <strong>Cluster</strong><br />
protocol remains stopped on the specified<br />
device until the SYS$LAVC_START_BUS<br />
routine executes successfully. The SYS$LAVC_<br />
START_BUS routine is called from within<br />
the LAVC$START_BUS.MAR program found<br />
in SYS$EXAMPLES or from a user-written<br />
program.<br />
The LAN driver has shut down the device<br />
because of a fatal error and is returning all<br />
outstanding transmits with SS$_OPINCOMPL.<br />
The LAN device is restarted automatically.<br />
The LAN driver has restarted the device<br />
successfully after a fatal error. This errorlog<br />
message is usually preceded by a FATAL<br />
ERROR DETECTED BY DATALINK error-log<br />
message whose first completion status longword<br />
is anything other than 00000001 and whose<br />
second completion status longword is 00001201.<br />
If the protocol on the<br />
device was stopped<br />
inadvertently, then<br />
restart the protocol<br />
by assembling<br />
and executing the<br />
LAVC$START_BUS<br />
program found in<br />
SYS$EXAMPLES.<br />
Reference: See<br />
Appendix D for an<br />
explanation of the local<br />
area <strong>OpenVMS</strong> <strong>Cluster</strong><br />
sample programs.<br />
Otherwise, this error<br />
message can be safely<br />
ignored.<br />
Infrequent occurrences<br />
of this error are<br />
typically not a problem.<br />
If the error occurs<br />
frequently or is<br />
accompanied by loss<br />
or reestablishment of<br />
connections to remote<br />
computers, there<br />
may be a hardware<br />
problem. Check for the<br />
proper LAN adapter<br />
revision level or contact<br />
your Compaq support<br />
representative.<br />
No action needed.<br />
The LAN driver is in the process of restarting<br />
the data link because an error forced the driver<br />
to shut down the controller and all users (see<br />
FATAL ERROR DETECTED BY DATALINK).<br />
(continued on next page)
Table C–7 (Cont.) Port Messages for LAN Devices<br />
<strong>Cluster</strong> Troubleshooting<br />
C.11 Analyzing Error-Log Entries for Port Devices<br />
Message Completion Status Explanation User Action<br />
INVALID<br />
CLUSTER<br />
PASSWORD<br />
RECEIVED<br />
NISCS<br />
PROTOCOL<br />
VERSION<br />
MISMATCH<br />
RECEIVED<br />
SS$_DEVREQERR<br />
(00000334)<br />
SS$_DISCONNECT<br />
(0000204C)<br />
The LAN controller tried to transmit the packet<br />
16 times and failed because of defers and<br />
collisions. This condition indicates that LAN<br />
traffic is heavy.<br />
There was a loss of carrier during or after the<br />
transmit.<br />
A computer is trying to join the cluster using<br />
the correct cluster group number for this cluster<br />
but an invalid password. The port emulator<br />
discards the message. The probable cause is<br />
that another cluster on the LAN is using the<br />
same cluster group number.<br />
A computer is trying to join the cluster using<br />
a version of the cluster LAN protocol that is<br />
incompatible with the one in use on this cluster.<br />
C.12 OPA0 Error-Message Logging and Broadcasting<br />
The port emulator<br />
automatically recovers<br />
from any of these<br />
errors, but many such<br />
errors indicate either<br />
that the LAN controller<br />
is faulty or that the<br />
LAN is overloaded.<br />
If you suspect either<br />
of these conditions,<br />
contact your Compaq<br />
support representative.<br />
Provide all clusters on<br />
the same LAN with<br />
unique cluster group<br />
numbers.<br />
Install a version of<br />
the operating system<br />
that uses a compatible<br />
protocol, or change the<br />
cluster group number<br />
so that the computer<br />
joins a different cluster.<br />
Port drivers detect certain error conditions and attempt to log them. The port<br />
driver attempts both OPA0 error broadcasting and standard error logging under<br />
any of the following circumstances:<br />
• The system disk has not yet been mounted.<br />
• The system disk is undergoing mount verification.<br />
• During mount verification, the system disk drive contains the wrong volume.<br />
• Mount verification for the system disk has timed out.<br />
• The local computer is participating in a cluster, and quorum has been lost.<br />
Note the implicit assumption that the system and error-logging devices are one<br />
and the same.<br />
<strong>Cluster</strong> Troubleshooting C–37
<strong>Cluster</strong> Troubleshooting<br />
C.12 OPA0 Error-Message Logging and Broadcasting<br />
The following table describes error-logging methods and their reliability.<br />
Method Reliability Comments<br />
Standard error logging to an<br />
error-logging device.<br />
Broadcasting selected<br />
information about the error<br />
condition to OPA0. (This is in<br />
addition to the port driver’s<br />
attempt to log the error<br />
condition to the error-logging<br />
device.)<br />
Under some circumstances, attempts to log errors to the errorlogging<br />
device can fail. Such failures can occur because the<br />
error-logging device is not accessible when attempts are made<br />
to log the error condition.<br />
This method of reporting errors is not entirely reliable, because<br />
some error conditions may not be reported due to the way<br />
OPA0 error broadcasting is performed. This situation occurs<br />
whenever a second error condition is detected before the port<br />
driver has been able to broadcast the first error condition to<br />
OPA0. In such a case, only the first error condition is reported<br />
to OPA0, because that condition is deemed to be the more<br />
important one.<br />
Because of the<br />
central role that the<br />
port device plays<br />
in clusters, the<br />
loss of error-logged<br />
information in such<br />
cases makes it<br />
difficult to diagnose<br />
and fix problems.<br />
This second,<br />
redundant method<br />
of error logging<br />
captures at least<br />
some of the<br />
information about<br />
port-device error<br />
conditions that<br />
would otherwise be<br />
lost.<br />
Note: Certain error conditions are always broadcast to OPA0, regardless of<br />
whether the error-logging device is accessible. In general, these are errors that<br />
cause the port to shut down either permanently or temporarily.<br />
C.12.1 OPA0 Error Messages<br />
One OPA0 error message for each error condition is always logged. The text of<br />
each error message is similar to the text in the summary displayed by formatting<br />
the corresponding standard error-log entry using the Error Log utility. (See<br />
Section C.11.7 for a list of Error Log utility summary messages and their<br />
explanations.)<br />
Table C–8 lists the OPA0 error messages. The table is divided into units by error<br />
type. Many of the OPA0 error messages contain some optional information, such<br />
as the remote port number, CI packet information (flags, port operation code,<br />
response status, and port number fields), or specific CI port registers. The codes<br />
specify whether the message is always logged on OPA0 or is logged only when the<br />
system device is inaccessible.<br />
Table C–8 OPA0 Messages<br />
Error Message<br />
Software Errors During Initialization<br />
Logged or<br />
Inaccessible<br />
%Pxxn, Insufficient Non-Paged Pool for Initialization Logged<br />
%Pxxn, Failed to Locate Port Micro-code Image Logged<br />
%Pxxn, SCSSYSTEMID has NOT been set to a Non-Zero Value Logged<br />
C–38 <strong>Cluster</strong> Troubleshooting<br />
(continued on next page)
Table C–8 (Cont.) OPA0 Messages<br />
Hardware Errors<br />
<strong>Cluster</strong> Troubleshooting<br />
C.12 OPA0 Error-Message Logging and Broadcasting<br />
%Pxxn, BIIC failure—BICSR/BER/CNF xxxxxx/xxxxxx/xxxxxx Logged<br />
%Pxxn, Micro-code Verification Error Logged<br />
%Pxxn, Port Transition Failure—CNF/PMC/PSR xxxxxx/xxxxxx/xxxxxx Logged<br />
%Pxxn, Port Error Bit(s) Set—CNF/PMC/PSR xxxxxx/xxxxxx/xxxxxx Logged<br />
%Pxxn, Port Power Down Logged<br />
%Pxxn, Port Power Up Logged<br />
%Pxxn, Unexpected Interrupt—CNF/PMC/PSR xxxxxx/xxxxxx/xxxxxx Logged<br />
%Pxxn, CI Port Timeout Logged<br />
%Pxxn, CI port ucode not at required rev level. —RAM/PROM rev is xxxx/xxxx Logged<br />
%Pxxn, CI port ucode not at current rev level.—RAM/PROM rev is xxxx/xxxx Logged<br />
%Pxxn, CPU ucode not at required rev level for CI activity Logged<br />
Queue Interlock Failures<br />
%Pxxn, Message Free Queue Remove Failure Logged<br />
%Pxxn, Datagram Free Queue Remove Failure Logged<br />
%Pxxn, Response Queue Remove Failure Logged<br />
%Pxxn, High Priority Command Queue Insert Failure Logged<br />
%Pxxn, Low Priority Command Queue Insert Failure Logged<br />
%Pxxn, Message Free Queue Insert Failure Logged<br />
%Pxxn, Datagram Free Queue Insert Failure Logged<br />
Errors Signaled with a CI Packet<br />
%Pxxn, Unrecognized SCA Packet—FLAGS/OPC/STATUS/PORT xx/xx/xx/xx Logged<br />
%Pxxn, Port has Closed Virtual Circuit—REMOTE PORT1 xxx Logged<br />
%Pxxn, Software Shutting Down Port Logged<br />
%Pxxn, Software is Closing Virtual Circuit—REMOTE PORT1 xxx Logged<br />
%Pxxn, Received Connect Without Path-Block—FLAGS/OPC/STATUS/PORT xx/xx/xx/xx Logged<br />
%Pxxn, Inappropriate SCA Control Message—FLAGS/OPC/STATUS/PORT xx/xx/xx/xx Logged<br />
%Pxxn, No Path-Block During Virtual Circuit Close—REMOTE PORT1 xxx Logged<br />
%Pxxn, HSC Error Logging Datagram Received Inaccessible—REMOTE PORT1 xxx Inaccessible<br />
%Pxxn, Remote System Conflicts with Known System—REMOTE PORT1 xxx Logged<br />
%Pxxn, Virtual Circuit Timeout—REMOTE PORT1 xxx Logged<br />
%Pxxn, Parallel Path is Closing Virtual Circuit— REMOTE PORT1 xxx Logged<br />
1 If the port driver can identify the remote SCS node name of the affected computer, the driver replaces the ‘‘REMOTE<br />
PORT xxx’’ text with ‘‘REMOTE SYSTEM X...’’, where X... is the value of the system parameter SCSNODE on the remote<br />
computer. If the remote SCS node name is not available, the port driver uses the existing message format.<br />
Key to CI Port Registers:<br />
CNF—configuration register<br />
PMC—port maintenance and control register<br />
PSR—port status register<br />
See also the CI hardware documentation for a detailed description of the CI port registers.<br />
(continued on next page)<br />
<strong>Cluster</strong> Troubleshooting C–39
<strong>Cluster</strong> Troubleshooting<br />
C.12 OPA0 Error-Message Logging and Broadcasting<br />
Table C–8 (Cont.) OPA0 Messages<br />
Error Message<br />
Errors Signaled with a CI Packet<br />
Logged or<br />
Inaccessible<br />
%Pxxn, Insufficient Nonpaged Pool for Virtual Circuits Logged<br />
Cable Change-of-State Notification<br />
%Pxxn, Path #0. Has gone from GOOD to BAD—REMOTE PORT1 xxx Inaccessible<br />
%Pxxn, Path #1. Has gone from GOOD to BAD—REMOTE PORT1 xxx Inaccessible<br />
%Pxxn, Path #0. Has gone from BAD to GOOD—REMOTE PORT1 xxx Inaccessible<br />
%Pxxn, Path #1. Has gone from BAD to GOOD—REMOTE PORT1 xxx Inaccessible<br />
%Pxxn, Cables have gone from UNCROSSED to CROSSED—REMOTE PORT1 xxx Inaccessible<br />
%Pxxn, Cables have gone from CROSSED to UNCROSSED—REMOTE PORT1 xxx Inaccessible<br />
%Pxxn, Path #0. Loopback has gone from GOOD to BAD—REMOTE PORT1 xxx Logged<br />
%Pxxn, Path #1. Loopback has gone from GOOD to BAD—REMOTE PORT1 xxx Logged<br />
%Pxxn, Path #0. Loopback has gone from BAD to GOOD—REMOTE PORT1 xxx Logged<br />
%Pxxn, Path #1. Loopback has gone from BAD to GOOD—REMOTE PORT1 xxx Logged<br />
%Pxxn, Path #0. Has become working but CROSSED to Path #1.— REMOTE PORT1 xxx Inaccessible<br />
%Pxxn, Path #1. Has become working but CROSSED to Path #0.— REMOTE PORT1 xxx Inaccessible<br />
1 If the port driver can identify the remote SCS node name of the affected computer, the driver replaces the ‘‘REMOTE<br />
PORT xxx’’ text with ‘‘REMOTE SYSTEM X...’’, where X... is the value of the system parameter SCSNODE on the remote<br />
computer. If the remote SCS node name is not available, the port driver uses the existing message format.<br />
C.12.2 CI Port Recovery<br />
C–40 <strong>Cluster</strong> Troubleshooting<br />
Two other messages concerning the CI port appear on OPA0:<br />
%Pxxn, CI port is reinitializing (xxx retries left.)<br />
%Pxxn, CI port is going off line.<br />
The first message indicates that a previous error requiring the port to shut<br />
down is recoverable and that the port will be reinitialized. The ‘‘xxx retries left’’<br />
specifies how many more reinitializations are allowed before the port must be<br />
left permanently off line. Each reinitialization of the port (for reasons other than<br />
power fail recovery) causes approximately 2 KB of nonpaged pool to be lost.<br />
The second message indicates that a previous error is not recoverable and that<br />
the port will be left off line. In this case, the only way to recover the port is to<br />
reboot the computer.
D<br />
Sample Programs for LAN Control<br />
Sample programs are provided to start and stop the NISCA protocol on a LAN<br />
adapter, and to enable LAN network failure analysis. The following programs are<br />
located in SYS$EXAMPLES:<br />
Program Description<br />
LAVC$START_BUS.MAR Starts the NISCA protocol on a specified LAN adapter.<br />
LAVC$STOP_BUS.MAR Stops the NISCA protocol on a specified LAN adapter.<br />
LAVC$FAILURE_ANALYSIS.MAR Enables LAN network failure analysis.<br />
LAVC$BUILD.COM Assembles and links the sample programs.<br />
Reference: The NISCA protocol, responsible for carrying messages across<br />
Ethernet and FDDI LANs to other nodes in the cluster, is described in<br />
Appendix F.<br />
D.1 Purpose of Programs<br />
The port emulator driver, PEDRIVER, starts the NISCA protocol on<br />
all of the LAN adapters in the cluster. LAVC$START_BUS.MAR and<br />
LAVC$STOP_BUS.MAR are provided for cluster managers who want to split<br />
the network load according to protocol type and therefore do not want the NISCA<br />
protocol running on all of the LAN adapters.<br />
Reference: See Section D.5 for information about editing and using the network<br />
failure analysis program.<br />
D.2 Starting the NISCA Protocol<br />
The sample program LAVC$START_BUS.MAR, provided in SYS$EXAMPLES,<br />
starts the NISCA protocol on a specific LAN adapter.<br />
To build the program, perform the following steps:<br />
Step Action<br />
1 Copy the files LAVC$START_BUS.MAR and LAVC$BUILD.COM from SYS$EXAMPLES<br />
to your local directory.<br />
2 Assemble and link the sample program using the following command:<br />
$ @LAVC$BUILD.COM LAVC$START_BUS.MAR<br />
Sample Programs for LAN Control D–1
Sample Programs for LAN Control<br />
D.2 Starting the NISCA Protocol<br />
D.2.1 Start the Protocol<br />
To start the protocol on a LAN adapter, perform the following steps:<br />
Step Action<br />
1 Use an account that has the PHY_IO privilege—you need this to run LAVC$START_<br />
BUS.EXE.<br />
2 Define the foreign command (DCL symbol).<br />
3 Execute the foreign command (LAVC$START_BUS.EXE), followed by the name of the LAN<br />
adapter on which you want to start the protocol.<br />
Example: The following example shows how to start the NISCA protocol on LAN<br />
adapter ETA0:<br />
$ START_BUS:==$SYS$DISK:[ ]LAVC$START_BUS.EXE<br />
$ START_BUS ETA<br />
D.3 Stopping the NISCA Protocol<br />
The sample program LAVC$STOP_BUS.MAR, provided in SYS$EXAMPLES,<br />
stops the NISCA protocol on a specific LAN adapter.<br />
Caution: Stopping the NISCA protocol on all LAN adapters causes satellites to<br />
hang and could cause cluster systems to fail with a CLUEXIT bugcheck.<br />
Follow the steps below to build the program:<br />
Step Action<br />
1 Copy the files LAVC$STOP_BUS.MAR and LAVC$BUILD.COM from SYS$EXAMPLES to<br />
your local directory.<br />
2 Assemble and link the sample program using the following command:<br />
D–2 Sample Programs for LAN Control<br />
$ @LAVC$BUILD.COM LAVC$STOP_BUS.MAR
Sample Programs for LAN Control<br />
D.3 Stopping the NISCA Protocol<br />
D.3.1 Stop the Protocol<br />
To stop the NISCA protocol on a LAN adapter, perform the following steps:<br />
Step Action<br />
1 Use an account that has the PHY_IO privilege—you need this to run LAVC$STOP_<br />
BUS.EXE.<br />
2 Define the foreign command (DCL symbol).<br />
3 Execute the foreign command (LAVC$STOP_BUS.EXE), followed by the name of the LAN<br />
adapter on which you want to stop the protocol.<br />
Example: The following example shows how to stop the NISCA protocol on LAN<br />
adapter ETA0:<br />
$ STOP_BUS:==$SYS$DISK[ ]LAVC$STOP_BUS.EXE<br />
$ STOP_BUS ETA<br />
D.3.2 Verify Successful Execution<br />
When the LAVC$STOP_BUS module executes successfully, the following deviceattention<br />
entry is written to the system error log:<br />
DEVICE ATTENTION . . .<br />
NI-SCS SUB-SYSTEM . . .<br />
FATAL ERROR DETECTED BY DATALINK . . .<br />
In addition, the following hexidecimal values are written to the STATUS field of<br />
the entry:<br />
First longword (00000001)<br />
Second longword (00001201)<br />
The error-log entry indicates expected behavior and can be ignored. However, if<br />
the first longword of the STATUS field contains a value other than hexidecimal<br />
value 00000001, an error has occurred and further investigation may be<br />
necessary.<br />
D.4 Analyzing Network Failures<br />
LAVC$FAILURE_ANALYSIS.MAR is a sample program, located in<br />
SYS$EXAMPLES, that you can edit and use to help detect and isolate a<br />
failed network component. When the program executes, it provides the physical<br />
description of your cluster communications network to the set of routines that<br />
perform the failure analysis.<br />
D.4.1 Failure Analysis<br />
Using the network failure analysis program can help reduce the time necessary<br />
for detection and isolation of a failing network component and, therefore,<br />
significantly increase cluster availability.<br />
Sample Programs for LAN Control D–3
Sample Programs for LAN Control<br />
D.4 Analyzing Network Failures<br />
D.4.2 How the LAVC$FAILURE_ANALYSIS Program Works<br />
The following table describes how the LAVC$FAILURE_ANALYSIS program<br />
works.<br />
Step Program Action<br />
1 The program groups channels that fail and compares them with the physical description of<br />
the cluster network.<br />
2 The program then develops a list of nonworking network components related to the<br />
failed channels and uses OPCOM messages to display the names of components with a<br />
probability of causing one or more channel failures.<br />
If the network failure analysis cannot verify that a portion of a path (containing multiple<br />
components) works, the program:<br />
1. Calls out the first component in the path as the primary suspect (%LAVC-W-<br />
PSUSPECT)<br />
2. Lists the other components as secondary or additional suspects (%LAVC-I-<br />
ASUSPECT)<br />
3 When the component works again, OPCOM displays the message %LAVC-S-WORKING.<br />
D.5 Using the Network Failure Analysis Program<br />
Table D–1 describes the steps you perform to edit and use the network failure<br />
analysis program.<br />
Table D–1 Procedure for Using the LAVC$FAILURE_ANALYSIS.MAR Program<br />
Step Action Reference<br />
1 Collect and record information specific to your cluster<br />
communications network.<br />
Section D.5.1<br />
2 Edit a copy of LAVC$FAILURE_ANALYSIS.MAR to include the<br />
information you collected.<br />
Section D.5.2<br />
3 Assemble, link, and debug the program. Section D.5.3<br />
4 Modify startup files to run the program only on the node for which<br />
you supplied data.<br />
Section D.5.4<br />
5 Execute the program on one or more of the nodes where you plan<br />
to perform the network failure analysis.<br />
Section D.5.5<br />
6 Modify MODPARAMS.DAT to increase the values of nonpaged<br />
pool parameters.<br />
Section D.5.6<br />
7 Test the Local Area <strong>OpenVMS</strong> <strong>Cluster</strong> Network Failure Analysis<br />
Program.<br />
Section D.5.7<br />
D–4 Sample Programs for LAN Control
Sample Programs for LAN Control<br />
D.5 Using the Network Failure Analysis Program<br />
D.5.1 Create a Network Diagram<br />
Follow the steps in Table D–2 to create a physical description of the network<br />
configuration and include it in electronic form in the LAVC$FAILURE_<br />
ANALYSIS.MAR program.<br />
Table D–2 Creating a Physical Description of the Network<br />
Step Action Comments<br />
1 Draw a diagram of your <strong>OpenVMS</strong> <strong>Cluster</strong><br />
communications network.<br />
When you edit LAVC$FAILURE_ANALYSIS.MAR,<br />
you include this drawing (in electronic form) in the<br />
program. Your drawing should show the physical<br />
layout of the cluster and include the following<br />
components:<br />
• LAN segments or rings<br />
• LAN bridges<br />
• Wiring concentrators, DELNI interconnects, or<br />
DEMPR repeaters<br />
• LAN adapters<br />
• VAX and Alpha systems<br />
2 Give each component in the drawing a unique label.<br />
For large clusters, you may need to verify the<br />
configuration by tracing the cables.<br />
If your <strong>OpenVMS</strong> <strong>Cluster</strong> contains a large number<br />
of nodes, you may want to replace each node name<br />
with a shorter abbreviation. Abbreviating node<br />
names can help save space in the electronic<br />
form of the drawing when you include it in<br />
LAVC$FAILURE_ANALYSIS.MAR. For example,<br />
you can replace the node name ASTRA with A and<br />
call node ASTRA’s two LAN adapters A1 and A2.<br />
3 List the following information for each component:<br />
• Unique label<br />
• Type [SYSTEM, LAN_ADP, DELNI]<br />
• Location (the physical location of the component)<br />
• LAN address or addresses (if applicable)<br />
Devices such as DELNI interconnects, DEMPR<br />
repeaters, and cables do not have LAN addresses.<br />
(continued on next page)<br />
Sample Programs for LAN Control D–5
Sample Programs for LAN Control<br />
D.5 Using the Network Failure Analysis Program<br />
Table D–2 (Cont.) Creating a Physical Description of the Network<br />
Step Action Comments<br />
4 Classify each component into one of the following<br />
categories:<br />
• Node: VAX or Alpha system in the <strong>OpenVMS</strong><br />
<strong>Cluster</strong> configuration.<br />
• Adapter: LAN adapter on the system that<br />
is normally used for <strong>OpenVMS</strong> <strong>Cluster</strong><br />
communications.<br />
• Component: Generic component in the network.<br />
Components in this category can usually be<br />
shown to be working if at least one path through<br />
them is working. Wiring concentrators, DELNI<br />
interconnects, DEMPR repeaters, LAN bridges,<br />
and LAN segments and rings typically fall into<br />
this category.<br />
• Cloud: Generic component in the network.<br />
Components in this category cannot be shown to<br />
be working even if one or more paths are shown<br />
to be working.<br />
5 Use the component labels from step 3 to describe<br />
each of the connections in the <strong>OpenVMS</strong> <strong>Cluster</strong><br />
communications network.<br />
6 Choose a node or group of nodes to run the network<br />
failure analysis program.<br />
D–6 Sample Programs for LAN Control<br />
The cloud component is necessary only when<br />
multiple paths exist between two points within the<br />
network, such as with redundant bridging between<br />
LAN segments. At a high level, multiple paths<br />
can exist; however, during operation, this bridge<br />
configuration allows only one path to exist at one<br />
time. In general, this bridge example is probably<br />
better handled by representing the active bridge<br />
in the description as a component and ignoring the<br />
standby bridge. (You can identify the active bridge<br />
with such network monitoring software as RBMS<br />
or DECelms.) With the default bridge parameters,<br />
failure of the active bridge will be called out.<br />
You should run the program only on a node that<br />
you included in the physical description when you<br />
edited LAVC$FAILURE_ANALYSIS.MAR. The<br />
network failure analysis program on one node<br />
operates independently from other systems in the<br />
<strong>OpenVMS</strong> <strong>Cluster</strong>. So, for executing the network<br />
failure analysis program, you should choose systems<br />
that are not normally shut down. Other good<br />
candidates for running the program are systems<br />
with the following characteristics:<br />
• Faster CPU speed<br />
• Larger amounts of memory<br />
• More LAN adapters (running the NISCA<br />
protocol)<br />
Note: The physical description is loaded into<br />
nonpaged pool, and all processing is performed at<br />
IPL 8. CPU use increases as the average number of<br />
network components in the network path increases.<br />
CPU use also increases as the total number of<br />
network paths increases.
Sample Programs for LAN Control<br />
D.5 Using the Network Failure Analysis Program<br />
D.5.2 Edit the Source File<br />
Follow these steps to edit the LAVC$FAILURE_ANALYSIS.MAR program.<br />
Step Action<br />
1 Copy the following files from SYS$EXAMPLES to your local directory:<br />
• LAVC$FAILURE_ANALYSIS.MAR<br />
• LAVC$BUILD.COM<br />
2 Use the <strong>OpenVMS</strong> <strong>Cluster</strong> network map and the other information you collected to edit<br />
the copy of LAVC$FAILURE_ANALYSIS.MAR.<br />
Example D–1 shows the portion of LAVC$FAILURE_ANALYSIS.MAR that you<br />
edit.<br />
Example D–1 Portion of LAVC$FAILURE_ANALYSIS.MAR to Edit<br />
; *** Start edits here ***<br />
;<br />
;<br />
;<br />
;<br />
;<br />
Edit 1.<br />
Define the hardware components needed to describe<br />
the physical configuration.<br />
NEW_COMPONENT<br />
NEW_COMPONENT<br />
NEW_COMPONENT<br />
NEW_COMPONENT<br />
NEW_COMPONENT<br />
NEW_COMPONENT<br />
SYSTEM<br />
LAN_ADP<br />
DEMPR<br />
DELNI<br />
SEGMENT<br />
NET_CLOUD<br />
NODE<br />
ADAPTER<br />
COMPONENT<br />
COMPONENT<br />
COMPONENT<br />
CLOUD<br />
; Edit 2.<br />
; ; Diagram of a multi-adapter local area <strong>OpenVMS</strong> <strong>Cluster</strong><br />
; ;; Sa -------+---------------+---------------+---------------+-------<br />
; | | | |<br />
; | MPR_A | |<br />
; | .----+----. | |<br />
; | 1| 1| 1| |<br />
; BrA ALPHA BETA DELTA BrB<br />
; | 2| 2| 2| |<br />
; | ‘----+----’ | |<br />
; | LNI_A | |<br />
; | | | |<br />
; Sb -------+---------------+---------------+---------------+-------<br />
; ;; Edit 3.<br />
;<br />
; Label<br />
; -----<br />
Node<br />
------<br />
Description<br />
-----------------------------------------------<br />
SYSTEM A,<br />
LAN_ADP A1,<br />
LAN_ADP A2,<br />
ALPHA,<br />
,<br />
,<br />
< - MicroVAX II; In the Computer room> . . .<br />
, . . .<br />
, . . .<br />
SYSTEM B,<br />
LAN_ADP B1,<br />
LAN_ADP B2,<br />
BETA,<br />
,<br />
,<br />
< - MicroVAX 3500; In the Computer room> . . .<br />
, . . .<br />
, . . .<br />
(continued on next page)<br />
Sample Programs for LAN Control D–7
Sample Programs for LAN Control<br />
D.5 Using the Network Failure Analysis Program<br />
Example D–1 (Cont.) Portion of LAVC$FAILURE_ANALYSIS.MAR to Edit<br />
SYSTEM D,<br />
LAN_ADP D1,<br />
LAN_ADP D2,<br />
DELTA, < - VAXstation II; In Dan’s office> . . .<br />
, , . . .<br />
, , . . .<br />
;<br />
;<br />
;<br />
Edit 4.<br />
Label each of the other network components.<br />
;<br />
DEMPR<br />
DELNI<br />
MPR_A, , <br />
LNI_A, , <br />
SEGMENT Sa,<br />
SEGMENT Sb,<br />
, <br />
, <br />
NET_CLOUD BRIDGES, , <br />
;<br />
;<br />
;<br />
;<br />
Edit 5.<br />
Describe the network connections.<br />
CONNECTION Sa, MPR_A<br />
CONNECTION MPR_A, A1<br />
CONNECTION A1, A<br />
CONNECTION MPR_A, B1<br />
CONNECTION B1, B<br />
CONNECTION<br />
CONNECTION<br />
Sa, D1<br />
D1, D<br />
CONNECTION<br />
CONNECTION<br />
Sa,<br />
Sb,<br />
BRIDGES<br />
BRIDGES<br />
CONNECTION<br />
CONNECTION<br />
CONNECTION<br />
CONNECTION<br />
CONNECTION<br />
Sb, LNI_A<br />
LNI_A,<br />
LNI_A,<br />
A2<br />
A2,<br />
B2<br />
B2,<br />
A<br />
B<br />
CONNECTION<br />
CONNECTION<br />
.PAGE<br />
Sb, D2<br />
D2, D<br />
; *** End of edits ***<br />
In the program, Edit number identifies a place where you edit the program to<br />
incorporate information about your network. Make the following edits to the<br />
program:<br />
Location Action<br />
Edit 1 Define a category for each component in the configuration. Use the information from<br />
step 5 in Section D.5.1. Use the following format:<br />
NEW_COMPONENT component_type category<br />
Example: The following example shows how to define a DEMPR repeater as part of<br />
the component category:<br />
NEW_COMPONENT DEMPR COMPONENT<br />
Edit 2 Incorporate the network map you drew for step 1 of Section D.5.1. Including the map<br />
here in LAVC$FAILURE_ANALYSIS.MAR gives you an electronic record of the map<br />
that you can locate and update more easily than a drawing on paper.<br />
D–8 Sample Programs for LAN Control
Location Action<br />
Sample Programs for LAN Control<br />
D.5 Using the Network Failure Analysis Program<br />
Edit 3 List each <strong>OpenVMS</strong> <strong>Cluster</strong> node and its LAN adapters. Use one line for each node.<br />
Each line should include the following information. Separate the items of information<br />
with commas to create a table of the information.<br />
• Component type, followed by a comma.<br />
• Label from the network map, followed by a comma.<br />
• Node name (for SYSTEM components only). If there is no node name, enter a<br />
comma.<br />
• Descriptive text that the network failure analysis program displays if it detects a<br />
failure with this component. Put this text within angle brackets (< >). This text<br />
should include the component’s physical location.<br />
• LAN hardware address (for LAN adapters).<br />
• DECnet LAN address for the LAN adapter that DECnet uses.<br />
Edit 4 List each of the other network components. Use one line for each component. Each line<br />
should include the following information:<br />
• Component name and category you defined with NEW_COMPONENT.<br />
• Label from the network map.<br />
• Descriptive text that the network failure analysis program displays if it detects a<br />
failure with this component. Include a description of the physical location of the<br />
component.<br />
• LAN hardware address (optional).<br />
• Alternate LAN address (optional).<br />
Edit 5 Define the connections between the network components. Use the CONNECTION<br />
macro and the labels for the two components that are connected. Include the following<br />
information:<br />
• CONNECTION macro name<br />
• First component label<br />
• Second component label<br />
Reference: You can find more detailed information about this exercise within the source module<br />
SYS$EXAMPLES:LAVC$FAILURE_ANALYSIS.MAR.<br />
D.5.3 Assemble and Link the Program<br />
Use the following command procedure to assemble and link the program:<br />
$ @LAVC$BUILD.COM LAVC$FAILURE_ANALYSIS.MAR<br />
Make the edits necessary to fix the assembly or link errors, such as errors caused<br />
by mistyping component labels in the path description. Assemble the program<br />
again.<br />
Sample Programs for LAN Control D–9
Sample Programs for LAN Control<br />
D.5 Using the Network Failure Analysis Program<br />
D.5.4 Modify Startup Files<br />
Before you execute the LAVC$FAILURE_ANALYSIS.EXE procedure, modify the<br />
startup files to run the procedure only on the node for which you supplied data.<br />
Example: To execute the program on node OMEGA, you would modify the<br />
startup files in SYS$COMMON:[SYSMGR] to include the following conditional<br />
statement:<br />
$ If F$GETSYI ("nodename").EQS."OMEGA"<br />
$ THEN<br />
$ RUN SYS$MANAGER:LAVC$FAILURE_ANALYSIS.EXE<br />
$ ENDIF<br />
D.5.5 Execute the Program<br />
To run the LAVC$FAILURE_ANALYSIS.EXE program, follow these steps:<br />
Step Action<br />
1 Use an account that has the PHY_IO privilege.<br />
2 Execute the program on each of the nodes that will perform the network failure analysis:<br />
$ RUN SYS$MANAGER:LAVC$FAILURE_ANALYSIS.EXE<br />
After it executes, the program displays the approximate amount of nonpaged pool<br />
required for the network description. The display is similar to the following:<br />
Non-paged Pool Usage: ~ 10004 bytes<br />
D.5.6 Modify MODPARAMS.DAT<br />
On each system running the network failure analysis, modify the file<br />
SYS$SPECIFIC:[SYSEXE]MODPARAMS.DAT to include the following lines,<br />
replacing value with the value that was displayed for nonpaged pool usage:<br />
ADD_NPAGEDYN = value<br />
ADD_NPAGEVIR = value<br />
D.5.7 Test the Program<br />
Run AUTOGEN on each system for which you modified MODPARAMS.DAT.<br />
Test the program by causing a failure. For example, disconnect a transceiver<br />
cable or ThinWire segment, or cause a power failure on a bridge, a DELNI<br />
interconnect, or a DEMPR repeater. Then check the OPCOM messages to see<br />
whether LAVC$FAILURE_ANALYSIS reports the failed component correctly. If<br />
it does not report the failure, check your edits to the network failure analysis<br />
program.<br />
D.5.8 Display Suspect Components<br />
When an <strong>OpenVMS</strong> <strong>Cluster</strong> network component failure occurs, OPCOM displays<br />
a list of suspected components. Displaying the list through OPCOM allows the<br />
system manager to enable and disable selectively the display of these messages.<br />
The following are sample displays:<br />
D–10 Sample Programs for LAN Control
Sample Programs for LAN Control<br />
D.5 Using the Network Failure Analysis Program<br />
%%%%%%%%%%% OPCOM 1-JAN-1994 14:16:13.30 %%%%%%%%%%%<br />
(from node BETA at 1-JAN-1994 14:15:55.38)<br />
Message from user SYSTEM on BETA LAVC-W-PSUSPECT, component_name<br />
%%%%%%%%%%% OPCOM 1-JAN-1994 14:16:13.41 %%%%%%%%%%%<br />
(from node BETA at 1-JAN-1994 14:15:55.49)<br />
Message from user SYSTEM on BETA %LAVC-W-PSUSPECT, component_name<br />
%%%%%%%%%%% OPCOM 1-JAN-1994 14:16:13.50 %%%%%%%%%%%<br />
(from node BETA at 1-JAN-1994 14:15:55.58)<br />
Message from user SYSTEM on BETA %LAVC-I-ASUSPECT, component_name<br />
The OPCOM display of suspected failures uses the following prefixes to list<br />
suspected failures:<br />
• %LAVC-W-PSUSPECT—Primary suspects<br />
• %LAVC-I-ASUSPECT—Secondary or additional suspects<br />
• %LAVC-S-WORKING—Suspect component is now working<br />
The text following the message prefix is the description of the network component<br />
you supplied when you edited LAVC$FAILURE_ANALYSIS.MAR.<br />
Sample Programs for LAN Control D–11
E.1 Introduction<br />
E<br />
Subroutines for LAN Control<br />
In addition to the sample programs described in Appendix D, a number of<br />
subroutines are provided as a way of extending the capabilities of the sample<br />
programs. Table E–1 describes the subroutines.<br />
Table E–1 Subroutines for LAN Control<br />
Subroutine Description<br />
To manage LAN adapters:<br />
SYS$LAVC_START_BUS Directs PEDRIVER to start the NISCA protocol on a<br />
specific LAN adapter.<br />
SYS$LAVC_STOP_BUS Directs PEDRIVER to stop the NISCA protocol on a<br />
specific LAN adapter.<br />
To control the network failure analysis system:<br />
SYS$LAVC_DEFINE_NET_COMPONENT Creates a representation of a physical network<br />
component.<br />
SYS$LAVC_DEFINE_NET_PATH Creates a directed list of network components between<br />
two network nodes.<br />
SYS$LAVC_ENABLE_ANALYSIS Enables the network failure analysis, which makes it<br />
possible to analyze future channel failures.<br />
SYS$LAVC_DISABLE_ANALYSIS Stops the network failure analysis and deallocates the<br />
memory used for the physical network description.<br />
E.1.1 Purpose of the Subroutines<br />
The subroutines described in this appendix are used by the the LAN control<br />
programs, LAVC$FAILURE_ANALYSIS.MAR, LAVC$START_BUS.MAR, and<br />
LAVC$STOP_BUS.MAR. Although these programs are sufficient for controlling<br />
LAN networks, you may also find it helpful to use the LAN control subroutines to<br />
further manage LAN adapters.<br />
E.2 Starting the NISCA Protocol<br />
The SYS$LAVC_START_BUS subroutine starts the NISCA protocol on a specified<br />
LAN adapter. To use the routine SYS$LAVC_START_BUS, specify the following<br />
parameter:<br />
Subroutines for LAN Control E–1
Subroutines for LAN Control<br />
E.2 Starting the NISCA Protocol<br />
Parameter Description<br />
BUS_NAME String descriptor representing the LAN adapter name buffer, passed by reference.<br />
The LAN adapter name must consist of 15 characters or fewer.<br />
Example: The following Fortran sample program uses SYS$LAVC_START_BUS<br />
to start the NISCA protocol on the LAN adapter XQA:<br />
PROGRAM START_BUS<br />
EXTERNAL SYS$LAVC_START_BUS<br />
INTEGER*4 SYS$LAVC_START_BUS<br />
INTEGER*4 STATUS<br />
STATUS = SYS$LAVC_START_BUS ( ’XQA0:’ )<br />
CALL SYS$EXIT ( %VAL ( STATUS ))<br />
END<br />
E.2.1 Status<br />
The SYS$LAVC_START_BUS subroutine returns a status value in register R0, as<br />
described in Table E–2.<br />
Table E–2 SYS$LAVC_START_BUS Status<br />
Status Result<br />
Success Indicates that PEDRIVER is attempting to start the NISCA protocol on the<br />
specified adapter.<br />
Failure Indicates that PEDRIVER cannot start the protocol on the specified LAN adapter.<br />
E.2.2 Error Messages<br />
SYS$LAVC_START_BUS can return the error condition codes shown in the<br />
following table.<br />
E–2 Subroutines for LAN Control<br />
Condition Code Description<br />
SS$_ACCVIO This status is returned for the following conditions:<br />
• No access to the argument list<br />
• No access to the LAN adapter name buffer descriptor<br />
• No access to the LAN adapter name buffer<br />
SS$_DEVACTIVE Bus already exists. PEDRIVER is already trying to use this LAN adapter<br />
for the NISCA protocol.<br />
SS$_INSFARG Not enough arguments supplied.<br />
SS$_INSFMEM Insufficient nonpaged pool to create the bus data structure.<br />
SS$_INVBUSNAM Invalid bus name specified. The device specified does not represent a LAN<br />
adapter that can be used for the protocol.<br />
SS$_IVBUFLEN This status value is returned under the following conditions:<br />
• The LAN adapter name contains no characters (length = 0).<br />
• The LAN adapter name contains more than 15 characters.
Condition Code Description<br />
Subroutines for LAN Control<br />
E.2 Starting the NISCA Protocol<br />
SS$_NOSUCHDEV This status value is returned under the following conditions:<br />
• The LAN adapter name specified does not correspond to a LAN<br />
device available to PEDRIVER on this system.<br />
• No LAN drivers are loaded in this system; the value for NET$AR_<br />
LAN_VECTOR is 0.<br />
• PEDRIVER is not initialized; PEDRIVER’s PORT structure is not<br />
available.<br />
Note: By calling this routine, an error-log message may be generated.<br />
SS$_NOTNETDEV PEDRIVER does not support the specified LAN device.<br />
SS$_SYSVERDIF The specified LAN device’s driver does not support the VCI interface<br />
version required by PEDRIVER.<br />
PEDRIVER can return additional errors that indicate it has failed to create the<br />
connection to the specified LAN adapter.<br />
E.3 Stopping the NISCA Protocol<br />
The SYS$LAVC_STOP_BUS routine stops the NISCA protocol on a specific LAN<br />
adapter.<br />
Caution: Stopping the NISCA protocol on all LAN adapters causes satellites to<br />
hang and could cause cluster systems to fail with a CLUEXIT bugcheck.<br />
To use this routine, specify the parameter described in the following table.<br />
Parameter Description<br />
BUS_NAME String descriptor representing the LAN adapter name buffer, passed by reference.<br />
The LAN adapter name must consist of 15 characters or fewer.<br />
Example: The following Fortran sample program shows how SYS$LAVC_STOP_<br />
BUS is used to stop the NISCA protocol on the LAN adapter XQB:<br />
PROGRAM STOP_BUS<br />
EXTERNAL SYS$LAVC_STOP_BUS<br />
INTEGER*4 SYS$LAVC_STOP_BUS<br />
INTEGER*4 STATUS<br />
STATUS = SYS$LAVC_STOP_BUS ( ’XQB’ )<br />
CALL SYS$EXIT ( %VAL ( STATUS ))<br />
END<br />
E.3.1 Status<br />
The SYS$LAVC_STOP_BUS subroutine returns a status value in register R0, as<br />
described in Table E–3.<br />
Subroutines for LAN Control E–3
Subroutines for LAN Control<br />
E.3 Stopping the NISCA Protocol<br />
Table E–3 SYS$LAVC_STOP_BUS Status<br />
Status Result<br />
Success Indicates that PEDRIVER is attempting to shut down the NISCA protocol on the<br />
specified adapter.<br />
Failure Indicates that PEDRIVER cannot shut down the protocol on the specified LAN<br />
adapter. However, PEDRIVER performs the shutdown asynchronously, and there<br />
could be other reasons why PEDRIVER is unable to complete the shutdown.<br />
When the LAVC$STOP_BUS module executes successfully, the following deviceattention<br />
entry is written to the system error log:<br />
DEVICE ATTENTION . . .<br />
NI-SCS SUB-SYSTEM . . .<br />
FATAL ERROR DETECTED BY DATALINK . . .<br />
In addition, the following hexadecimal values are written to the STATUS field of<br />
the entry:<br />
First longword (00000001)<br />
Second longword (00001201)<br />
This error-log entry indicates expected behavior and can be ignored. However, if<br />
the first longword of the STATUS field contains a value other than hexadecimal<br />
value 00000001, an error has occurred and further investigation may be<br />
necessary.<br />
E.3.2 Error Messages<br />
SYS$LAVC_STOP_BUS can return the error condition codes shown in the<br />
following table.<br />
E–4 Subroutines for LAN Control<br />
Condition Code Description<br />
SS$_ACCVIO This status is returned for the following conditions:<br />
• No access to the argument list<br />
• No access to the LAN adapter name buffer descriptor<br />
• No access to the LAN adapter name buffer<br />
SS$_INVBUSNAM Invalid bus name specified. The device specified does not<br />
represent a LAN adapter that can be used for the NISCA<br />
protocol.<br />
SS$_IVBUFLEN This status value is returned under the following conditions:<br />
• The LAN adapter name contains no characters (length =<br />
0).<br />
• The LAN adapter name has more than 15 characters.
Condition Code Description<br />
Subroutines for LAN Control<br />
E.3 Stopping the NISCA Protocol<br />
SS$_NOSUCHDEV This status value is returned under the following conditions:<br />
• The LAN adapter name specified does not correspond<br />
to a LAN device that is available to PEDRIVER on this<br />
system.<br />
• No LAN drivers are loaded in this system. NET$AR_<br />
LAN_VECTOR is zero.<br />
• PEDRIVER is not initialized. PEDRIVER’s PORT<br />
structure is not available.<br />
E.4 Creating a Representation of a Network Component<br />
The SYS$LAVC_DEFINE_NET_COMPONENT subroutine creates a<br />
representation for a physical network component.<br />
Use the following format to specify the parameters:<br />
STATUS = SYS$LAVC_DEFINE_NET_COMPONENT (<br />
component_description,<br />
nodename_length,<br />
component_type,<br />
lan_hardware_addr,<br />
lan_decnet_addr,<br />
component_id_value )<br />
Table E–4 describes the SYS$LAVC_DEFINE_NET_COMPONENT parameters.<br />
Table E–4 SYS$LAVC_DEFINE_NET_COMPONENT Parameters<br />
Parameter Description<br />
component_description Address of a string descriptor representing network component<br />
name buffer. The length of the network component name must be<br />
less than or equal to the number of COMP$C_MAX_NAME_LEN<br />
characters.<br />
nodename_length Address of the length of the node name. This address is located<br />
at the beginning of the network component name buffer for<br />
COMP$C_NODE types. You should use zero for other component<br />
types.<br />
component_type Address of the component type. These values are defined by<br />
$PEMCOMPDEF, found in SYS$LIBRARY:LIB.MLB.<br />
lan_hardware_addr Address of a string descriptor of a buffer containing the<br />
component’s LAN hardware address (6 bytes). You must specify<br />
this value for COMP$C_ADAPTER types. For other component<br />
types, this value is optional.<br />
lan_decnet_addr String descriptor of a buffer containing the component’s LAN<br />
DECnet address (6 bytes). This is an optional parameter for all<br />
component types.<br />
component_id_value Address of a longword that is written with the component ID<br />
value.<br />
Subroutines for LAN Control E–5
Subroutines for LAN Control<br />
E.4 Creating a Representation of a Network Component<br />
E.4.1 Status<br />
If successful, the SYS$LAVC_DEFINE_NET_COMPONENT subroutine creates<br />
a COMP data structure and returns its ID value. This subroutine copies userspecified<br />
parameters into the data structure and sets the reference count to<br />
zero.<br />
The component ID value is a 32-bit value that has a one-to-one association with<br />
a network component. Lists of these component IDs are passed to SYS$LAVC_<br />
DEFINE_NET_PATH to specify the components used when a packet travels from<br />
one node to another.<br />
E.4.2 Error Messages<br />
SYS$LAVC_DEFINE_NET_COMPONENT can return the error condition codes<br />
shown in the following table.<br />
E–6 Subroutines for LAN Control<br />
Condition Code Description<br />
SS$_ACCVIO This status is returned for the following conditions:<br />
• No access to the network component name buffer descriptor<br />
• No access to the network component name buffer<br />
• No access to the component’s LAN hardware address if a nonzero<br />
value was specified<br />
• No access to the component’s LAN DECnet address if a nonzero<br />
value was specified<br />
• No access to the lan_hardware_addr string descriptor<br />
• No access to the lan_decnet_addr string descriptor<br />
• No write access to the component_id_value address<br />
• No access to the component_type address<br />
• No access to the nodename_length address<br />
• No access to the argument list<br />
SS$_DEVACTIVE Analysis program already running. You must stop the analysis by<br />
calling the SYS$LAVC_DISABLE_ANALYSIS before you define the<br />
network components and the network component lists.<br />
SS$_INSFARG Not enough arguments supplied.<br />
SS$_INVCOMPTYPE The component type is either 0 or greater than or equal to COMP$C_<br />
INVALID.
Subroutines for LAN Control<br />
E.4 Creating a Representation of a Network Component<br />
Condition Code Description<br />
SS$_IVBUFLEN This status value is returned under the following conditions:<br />
E.5 Creating a Network Component List<br />
• The component name has no characters (length = 0).<br />
• Length of the component name is greater than COMP$C_MAX_<br />
NAME_LEN.<br />
• The node name has no characters (length = 0) and the component<br />
type is COMP$C_NODE.<br />
• The node name has more than 8 characters and the component<br />
type is COMP$C_NODE.<br />
• The lan_hardware_addr string descriptor has fewer than 6<br />
characters.<br />
• The lan_decnet_addr has fewer than 6 characters.<br />
The SYS$LAVC_DEFINE_NET_PATH subroutine creates a directed list of<br />
network components between two network nodes. A directed list is a list of<br />
all the components through which a packet passes as it travels from the failure<br />
analysis node to other nodes in the cluster network.<br />
Use the following format to specify the parameters:<br />
STATUS = SYS$LAVC_DEFINE_NET_PATH (<br />
network_component_list,<br />
used_for_analysis_status,<br />
bad_component_id )<br />
Table E–5 describes the SYS$LAVC_DEFINE_NET_PATH parameters.<br />
Table E–5 SYS$LAVC_DEFINE_NET_PATH Parameters<br />
Parameter Description<br />
network_component_list Address of a string descriptor for a buffer containing the<br />
component ID values for each of the components in the path.<br />
List the component ID values in the order in which a network<br />
message travels through them. Specify components in the<br />
following order:<br />
1. Local node<br />
2. Local LAN adapter<br />
3. Intermediate network components<br />
4. Remote network LAN adapter<br />
5. Remote node<br />
You must list two nodes and two LAN adapters in the network<br />
path. The buffer length must be greater than 15 bytes and<br />
less than 509 bytes.<br />
(continued on next page)<br />
Subroutines for LAN Control E–7
Subroutines for LAN Control<br />
E.5 Creating a Network Component List<br />
Table E–5 (Cont.) SYS$LAVC_DEFINE_NET_PATH Parameters<br />
Parameter Description<br />
used_for_analysis_status Address of a longword status value that is written. This<br />
status indicates whether this network path has any value for<br />
the network failure analysis.<br />
bad_component_id Address of a longword value that contains the erroneous<br />
component ID if an error is detected while processing the<br />
component list.<br />
E.5.1 Status<br />
This subroutine creates a directed list of network components that describe<br />
a specific network path. If SYS$LAVC_DEFINE_NET_PATH is successful, it<br />
creates a CLST data structure. If one node is the local node, then this data<br />
structure is associated with a PEDRIVER channel. In addition, the reference<br />
count for each network component in the list is incremented. If neither node<br />
is the local node, then the used_for_analysis_status address contains an error<br />
status.<br />
The SYS$LAVC_DEFINE_NET_PATH subroutine returns a status value in<br />
register R0, as described in Table E–6, indicating whether the network component<br />
list has the correct construction.<br />
Table E–6 SYS$LAVC_DEFINE_NET_PATH Status<br />
Status Result<br />
Success The used_for_analysis_status value indicates whether the network path is useful<br />
for network analysis performed on the local node.<br />
Failure If a failure status returned in R0 is SS$_INVCOMPID, the bad_component_id<br />
address contains the value of the bad_component_id found in the buffer.<br />
E.5.2 Error Messages<br />
SYS$LAVC_DEFINE_NET_PATH can return the error condition codes shown in<br />
the following table.<br />
E–8 Subroutines for LAN Control<br />
Condition Code Description<br />
SS$_ACCVIO This status value can be returned under the following conditions:<br />
• No access to the descriptor or the network component ID value<br />
buffer<br />
• No access to the argument list<br />
• No write access to the used_for_analysis_status address<br />
• No write access to the bad_component_id address<br />
SS$_DEVACTIVE Analysis already running. You must stop the analysis by calling the<br />
SYS$LAVC_DISABLE_ANALYSIS function before defining the network<br />
components and the network component lists.<br />
SS$_INSFARG Not enough arguments supplied.<br />
SS$_INVCOMPID Invalid network component ID specified in the buffer. The bad_<br />
component_id address contains the failed component ID.
Condition Code Description<br />
Subroutines for LAN Control<br />
E.5 Creating a Network Component List<br />
SS$_INVCOMPLIST This status value can be returned under the following conditions:<br />
• Fewer than two nodes were specified in the node list.<br />
• More than two nodes were specified in the list.<br />
• The first network component ID was not a COMP$C_NODE type.<br />
• The last network component ID was not a COMP$C_NODE type.<br />
• Fewer than two adapters were specified in the list.<br />
• More than two adapters were specified in the list.<br />
SS$_IVBUFLEN Length of the network component ID buffer is less than 16, is not a<br />
multiple of 4, or is greater than 508.<br />
SS$_RMTPATH Network path is not associated with the local node. This status is<br />
returned only to indicate whether this path was needed for network<br />
failure analysis on the local node.<br />
E.6 Starting Network Component Failure Analysis<br />
The SYS$LAVC_ENABLE_ANALYSIS subroutine starts the network component<br />
failure analysis.<br />
Example: The following is an example of using the SYS$LAVC_ENABLE_<br />
ANALYSIS subroutine:<br />
STATUS = SYS$LAVC_ENABLE_ANALYSIS ( )<br />
E.6.1 Status<br />
This subroutine attempts to enable the network component failure analysis code.<br />
The attempt will succeed if at least one component list is defined.<br />
SYS$LAVC_ENABLE_ANALYSIS returns a status in register R0.<br />
E.6.2 Error Messages<br />
SYS$LAVC_ENABLE_ANALYSIS can return the error condition codes shown in<br />
the following table.<br />
Condition Code Description<br />
SS$_DEVOFFLINE PEDRIVER is not properly initialized. ROOT or PORT block<br />
is not available.<br />
SS$_NOCOMPLSTS No network connection lists exist. Network analysis is not<br />
possible.<br />
SS$_WASSET Network component analysis is already running.<br />
Subroutines for LAN Control E–9
Subroutines for LAN Control<br />
E.7 Stopping Network Component Failure Analysis<br />
E.7 Stopping Network Component Failure Analysis<br />
The SYS$LAVC_DISABLE_ANALYSIS subroutine stops the network component<br />
failure analysis.<br />
Example: The following is an example of using SYS$LAVC_DISABLE_<br />
ANALYSIS:<br />
STATUS = SYS$LAVC_DISABLE_ANALYSIS ( )<br />
This subroutine disables the network component failure analysis code and, if<br />
analysis was enabled, deletes all the network component definitions and network<br />
component list data structures from nonpaged pool.<br />
E.7.1 Status<br />
SYS$LAVC_DISABLE_ANALYSIS returns a status in register R0.<br />
E.7.2 Error Messages<br />
SYS$LAVC_DISABLE_ANALYSIS can return the error condition codes shown in<br />
the following table.<br />
Condition Code Description<br />
SS$_DEVOFFLINE PEDRIVER is not properly initialized. ROOT or PORT block is not<br />
available.<br />
SS$_WASCLR Network component analysis already stopped.<br />
E–10 Subroutines for LAN Control
F<br />
Troubleshooting the NISCA Protocol<br />
NISCA is the transport protocol responsible for carrying messages, such as disk<br />
I/Os and lock messages, across Ethernet and FDDI LANs to other nodes in the<br />
cluster. The acronym NISCA refers to the protocol that implements an Ethernet<br />
or FDDI network interconnect (NI) according to the System Communications<br />
Architecture (SCA).<br />
Using the NISCA protocol, an <strong>OpenVMS</strong> software interface emulates the CI<br />
port interface—that is, the software interface is identical to that of the CI bus,<br />
except that data is transferred over a LAN. The NISCA protocol allows <strong>OpenVMS</strong><br />
<strong>Cluster</strong> communication over the LAN without the need for any special hardware.<br />
This appendix describes the NISCA transport protocol and provides<br />
troubleshooting strategies to help a network manager pinpoint networkrelated<br />
problems. Because troubleshooting hard component failures in the LAN<br />
is best accomplished using a LAN analyzer, this appendix also describes the<br />
features and setup of a LAN analysis tool.<br />
Note<br />
Additional troubleshooting information specific to the revised PEDRIVER<br />
is planned for the next revision of this manual.<br />
F.1 How NISCA Fits into the SCA<br />
The NISCA protocol is an implementation of the Port-to-Port Driver (PPD)<br />
protocol of the SCA.<br />
F.1.1 SCA Protocols<br />
As described in Chapter 2, the SCA is a software architecture that provides<br />
efficient communication services to low-level distributed applications (for example,<br />
device drivers, file services, network managers).<br />
The SCA specifies a number of protocols for <strong>OpenVMS</strong> <strong>Cluster</strong> systems, including<br />
System Applications (SYSAP), System Communications Services (SCS), the<br />
Port-to-Port Driver (PPD), and the Physical Interconnect (PI) of the device driver<br />
and LAN adapter. Figure F–1 shows these protocols as interdependent levels<br />
that make up the SCA architecture. Figure F–1 shows the NISCA protocol as a<br />
particular implementation of the PPD layer of the SCA architecture.<br />
Troubleshooting the NISCA Protocol F–1
Troubleshooting the NISCA Protocol<br />
F.1 How NISCA Fits into the SCA<br />
Figure F–1 Protocols in the SCA Architecture<br />
System Communications Architecture<br />
SYSAP<br />
(System Applications)<br />
SCS<br />
(System Communications Services)<br />
PPD<br />
(Port−to−Port Driver)<br />
PI<br />
(Physical Interconnect)<br />
NISCA Protocol<br />
PPD (Port to Port Driver)<br />
PPC (Port−to−Port Communication)<br />
TR (Transport)<br />
CC (Channel Control)<br />
DX (Datagram Exchange)<br />
Table F–1 describes the levels of the SCA protocol shown in Figure F–1.<br />
Table F–1 SCA Protocol Layers<br />
Protocol Description<br />
ZK−5919A−GE<br />
SYSAP Represents clusterwide system applications that execute on each node. These system applications share<br />
communication paths in order to send messages between nodes. Examples of system applications are disk<br />
class drivers (such as DUDRIVER), the MSCP server, and the connection manager.<br />
SCS Manages connections around the <strong>OpenVMS</strong> <strong>Cluster</strong> and multiplexes messages between system<br />
applications over a common transport called a virtual circuit (see Section F.1.2). The SCS layer also<br />
notifies individual system applications when a connection fails so that they can respond appropriately.<br />
For example, an SCS notification might trigger DUDRIVER to fail over a disk, trigger a cluster state<br />
transition, or notify the connection manager to start timing reconnect (RECNXINTERVAL) intervals.<br />
(continued on next page)<br />
F–2 Troubleshooting the NISCA Protocol
Table F–1 (Cont.) SCA Protocol Layers<br />
Protocol Description<br />
Troubleshooting the NISCA Protocol<br />
F.1 How NISCA Fits into the SCA<br />
PPD Provides a message delivery service to other nodes in the <strong>OpenVMS</strong> <strong>Cluster</strong> system.<br />
PPD Level Description<br />
Port-to-Port Driver<br />
(PPD)<br />
Port-to-Port<br />
Communication<br />
(PPC)<br />
Establishes virtual circuits and handles errors.<br />
Provides port-to-port communication, datagrams, sequenced messages, and block<br />
transfers. ‘‘Segmentation’’ also occurs at the PPC level. Segmentation of large<br />
blocks of data is done differently on a LAN than on a CI or a DSSI bus. LAN<br />
data packets are fragmented according to the size allowed by the particular LAN<br />
communications path, as follows:<br />
Port-to-Port Communications Packet Size Allowed<br />
Ethernet-to-Ethernet 1498 bytes<br />
FDDI-to-Ethernet 1498 bytes<br />
FDDI-to-Ethernet-to-FDDI 1498 bytes<br />
FDDI-to-FDDI 4468 bytes<br />
Note: The default value is 1498 bytes for both Ethernet and FDDI.<br />
Transport (TR) Provides an error-free path, called a virtual circuit (see Section F.1.2), between<br />
nodes. The PPC level uses a virtual circuit for transporting sequenced messages<br />
and datagrams between two nodes in the cluster.<br />
Channel Control<br />
(CC)<br />
Datagram<br />
Exchange (DX)<br />
Manages network paths, called channels, between nodes in an <strong>OpenVMS</strong> <strong>Cluster</strong>.<br />
The CC level maintains channels by sending HELLO datagram messages between<br />
nodes. A node sends a HELLO datagram messages to indicate it is still functioning.<br />
The TR level uses channels to carry virtual circuit traffic.<br />
Interfaces to the LAN driver.<br />
PI Provides connections to LAN devices. PI represents LAN drivers and adapters over which packets are sent<br />
and received.<br />
PI Component Description<br />
LAN drivers Multiplex NISCA and many other clients (such as DECnet, TCP/IP, LAT,<br />
LAD/LAST) and provide them with datagram services on Ethernet and FDDI<br />
network interfaces.<br />
LAN adapters Consist of the LAN network driver and adapter hardware.<br />
Troubleshooting the NISCA Protocol F–3
Troubleshooting the NISCA Protocol<br />
F.1 How NISCA Fits into the SCA<br />
F.1.2 Paths Used for Communication<br />
The NISCA protocol controls communications over the paths described in<br />
Table F–2.<br />
Table F–2 Communication Paths<br />
Path Description<br />
Virtual circuit A common transport that provides reliable port-to-port communication<br />
between <strong>OpenVMS</strong> <strong>Cluster</strong> nodes in order to:<br />
• Ensure the delivery of messages without duplication or loss, each port<br />
maintains a virtual circuit with every other remote port.<br />
• Ensure the sequential ordering of messages, virtual circuit sequence<br />
numbers are used on the individual packets. Each transmit message<br />
carries a sequence number; duplicates are discarded.<br />
The virtual circuit descriptor table in each port indicates the status of it’s<br />
port-to-port circuits. After a virtual circuit is formed between two ports,<br />
communication can be established between SYSAPs in the nodes.<br />
Channel A logical communication path between two LAN adapters located on different<br />
nodes. Channels between nodes are determined by the pairs of adapters and<br />
the connecting network. For example, two nodes, each having two adapters,<br />
could establish four channels. The messages carried by a particular virtual<br />
circuit can be sent over any of the channels connecting the two nodes.<br />
Note: The difference between a channel and a virtual circuit is that channels<br />
provide a path for datagram service. Virtual circuits, layered on channels, provide<br />
an error-free path between nodes. Multiple channels can exist between nodes<br />
in an <strong>OpenVMS</strong> <strong>Cluster</strong> but only one virtual circuit can exist between any two<br />
nodes at a time.<br />
F.1.3 PEDRIVER<br />
The port emulator driver, PEDRIVER, implements the NISCA protocol and<br />
establishes and controls channels for communication between local and remote<br />
LAN ports.<br />
PEDRIVER implements a packet delivery service (at the TR level of the NISCA<br />
protocol) that guarantees the sequential delivery of messages. The messages<br />
carried by a particular virtual circuit can be sent over any of the channels<br />
connecting two nodes. The choice of channel is determined by the sender<br />
(PEDRIVER) of the message. Because a node sending a message can choose any<br />
channel, PEDRIVER, as a receiver, must be prepared to receive messages over<br />
any channel.<br />
At any point in time, the TR level makes use of a single ‘‘preferred channel’’ to<br />
carry the traffic for a particular virtual circuit.<br />
Reference: See Appendix G for more information about how transmit channels<br />
are selected.<br />
F–4 Troubleshooting the NISCA Protocol
F.2 Addressing LAN Communication Problems<br />
Troubleshooting the NISCA Protocol<br />
F.2 Addressing LAN Communication Problems<br />
This section describes LAN Communication Problems and how to address them.<br />
F.2.1 Symptoms<br />
Communication trouble in <strong>OpenVMS</strong> <strong>Cluster</strong> systems may be indicated by<br />
symptoms such as the following:<br />
• Poor performance<br />
• Console messages<br />
‘‘Virtual circuit closed’’ messages from PEA0 (PEDRIVER) on the console<br />
‘‘Connection loss’’ OPCOM messages on the console<br />
CLUEXIT bugchecks<br />
‘‘Excessive packet losses on LAN Path’’ messages on the console<br />
• Repeated loss of a virtual circuit or multiple virtual circuits over a short<br />
period of time (fewer than 10 minutes)<br />
Before you initiate complex diagnostic procedures, do not overlook the obvious.<br />
Always make sure the hardware is configured and connected properly and that<br />
the network is started. Also, make sure system parameters are set correctly on<br />
all nodes in the <strong>OpenVMS</strong> <strong>Cluster</strong>.<br />
F.2.2 Traffic Control<br />
Keep in mind that an <strong>OpenVMS</strong> <strong>Cluster</strong> system generates substantially heavier<br />
traffic than other LAN protocols. In many cases, cluster behavior problems<br />
that appear to be related to the network might actually be related to software,<br />
hardware, or user errors. For example, a large amount of traffic does not<br />
necessarily indicate a problem with the <strong>OpenVMS</strong> <strong>Cluster</strong> network. The amount<br />
of traffic generated depends on how the users utilize the system and the way that<br />
the <strong>OpenVMS</strong> <strong>Cluster</strong> is configured with additional interconnects (such as DSSI<br />
and CI).<br />
If the amount of traffic generated by the <strong>OpenVMS</strong> <strong>Cluster</strong> exceeds the expected<br />
or desired levels, then you might be able to reduce the level of traffic by:<br />
• Adding DSSI or CI interconnects<br />
• Shifting the user load between machines<br />
• Adding LAN segments and reconfiguring the LAN connections across the<br />
<strong>OpenVMS</strong> <strong>Cluster</strong> system<br />
F.2.3 Excessive Packet Losses on LAN Paths<br />
Prior to <strong>OpenVMS</strong> Version 7.3, an SCS virtual circuit closure was the first<br />
indication that a LAN path had become unusable. In <strong>OpenVMS</strong> Version 7.3,<br />
whenever the last usable LAN path is losing packets at an excessive rate,<br />
PEDRIVER displays the following console message:<br />
%PEA0, Excessive packet losses on LAN path from local-device-name<br />
to device-name on REMOTE NODE node-name<br />
Troubleshooting the NISCA Protocol F–5
Troubleshooting the NISCA Protocol<br />
F.2 Addressing LAN Communication Problems<br />
This message is displayed when PEDRIVER recently had to perform an<br />
excessively high rate of packet retransmissions on the LAN path consisting<br />
of the local device, the intervening network, and the device on the remote node.<br />
The message indicates that the LAN path has degraded and is approaching,<br />
or has reached, the point where reliable communications with the remote node<br />
are no longer possible. It is likely that the virtual circuit to the remote node<br />
will close if the losses continue. Furthermore, continued operation with high<br />
LAN packet losses can result in significant loss in performance because of the<br />
communication delays resulting from the packet loss detection timeouts and<br />
packet retransmission.<br />
The corrective steps to take are:<br />
1. Check the local and remote LAN device error counts to see whether a problem<br />
exists on the devices. Issue the following commands on each node:<br />
$ SHOW DEVICE local-device-name<br />
$ MC SCACP<br />
SCACP> SHOW LAN device-name<br />
$ MC LANCP<br />
LANCP> SHOW DEVICE device-name/COUNT<br />
2. If device error counts on the local devices are within normal bounds, contact<br />
your network administrators to request that they diagnose the LAN path<br />
between the devices.<br />
F.2.4 Preliminary Network Diagnosis<br />
If the symptoms and preliminary diagnosis indicate that you might have a<br />
network problem, troubleshooting LAN communication failures should start<br />
with the step-by-step procedures described in Appendix C. Appendix C helps you<br />
diagnose and solve common Ethernet and FDDI LAN communication failures<br />
during the following stages of <strong>OpenVMS</strong> <strong>Cluster</strong> activity:<br />
• When a computer or a satellite fails to boot<br />
• When a computer fails to join the <strong>OpenVMS</strong> <strong>Cluster</strong><br />
• During run time when startup procedures fail to complete<br />
• When a <strong>OpenVMS</strong> <strong>Cluster</strong> hangs<br />
The procedures in Appendix C require that you verify a number of parameters<br />
during the diagnostic process. Because system parameter settings play a key role<br />
in effective <strong>OpenVMS</strong> <strong>Cluster</strong> communications, Section F.2.6 describes several<br />
system parameters that are especially important to the timing of LAN bridges,<br />
disk failover, and channel availability.<br />
F.2.5 Tracing Intermittent Errors<br />
Because PEDRIVER communication is based on channels, LAN network problems<br />
typically fall into these areas:<br />
• Channel formation and maintenance<br />
Channels are formed when HELLO datagram messages are received from a<br />
remote system. A failure can occur when the HELLO datagram messages are<br />
not received or when the channel control message contains the wrong data.<br />
F–6 Troubleshooting the NISCA Protocol
Troubleshooting the NISCA Protocol<br />
F.2 Addressing LAN Communication Problems<br />
• Retransmission<br />
A well-configured <strong>OpenVMS</strong> <strong>Cluster</strong> system should not perform excessive<br />
retransmissions between nodes. Retransmissions between any nodes<br />
that occur more frequently than once every few seconds deserve network<br />
investigation.<br />
Diagnosing failures at this level becomes more complex because the errors<br />
are usually intermittent. Moreover, even though PEDRIVER is aware when a<br />
channel is unavailable and performs error recovery based on this information, it<br />
does not provide notification when a channel failure occurs; PEDRIVER provides<br />
notification only for virtual circuit failures.<br />
However, the Local Area <strong>OpenVMS</strong> <strong>Cluster</strong> Network Failure Analysis Program<br />
(LAVC$FAILURE_ANALYSIS), available in SYS$EXAMPLES, can help you use<br />
PEDRIVER information about channel status. The LAVC$FAILURE_ANALYSIS<br />
program (documented in Appendix D) analyzes long-term channel outages, such<br />
as hard failures in LAN network components that occur during run time.<br />
This program uses tables in which you describe your LAN hardware<br />
configuration. During a channel failure, PEDRIVER uses the hardware<br />
configuration represented in the table to isolate which component might be<br />
causing the failure. PEDRIVER reports the suspected component through<br />
an OPCOM display. You can then isolate the LAN component for repair or<br />
replacement.<br />
Reference: Section F.7 addresses the kinds of problems you might find in the<br />
NISCA protocol and provides methods for diagnosing and solving them.<br />
F.2.6 Checking System Parameters<br />
Table F–3 describes several system parameters relevant to the recovery and<br />
failover time limits for LANs in an <strong>OpenVMS</strong> <strong>Cluster</strong>.<br />
Table F–3 System Parameters for Timing<br />
Parameter Use<br />
RECNXINTERVAL<br />
Defines the amount of time to wait before<br />
removing a node from the <strong>OpenVMS</strong> <strong>Cluster</strong><br />
after detection of a virtual circuit failure, which<br />
could result from a LAN bridge failure.<br />
MVTIMEOUT<br />
Defines the amount of time the <strong>OpenVMS</strong><br />
operating system tries to recover a path to a<br />
disk before returning failure messages to the<br />
application.<br />
If your network uses multiple paths and you<br />
want the <strong>OpenVMS</strong> <strong>Cluster</strong> to survive failover<br />
between LAN bridges, make sure the value of<br />
RECNXINTERVAL is greater than the time it<br />
takes to fail over those paths.<br />
Reference: The formula for calculating this<br />
parameter is discussed in Section 3.4.7.<br />
Relevant when an <strong>OpenVMS</strong> <strong>Cluster</strong><br />
configuration is set up to serve disks over<br />
either the Ethernet or FDDI. MVTIMEOUT<br />
is similar to RECNXINTERVAL except that<br />
RECNXINTERVAL is CPU to CPU, and<br />
MVTIMEOUT is CPU to disk.<br />
(continued on next page)<br />
Troubleshooting the NISCA Protocol F–7
Troubleshooting the NISCA Protocol<br />
F.2 Addressing LAN Communication Problems<br />
Table F–3 (Cont.) System Parameters for Timing<br />
Parameter Use<br />
SHADOW_MBR_TIMEOUT<br />
Defines the amount of time that the Volume<br />
Shadowing for <strong>OpenVMS</strong> tries to recover from<br />
a transient disk error on a single member of a<br />
multiple-member shadow set.<br />
SHADOW_MBR_TIMEOUT differs from<br />
MVTIMEOUT because it removes a failing<br />
shadow set member quickly. The remaining<br />
shadow set members can recover more rapidly<br />
once the failing member is removed.<br />
Note: The TIMVCFAIL system parameter, which optimizes the amount of time<br />
needed to detect a communication failure, is not recommended for use with<br />
LAN communications. This parameter is intended for CI and DSSI connections.<br />
PEDRIVER (which is for Ethernet and FDDI) usually surpasses the detection<br />
provided by TIMVCFAIL with the listen timeout of 8 to 9 seconds.<br />
F.2.7 Channel Timeouts<br />
Channel timeouts are detected by PEDRIVER as described in Table F–4.<br />
Table F–4 Channel Timeout Detection<br />
PEDRIVER Actions Comments<br />
Listens for HELLO datagram messages, which<br />
are sent over channels at least once every 3<br />
seconds<br />
Closes a channel when HELLO datagrams or<br />
sequenced messages have not been received for a<br />
period of 8 to 9 seconds<br />
Closes a virtual circuit when:<br />
• No channels are available.<br />
• The packet size of the only available<br />
channels is insufficient.<br />
Every node in the <strong>OpenVMS</strong> <strong>Cluster</strong> multicasts<br />
HELLO datagram messages on each LAN<br />
adapter to notify other nodes that it is still<br />
functioning. Receiving nodes know that the<br />
network connection is still good.<br />
Because HELLO datagram messages are<br />
transmitted at least once every 3 seconds,<br />
PEDRIVER times out a channel only if at least<br />
two HELLO datagram messages are lost and<br />
there is no sequenced message traffic.<br />
The virtual circuit is not closed if any other<br />
channels to the node are available except<br />
when the packet sizes of available channels<br />
are smaller than the channel being used for<br />
the virtual circuit. For example, if a channel<br />
fails over from FDDI to Ethernet, PEDRIVER<br />
may close the virtual circuit and then reopen it<br />
after negotiating the smaller packet size that is<br />
necessary for Ethernet segmentation.<br />
Does not report errors when a channel is closed OPCOM ‘‘Connection loss’’ errors or SYSAP<br />
messages are not sent to users or other system<br />
applications until after the virtual circuit<br />
shuts down. This fact is significant, especially<br />
if there are multiple paths to a node and a<br />
LAN hardware failure occurs. In this case, you<br />
might not receive an error message; PEDRIVER<br />
continues to use the virtual circuit over another<br />
available channel.<br />
Reestablishes a virtual circuit when a channel<br />
becomes available again<br />
F–8 Troubleshooting the NISCA Protocol<br />
PEDRIVER reopens a channel when HELLO<br />
datagram messages are received again.
Troubleshooting the NISCA Protocol<br />
F.3 Using SDA to Monitor LAN Communications<br />
F.3 Using SDA to Monitor LAN Communications<br />
This section describes how to use SDA to monitor LAN communications.<br />
F.3.1 Isolating Problem Areas<br />
If your system shows symptoms of intermittent failures during run time, you<br />
need to determine whether there is a network problem or whether the symptoms<br />
are caused by some other activity in the system.<br />
Generally, you can diagnose problems in the NISCA protocol or the network using<br />
the <strong>OpenVMS</strong> System Dump Analyzer utility (SDA). SDA is an effective tool for<br />
isolating problems on specific nodes running in the <strong>OpenVMS</strong> <strong>Cluster</strong> system.<br />
Reference: The following sections describe the use of some SDA commands<br />
and qualifiers. You should also refer to the <strong>OpenVMS</strong> Alpha System Analysis<br />
Tools Manual or the <strong>OpenVMS</strong> VAX System Dump Analyzer Utility Manual for<br />
complete information about SDA for your system.<br />
F.3.2 SDA Command SHOW PORT<br />
The SDA command SHOW PORT provides relevant information that is useful in<br />
troubleshooting PEDRIVER and LAN adapters in particular. Begin by entering<br />
the SHOW PORT command, which causes SDA to define cluster symbols.<br />
Example F–1 illustrates how the SHOW PORT command provides a summary of<br />
<strong>OpenVMS</strong> <strong>Cluster</strong> data structures.<br />
Troubleshooting the NISCA Protocol F–9
Troubleshooting the NISCA Protocol<br />
F.3 Using SDA to Monitor LAN Communications<br />
Example F–1 SDA Command SHOW PORT Display<br />
$ ANALYZE/SYSTEM<br />
SDA> SHOW PORT<br />
VAXcluster data structures<br />
--------------------------<br />
--- PDT Summary Page ---<br />
PDT Address Type Device Driver Name<br />
----------- ---- ------- -----------<br />
80C3DBA0 pa PAA0 PADRIVER<br />
80C6F7A0 pe PEA0 PEDRIVER<br />
F.3.3 Monitoring Virtual Circuits<br />
To examine information about the virtual circuit (VC) that carries messages<br />
between the local node (where you are running SDA) and another remote node,<br />
enter the SDA command SHOW PORT/VC=VC_remote-node-name. Example F–2<br />
shows how to examine information about the virtual channel running between a<br />
local node and the remote node, NODE11.<br />
Example F–2 SDA Command SHOW PORT/VC Display<br />
SDA> SHOW PORT/VC=VC_NODE11<br />
VAXcluster data structures<br />
--------------------------<br />
--- Virtual Circuit (VC) 98625380 ---<br />
Remote System Name: NODE11<br />
Local System ID: 217 (D9)<br />
(0:VAX) Remote SCSSYSTEMID: 19583<br />
Status: 0005 open,path<br />
------ Transmit ------- ----- VC Closures ----- ’--- Congestion Control ----<br />
Msg Xmt! 46193196 SeqMsg TMO 0 Pipe Quota/Slo/Max( 31/ 7/31<br />
Unsequence 3 CC DFQ Empty 0 Pipe Quota Reached) 213481<br />
Sequence 41973703 Topology Change% 0 Xmt C/T+> 0/1984<br />
ReXmt" 128/106<br />
Lone ACK 4219362<br />
Bytes Xmt 137312089<br />
------- Receive -------<br />
NPAGEDYN Low& 0<br />
- Messages Discarded -<br />
RndTrp uS+? 18540+7764<br />
UnAcked Msgs 0<br />
CMD Queue Len/Max 0/21<br />
----- Channel Selection -----<br />
Msg Rcv#<br />
Unsequence<br />
Sequence<br />
47612604<br />
3<br />
37877271<br />
No Xmt Chan<br />
Rcv Short Msg<br />
Illegal Seq Msg<br />
0<br />
0<br />
0<br />
Preferred Channel<br />
Delay Time<br />
Buffer Size<br />
9867F400<br />
FAAD63E0<br />
1424<br />
ReRcv$<br />
Lone ACK<br />
Cache<br />
13987<br />
9721030<br />
314<br />
Bad Checksum<br />
TR DFQ Empty<br />
TR MFQ Empty<br />
0<br />
0<br />
0<br />
Channel Count<br />
Channel Selections<br />
Protocol<br />
18<br />
32138<br />
1.3.0<br />
Ill ACK 0 CC MFQ Empty 0 Open+@ 8-FEB-1994 17:00:05.12<br />
Bytes Rcv 3821742649 Cache Miss 0 Cls+A 17-NOV-1858 00:00:00.00<br />
The SHOW PORT/VC=VC_remote-node-name command displays a number of<br />
performance statistics about the virtual circuit for the target node. The display<br />
groups the statistics into general categories that summarize such things as packet<br />
transmissions to the remote node, packets received from the remote node, and<br />
congestion control behavior. The statistics most useful for problem isolation are<br />
called out in Example F–2 and described in Table F–5.<br />
Note: The counters shown in Example F–2 are stored in fixed-size fields and are<br />
automatically reset to 0 when a field reaches its maximum value (or when the<br />
system is rebooted). Because fields have different maximum sizes and growth<br />
rates, the field counters are likely to reset at different times. Thus, for a system<br />
that has been running for a long time, some field values may seem illogical and<br />
appear to contradict others.<br />
F–10 Troubleshooting the NISCA Protocol
Table F–5 SHOW PORT/VC Display<br />
Field Description<br />
Troubleshooting the NISCA Protocol<br />
F.3 Using SDA to Monitor LAN Communications<br />
! Msg Xmt (messages transmitted) Shows the total number of packets transmitted over the virtual circuit to the<br />
remote node, including both sequenced and unsequenced (channel control)<br />
messages, and lone acknowledgments. (All application data is carried<br />
in sequenced messages.) The counters for sequenced messages and lone<br />
acknowledgments grow more quickly than most other fields.<br />
" ReXmt (retransmission) Indicates the number of retransmissions and retransmit related timeouts for<br />
the virtual circuit.<br />
• The rightmost number (106) in the ReXmt field indicates the number<br />
of times a timeout occured. A timeout indicates one of the following<br />
problems:<br />
The remote system NODE11 did not receive the sequenced message<br />
sent by UPNVMS.<br />
The sequenced message arrived but was delayed in transit to<br />
NODE11.<br />
The local system UPNVMS did not receive the acknowledgment to<br />
the message sent to remote node NODE11.<br />
The acknowledgment arrived but was delayed in transit from<br />
NODE11.<br />
Congestion either in the network or at one of the nodes can cause the<br />
following problems:<br />
Congestion in the network can result in delayed or lost packets.<br />
Network hardware problems can also result in lost packets.<br />
Congestion in UPNVMS or NODE11 can result either in packet<br />
delay because of queuing in the adapter or in packet discard<br />
because of insufficient buffer space.<br />
• The leftmost number (128) indicates the number of packets actually<br />
retransmitted. For example, if the network loses two packets at the<br />
same time, one timeout is counted but two packets are retransmitted.<br />
A retransmission occurs when the local node does not receive an<br />
acknowledgment for a transmitted packet within a predetermined<br />
timeout interval.<br />
Although you should expect to see a certain number of retransmissions<br />
(especially in heavily loaded networks), an excessive number of<br />
retransmissions wastes network bandwidth and indicates excessive<br />
load or intermittent hardware failure. If the leftmost value in the<br />
ReXmt field is greater than about 0.01% to 0.05% of the total number<br />
of the transmitted messages shown in the Msg Xmt field, the <strong>OpenVMS</strong><br />
<strong>Cluster</strong> system probably is experiencing excessive network problems or<br />
local loss from congestion.<br />
# Msg Rcv (messages received) Indicates the total number of messages received by local node UPNVMS<br />
over this virtual circuit. The values for sequenced messages and lone<br />
acknowledgments usually increase at a rapid rate.<br />
(continued on next page)<br />
Troubleshooting the NISCA Protocol F–11
Troubleshooting the NISCA Protocol<br />
F.3 Using SDA to Monitor LAN Communications<br />
Table F–5 (Cont.) SHOW PORT/VC Display<br />
Field Description<br />
$ ReRcv (rereceive) Displays the number of packets received redundantly by this system. A<br />
remote system may retransmit packets even though the local node has<br />
already successfully received them. This happens when the cumulative delay<br />
of the packet and its acknowledgment is longer than the estimated roundtrip<br />
time being used as a timeout value by the remote node. Therefore, the<br />
remote node retransmits the packet even though it is unnnecessary.<br />
Underestimation of the round-trip delay by the remote node is not directly<br />
harmful, but the retransmission and subsequent congestion-control behavior<br />
on the remote node have a detrimental effect on data throughput. Large<br />
numbers indicate frequent bursts of congestion in the network or adapters<br />
leading to excessive delays. If the value in the ReRcv field is greater than<br />
approximately 0.01% to 0.05% of the total messages received, there may be a<br />
problem with congestion or network delays.<br />
% Topology Change Indicates the number of times PEDRIVER has performed a failover from<br />
FDDI to Ethernet, which necessitated closing and reopening the virtual<br />
circuit. In Example F–2, there have been no failovers. However, if the field<br />
indicates a number of failovers, a problem may exist on the FDDI ring.<br />
& NPAGEDYN (nonpaged dynamic pool) Displays the number of times the virtual circuit was closed because of a pool<br />
allocation failure on the local node. If this value is nonzero, you probably<br />
need to increase the value of the NPAGEDYN system parameter on the local<br />
node.<br />
’ Congestion Control Displays information about the virtual circuit to control the pipe quota<br />
(the number of messages that can be sent to the remote node [put into the<br />
‘‘pipe’’] before receving an acknowledgment and the retransmission timeout).<br />
PEDRIVER varies the pipe quota and the timeout value to control the<br />
amount of network congestion.<br />
( Pipe Quota/Slo/Max Indicates the current thresholds governing the pipe quota.<br />
• The leftmost number (31) is the current value of the pipe quota<br />
(transmit window). After a timeout, the pipe quota is reset to<br />
1 to decrease congestion and is allowed to increase quickly as<br />
acknowledgments are received.<br />
• The middle number (7) is the slow-growth threshold (the size at which<br />
the rate of increase is slowed) to avoid congestion on the network again.<br />
• The rightmost number (31) is the maximum value currently allowed for<br />
the VC based on channel limitations.<br />
Reference: See Appendix G for PEDRIVER congestion control and channel<br />
selection information.<br />
) Pipe Quota Reached Indicates the number of times the entire transmit window was full. If<br />
this number is small as compared with the number of sequenced messages<br />
transmitted, it indicates that the local node is not sending large bursts of<br />
data to the remote node.<br />
+> Xmt C/T (transmission count/target) Shows both the number of successful transmissions since the last time the<br />
pipe quota was increased and the target value at which the pipe quota is<br />
allowed to increase. In the example, the count is 0 because the pipe quota is<br />
already at its maximum value (31), so successful transmissions are not being<br />
counted.<br />
+? RndTrp uS (round trip in<br />
microseconds)<br />
F–12 Troubleshooting the NISCA Protocol<br />
Displays values that are used to calculate the retransmission timeout in<br />
microseconds. The leftmost number (18540) is the average round-trip time,<br />
and the rightmost number (7764) is the average variation in round-trip<br />
time. In the example, the values indicate that the round trip is about 19<br />
milliseconds plus or minus about 8 milliseconds.<br />
(continued on next page)
Table F–5 (Cont.) SHOW PORT/VC Display<br />
Field Description<br />
Troubleshooting the NISCA Protocol<br />
F.3 Using SDA to Monitor LAN Communications<br />
+@ Open and Cls Displays open (Open) and closed (Cls) timestamps for the last significant<br />
changes in the virtual circuit. The repeated loss of one or more virtual<br />
circuits over a short period of time (fewer than 10 minutes) indicates<br />
network problems.<br />
+A Cls If you are analyzing a crash dump, you should check whether the crashdump<br />
time corresponds to the timestamp for channel closures (Cls).<br />
F.3.4 Monitoring PEDRIVER Buses<br />
The SDA command SHOW PORT/BUS=BUS_LAN-device command is useful for<br />
displaying the PEDRIVER representation of a LAN adapter. To PEDRIVER,<br />
a bus is the logical representation of the LAN adapter. (To list the names and<br />
addresses of buses, enter the SDA command SHOW PORT/ADDR=PE_PDT and<br />
then press the Return key twice.) Example F–3 shows a display for the LAN<br />
adapter named EXA.<br />
Example F–3 SDA Command SHOW PORT/BUS Display<br />
SDA> SHOW PORT/BUS=BUS_EXA<br />
VAXcluster data structures<br />
--------------------------<br />
--- BUS: 817E02C0 (EXA) Device: EX_DEMNA LAN Address: AA-00-04-00-64-4F ---<br />
LAN Hardware Address: 08-00-2B-2C-20-B5<br />
Status: 00000803 run,online!,restart<br />
------- Transmit ------ ------- Receive ------- ---- Structure Addresses ---<br />
Msg Xmt 20290620 Msg Rcv 67321527 PORT Address 817E1140<br />
Mcast Msgs 1318437 Mcast Msgs 39773666 VCIB Addr 817E0478<br />
Mcast Bytes 168759936 Mcast Bytes 159660184 HELLO Message Addr 817E0508<br />
Bytes Xmt 2821823510 Bytes Rcv 3313602089 BYE Message Addr 817E0698<br />
Outstand I/Os 0 Buffer Size 1424 Delete BUS Rtn Adr 80C6DA46<br />
Xmt Errors" 15896 Rcv Ring Size 31<br />
Last Xmt Error 0000005C<br />
--- Receive Errors ----<br />
TR Mcast Rcv 0<br />
Rcv Bad SCSID 0<br />
Rcv Short Msg 0<br />
Time of Last Xmt Error#21-JAN-1994 15:33:38.96<br />
------ BUS Timer ------ ----- Datalink Events ------<br />
Handshake TMO 80C6F070 Last 7-DEC-1992 17:15:42.18<br />
Listen TMO 80C6F074 Last Event 00001202<br />
HELLO timer 3 Port Usable 1<br />
Fail CH Alloc<br />
Fail VC Alloc<br />
Wrong PORT<br />
0<br />
0<br />
0<br />
HELLO Xmt err$ 1623 Port Unusable<br />
Address Change<br />
Port Restart Fail<br />
0<br />
1<br />
0<br />
Field Description<br />
! Status: The Status line should always display a status of ‘‘online’’ to indicate that<br />
PEDRIVER can access its LAN adapter.<br />
" Xmt Errors (transmission errors) Indicates the number of times PEDRIVER has been unable to transmit a<br />
packet using this LAN adapter.<br />
# Time of Last Xmt Error You can compare the time shown in this field with the Open and Cls times<br />
shown in the VC display in Example F–2 to determine whether the time of<br />
the LAN adapter failure is close to the time of a virtual circuit failure.<br />
Note: Transmission errors at the LAN adapter bus level cause a virtual<br />
circuit breakage.<br />
Troubleshooting the NISCA Protocol F–13
Troubleshooting the NISCA Protocol<br />
F.3 Using SDA to Monitor LAN Communications<br />
Field Description<br />
$ HELLO Xmt err (HELLO<br />
transmission error)<br />
Indicates how many times a message transmission failure has ‘‘dropped’’ a<br />
PEDRIVER HELLO datagram message. (The Channel Control [CC] level<br />
description in Section F.1 briefly describes the purpose of HELLO datagram<br />
messages.) If many HELLO transmission errors occur, PEDRIVER on other<br />
nodes probably is timing out a channel, which could eventually result in<br />
closure of the virtual circuit.<br />
The 1623 HELLO transmission failures shown in Example F–3 contributed<br />
to the high number of transmission errors (15896). Note that it is impossible<br />
to have a low number of transmission errors and a high number of HELLO<br />
transmission errors.<br />
F.3.5 Monitoring LAN Adapters<br />
Use the SDA command SHOW LAN/COUNT to display information about<br />
the LAN adapters as maintained by the LAN device driver (the command<br />
shows counters for all protocols, not just PEDRIVER [SCA] related counters).<br />
Example F–4 shows a sample display from the SHOW LAN/COUNT command.<br />
Example F–4<br />
$ ANALYZE/SYSTEM<br />
SDA Command SHOW LAN/COUNTERS Display<br />
SDA> SHOW LAN/COUNTERS<br />
LAN Data Structures<br />
-------------------<br />
-- EXA Counters Information 22-JAN-1994 11:21:19 --<br />
Seconds since zeroed<br />
Octets received<br />
PDUs received<br />
Mcast octets received<br />
Mcast PDUs received<br />
Unrec indiv dest PDUs<br />
Unrec mcast dest PDUs<br />
Data overruns<br />
3953329<br />
13962888501<br />
121899287<br />
7494809802<br />
58046934<br />
0<br />
0<br />
2<br />
Station failures<br />
Octets sent<br />
PDUs sent<br />
Mcast octets sent<br />
Mcast PDUs sent<br />
PDUs sent, deferred<br />
PDUs sent, one coll<br />
PDUs sent, mul coll<br />
0<br />
11978817384<br />
76872280<br />
183142023<br />
1658028<br />
4608431<br />
3099649<br />
2439257<br />
Unavail station buffs!<br />
Unavail user buffers<br />
Frame check errors<br />
Alignment errors<br />
Frames too long<br />
Rcv data length error<br />
802E PDUs received<br />
802 PDUs received<br />
Eth PDUs received<br />
LAN Data Structures<br />
0<br />
0<br />
483<br />
10215<br />
142<br />
0<br />
28546<br />
0<br />
122691742<br />
Excessive collisions"<br />
Carrier check failure<br />
Short circuit failure<br />
Open circuit failure<br />
Transmits too long<br />
Late collisions<br />
Coll detect chk fail<br />
Send data length err<br />
Frame size errors<br />
5059<br />
0<br />
0<br />
0<br />
0<br />
14931<br />
0<br />
0<br />
0<br />
-------------------<br />
-- EXA Internal Counters Information 22-JAN-1994 11:22:28 --<br />
Internal counters address<br />
Number of ports<br />
No work transmits<br />
Bad PTE transmits<br />
80C58257<br />
0<br />
3303771<br />
0<br />
Internal counters size<br />
Global page transmits<br />
SVAPTE/BOFF transmits<br />
Buffer_Adr transmits<br />
24<br />
0<br />
0<br />
0<br />
Fatal error count<br />
Transmit timeouts<br />
Restart failures<br />
Power failures<br />
Hardware errors<br />
Control timeouts<br />
0<br />
0<br />
0<br />
0<br />
0<br />
0<br />
RDL errors<br />
Last fatal error<br />
Prev fatal error<br />
Last error CSR<br />
Fatal error code<br />
Prev fatal error<br />
0<br />
None<br />
None<br />
00000000<br />
None<br />
None<br />
Loopback sent<br />
System ID sent<br />
ReqCounters sent<br />
0<br />
0<br />
0<br />
Loopback failures<br />
System ID failures<br />
ReqCounters failures<br />
0<br />
0<br />
0<br />
F–14 Troubleshooting the NISCA Protocol<br />
(continued on next page)
Troubleshooting the NISCA Protocol<br />
F.3 Using SDA to Monitor LAN Communications<br />
Example F–4 (Cont.) SDA Command SHOW LAN/COUNTERS Display<br />
-- EXA1 60-07 (SCA) Counters Information 22-JAN-1994 11:22:31 --<br />
Last receive# 22-JAN 11:22:31<br />
Octets received 7616615830<br />
PDUs received 67375315<br />
Mcast octets received 0<br />
Mcast PDUs received 0<br />
Unavail user buffer 0<br />
Last start done 7-DEC 17:12:29<br />
.<br />
..<br />
Last transmit#<br />
Octets sent<br />
PDUs sent<br />
Mcast octets sent<br />
Mcast PDUs sent<br />
Last start attempt<br />
Last start failed<br />
22-JAN 11:22:31<br />
2828248622<br />
20331888<br />
0<br />
0<br />
None<br />
None<br />
The SHOW LAN/COUNTERS display usually includes device counter information<br />
about several LAN adapters. However, for purposes of example, only one device<br />
is shown in Example F–4.<br />
Field Description<br />
! Unavail station buffs (unavailable<br />
station buffers)<br />
Records the number of times that fixed station buffers in the LAN driver<br />
were unavailable for incoming packets. The node receiving a message can<br />
lose packets when the node does not have enough LAN station buffers. (LAN<br />
buffers are used by a number of consumers other than PEDRIVER, such as<br />
DECnet, TCP/IP, and LAT.) Packet loss because of insufficient LAN station<br />
buffers is a symptom of either LAN adapter congestion or the system’s<br />
inability to reuse the existing buffers fast enough.<br />
" Excessive collisions Indicates the number of unsuccessful attempts to transmit messages on the<br />
adapter. This problem is often caused by:<br />
• A LAN loading problem resulting from heavy traffic (70% to 80%<br />
utilization) on the specific LAN segment.<br />
• A component called a screamer. A screamer is an adapter whose<br />
protocol does not adhere to Ethernet or FDDI hardware protocols.<br />
A screamer does not wait for permission to transmit packets on the<br />
adapter, thereby causing collision errors to register in this field.<br />
If a significant number of transmissions with multiple collisions have<br />
occurred, then <strong>OpenVMS</strong> <strong>Cluster</strong> performance is degraded. You might<br />
be able to improve performance either by removing some nodes from the<br />
LAN segment or by adding another LAN segment to the cluster. The overall<br />
goal is to reduce traffic on the existing LAN segment, thereby making more<br />
bandwidth available to the <strong>OpenVMS</strong> <strong>Cluster</strong> system.<br />
# Last receive and Last transmit The difference in the times shown in the Last receive and Last transmit<br />
message fields should not be large. Minimally, the timestamps in these<br />
fields should reflect that HELLO datagram messages are being sent across<br />
channels every 3 seconds. Large time differences might indicate:<br />
A hardware failure<br />
Whether or not the LAN driver sees the NISCA protocol as being active<br />
on a specific LAN adapter<br />
Troubleshooting the NISCA Protocol F–15
Troubleshooting the NISCA Protocol<br />
F.4 Troubleshooting NISCA Communications<br />
F.4 Troubleshooting NISCA Communications<br />
F.4.1 Areas of Trouble<br />
Sections F.5 and F.6 describe two likely areas of trouble for LAN networks:<br />
channel formation and retransmission. The discussions of these two problems<br />
often include references to the use of a LAN analyzer tool to isolate information<br />
in the NISCA protocol.<br />
Reference: As you read about how to diagnose NISCA problems, you may also<br />
find it helpful to refer to Section F.7, which describes the NISCA protocol packet,<br />
and Section F.8, which describes how to choose and use a LAN network failure<br />
analyzer.<br />
F.5 Channel Formation<br />
Channel-formation problems occur when two nodes cannot communicate properly<br />
between LAN adapters.<br />
F.5.1 How Channels Are Formed<br />
Table F–6 provides a step-by-step description of channel formation.<br />
Table F–6 Channel Formation<br />
Step Action<br />
1 Channels are formed when a node sends a HELLO datagram from its LAN adapter to a<br />
LAN adapter on another cluster node. If this is a new remote LAN adapter address, or<br />
if the corresponding channel is closed, the remote node receiving the HELLO datagram<br />
sends a CCSTART datagram to the originating node after a delay of up to 2 seconds.<br />
2 Upon receiving a CCSTART datagram, the originating node verifies the cluster password<br />
and, if the password is correct, the node responds with a VERF datagram and waits for up<br />
to 5 seconds for the remote node to send a VACK datagram. (VERF, VACK, CCSTART, and<br />
HELLO datagrams are described in Section F.7.6.)<br />
3 Upon receiving a VERF datagram, the remote node verifies the cluster password; if the<br />
password is correct, the node responds with a VACK datagram and marks the channel as<br />
open. (See Figure F–2.)<br />
4<br />
WHEN the local node... THEN...<br />
Does not receive the VACK<br />
datagram within 5 seconds<br />
Receives the VACK datagram<br />
within 5 seconds and the cluster<br />
password is correct<br />
The channel state goes back to closed and the<br />
handshake timeout counter is incremented.<br />
The channel is opened.<br />
5 Once a channel has been formed, it is maintained (kept open) by the regular multicast of<br />
HELLO datagram messages. Each node multicasts a HELLO datagram message at least<br />
once every 3.0 seconds over each LAN adapter. Either of the nodes sharing a channel<br />
closes the channel with a listen timeout if it does not receive a HELLO datagram or a<br />
sequence message from the other node within 8 to 9 seconds. If you receive a ‘‘Port closed<br />
virtual circuit’’ message, it indicates a channel was formed but there is a problem receiving<br />
traffic on time. When this happens, look for HELLO datagram messages getting lost.<br />
Figure F–2 shows a message exchange during a successful channel-formation<br />
handshake.<br />
F–16 Troubleshooting the NISCA Protocol
Troubleshooting the NISCA Protocol<br />
F.5 Channel Formation<br />
Figure F–2 Channel-Formation Handshake<br />
NISCA<br />
Local Node Remote Node<br />
Hello Multicast<br />
CCSTART<br />
VERF<br />
VACK<br />
ZK−5940A−GE<br />
F.5.2 Techniques for Troubleshooting<br />
When there is a break in communications between two nodes and you suspect<br />
problems with channel formation, follow these instructions:<br />
Step Action<br />
1 Check the obvious:<br />
• Is the remote node powered on?<br />
• Is the remote node booted?<br />
• Are the required network connections connected?<br />
• Do the cluster multicast datagrams pass through all of the required bridges in both<br />
directions?<br />
• Are the cluster group code and password values the same on all nodes?<br />
2 Check for dead channels by using SDA. The SDA command SHOW<br />
PORT/CHANNEL/VC=VC_remote_node can help you determine whether a channel ever<br />
existed; the command displays the channel’s state.<br />
Reference: Refer to Section F.3 for examples of the SHOW PORT command.<br />
Section F.10.1 describes how to use a LAN analyzer to troubleshoot channel formation<br />
problems.<br />
3 See also Appendix D for information about using the LAVC$FAILURE_ANALYSIS<br />
program to troubleshoot channel problems.<br />
F.6 Retransmission Problems<br />
Retransmissions occur when the local node does not receive acknowledgment of a<br />
message in a timely manner.<br />
Troubleshooting the NISCA Protocol F–17
Troubleshooting the NISCA Protocol<br />
F.6 Retransmission Problems<br />
F.6.1 Why Retransmissions Occur<br />
The first time the sending node transmits the datagram containing the sequenced<br />
message data, PEDRIVER sets the value of the REXMT flag bit in the TR<br />
header to 0. If the datagram requires retransmission, PEDRIVER sets the<br />
REXMT flag bit to 1 and resends the datagram. PEDRIVER retransmits the<br />
datagram until either the datagram is received or the virtual circuit is closed. If<br />
multiple channels are available, PEDRIVER attempts to retransmit the message<br />
on a different channel in an attempt to avoid the problem that caused the<br />
retransmission.<br />
Retransmission typically occurs when a node runs out of a critical resource, such<br />
as large request packets (LRPs) or nonpaged pool, and a message is lost after<br />
it reaches the remote node. Other potential causes of retransmissions include<br />
overloaded LAN bridges, slow LAN adapters (such as the DELQA), and heavily<br />
loaded systems, which delay packet transmission or reception. Figure F–3 shows<br />
an unsuccessful transmission followed by a successful retransmission.<br />
Figure F–3 Lost Messages Cause Retransmissions<br />
NISCA<br />
Local Node Remote Node<br />
(timeout)<br />
Message n<br />
ACK is not sent<br />
n<br />
ACK n<br />
(lost)<br />
Retransmitted message n<br />
Note: n represents a number that identifies<br />
each sequenced message.<br />
ZK−5941A−GE<br />
Because the first message was lost, the local node does not receive<br />
acknowledgment (ACK) from the remote node. The remote node acknowledged<br />
the second (successful) transmission of the message.<br />
Retransmission can also occur if the cables are seated improperly, if the network<br />
is too busy and the datagram cannot be sent, or if the datagram is corrupted or<br />
lost during transmission either by the originating LAN adapter or by any bridges<br />
or repeaters. Figure F–4 illustrates another type of retransmission.<br />
F–18 Troubleshooting the NISCA Protocol
Figure F–4 Lost ACKs Cause Retransmissions<br />
Troubleshooting the NISCA Protocol<br />
F.6 Retransmission Problems<br />
NISCA<br />
Local Node Remote Node<br />
(timeout)<br />
(lost)<br />
Message n<br />
ACK n<br />
Retransmitted message n<br />
ACK n<br />
Note: n represents a number that identifies<br />
each sequenced message.<br />
ZK−5942A−GE<br />
In Figure F–4, the remote node receives the message and transmits an<br />
acknowledgment (ACK) to the sending node. However, because the ACK from the<br />
receiving node is lost, the sending node retransmits the message.<br />
F.6.2 Techniques for Troubleshooting<br />
You can troubleshoot cluster retransmissions using a LAN protocol analyzer for<br />
each LAN segment. If multiple segments are used for cluster communications,<br />
then the LAN analyzers need to support a distributed enable and trigger<br />
mechanism (see Section F.8). See also Section G.1 for more information about<br />
how PEDRIVER chooses channels on which to transmit datagrams.<br />
Reference: Techniques for isolating the retransmitted datagram using a<br />
LAN analyzer are discussed in Section F.10.2. See also Appendix G for more<br />
information about congestion control and PEDRIVER message retransmission.<br />
F.7 Understanding NISCA Datagrams<br />
Troubleshooting NISCA protocol communication problems requires an<br />
understanding of the NISCA protocol packet that is exchanged across the<br />
<strong>OpenVMS</strong> <strong>Cluster</strong> system.<br />
F.7.1 Packet Format<br />
The format of packets on the NISCA protocol is defined by the $NISCADEF<br />
macro, which is located in [DRIVER.LIS] on VAX systems and in [LIB.LIS] for<br />
Alpha systems on your CD listing disk.<br />
Figure F–5 shows the general form of NISCA datagrams. A NISCA datagram<br />
consists of the following headers, which are usually followed by user data:<br />
• LAN headers, including an Ethernet or an FDDI header<br />
• Datagram exchange (DX) header<br />
Troubleshooting the NISCA Protocol F–19
Troubleshooting the NISCA Protocol<br />
F.7 Understanding NISCA Datagrams<br />
• Channel control (CC) or transport (TR) header<br />
Figure F–5 NISCA Headers<br />
LAN<br />
Header<br />
F.7.2 LAN Headers<br />
DX<br />
Header<br />
CC or TR<br />
Header<br />
...<br />
ZK−5920A−GE<br />
Caution: The NISCA protocol is subject to change without notice.<br />
The NISCA protocol is supported on LANs consisting of Ethernet and FDDI,<br />
described in Sections F.7.3 and F.7.4. These headers contain information that is<br />
useful for diagnosing problems that occur between LAN adapters.<br />
Reference: See Section F.9.4 for methods of isolating information in LAN<br />
headers.<br />
F.7.3 Ethernet Header<br />
Each datagram that is transmitted or received on the Ethernet is prefixed with<br />
an Ethernet header. The Ethernet header, shown in Figure F–6 and described in<br />
Table F–7, is 16 bytes long.<br />
Figure F–6 Ethernet Header<br />
0 6 12 14 16<br />
Destination<br />
Address<br />
Source<br />
Address<br />
Protocol<br />
Type<br />
Table F–7 Fields in the Ethernet Header<br />
Field Description<br />
Length<br />
...<br />
ZK−5921A−GE<br />
Destination address LAN address of the adapter that should receive the datagram<br />
Source address LAN address of the adapter sending the datagram<br />
Protocol type NISCA protocol (60–07) hexadecimal<br />
Length Number of data bytes in the datagram following the length field<br />
F.7.4 FDDI Header<br />
Each datagram that is transmitted or received on the FDDI is prefixed with an<br />
FDDI header. The NISCA protocol uses mapped Ethernet format datagrams on<br />
the FDDI. The FDDI header, shown in Figure F–7 and described in Table F–8, is<br />
23 bytes long.<br />
F–20 Troubleshooting the NISCA Protocol
Figure F–7 FDDI Header<br />
0<br />
Frame<br />
Control<br />
Destination<br />
Address<br />
Source<br />
Address<br />
SNAP<br />
SAP<br />
SNAP<br />
PID<br />
Troubleshooting the NISCA Protocol<br />
F.7 Understanding NISCA Datagrams<br />
1 7 13 16 19 21 23<br />
Table F–8 Fields in the FDDI Header<br />
Field Description<br />
Protocol<br />
Type<br />
Length<br />
...<br />
ZK−5922A−GE<br />
Frame control NISCA datagrams are logical link control (LLC) frames with a<br />
priority value (5x). The low-order 3 bits of the frame-control byte<br />
contain the priority value. All NISCA frames are transmitted<br />
with a nonzero priority field. Frames received with a zero-priority<br />
field are assumed to have traveled over an Ethernet segment<br />
because Ethernet packets do not have a priority value and<br />
because Ethernet-to-FDDI bridges generate a priority value of<br />
0.<br />
Destination address LAN address of the adapter that should receive the datagram.<br />
Source address LAN address of the adapter sending the datagram.<br />
SNAP SAP Subnetwork access protocol; service access point. The value of the<br />
access point is AA–AA–03 hexadecimal.<br />
SNAP PID Subnetwork access protocol; protocol identifier. The value of the<br />
identifier is 00–00–00 hexadecimal.<br />
Protocol type NISCA protocol (60–07) hexadecimal.<br />
Length Number of data bytes in the datagram following the length field.<br />
F.7.5 Datagram Exchange (DX) Header<br />
The datagram exchange (DX) header for the <strong>OpenVMS</strong> <strong>Cluster</strong> protocol is used<br />
to address the data to the correct <strong>OpenVMS</strong> <strong>Cluster</strong> node. The DX header,<br />
shown in Figure F–8 and described in Table F–9, is 14 bytes long. It contains<br />
information that describes the <strong>OpenVMS</strong> <strong>Cluster</strong> connection between two nodes.<br />
See Section F.9.3 about methods of isolating data for the DX header.<br />
Figure F–8 DX Header<br />
LAN<br />
Header<br />
0<br />
Destination<br />
SCS Address<br />
6 8 14<br />
<strong>Cluster</strong><br />
Group Number<br />
Source SCS<br />
Address<br />
...<br />
ZK−5923A−GE<br />
Troubleshooting the NISCA Protocol F–21
Troubleshooting the NISCA Protocol<br />
F.7 Understanding NISCA Datagrams<br />
Table F–9 Fields in the DX Header<br />
Field Description<br />
Destination SCS address Manufactured using the address AA–00–04–00–remote-node-<br />
SCSSYSTEMID. Append the remote node’s SCSSYSTEMID<br />
system parameter value for the low-order 16 bits. This address<br />
represents the destination SCS transport address or the<br />
<strong>OpenVMS</strong> <strong>Cluster</strong> multicast address.<br />
<strong>Cluster</strong> group number The cluster group number specified by the system manager. See<br />
Chapter 8 for more information about cluster group numbers.<br />
Source SCS address Represents the source SCS transport address and is<br />
manufactured using the address AA–00–04–00–local-node-<br />
SCSSYSTEMID. Append the local node’s SCSSYSTEMID system<br />
parameter value as the low-order 16 bits.<br />
F.7.6 Channel Control (CC) Header<br />
The channel control (CC) message is used to form and maintain working network<br />
paths between nodes in the <strong>OpenVMS</strong> <strong>Cluster</strong> system. The important fields for<br />
network troubleshooting are the datagram flags/type and the cluster password.<br />
Note that because the CC and TR headers occupy the same space, there is<br />
a TR/CC flag that identifies the type of message being transmitted over the<br />
channel. Figure F–9 shows the portions of the CC header needed for network<br />
troubleshooting, and Table F–10 describes these fields.<br />
Figure F–9 CC Header<br />
LAN<br />
Header<br />
DX<br />
Header<br />
F–22 Troubleshooting the NISCA Protocol<br />
0<br />
Datagram<br />
Flags/Type<br />
1 38 46<br />
...<br />
<strong>Cluster</strong><br />
Password<br />
...<br />
ZK−5924A−GE
Table F–10 Fields in the CC Header<br />
Field Description<br />
Troubleshooting the NISCA Protocol<br />
F.7 Understanding NISCA Datagrams<br />
Datagram type (bits ) Identifies the type of message on the Channel Control level. The following table<br />
shows the datagrams and their functions.<br />
Value<br />
Abbreviated<br />
Datagram<br />
Type<br />
0 HELLO HELLO<br />
datagram<br />
message<br />
1 BYE Node-stop<br />
notification<br />
Expanded<br />
Datagram<br />
Type Function<br />
Multicast datagram that initiates<br />
the formation of a channel between<br />
cluster nodes and tests and<br />
maintains the existing channels.<br />
This datagram does not contain a<br />
valid cluster password.<br />
Datagram that signals the<br />
departure of a cluster node.<br />
2 CCSTART Channel start Datagram that starts the channelformation<br />
handshake between<br />
two cluster nodes. This datagram<br />
is sent in response to receiving<br />
a HELLO datagram from an<br />
unknown LAN adapter address.<br />
3 VERF Verify Datagram that acknowledges the<br />
CCSTART datagram and continues<br />
the channel formation handshake.<br />
The datagram is sent in response to<br />
receiving a CCSTART or SOLICIT_<br />
SRV datagram.<br />
4 VACK Verify<br />
acknowledge<br />
5 Reserved<br />
6 SOLICIT_<br />
SERVICE<br />
7–15 Reserved<br />
Datagram that completes the<br />
channel-formation handshake. The<br />
datagram is sent in response to<br />
receiving a VERF datagram.<br />
Solicit Datagram sent by a booting node<br />
to form a channel to its disk server.<br />
The server responds by sending a<br />
VERF, which forms the channel.<br />
Datagram flags (bits ) Provide additional information about the control datagram. The following bits are<br />
defined:<br />
• Bit (AUTHORIZE)—Set to 1 if the cluster password field is valid.<br />
• Bit (Reserved)—Set to 1.<br />
• Bit (Reserved)—Set to 0.<br />
<strong>Cluster</strong> password Contains the cluster password.<br />
• Bit (TR/CC flag)—Set to 1 to indicate the CC datagram.<br />
F.7.7 Transport (TR) Header<br />
The transport (TR) header is used to pass SCS datagrams and sequenced<br />
messages between cluster nodes. The important fields for network<br />
troubleshooting are the TR datagram flags, message acknowledgment, and<br />
sequence numbers. Note that because the CC and TR headers occupy the same<br />
Troubleshooting the NISCA Protocol F–23
Troubleshooting the NISCA Protocol<br />
F.7 Understanding NISCA Datagrams<br />
Figure F–10 TR Header<br />
LAN<br />
Header<br />
DX<br />
Header<br />
space, a TR/CC flag identifies the type of message being transmitted over the<br />
channel.<br />
Figure F–10 shows the portions of the TR header that are needed for network<br />
troubleshooting, and Table F–11 describes these fields.<br />
0<br />
Datagram<br />
Flags<br />
1 2<br />
4 6<br />
... Message<br />
Acknowledgment<br />
Sequence<br />
Number<br />
...<br />
10 14 18<br />
Extended Message<br />
Acknowledgment<br />
Extended Sequence<br />
Number<br />
...<br />
ZK−5925A−GE<br />
Note: The TR header shown in Figure F–10 is used when both nodes are running<br />
Version 1.4 or later of the NISCA protocol. If one or both nodes are running<br />
Version 1.3 or an earlier version of the protocol, then both nodes will use the<br />
message acknowledgment and sequence number fields in place of the extended<br />
message acknowledgment and extended sequence number fields, respectively.<br />
Table F–11 Fields in the TR Header<br />
Field Description<br />
Datagram flags (bits ) Provide additional information about the transport datagram.<br />
Value<br />
Abbreviated<br />
Datagram<br />
Type<br />
Expanded<br />
Datagram<br />
Type Function<br />
0 DATA Packet data Contains data to be delivered to<br />
the upper levels of software.<br />
1 SEQ Sequence flag Set to 1 if this is a sequenced<br />
message and the sequence<br />
number is valid.<br />
2 Reserved Set to 0.<br />
3 ACK Acknowledgment Acknowledges the field is valid.<br />
4 RSVP Reply flag Set when an ACK datagram is<br />
needed immediately.<br />
5 REXMT Retransmission Set for all retransmissions of a<br />
sequenced message.<br />
6 Reserved Set to 0.<br />
7 TR/CC flag Transport flag Set to 0; indicates a TR<br />
datagram.<br />
Message acknowledgment An increasing value that specifies the last sequenced message segment received by<br />
the local node. All messages prior to this value are also acknowledged. This field<br />
is used when one or both nodes are running Version 1.3 or earlier of the NISCA<br />
protocol.<br />
Extended message<br />
acknowledgment<br />
F–24 Troubleshooting the NISCA Protocol<br />
An increasing value that specifies the last sequenced message segment received by<br />
the local node. All messages prior to this value are also acknowledged. This field is<br />
used when both nodes are running Version 1.4 or later of the NISCA protocol.<br />
(continued on next page)
Table F–11 (Cont.) Fields in the TR Header<br />
Field Description<br />
Troubleshooting the NISCA Protocol<br />
F.7 Understanding NISCA Datagrams<br />
Sequence number An increasing value that specifies the order of datagram transmission from the local<br />
node. This number is used to provide guaranteed delivery of this sequenced message<br />
segment to the remote node. This field is used when one or both nodes are running<br />
Version 1.3 or earlier of the NISCA protocol.<br />
Extended sequence number An increasing value that specifies the order of datagram transmission from the local<br />
node. This number is used to provide guaranteed delivery of this sequenced message<br />
segment to the remote node. This field is used when both nodes are running Version<br />
1.4 or later of the NISCA protocol.<br />
F.8 Using a LAN Protocol Analysis Program<br />
Some failures, such as packet loss resulting from congestion, intermittent<br />
network interruptions of less than 20 seconds, problems with backup bridges,<br />
and intermittent performance problems, can be difficult to diagnose. Intermittent<br />
failures may require the use of a LAN analysis tool to isolate and troubleshoot<br />
the NISCA protocol levels described in Section F.1.<br />
As you evaluate the various network analysis tools currently available, you<br />
should look for certain capabilities when comparing LAN analyzers. The<br />
following sections describe the required capabilities.<br />
F.8.1 Single or Multiple LAN Segments<br />
Whether you need to troubleshoot problems on a single LAN segment or on<br />
multiple LAN segments, a LAN analyzer should help you isolate specific patterns<br />
of data. Choose a LAN analyzer that can isolate data matching unique patterns<br />
that you define. You should be able to define data patterns located in the data<br />
regions following the LAN header (described in Section F.7.2). In order to<br />
troubleshoot the NISCA protocol properly, a LAN analyzer should be able to<br />
match multiple data patterns simultaneously.<br />
To troubleshoot single or multiple LAN segments, you must minimally define and<br />
isolate transmitted and retransmitted data in the TR header (see Section F.7.7).<br />
Additionally, for effective network troubleshooting across multiple LAN segments,<br />
a LAN analysis tool should include the following functions:<br />
• A distributed enable function that allows you to synchronize multiple<br />
LAN analyzers that are set up at different locations so that they can capture<br />
information about the same event as it travels through the LAN configuration<br />
• A distributed combination trigger function that automatically triggers<br />
multiple LAN analyzers at different locations so that they can capture<br />
information about the same event<br />
The purpose of distributed enable and distributed combination trigger functions<br />
is to capture packets as they travel across multiple LAN segments. The<br />
implementation of these functions discussed in the following sections use<br />
multicast messages to reach all LAN segments of the extended LAN in the system<br />
configuration. By providing the ability to synchronize several LAN analyzers at<br />
different locations across multiple LAN segments, the distributed enable and<br />
combination trigger functions allow you to troubleshoot LAN configurations that<br />
span multiple sites over several miles.<br />
Troubleshooting the NISCA Protocol F–25
Troubleshooting the NISCA Protocol<br />
F.8 Using a LAN Protocol Analysis Program<br />
F.8.2 Multiple LAN Segments<br />
To troubleshoot multiple LAN segments, LAN analyzers must be able to capture<br />
the multicast packets and dynamically enable the trigger function of the LAN<br />
analyzer, as follows:<br />
Step Action<br />
1 Start capturing the data according to the rules specific to your LAN analyzer. Compaq<br />
recommends that only one LAN analyzer transmit a distributed enable multicast packet<br />
on the LAN. The packet must be transmitted according to the media access-control rules.<br />
2 Wait for the distributed enable multicast packet. When the packet is received, enable the<br />
distributed combination trigger function. Prior to receiving the distributed enable packet,<br />
all LAN analyzers must be able to ignore the trigger condition. This feature is required<br />
in order to set up multiple LAN analyzers capable of capturing the same event. Note that<br />
the LAN analyzer transmitting the distributed enable should not wait to receive it.<br />
3 Wait for an explicit (user-defined) trigger event or a distributed trigger packet. When the<br />
LAN analyzer receives either of these triggers, the LAN analyzer should stop the data<br />
capture.<br />
Prior to receiving either trigger, the LAN analyzer should continue to capture the<br />
requested data. This feature is required in order to allow multiple LAN analyzers to<br />
capture the same event.<br />
4 Once triggered, the LAN analyzer completes the distributed trigger function to stop the<br />
other LAN analyzers from capturing data related to the event that has already occurred.<br />
The <strong>HP</strong> 4972A LAN Protocol Analyzer, available from the Hewlett-Packard<br />
Company, is one example of a network failure analysis tool that provides the<br />
required functions described in this section.<br />
Reference: Section F.10 provides examples that use the <strong>HP</strong> 4972A LAN Protocol<br />
Analyzer.<br />
F.9 Data Isolation Techniques<br />
The following sections describe the types of data you should isolate when you use<br />
a LAN analysis tool to capture <strong>OpenVMS</strong> <strong>Cluster</strong> data between nodes and LAN<br />
adapters.<br />
F.9.1 All <strong>OpenVMS</strong> <strong>Cluster</strong> Traffic<br />
To isolate all <strong>OpenVMS</strong> <strong>Cluster</strong> traffic on a specific LAN segment, capture all the<br />
packets whose LAN header contains the protocol type 60–07.<br />
Reference: See also Section F.7.2 for a description of the LAN headers.<br />
F.9.2 Specific <strong>OpenVMS</strong> <strong>Cluster</strong> Traffic<br />
To isolate <strong>OpenVMS</strong> <strong>Cluster</strong> traffic for a specific cluster on a specific LAN<br />
segment, capture packets in which:<br />
• The LAN header contains the the protocol type 60–07.<br />
• The DX header contains the cluster group number specific to that <strong>OpenVMS</strong><br />
<strong>Cluster</strong>.<br />
Reference: See Sections F.7.2 and F.7.5 for descriptions of the LAN and DX<br />
headers.<br />
F–26 Troubleshooting the NISCA Protocol
Troubleshooting the NISCA Protocol<br />
F.9 Data Isolation Techniques<br />
F.9.3 Virtual Circuit (Node-to-Node) Traffic<br />
To isolate virtual circuit traffic between a specific pair of nodes, capture packets<br />
in which the LAN header contains:<br />
• The protocol type 60–07<br />
• The destination SCS address<br />
• The source SCS address<br />
You can further isolate virtual circuit traffic between a specific pair of nodes to a<br />
specific LAN segment by capturing the following additional information from the<br />
DX header:<br />
• The cluster group code specific to that <strong>OpenVMS</strong> <strong>Cluster</strong><br />
• The destination SCS transport address<br />
• The source SCS transport address<br />
Reference: See Sections F.7.2 and F.7.5 for LAN and DX header information.<br />
F.9.4 Channel (LAN Adapter–to–LAN Adapter) Traffic<br />
To isolate channel information, capture all packet information on every<br />
channel between LAN adapters. The DX header contains information useful<br />
for diagnosing heavy communication traffic between a pair of LAN adapters.<br />
Capture packets in which the LAN header contains:<br />
• The destination LAN adapter address<br />
• The source LAN adapter address<br />
Because nodes can use multiple LAN adapters, specifying the source and<br />
destination LAN addresses may not capture all of the traffic for the node.<br />
Therefore, you must specify a channel as the source LAN address and the<br />
destination LAN address in order to isolate traffic on a specific channel.<br />
Reference: See Section F.7.2 for information about the LAN header.<br />
F.9.5 Channel Control Traffic<br />
To isolate channel control traffic, capture packets in which:<br />
• The LAN header contains the the protocol type 60–07.<br />
• The CC header datagram flags byte (the TR/CC flag, bit ) is set to 1.<br />
Reference: See Sections F.7.2 and F.7.6 for a description of the LAN and CC<br />
headers.<br />
F.9.6 Transport Data<br />
To isolate transport data, capture packets in which:<br />
• The LAN header contains the the protocol type 60–07.<br />
• The TR header datagram flags byte (the TR/CC flag, bit ) is set to 0.<br />
Reference: See Sections F.7.2 and F.7.7 for a description of the LAN and TR<br />
headers.<br />
Troubleshooting the NISCA Protocol F–27
Troubleshooting the NISCA Protocol<br />
F.10 Setting Up an <strong>HP</strong> 4972A LAN Protocol Analyzer<br />
F.10 Setting Up an <strong>HP</strong> 4972A LAN Protocol Analyzer<br />
The <strong>HP</strong> 4972A LAN Protocol Analyzer, available from the Hewlett-Packard<br />
Company, is highlighted here because it meets all of the requirements listed<br />
in Section F.8. However, the <strong>HP</strong> 4972A LAN Protocol Analyzer is merely<br />
representative of the type of product useful for LAN network troubleshooting.<br />
Note: Use of this particular product as an example here should not be construed<br />
as a specific purchase requirement or endorsement.<br />
This section provides some examples of how to set up the <strong>HP</strong> 4972A LAN Protocol<br />
Analyzer to troubleshoot the local area <strong>OpenVMS</strong> <strong>Cluster</strong> system protocol for<br />
channel formation and retransmission problems.<br />
F.10.1 Analyzing Channel Formation Problems<br />
If you have a LAN protocol analyzer, you can set up filters to capture data related<br />
to the channel control header (described in Section F.7.6).<br />
You can trigger the LAN analyzer by using the following datagram fields:<br />
• Protocol type set to 60–07 hexadecimal<br />
• Correct cluster group number<br />
• TR/CC flag set to 1<br />
Then look for the HELLO, CCSTART, VERF, and VACK datagrams in the<br />
captured data. The CCSTART, VERF, VACK, and SOLICIT_SRV datagrams<br />
should have the AUTHORIZE bit (bit ) set in the CC flags byte. Additionally,<br />
these messages should contain the scrambled cluster password (nonzero<br />
authorization field). You can find the scrambled cluster password and the<br />
cluster group number in the first four longwords of SYS$SYSTEM:CLUSTER_<br />
AUTHORIZE.DAT file.<br />
Reference: See Sections F.9.3 through F.9.5 for additional data isolation<br />
techniques.<br />
F.10.2 Analyzing Retransmission Problems<br />
Using a LAN analyzer, you can trace datagrams as they travel across an<br />
<strong>OpenVMS</strong> <strong>Cluster</strong> system, as described in Table F–12.<br />
Table F–12 Tracing Datagrams<br />
Step Action<br />
1 Trigger the analyzer using the following datagram fields:<br />
• Protocol type set to 60–07<br />
• Correct cluster group number<br />
• TR/CC flag set to 0<br />
• REXMT flag set to 1<br />
F–28 Troubleshooting the NISCA Protocol<br />
(continued on next page)
Troubleshooting the NISCA Protocol<br />
F.10 Setting Up an <strong>HP</strong> 4972A LAN Protocol Analyzer<br />
Table F–12 (Cont.) Tracing Datagrams<br />
Step Action<br />
2 Use the distributed enable function to allow the same event to be captured by several<br />
LAN analyzers at different locations. The LAN analyzers should start the data capture,<br />
wait for the distributed enable message, and then wait for the explicit trigger event or the<br />
distributed trigger message. Once triggered, the analyzer should complete the distributed<br />
trigger function to stop the other LAN analyzers capturing data.<br />
3 Once all the data is captured, locate the sequence number (for nodes running the NISCA<br />
protocol Version 1.3 or earlier) or the extended sequence number (for nodes running the<br />
NISCA protocol Version 1.4 or later) for the datagram being retransmitted (the datagram<br />
with the REXMT flag set). Then, search through the previously captured data for another<br />
datagram between the same two nodes (not necessarily the same LAN adapters) with the<br />
following characteristics:<br />
• Protocol type set to 60–07<br />
• Same DX header as the datagram with the REXMT flag set<br />
• TR/CC flag set to 0<br />
• REXMT flag set to 0<br />
• Same sequence number or extended sequence number as the datagram with the<br />
REXMT flag set<br />
(continued on next page)<br />
Troubleshooting the NISCA Protocol F–29
Troubleshooting the NISCA Protocol<br />
F.10 Setting Up an <strong>HP</strong> 4972A LAN Protocol Analyzer<br />
F.11 Filters<br />
Table F–12 (Cont.) Tracing Datagrams<br />
Step Action<br />
4 The following techniques provide a way of searching for the problem’s origin.<br />
IF... THEN...<br />
The datagram appears to be<br />
corrupt<br />
The datagram appears to be<br />
correct<br />
The datagram arrives<br />
successfully at its LAN segment<br />
destination<br />
The acknowledgment was<br />
not sent, or if a significant<br />
delay occurred between the<br />
reception of the message<br />
and the transmission of the<br />
acknowledgment<br />
The ACK arrives back at<br />
the node that sent the<br />
retransmission packet<br />
Use the LAN analyzer to search in the direction of the<br />
source node for the corruption cause.<br />
Search in the direction of the destination node to<br />
ensure that the datagram gets to its destination.<br />
Look for a TR packet from the destination node<br />
containing the sequence number (for nodes running<br />
the NISCA protocol Version 1.3 or earlier) or the<br />
extended sequence number (for nodes running<br />
the NISCA protocol Version 1.4 or later) in the<br />
message acknowledgment or extended message<br />
acknowledgement field. ACK datagrams have the<br />
following fields set:<br />
• Protocol type set to 60–07<br />
• Same DX header as the datagram with the<br />
REXMT flag set<br />
• TR/CC flag set to 0<br />
• ACK flag set to 1<br />
Look for a problem with the destination node and<br />
LAN adapter. Then follow the ACK packet through<br />
the network.<br />
Either of the following conditions may exist:<br />
• The retransmitting node is having trouble<br />
receiving LAN data.<br />
• The round-trip delay of the original datagram<br />
exceeded the estimated timeout value.<br />
You can verify the second possibility by using SDA<br />
and looking at the ReRcv field of the virtual circuit<br />
display of the system receiving the retransmitted<br />
datagram.<br />
Reference: See Example F–2 for an example of this<br />
type of SDA display.<br />
Reference: See Appendix G for more information about congestion control and<br />
PEDRIVER message retransmission.<br />
This section describes:<br />
• How to use the <strong>HP</strong> 4972A LAN Protocol Analyzer filters to isolate packets<br />
that have been retransmitted or that are specific to a particular <strong>OpenVMS</strong><br />
<strong>Cluster</strong>.<br />
• How to enable the distributed enable and trigger functions.<br />
F–30 Troubleshooting the NISCA Protocol
Troubleshooting the NISCA Protocol<br />
F.11 Filters<br />
F.11.1 Capturing All LAN Retransmissions for a Specific <strong>OpenVMS</strong> <strong>Cluster</strong><br />
Use the values shown in Table F–13 to set up a filter, named LAVc_TR_ReXMT,<br />
for all of the LAN retransmissions for a specific cluster. Fill in the value for the<br />
local area <strong>OpenVMS</strong> <strong>Cluster</strong> group code (nn–nn) to isolate a specific <strong>OpenVMS</strong><br />
<strong>Cluster</strong> on the LAN.<br />
Table F–13<br />
Byte<br />
Capturing Retransmissions on the LAN<br />
Number Field Value<br />
1 DESTINATION xx–xx–xx–xx–xx–xx<br />
7 SOURCE xx–xx–xx–xx–xx–xx<br />
13 TYPE 60–07<br />
23 LAVC_GROUP_CODE nn–nn<br />
31 TR FLAGS 0x1xxxxx2<br />
33 ACKING MESSAGE xx–xx<br />
35 SENDING MESSAGE xx–xx<br />
F.11.2 Capturing All LAN Packets for a Specific <strong>OpenVMS</strong> <strong>Cluster</strong><br />
Use the values shown in Table F–14 to filter all of the LAN packets for a specific<br />
cluster. Fill in the value for <strong>OpenVMS</strong> <strong>Cluster</strong> group code (nn–nn) to isolate a<br />
specific <strong>OpenVMS</strong> <strong>Cluster</strong> on the LAN. The filter is named LAVc_all.<br />
Table F–14<br />
Byte<br />
Capturing All LAN Packets (LAVc_all)<br />
Number Field Value<br />
1 DESTINATION xx–xx–xx–xx–xx–xx<br />
7 SOURCE xx–xx–xx–xx–xx–xx<br />
13 TYPE 60–07<br />
23 LAVC_GROUP_CODE nn–nn<br />
33 ACKING MESSAGE xx–xx<br />
35 SENDING MESSAGE xx–xx<br />
F.11.3 Setting Up the Distributed Enable Filter<br />
Use the values shown in Table F–15 to set up a filter, named Distrib_Enable,<br />
for the distributed enable packet received event. Use this filter to troubleshoot<br />
multiple LAN segments.<br />
Table F–15<br />
Byte<br />
Setting Up a Distributed Enable Filter (Distrib_Enable)<br />
Number Field Value ASCII<br />
1 DESTINATION 01–4C–41–56–63–45 .LAVcE<br />
7 SOURCE xx–xx–xx–xx–xx–xx<br />
(continued on next page)<br />
Troubleshooting the NISCA Protocol F–31
Troubleshooting the NISCA Protocol<br />
F.11 Filters<br />
Table F–15 (Cont.) Setting Up a Distributed Enable Filter (Distrib_Enable)<br />
Byte<br />
Number Field Value ASCII<br />
13 TYPE 60–07 ‘.<br />
15 TEXT xx<br />
F.11.4 Setting Up the Distributed Trigger Filter<br />
F.12 Messages<br />
Use the values shown in Table F–16 to set up a filter, named Distrib_Trigger,<br />
for the distributed trigger packet received event. Use this filter to troubleshoot<br />
multiple LAN segments.<br />
Table F–16<br />
Byte<br />
Setting Up the Distributed Trigger Filter (Distrib_Trigger)<br />
Number Field Value ASCII<br />
1 DESTINATION 01–4C–41–56–63–54 .LAVcT<br />
7 SOURCE xx–xx–xx–xx–xx–xx<br />
13 TYPE 60–07 ‘.<br />
15 TEXT xx<br />
This section describes how to set up the distributed enable and distributed trigger<br />
messages.<br />
F.12.1 Distributed Enable Message<br />
Table F–17 shows how to define the distributed enable message (Distrib_Enable)<br />
by creating a new message. You must replace the source address (nn nn nn nn<br />
nn nn) with the LAN address of the LAN analyzer.<br />
Table F–17 Setting Up the Distributed Enable Message (Distrib_Enable)<br />
Field<br />
Byte<br />
Number Value ASCII<br />
Destination 1 01 4C 41 56 63 45 .LAVcE<br />
Source 7 nn nn nn nn nn nn<br />
Protocol 13 60 07 ‘.<br />
Text 15 44 69 73 74 72 69 62 75 74 65 Distribute<br />
25 64 20 65 6E 61 62 6C 65 20 66 d enable f<br />
35 6F 72 20 74 72 6F 75 62 6C 65 or trouble<br />
45 73 68 6F 6F 74 69 6E 67 20 74 shooting t<br />
55 68 65 20 4C 6F 63 61 6C 20 41 he Local A<br />
65 72 65 61 20 56 4D 53 63 6C 75 rea VMSclu<br />
75 73 74 65 72 20 50 72 6F 74 6F ster Proto<br />
85 63 6F 6C 3A 20 4E 49 53 43 41 col: NISCA<br />
F–32 Troubleshooting the NISCA Protocol
Troubleshooting the NISCA Protocol<br />
F.12 Messages<br />
F.12.2 Distributed Trigger Message<br />
Table F–18 shows how to define the distributed trigger message (Distrib_Trigger)<br />
by creating a new message. You must replace the source address (nn nn nn nn<br />
nn nn) with the LAN address of the LAN analyzer.<br />
Table F–18 Setting Up the Distributed Trigger Message (Distrib_Trigger)<br />
Field<br />
Byte<br />
Number Value ASCII<br />
Destination 1 01 4C 41 56 63 54 .LAVcT<br />
Source 7 nn nn nn nn nn nn<br />
Protocol 13 60 07 ‘.<br />
Text 15 44 69 73 74 72 69 62 75 74 65 Distribute<br />
25 64 20 74 72 69 67 67 65 72 20 d trigger<br />
35 66 6F 72 20 74 72 6F 75 62 6C for troubl<br />
45 65 73 68 6F 6F 74 69 6E 67 20 eshooting<br />
55 74 68 65 20 4C 6F 63 61 6C 20 the Local<br />
65 41 72 65 61 20 56 4D 53 63 6C Area VMScl<br />
75 75 73 74 65 72 20 50 72 6F 74 uster Prot<br />
85 6F 63 6F 6C 3A 20 4E 49 53 43 ocol: NISC<br />
95 41 A<br />
F.13 Programs That Capture Retransmission Errors<br />
You can program the <strong>HP</strong> 4972 LAN Protocol Analyzer, as shown in the following<br />
source code, to capture retransmission errors. The starter program initiates the<br />
capture across all of the LAN analyzers. Only one LAN analyzer should run a<br />
copy of the starter program. Other LAN analyzers should run either the partner<br />
program or the scribe program. The partner program is used when the initial<br />
location of the error is unknown and when all analyzers should cooperate in<br />
the detection of the error. Use the scribe program to trigger on a specific LAN<br />
segment as well as to capture data from other LAN segments.<br />
F.13.1 Starter Program<br />
The starter program initially sends the distributed enable signal to the other LAN<br />
analyzers. Next, this program captures all of the LAN traffic, and terminates as<br />
a result of either a retransmitted packet detected by this LAN analyzer or after<br />
receiving the distributed trigger sent from another LAN analyzer running the<br />
partner program.<br />
The starter program shown in the following example is used to initiate data<br />
capture on multiple LAN segments using multiple LAN analyzers. The goal is to<br />
capture the data during the same time interval on all of the LAN segments so<br />
that the reason for the retransmission can be located.<br />
Store: frames matching LAVc_all<br />
or Distrib_Enable<br />
or Distrib_Trigger<br />
ending with LAVc_TR_ReXMT<br />
or Distrib_Trigger<br />
Log file: not used<br />
Troubleshooting the NISCA Protocol F–33
Troubleshooting the NISCA Protocol<br />
F.13 Programs That Capture Retransmission Errors<br />
F.13.2 Partner Program<br />
F.13.3 Scribe Program<br />
Block 1: Enable_the_other_analyzers<br />
Send message Distrib_Enable<br />
and then<br />
Go to block 2<br />
Block 2: Wait_for_the_event<br />
When frame matches LAVc_TR_ReXMT then go to block 3<br />
Block 3: Send the distributed trigger<br />
Mark frame<br />
and then<br />
Send message Distrib_Trigger<br />
The partner program waits for the distributed enable; then it captures all of<br />
the LAN traffic and terminates as a result of either a retransmission or the<br />
distributed trigger. Upon termination, this program transmits the distributed<br />
trigger to make sure that other LAN analyzers also capture the data at about<br />
the same time as when the retransmitted packet was detected on this segment<br />
or another segment. After the data capture completes, the data from multiple<br />
LAN segments can be reviewed to locate the initial copy of the data that was<br />
retransmitted. The partner program is shown in the following example:<br />
Store: frames matching LAVc_all<br />
or Distrib_Enable<br />
or Distrib_Trigger<br />
ending with Distrib_Trigger<br />
Log file: not used<br />
Block 1: Wait_for_distributed_enable<br />
When frame matches Distrib_Enable then go to block 2<br />
Block 2: Wait_for_the_event<br />
When frame matches LAVc_TR_ReXMT then go to block 3<br />
Block 3: Send the distributed trigger<br />
Mark frame<br />
and then<br />
Send message Distrib_Trigger<br />
The scribe program waits for the distributed enable and then captures all of<br />
the LAN traffic and terminates as a result of the distributed trigger. The scribe<br />
program allows a network manager to capture data at about the same time as<br />
when the retransmitted packet was detected on another segment. After the data<br />
capture has completed, the data from multiple LAN segments can be reviewed to<br />
locate the initial copy of the data that was retransmitted. The scribe program is<br />
shown in the following example:<br />
F–34 Troubleshooting the NISCA Protocol
Troubleshooting the NISCA Protocol<br />
F.13 Programs That Capture Retransmission Errors<br />
Store: frames matching LAVc_all<br />
or Distrib_Enable<br />
or Distrib_Trigger<br />
ending with Distrib_Trigger<br />
Log file: not used<br />
Block 1: Wait_for_distributed_enable<br />
When frame matches Distrib_Enable then go to block 2<br />
Block 2: Wait_for_the_event<br />
When frame matches LAVc_TR_ReXMT then go to block 3<br />
Block 3: Mark_the_frames<br />
Mark frame<br />
and then<br />
Go to block 2<br />
Troubleshooting the NISCA Protocol F–35
G<br />
NISCA Transport Protocol Channel Selection<br />
and Congestion Control<br />
G.1 NISCA Transmit Channel Selection<br />
This appendix describes PEDRIVER running on <strong>OpenVMS</strong> Version 7.3 (Alpha<br />
and VAX) and PEDRIVER running on earliers versions of <strong>OpenVMS</strong> Alpha and<br />
VAX.<br />
G.1.1 Multiple-Channel Load Distribution on <strong>OpenVMS</strong> Version 7.3 (Alpha and<br />
VAX) or Later<br />
While all available channels with a node can be used to receive datagrams from<br />
that node, not all channels are necessarily used to transmit datagrams to that<br />
node. The NISCA protocol chooses a set of equally desirable channels to be used<br />
for datagram transmission, from the set of all available channels to a remote<br />
node. This set of transmit channels is called the equivalent channel set (ECS).<br />
Datagram transmissions are distributed in round-robin fashion across all the<br />
ECS members, thus maximizing internode cluster communications throughput.<br />
G.1.1.1 Equivalent Channel Set Selection<br />
When multiple node-to-node channels are available, the <strong>OpenVMS</strong> <strong>Cluster</strong><br />
software bases the choice of which set of channels to use on the following criteria,<br />
which are evaluated in strict precedence order:<br />
1. Packet loss history<br />
Channels that have recently been losing LAN packets at a high rate are<br />
termed lossy and will be excluded from consideration. Channels that have<br />
an acceptable loss history are termed tight and will be further considered for<br />
use.<br />
2. Capacity<br />
Next, capacity criteria for the current set of tight channels are evaluated. The<br />
capacity criteria are:<br />
a. Priority<br />
Management priority values can be assigned both to individual channels<br />
and to local LAN devices. A channel’s priority value is the sum of these<br />
management-assigned priority values. Only tight channels with a priority<br />
value equal to, or one less than, the highest priority value of any tight<br />
channel will be further considered for use.<br />
b. Packet size<br />
Tight, equivalent-priority channels whose maximum usable packet size is<br />
equivalent to that of the largest maximum usable packet size of any tight<br />
equivalent-priority channel will be further considered for use.<br />
NISCA Transport Protocol Channel Selection and Congestion Control G–1
NISCA Transport Protocol Channel Selection and Congestion Control<br />
G.1 NISCA Transmit Channel Selection<br />
A channel that satisfies all of these capacity criteria is classified as a peer. A<br />
channel that is deficient with respect to any capacity criteria is classified as<br />
inferior. A channel that exceeds one or more of the current capacity criteria,<br />
and meets the other capacity criteria is classified as superior.<br />
Note that detection of a superior channel will immediately result in<br />
recalculation of the capacity criteria for membership. This recalculation<br />
will result in the superior channel’s capacity criteria becoming the ECS’s<br />
capacity criteria, against which all tight channels will be evaluated.<br />
Similarly, if the last peer channel becomes unavailable or lossy, the capacity<br />
criteria for ECS membership will be recalculated. This will likely result in<br />
previously inferior channels becoming classified as peers.<br />
Channels whose capacity values have not been evaluated against the current<br />
ECS membership capacity criteria will sometimes be classified as ungraded.<br />
Since they cannot affect the current ECS membership criteria, lossy channels<br />
are marked as ungraded as a computational expedient when a complete<br />
recalculation of ECS membership is being performed.<br />
3. Delay<br />
Channels that meet the preceding ECS membership criteria will be used if<br />
their average round-trip delays are closely matched to that of the fastest<br />
such channel—that is, they are fast. A channel that does not meet the ECS<br />
membership delay criteria is considered slow.<br />
The delay of each channel currently in the ECS is measured using cluster<br />
communications traffic sent using that channel. If a channel has not been<br />
used to send a datagram for a few seconds, its delay will be measured using<br />
a round-trip handshake. Thus, a lossy or slow channel will be measured at<br />
intervals of a few seconds to determine whether its delay, or datagram loss<br />
rate, has improved enough so that it meets the ECS membership criteria.<br />
Using the terminology introduced in this section, the ECS members are the<br />
current set of tight, peer, and fast channels.<br />
G.1.1.2 Local and Remote LAN Adapter Load Distribution<br />
Once the ECS member channels are selected, they are ordered using an<br />
algorithm that attempts to arrange them so as to use all local adapters for<br />
packet transmissions before returning to reuse a local adapter. Also, the ordering<br />
algorithm attempts to do the same with all remote LAN adapters. Once the order<br />
is established, it is used round robin for packet transmissions.<br />
With these algorithms, PEDRIVER will make a best effort at utilizing multiple<br />
LAN adapters on a server node that communicates continuously with a client<br />
that also has multiple LAN adapters, as well as with a number of clients. In<br />
a two-node cluster, PEDRIVER will actively attempt to use all available LAN<br />
adapters that have usable LAN paths to the other node’s LAN adapters, and that<br />
have comparable capacity values. Thus, additional adapters provide both higher<br />
availability and alternative paths that can be used to avoid network congestion.<br />
G.1.2 Preferred Channel (<strong>OpenVMS</strong> Version 7.2 and Earlier)<br />
This section describes the transmit-channel selection algorithm used by <strong>OpenVMS</strong><br />
VAX and Alpha prior to <strong>OpenVMS</strong> Version 7.3.<br />
All available channels with a node can be used to receive datagrams from that<br />
node. PEDRIVER chooses a single channel on which to transmit datagrams, from<br />
the set of available channels to a remote node.<br />
G–2 NISCA Transport Protocol Channel Selection and Congestion Control
NISCA Transport Protocol Channel Selection and Congestion Control<br />
G.1 NISCA Transmit Channel Selection<br />
The driver software chooses a transmission channel to each remote node. A<br />
selection algorithm for the transmission channel makes a best effor to ensure<br />
that messages are sent in the order they are expected to be received. Sending<br />
the messages in this way also maintains compatibility with previous versions of<br />
the operating system. The currently selected transmission channel is called the<br />
preferred channel.<br />
At any point in time, the TR level of the NISCA protocol can modify its choice of<br />
a preferred channel based on the following:<br />
Minimum measured incoming delay<br />
The NISCA protocol routinely measures HELLO message delays and uses<br />
these measurements to pick the most lightly loaded channel on which to send<br />
messages.<br />
Maximum datagram size<br />
PEDRIVER favors channels with large datagram sizes. For example, an<br />
FDDI-to-FDDI channel is favored over an FDDI-to-Ethernet channel or an<br />
Ethernet-to-Ethernet channel. If your configuration uses FDDI to Ethernet<br />
bridges, the PPC level of the NISCA protocol segments messages into the<br />
smaller Ethernet datagram sizes before transmitting them.<br />
PEDRIVER continually uses received HELLO messages to compute the incoming<br />
network delay value for each channel. Thus each channel’s incoming delay is<br />
recalculated at intervals of ~2 to ~3 seconds. PEDRIVER then assumes that the<br />
network utilizes a broadcast medium (eg. An Ethernet wire, or an FDDI ring).<br />
Thus incoming and outgoing delays are symmetrical.<br />
PEDRIVER switches the preferred channel based on observed network delays or<br />
network component failures. Switching to a new transmission channel sometimes<br />
causes messages to be received out of the desired order. PEDRIVER uses a<br />
receive resequencing cache to reorder these messages instead of discarding them,<br />
which eliminates unnecessary retransmissions.<br />
With these algorithms, PEDRIVER has a greater chance of utilizing multiple<br />
adapters on a server node that communicates continuously with a number of<br />
clients. In a two-node cluster, PEDRIVER will actively use at most two LAN<br />
adapters: one to transmit and one to receive. Additional adapters provide both<br />
higher availability and alternative paths that can be used to avoid network<br />
congestion. As more nodes are added to the cluster, PEDRIVER is more likely to<br />
use the additional adapters.<br />
G.2 NISCA Congestion Control<br />
Network congestion occurs as the result of complex interactions of workload<br />
distribution and network topology, including the speed and buffer capacity of<br />
individual hardware components.<br />
Network congestion can have a negative impact on cluster performance in several<br />
ways:<br />
• Moderate levels of congestion can lead to increased queue lengths in network<br />
components (such as adapters and bridges) that in turn can lead to increased<br />
latency and slower response.<br />
• Higher levels of congestion can result in the discarding of packets because of<br />
queue overflow.<br />
NISCA Transport Protocol Channel Selection and Congestion Control G–3
NISCA Transport Protocol Channel Selection and Congestion Control<br />
G.2 NISCA Congestion Control<br />
• Packet loss can lead to packet retransmissions and, potentially, even more<br />
congestion. In extreme cases, packet loss can result in the loss of <strong>OpenVMS</strong><br />
<strong>Cluster</strong> connections.<br />
At the cluster level, these congestion effects will appear as delays in cluster<br />
communications (e.g. delays of lock transactions, served I/Os, ICC messages,<br />
etc.). The user visable effects of network congestion can be application<br />
response sluggishness, or loss of throughput.<br />
Thus, although a particular network component or protocol cannot guarantee the<br />
absence of congestion, the NISCA transport protocol implemented in PEDRIVER<br />
incorporates several mechanisms to mitigate the effects of congestion on<br />
<strong>OpenVMS</strong> <strong>Cluster</strong> traffic and to avoid having cluster traffic exacerbate congestion<br />
when it occurs. These mechanisms affect the retransmission of packets carrying<br />
user data and the multicast HELLO datagrams used to maintain connectivity.<br />
G.2.1 Congestion Caused by Retransmission<br />
Associated with each virtual circuit from a given node is a transmission window<br />
size, which indicates the number of packets that can be outstanding to the remote<br />
node (for example, the number of packets that can be sent to the node at the<br />
other end of the virtual circuit before receiving an acknowledgment [ACK]).<br />
If the window size is 8 for a particular virtual circuit, then the sender can<br />
transmit up to 8 packets in a row but, before sending the ninth, must wait until<br />
receiving an ACK indicating that at least the first of the 8 has arrived.<br />
If an ACK is not received, a timeout occurs, the packet is assumed lost, and<br />
must be retransmitted. If another timeout occurs for a retransmitted packet,<br />
the timeout interval is significantly increased and the packet is retransmitted<br />
again. After a large number of consecutive retransmissions of the same packet<br />
has occured, the virtual circuit will be closed.<br />
G.2.1.1 <strong>OpenVMS</strong> VAX Version 6.0 or <strong>OpenVMS</strong> AXP Version 1.5, or Later<br />
This section pertains to PEDRIVER running on <strong>OpenVMS</strong> VAX Version 6.0 or<br />
<strong>OpenVMS</strong> AXP Version 1.5, or later.<br />
The retransmission mechanism is an adaptation of the algorithms developed for<br />
the Internet TCP protocol by Van Jacobson and improves on the old mechanism<br />
by making both the window size and the retransmission timeout interval adapt to<br />
network conditions.<br />
• When a timeout occurs because of a lost packet, the window size is decreased<br />
immediately to reduce the load on the network. The window size is allowed<br />
to grow only after congestion subsides. More specifically, when a packet<br />
loss occurs, the window size is decreased to 1 and remains there, allowing<br />
the transmitter to send only one packet at a time until all the original<br />
outstanding packets have been acknowledged.<br />
After this occurs, the window is allowed to grow quickly until it reaches half<br />
its previous size. Once reaching the halfway point, the window size is allowed<br />
to increase relatively slowly to take advantage of available network capacity<br />
until it reaches a maximum value determined by the configuration variables<br />
(for example, a minumum of the number of adapter buffers and the remote<br />
node’s resequencing cache).<br />
G–4 NISCA Transport Protocol Channel Selection and Congestion Control
NISCA Transport Protocol Channel Selection and Congestion Control<br />
G.2 NISCA Congestion Control<br />
• The retransmission timeout interval is set based on measurements of actual<br />
round-trip times, and the average varience from this average, for packets that<br />
are transmitted over the virtual circuit. This allows PEDRIVER to be more<br />
responsive to packet loss in most networks but avoids premature timeouts<br />
for networks in which the actual round-trip delay is consistently long. The<br />
algorithm can accomodate average delays of up to a few seconds.<br />
G.2.1.2 VMS Version 5.5 or Earlier<br />
This section pertains to PEDRIVER running on VMS Version 5.5 or earlier.<br />
• The window size is relatively static—usually 8, 16 or 31 and the<br />
retransmission policy assumes that all outstanding packets are lost and thus<br />
retransmits them all at once. Retransmission of an entire window of packets<br />
under congestion conditions tends to exacerbate the condition significantly.<br />
• The timeout interval for determining that a packet is lost is fixed (3 seconds).<br />
This means that the loss of a single packet can interrupt communication<br />
between cluster nodes for as long as 3 seconds.<br />
G.2.2 HELLO Multicast Datagrams<br />
PEDRIVER periodically multicasts a HELLO datagram over each network<br />
adapter attached to the node. The HELLO datagram serves two purposes:<br />
• It informs other nodes of the existence of the sender so that they can form<br />
channels and virtual circuits.<br />
• It helps to keep communications open once they are established.<br />
HELLO datagram congestion and loss of HELLO datagrams can prevent<br />
connections from forming or cause connections to be lost. Table G–1 describes<br />
conditions causing HELLO datagram congestion and how PEDRIVER helps avoid<br />
the problems. The result is a substantial decrease in the probability of HELLO<br />
datagram synchronization and thus a decrease in HELLO datagram congestion.<br />
Table G–1 Conditions that Create HELLO Datagram Congestion<br />
Conditions that cause congestion How PEDRIVER avoids congestion<br />
If all nodes receiving a HELLO datagram from a<br />
new node responded immediately, the receiving<br />
network adapter on the new node could be<br />
overrun with HELLO datagrams and be forced<br />
to drop some, resulting in connections not being<br />
formed. This is especially likely in large clusters.<br />
To avoid this problem on nodes running:<br />
• On VMS Version 5.5–2 or earlier, nodes that receive HELLO<br />
datagrams delay for a random time interval of up to 1 second<br />
before responding.<br />
• On <strong>OpenVMS</strong> VAX Version 6.0 or later, or <strong>OpenVMS</strong> AXP<br />
Version 1.5 or later, this random delay is a maximum of 2<br />
seconds to support large <strong>OpenVMS</strong> <strong>Cluster</strong> systems.<br />
(continued on next page)<br />
NISCA Transport Protocol Channel Selection and Congestion Control G–5
NISCA Transport Protocol Channel Selection and Congestion Control<br />
G.2 NISCA Congestion Control<br />
Table G–1 (Cont.) Conditions that Create HELLO Datagram Congestion<br />
Conditions that cause congestion How PEDRIVER avoids congestion<br />
If a large number of nodes in a network became<br />
synchronized and transmitted their HELLO<br />
datagrams at or near the same time, receiving<br />
nodes could drop some datagrams and time out<br />
channels.<br />
G–6 NISCA Transport Protocol Channel Selection and Congestion Control<br />
On nodes running VMS Version 5.5–2 or earlier, PEDRIVER<br />
multicasts HELLO datagrams over each adapter every 3 seconds,<br />
making HELLO datagram congestion more likely.<br />
On nodes running <strong>OpenVMS</strong> VAX Version 6.0 or later, or <strong>OpenVMS</strong><br />
AXP Version 1.5 or later, PEDRIVER prevents this form of<br />
HELLO datagram congestion by distributing its HELLO datagram<br />
multicasts randomly over time. A HELLO datagram is still<br />
multicast over each adapter approximately every 3 seconds but not<br />
over all adapters at once. Instead, if a node has multiple network<br />
adapters, PEDRIVER attempts to distribute its HELLO datagram<br />
multicasts so that it sends a HELLO datagram over some of its<br />
adapters during each second of the 3-second interval.<br />
In addition, rather than multicasting precisely every 3 seconds,<br />
PEDRIVER varies the time between HELLO datagram multicasts<br />
between approximately 1.6 to 3 seconds, changing the average from<br />
3 seconds to approximately 2.3 seconds.
A<br />
Access control lists<br />
See ACLs<br />
ACLs (access control lists)<br />
building a common file, 5–17<br />
ACP_REBLDSYSD system parameter<br />
rebuilding system disks, 6–26<br />
Adapters<br />
booting from multiple LAN, 9–6<br />
AGEN$ files<br />
updating, 8–35<br />
AGEN$INCLUDE_PARAMS file, 8–11, 8–35<br />
AGEN$NEW_NODE_DEFAULTS.DAT file, 8–11,<br />
8–35<br />
AGEN$NEW_SATELLITE_DEFAULTS.DAT file,<br />
8–11, 8–35<br />
Allocation classes<br />
MSCP controllers, 6–7<br />
node, 6–6<br />
assigning value to computers, 6–9<br />
assigning value to HSC subsystems, 6–9<br />
assigning value to HSD subsystems, 6–10<br />
assigning value to HSJ subsystems, 6–10<br />
rules for specifying, 6–7<br />
using in a distributed environment, 6–29<br />
port, 6–6<br />
constraints removed by, 6–15<br />
reasons for using, 6–14<br />
SCSI devices, 6–6<br />
specifying, 6–17<br />
ALLOCLASS system parameter, 4–5, 6–9, 6–12<br />
ALPHAVMSSYS.PAR file<br />
created by CLUSTER_CONFIG.COM procedure,<br />
8–1<br />
ANALYZE/ERROR_LOG command<br />
error logging, C–24<br />
Applications<br />
shared, 5–1<br />
Architecture<br />
<strong>OpenVMS</strong> <strong>Cluster</strong> systems, 2–1<br />
Asterisk ( * )<br />
as wildcard character<br />
in START/QUEUE/MANAGER command,<br />
7–2<br />
Index<br />
ATM (asynchronous transfer mode), 2–2<br />
configurations, 3–4<br />
Audit server databases, 5–18<br />
AUTHORIZE flag, F–23<br />
Authorize utility (AUTHORIZE), 1–11, B–1<br />
See also CLUSTER_AUTHORIZE.DAT files and<br />
Security management<br />
cluster common system authorization file, 5–6<br />
network proxy database, 5–18<br />
rights identifier database, 5–19<br />
system user authorization file, 5–20<br />
AUTOGEN.COM command procedure, 1–10<br />
building large <strong>OpenVMS</strong> cluster systems, 9–1,<br />
9–17<br />
cloning system disks, 9–16<br />
common parameter files, 8–11<br />
controlling satellite booting, 9–12<br />
enabling or disabling disk server, 8–20<br />
executed by CLUSTER_CONFIG.COM<br />
command procedure, 8–1<br />
running with feedback option, 8–40, 10–1<br />
SAVE_FEEDBACK option, 10–12<br />
specifying dump file, 10–12<br />
upgrading the <strong>OpenVMS</strong> operating system,<br />
10–3<br />
using with MODPARAMS.DAT, 6–9<br />
Autologin facility, 5–19<br />
Availability<br />
after LAN component failures, D–3<br />
booting from multiple LAN adapters, 9–6<br />
DSSI, 3–3<br />
of data, 2–5, 6–27<br />
of network, D–3<br />
of queue manager, 7–2<br />
B<br />
Backup utility (BACKUP), 1–11<br />
upgrading the operating system, 10–3<br />
Batch queues, 5–1, 7–1, 7–9<br />
See also Queues and Queue managers<br />
assigning unique name to, 7–10<br />
clusterwide generic, 7–10<br />
initializing, 7–10<br />
setting up, 7–8<br />
starting, 7–10<br />
SYS$BATCH, 7–10<br />
Index–1
Booting<br />
See also Satellites nodes, booting<br />
avoiding system disk rebuilds, 9–14<br />
computer on CI fails to boot, C–3<br />
computer on CI fails to join cluster, C–13<br />
minimizing boot time, 9–3<br />
nodes into existing <strong>OpenVMS</strong> <strong>Cluster</strong>, 8–35,<br />
8–41<br />
sequence of events, C–1<br />
Boot nodes<br />
See Boot servers<br />
Boot servers<br />
after configuration change, 8–33<br />
defining maximum DECnet address value, 4–7<br />
functions, 3–5<br />
rebooting a satellite, 8–39<br />
Broadcast messages, 10–9<br />
Buffer descriptor table entries, 9–18<br />
BYE datagram, F–23<br />
C<br />
Cables<br />
configurations, C–21<br />
troubleshooting CI problems, C–21<br />
CC protocol<br />
CC header, F–22<br />
part of NISCA transport protocol, F–3<br />
setting the TR/CC flag, F–23<br />
CCSTART datagram, F–16, F–23<br />
Channel Control protocol<br />
See CC protocol<br />
Channel formation<br />
acknowledging with VERF datagram, F–16,<br />
F–23<br />
BYE datagram, F–23<br />
completing with VACK datagram, F–16, F–23<br />
handshake, F–16<br />
HELLO datagrams, F–16, F–23<br />
multiple, F–19<br />
opening with CCSTART datagram, F–16<br />
problems, F–16<br />
Channels<br />
definition, F–4<br />
established and implemented by PEDRIVER,<br />
F–4<br />
CHECK_CLUSTER system parameter, A–1<br />
CI (computer interconnect), 2–2<br />
cable repair, C–23<br />
changing to mixed interconnect, 8–21<br />
communication path, C–18<br />
computers<br />
adding, 8–10<br />
failure to boot, C–3<br />
failure to join the cluster, C–13<br />
configurations, 3–2<br />
device-attention entry, C–26<br />
Index–2<br />
CI (computer interconnect) (cont’d)<br />
error-log entry, C–31<br />
analyzing, C–24<br />
formats, C–25<br />
error recovery, C–27, C–40<br />
logged-message entry, C–29<br />
MSCP server access to shadow sets, 6–29<br />
port<br />
loopback datagram facility, C–20<br />
polling, C–17<br />
verifying function, C–19<br />
troubleshooting, C–1<br />
CISCEs<br />
See Star couplers<br />
CLUEXIT bugcheck<br />
diagnosing, C–16<br />
<strong>Cluster</strong><br />
See <strong>OpenVMS</strong> <strong>Cluster</strong> systems<br />
<strong>Cluster</strong> aliases<br />
definition, 4–14, 4–15<br />
enabling operations, 4–16<br />
limits, 9–20<br />
<strong>Cluster</strong> authorization file<br />
See CLUSTER_AUTHORIZE.DAT files<br />
<strong>Cluster</strong> group numbers<br />
in DX header, F–22<br />
location of, F–28<br />
on extended LANs, 2–11, 3–4, 10–14<br />
setting, 2–12, 4–4, 10–15<br />
<strong>Cluster</strong> passwords, 2–11, 2–12, 4–4, F–16, F–23<br />
See also Security management<br />
error example, 2–12<br />
location of, F–28<br />
multiple LAN configurations, 3–4<br />
on extended LANs, 2–11, 10–14<br />
requested by CLUSTER_CONFIG.COM<br />
procedure, 8–7<br />
setting, 4–4, 10–15<br />
<strong>Cluster</strong>wide logical names<br />
database, 5–12<br />
defining, 5–12<br />
in applications, 5–11<br />
system management, 5–6<br />
<strong>Cluster</strong>wide process services, 1–9<br />
CLUSTER_AUTHORIZE.DAT files, 2–12, F–28<br />
See also Authorize utility (AUTHORIZE),<br />
<strong>Cluster</strong> group numbers, <strong>Cluster</strong> passwords,<br />
and Security management<br />
enabling LAN for cluster communications,<br />
8–20<br />
ensuring cluster integrity, 10–14<br />
multiple, 10–15<br />
troubleshooting MOP servers, C–9<br />
updating, 10–15<br />
verifying presence of <strong>OpenVMS</strong> <strong>Cluster</strong><br />
software, C–13
CLUSTER_AUTHORIZE.DAT files (cont’d)<br />
verifying the cluster security information,<br />
C–14<br />
CLUSTER_CONFIG.COM command procedure,<br />
6–18<br />
adding a quorum disk, 8–16<br />
change options, 8–20<br />
converting standalone computer to cluster<br />
computer, 8–24<br />
creating a duplicate system disk, 8–31<br />
disabling a quorum disk, 8–19<br />
enabling disk server, 6–20, 8–24<br />
enabling tape server, 8–28<br />
functions, 8–1<br />
modifying satellite LAN hardware address,<br />
8–21<br />
preparing to execute, 8–4<br />
removing computers, 8–17<br />
required information, 8–5<br />
system files created for satellites, 8–1<br />
CLUSTER_CONFIG_LAN.COM command<br />
procedure, 6–18<br />
functions, 8–1<br />
CLUSTER_CREDITS system parameter, 9–18,<br />
A–1<br />
CLUSTER_SERVER process<br />
initializing clusterwide logical name database,<br />
5–12<br />
CLUSTER_SHUTDOWN option, 10–10<br />
/CLUSTER_SHUTDOWN qualifier, 8–36<br />
Communications<br />
channel-formation problems, F–16<br />
mechanisms, 1–5<br />
PEDRIVER, F–4<br />
retransmission problems, F–17<br />
SCS interprocessor, 1–4<br />
troubleshooting NISCA protocol levels, F–16<br />
Computer interconnect<br />
See CI and Interconnects<br />
Computers<br />
removing from cluster, 8–17<br />
Configurations, 3–1<br />
changing from CI or DSSI to mixed<br />
interconnect, 8–21<br />
changing from local area to mixed interconnect,<br />
8–22<br />
CI, 3–1<br />
DECnet, 4–13<br />
DSSI, 3–3<br />
FDDI network, 3–8<br />
Fibre Channel, 3–7<br />
guidelines for growing your <strong>OpenVMS</strong> <strong>Cluster</strong>,<br />
9–1<br />
LAN, 3–4<br />
MEMORY CHANNEL, 3–10<br />
of shadow sets, 6–27<br />
reconfiguring, 8–33<br />
recording data, 10–3<br />
Configurations (cont’d)<br />
SCSI, 3–12<br />
Congestion control<br />
in NISCA Transport Protocol, G–3<br />
in PEDRIVER, G–3<br />
retransmissions, G–4<br />
Connection manager, 1–4<br />
overview, 2–5<br />
state transitions, 2–9<br />
Controllers<br />
dual-pathed devices, 6–3<br />
dual-ported devices, 6–2<br />
HSx storage subsystems, 1–4<br />
Convert utility (CONVERT)<br />
syntax, B–3<br />
using to merge SYSUAF.DAT files, B–2<br />
Credit waits, 9–18<br />
$CREPRC system service, 2–14<br />
CWCREPRC_ENABLE system parameter, A–1<br />
D<br />
Data availability<br />
See Availability<br />
Datagram Exchange protocol<br />
See DX protocol<br />
Datagrams<br />
ACK flag, F–24<br />
AUTHORIZE flag, F–23<br />
BYE, F–23<br />
CC header, F–22<br />
CCSTART, F–16, F–23<br />
DATA flag, F–24<br />
DX header, F–21<br />
Ethernet headers, F–20<br />
FDDI headers, F–20<br />
flags, F–23<br />
format of the NISCA protocol packet, F–19<br />
HELLO, F–16, F–23<br />
multicast, F–23<br />
NISCA, F–19<br />
reserved flag, F–23, F–24<br />
retransmission problems, F–17<br />
REXMT flag, F–24<br />
RSVP flag, F–24<br />
SEQ flag, F–24<br />
TR/CC flag, F–23<br />
TR flags, F–24<br />
TR header, F–23<br />
VACK, F–16, F–23<br />
VERF, F–16, F–23<br />
Data integrity<br />
connection manager, 2–9<br />
Debugging<br />
satellite booting, C–5<br />
DECamds<br />
operations management, 1–9, 10–22<br />
Index–3
DECbootsync, 9–10<br />
controlling workstation startup, 9–10<br />
DECdtm services<br />
creating a transaction log, 8–9, 8–24<br />
determining computer use of, 8–3<br />
removing a node, 8–3, 8–18<br />
DECelms software<br />
monitoring LAN traffic, 10–23<br />
DECmcc software<br />
monitoring LAN traffic, 10–23<br />
DECnet/OSI<br />
See DECnet software<br />
DECnet for <strong>OpenVMS</strong><br />
See DECnet software<br />
DECnet–Plus<br />
See DECnet software<br />
DECnet software<br />
cluster alias, 4–14, 4–15, 4–16<br />
limits, 9–20<br />
cluster satellite pseudonode name, 9–7<br />
configuring, 4–13<br />
disabling LAN device, 4–13<br />
downline loading, 9–9<br />
enabling circuit service for cluster MOP server,<br />
4–7<br />
installing network license, 4–5<br />
LAN network troubleshooting, D–3<br />
making databases available clusterwide, 4–13<br />
making remote node data available clusterwide,<br />
4–13<br />
maximum address value<br />
defining for cluster boot server, 4–7<br />
modifying satellite LAN hardware address,<br />
8–21<br />
monitoring LAN activity, 10–23<br />
NCP (Network Control Program), 4–14<br />
NETNODE_REMOTE.DAT file<br />
renaming to SYS$COMMON directory,<br />
4–13<br />
network cluster functions, 1–7<br />
restoring satellite configuration data, 10–4<br />
starting, 4–15<br />
tailoring, 4–7<br />
DECram software<br />
improving performance, 9–15<br />
Device drivers<br />
loading, 5–15<br />
port, 1–5<br />
Device names<br />
cluster, 6–6<br />
RAID Array 210 and 230, 6–13<br />
SCSI, 6–6<br />
Devices<br />
dual-pathed, 6–3<br />
dual-ported, 6–2<br />
floppy disk<br />
naming with port allocation classes, 6–16<br />
Index–4<br />
Devices (cont’d)<br />
IDE<br />
naming with port allocation classes, 6–16<br />
PCI RAID<br />
naming with port allocation classes, 6–16<br />
port error-log entries, C–24<br />
SCSI support, 6–28<br />
shared disks, 6–24<br />
types of interconnect, 1–3<br />
DEVICE_NAMING system parameter, 6–18<br />
Digital Storage Architecture<br />
See DSA<br />
Digital Storage <strong>Systems</strong> Interconnect<br />
See DSSI<br />
Directories<br />
system, 5–5<br />
Directory structures<br />
on common system disk, 5–4<br />
Disaster Tolerant <strong>Cluster</strong> Services for <strong>OpenVMS</strong>,<br />
1–1<br />
Disaster-tolerant <strong>OpenVMS</strong> <strong>Cluster</strong> systems, 1–1<br />
Disk class drivers, 1–4, 2–15<br />
Disk controllers, 1–4<br />
Disk mirroring<br />
See Volume shadowing and Shadow sets<br />
Disks<br />
altering label, 8–38<br />
assigning allocation classes, 6–9<br />
cluster-accessible, 1–3, 6–1<br />
storing common procedures on, 5–15<br />
configuring, 6–25<br />
data, 5–1<br />
dual-pathed, 6–2, 6–3, 6–11<br />
setting up, 5–15<br />
dual-ported, 6–2<br />
local<br />
clusterwide access, 6–20<br />
setting up, 5–15<br />
managing, 6–1<br />
mounting<br />
clusterwide, 8–32<br />
MSCPMOUNT.COM file, 6–25<br />
node allocation class, 6–7<br />
quorum, 2–7<br />
rebuilding, 6–26<br />
rebuild operation, 6–26<br />
restricted access, 6–1<br />
selecting server, 8–4<br />
served by MSCP, 6–2, 6–20<br />
shared, 1–3, 5–1, 6–24<br />
mounting, 6–24, 6–25<br />
system, 5–1<br />
avoiding rebuilds, 9–14<br />
backing up, 8–32<br />
configuring in large cluster, 9–13, 9–14<br />
configuring multiple, 9–15<br />
controlling dump files, 9–17
Disks<br />
system (cont’d)<br />
creating duplicate, 8–31<br />
directory structure, 5–4<br />
dismounting, 6–19<br />
mixed architecture, 4–1, 10–8<br />
mounting clusterwide, 8–33<br />
moving high-activity files, 9–14<br />
rebuilding, 6–26<br />
shadowing across an <strong>OpenVMS</strong> <strong>Cluster</strong>,<br />
6–30<br />
troubleshooting I/O bottlenecks, 10–21<br />
Disk servers<br />
configuring LAN adapter, 9–4<br />
configuring memory, 9–4<br />
functions, 3–5<br />
MSCP on LAN configurations, 3–5<br />
selecting, 8–4<br />
troubleshooting, C–10<br />
DISK_QUORUM system parameter, 2–8, A–2<br />
Distributed combination trigger function, F–19,<br />
F–25<br />
filter, F–32<br />
message, F–33<br />
Distributed enable function, F–19, F–25<br />
filter, F–31<br />
message, F–32<br />
Distributed file system, 1–4, 2–15<br />
Distributed job controller, 1–4, 2–16<br />
separation from queue manager, 7–1<br />
Distributed lock manager, 1–4, 2–13<br />
DECbootsync use of, 9–10<br />
device names, 6–6<br />
inaccessible cluster resource, C–16<br />
LOCKDIRWT system parameter, A–2<br />
lock limit, 2–14<br />
Distributed processing, 5–1, 7–1<br />
Distrib_Enable filter<br />
<strong>HP</strong> 4972 LAN Protocol Analyzer, F–31<br />
Distrib_Trigger filter<br />
<strong>HP</strong> 4972 LAN Protocol Analyzer, F–32<br />
DKDRIVER, 2–2<br />
Drivers<br />
DKDRIVER, 2–2<br />
DSDRIVER, 1–4, 2–15<br />
DUDRIVER, 1–4, 2–2, 2–15<br />
Ethernet E*driver, 2–2<br />
FDDI F*driver, 2–2<br />
load balancing, 6–23<br />
MCDRIVER, 2–2<br />
PADRIVER, C–17, C–18<br />
PBDRIVER, C–18<br />
PEDRIVER, 2–2, C–18, G–1<br />
PIDRIVER, 2–2, C–18<br />
PK*DRIVER, 2–2<br />
PMDRIVER, 2–2<br />
PNDRIVER, 2–2<br />
port, 1–5<br />
Drivers (cont’d)<br />
TUDRIVER, 1–4, 2–2, 2–16<br />
DR_UNIT_BASE system parameter, 6–13, A–2<br />
DSA (Digital Storage Architecture)<br />
disks and tapes in <strong>OpenVMS</strong> <strong>Cluster</strong>, 1–3<br />
served devices, 6–20<br />
served tapes, 6–20<br />
support for compliant hardware, 6–27<br />
DSDRIVER, 1–4, 2–15<br />
load balancing, 6–23<br />
DSSI (DIGITAL Storage <strong>Systems</strong> Interconnect),<br />
2–2<br />
changing allocation class values on DSSI<br />
subsystems, 6–10, 8–38<br />
changing to mixed interconnect, 8–21<br />
configurations, 3–3<br />
ISE peripherals, 3–3<br />
MSCP server access to shadow sets, 6–29<br />
DTCS<br />
See Disaster Tolerant <strong>Cluster</strong> Services for<br />
<strong>OpenVMS</strong><br />
DUDRIVER, 1–4, 2–2, 2–15<br />
load balancing, 6–23<br />
DUMPFILE AUTOGEN symbol, 10–12<br />
Dump files<br />
in large clusters, 9–17<br />
managing, 10–12<br />
DUMPSTYLE AUTOGEN symbol, 10–12<br />
DX protocol<br />
DX header, F–21<br />
part of NISCA transport protocol, F–3<br />
E<br />
ENQLM process limit, 2–14<br />
Error Log utility (ERROR LOG)<br />
invoking, C–24<br />
Errors<br />
capturing retransmission, F–33<br />
CI port recovery, C–40<br />
fatal errors detected by data link, C–36, E–4<br />
recovering from CI port, C–27<br />
returned by SYS$LAVC_DEFINE_NET_<br />
COMPONENT subroutine, E–6<br />
returned by SYS$LAVC_DEFINE_NET_PATH<br />
subroutine, E–8<br />
returned by SYS$LAVC_DISABLE_ANALYSIS<br />
subroutine, E–10<br />
returned by SYS$LAVC_ENABLE_ANALYSIS<br />
subroutine, E–9<br />
returned by SYS$LAVC_START_BUS<br />
subroutine, E–2<br />
returned by SYS$LAVC_STOP_BUS subroutine,<br />
E–4<br />
stopping NISCA protocol on LAN adapters,<br />
D–2<br />
stopping the LAN on all LAN adapters, C–36,<br />
E–4<br />
Index–5
Errors (cont’d)<br />
when stopping the NISCA protocol, D–2<br />
Ethernet, 2–2<br />
configurations, 3–4<br />
configuring adapter, 9–4<br />
error-log entry, C–31<br />
hardware address, 8–5<br />
header for datagrams, F–20<br />
large-packet support, 10–16<br />
monitoring activity, 10–23<br />
MSCP server access to shadow sets, 6–29<br />
port, C–18<br />
setting up LAN analyzer, F–28<br />
Ethernet E*driver physical device driver, 2–2<br />
EXPECTED_VOTES system parameter, 2–6,<br />
8–11, 8–33, 8–35, 10–19, A–2<br />
Extended message acknowledgment, F–24<br />
Extended sequence numbers<br />
for datagram flags, F–25<br />
F<br />
Failover<br />
dual-host <strong>OpenVMS</strong> <strong>Cluster</strong> configuration, 3–3<br />
dual-ported disks, 6–3<br />
dual-ported DSA tape, 6–13<br />
preferred disk, 6–5<br />
FDDI (Fiber Distributed Data Interface), 2–2<br />
configurations, 3–4<br />
configuring adapter, 9–4<br />
error-log entry, C–31<br />
hardware address, 8–5<br />
header for datagrams, F–20<br />
influence of LRPSIZE on, 10–16<br />
large-packet support, 10–16<br />
massively distributed shadowing, 6–29<br />
monitoring activity, 10–23<br />
port, C–18<br />
use of priority field, 10–16<br />
FDDI F*driver physical device driver, 2–2<br />
Feedback option, 8–40<br />
Fiber Distributed Data Interface<br />
See FDDI<br />
Fibre Channel interconnect, 2–2, 3–13<br />
Files<br />
See also Dump files<br />
cluster-accessible, 6–1<br />
security, 5–17<br />
shared, 5–20<br />
startup command procedure, 5–13<br />
system<br />
clusterwide coordination, 5–22<br />
moving off system disk, 9–14<br />
File system<br />
distributed, 1–4, 2–15<br />
Filters<br />
distributed enable, F–31<br />
distributed trigger, F–32<br />
Index–6<br />
Filters (cont’d)<br />
<strong>HP</strong> 4972 LAN Protocol Analyzer, F–30<br />
LAN analyzer, F–30<br />
local area <strong>OpenVMS</strong> <strong>Cluster</strong> packet, F–31<br />
local area <strong>OpenVMS</strong> <strong>Cluster</strong> retransmission,<br />
F–31<br />
Flags<br />
ACK transport datagram, F–24<br />
AUTHORIZE datagram flag in CC header,<br />
F–23<br />
datagram flags field, F–24<br />
DATA transport datagram, F–24<br />
in the CC datagram, F–23<br />
reserved, F–23, F–24<br />
REXMT datagram, F–24<br />
RSVP datagram, F–24<br />
SEQ datagram, F–24<br />
TR/CC datagram, F–23<br />
G<br />
Galaxy configurations, 1–3<br />
Galaxy Configuration Utility (GCU), 1–9<br />
$GETSYI system service, 5–11<br />
Graphical Configuration Manager (GCM), 1–9<br />
Group numbers<br />
See <strong>Cluster</strong> group numbers<br />
H<br />
Hang conditions<br />
diagnosing, C–15<br />
Hardware components, 1–2<br />
Headers<br />
CC, F–22<br />
DX, F–21<br />
Ethernet, F–20<br />
FDDI, F–20<br />
TR, F–23<br />
HELLO datagram, F–16, F–23<br />
congestion, G–5<br />
Hierarchical storage controller subsystems<br />
See HSC subsystems<br />
<strong>HP</strong> 4972 LAN Protocol Analyzer, F–30<br />
HSC subsystems, 1–4<br />
assigning allocation classes, 6–9<br />
changing allocation class values, 6–9, 8–38<br />
dual-pathed disks, 6–11<br />
dual-ported devices, 6–2, 6–4<br />
served devices, 6–20<br />
HSD subsystems, 1–4<br />
assigning allocation classes, 6–10<br />
HSJ subsystems, 1–4<br />
assigning allocation classes, 6–10<br />
changing allocation class values, 6–10, 8–38<br />
dual-ported devices, 6–4
HSZ subsystems, 1–4<br />
I<br />
Installation procedures<br />
layered products, 4–6<br />
Integrated storage elements<br />
See ISEs<br />
Interconnects, 1–3, 2–2, 3–1<br />
See also Configurations<br />
Interprocessor communication, 6–29<br />
IO$_SETPRFPATH $QIO function, 6–5<br />
ISEs (integrated storage elements)<br />
peripherals, 3–3<br />
use in an <strong>OpenVMS</strong> <strong>Cluster</strong>, 1–3<br />
J<br />
Job controller<br />
See Distributed job controller<br />
L<br />
LAN$DEVICE_DATABASE.DAT file, 4–9<br />
LAN$NODE_DATABASE.DAT file, 4–9<br />
LAN$POPULATE.COM, 4–10<br />
LAN adapters<br />
BYE datagram, F–23<br />
capturing traffic data on, F–27<br />
datagram flags, F–24<br />
overloaded, F–12<br />
selecting, 4–13<br />
sending CCSTART datagram, F–16, F–23<br />
sending HELLO datagrams, F–16, F–23<br />
stopping, C–36, E–3<br />
stopping NISCA protocol, D–2<br />
VACK datagram, F–16, F–23<br />
VERF datagram, F–16, F–23<br />
LAN analyzers, F–19, F–30 to F–34<br />
analyzing retransmission errors, F–33<br />
distributed enable filter, F–31<br />
distributed enable messages, F–32<br />
distributed trigger filter, F–32<br />
distributed trigger messages, F–33<br />
filtering retransmissions, F–31<br />
packet filter, F–31<br />
scribe program, F–34<br />
starter program, F–33<br />
LAN bridges<br />
use of FDDI priority field, 10–16<br />
LAN Control Program (LANCP) utility, 1–10<br />
booting cluster satellites, 4–7, 4–9<br />
LAN$DEVICE_DATABASE.DAT file, 4–9<br />
LAN$NODE_DATABASE.DAT file, 4–9<br />
LANCP<br />
See LAN Control Program (LANCP) utility<br />
LAN path<br />
See NISCA transport protocol<br />
See PEDRIVER<br />
load distribution, G–1<br />
LAN protocol analysis program, F–19<br />
troubleshooting NISCA protocol levels, F–25<br />
LANs (local area networks)<br />
alternate adapter booting, 9–6<br />
analyzing retransmission errors, F–33<br />
capturing retransmissions, F–31, F–33<br />
changing to mixed interconnect, 8–22<br />
configurations, 3–4<br />
configuring adapter, 4–13, 9–4<br />
controlling with sample programs, D–1<br />
creating a network component list, E–7<br />
creating a network component representation,<br />
E–5<br />
data capture, F–26<br />
debugging satellite booting, C–5<br />
device-attention entry, C–27<br />
distributed enable messages, F–32<br />
distributed trigger messages, F–33<br />
downline loading, 9–9<br />
drivers in PI protocol, F–3<br />
enabling data capture, F–31<br />
error-log entry, C–31<br />
Ethernet troubleshooting, F–28<br />
hardware address, 8–5<br />
LAN address for satellite, 9–7<br />
LAN control subroutines, E–1<br />
large-packet support for FDDI, 10–16<br />
LRPSIZE system parameter, 10–16<br />
maximum packet size, 10–16<br />
monitoring LAN activity, 10–23<br />
multiple paths, G–1<br />
network failure analysis, C–15, D–3<br />
NISCA protocol, F–19<br />
NISCA Protocol, G–1<br />
NISCS_CONV_BOOT system parameter, C–5<br />
OPCOM messages, D–10<br />
path selection and congestion control appendix,<br />
G–1<br />
port, C–18<br />
required tools for troubleshooting, F–25<br />
sample programs, D–2, D–3<br />
satellite booting, 4–7, 9–5, 9–6, 9–7<br />
single-adapter booting, 9–5<br />
starting network failure analysis, E–9<br />
starting NISCA protocol, D–1<br />
on LAN adapters, E–1<br />
stopping network failure analysis, E–10<br />
stopping NISCA protocol, D–2<br />
on LAN adapters, C–36, E–3<br />
stopping on all LAN adapters, E–3<br />
subroutine package, E–1, E–3, E–5, E–7, E–9,<br />
E–10<br />
transmit path selection, G–1<br />
Index–7
LANs (local area networks) (cont’d)<br />
troubleshooting NISCA communications, F–16<br />
LAN Traffic Monitor<br />
See LTM<br />
LAVC$FAILURE_ANALYSIS.MAR program, F–30<br />
to F–34<br />
distributed enable filter, F–31<br />
distributed enable messages, F–32<br />
distributed trigger filter, F–32<br />
distributed trigger messages, F–33<br />
filtering LAN packets, F–31<br />
filtering LAN retransmissions, F–31<br />
filters, F–30<br />
partner program, F–34<br />
retransmission errors, F–33<br />
sample program, D–3<br />
scribe program, F–34<br />
starter program, F–33<br />
LAVC$START_BUS.MAR sample program, D–1<br />
LAVC$STOP_BUS.MAR sample program, D–2<br />
%LAVC-I-ASUSPECT OPCOM message, D–10<br />
LAVc protocol<br />
See NISCA transport protocol<br />
See PEdriver<br />
%LAVC-S-WORKING OPCOM message, D–10<br />
%LAVC-W-PSUSPECT OPCOM message, D–10<br />
LAVc_all filter<br />
<strong>HP</strong> 4972 LAN Protocol Analyzer, F–31<br />
LAVc_TR_ReXMT filter<br />
<strong>HP</strong> 4972 LAN Protocol Analyzer, F–31<br />
Layered products<br />
installing, 4–6<br />
License database, 4–6<br />
Licenses<br />
DECnet, 4–5<br />
installing, 4–5<br />
<strong>OpenVMS</strong> <strong>Cluster</strong> systems, 4–5<br />
<strong>OpenVMS</strong> operating system, 4–5<br />
LMF (License Management Facility), 1–9<br />
Load balancing<br />
determining failover target, 6–5<br />
devices served by MSCP, 6–22<br />
MSCP I/O, 6–22<br />
dynamic, 6–22, 6–23<br />
static, 6–22, 6–23<br />
queue database files, 7–4<br />
queues, 5–1, 7–1<br />
Load capacity ratings, 6–23<br />
Load file<br />
satellite booting, 9–5<br />
Local area networks<br />
See LANs<br />
Local area <strong>OpenVMS</strong> <strong>Cluster</strong> environments<br />
capturing distributed trigger event, F–32<br />
debugging satellite booting, C–1, C–5<br />
network failure analysis, C–24<br />
Index–8<br />
Lock database, 2–9<br />
LOCKDIRWT system parameter, A–2<br />
used to control lock resource trees, 2–14<br />
Lock manager<br />
See Distributed lock manager<br />
Logical names<br />
clusterwide<br />
in applications, 5–11<br />
invalid, 5–10<br />
system management, 5–6<br />
defining<br />
for QMAN$MASTER.DAT, 5–23<br />
system, 5–5<br />
LOGINOUT<br />
determining process quota values, 10–17<br />
Logins<br />
controlling, 5–22<br />
disabling, 4–7<br />
LRPSIZE system parameter, 10–16, A–2<br />
LTM (LAN Traffic Monitor), 10–23<br />
M<br />
Macros<br />
NISCA, F–19<br />
Maintenance Operations Protocol<br />
See MOP servers<br />
MASSBUS disks<br />
dual-ported, 6–2<br />
Mass storage control protocol servers<br />
See MSCP servers<br />
MCDRIVER, 2–2<br />
Members<br />
definition, 1–3<br />
managing<br />
cluster group numbers, 2–11<br />
cluster passwords, 2–11<br />
cluster security, 5–16<br />
state transitions, 1–4<br />
shadow set, 6–27<br />
MEMORY CHANNEL, 2–2<br />
configurations, 3–10<br />
system parameters, A–2<br />
Messages<br />
acknowledgment, F–24<br />
distributed enable, F–32<br />
distributed trigger, F–33<br />
OPCOM, D–10<br />
Mirroring<br />
See Volume shadowing and Shadow sets<br />
MKDRIVER, 2–2<br />
MODPARAMS.DAT file<br />
adding a node or a satellite, 8–11<br />
adjusting maximum packet size, 10–17<br />
cloning system disks, 9–16<br />
created by CLUSTER_CONFIG.COM procedure,<br />
8–1
MODPARAMS.DAT file (cont’d)<br />
example, 6–9<br />
specifying dump file, 10–12<br />
specifying MSCP disk-serving parameters,<br />
6–20<br />
specifying TMSCP tape-serving parameters,<br />
6–20<br />
updating, 8–35<br />
Monitor utility (MONITOR), 1–11<br />
locating disk I/O bottlenecks, 10–21<br />
MOP downline load services<br />
See also DECnet software and LAN Control<br />
Program (LANCP) utility<br />
DECnet MOP, 4–8<br />
LAN MOP, 4–8<br />
servers<br />
enabling, 9–9<br />
functions, 3–5<br />
satellite booting, 3–5<br />
selecting, 8–4<br />
MOUNT/GROUP command, 6–25<br />
MOUNT/NOREBUILD command, 6–26, 9–14<br />
MOUNT/SYSTEM command, 6–24<br />
Mount utility (MOUNT), 1–11<br />
CLU_MOUNT_DISK.COM command procedure,<br />
5–23<br />
mounting disks clusterwide, 8–32<br />
MSCPMOUNT.COM command procedure, 5–15<br />
remounting disks, 9–14<br />
SATELLITE_PAGE.COM command procedure,<br />
8–6<br />
shared disks, 6–24<br />
MPDEV_AFB_INTVL system parameter, A–4<br />
MPDEV_D1 system parameter, A–4, A–12<br />
MPDEV_ENABLE system parameter, A–4<br />
MPDEV_LCRETRIES system parameter, A–4<br />
MPDEV_POLLER system parameter, A–4<br />
MPDEV_REMOTE system parameter, A–5<br />
MSCPMOUNT.COM file, 5–15, 6–25<br />
MSCP servers, 1–4, 2–15<br />
access to shadow sets, 6–29<br />
ALLOCLASS parameter, 6–12<br />
booting sequence, C–2<br />
boot server, 3–5<br />
cluster-accessible disks, 6–20<br />
cluster-accessible files, 6–2<br />
configuring, 4–4, 8–20<br />
enabling, 6–20<br />
functions, 6–20<br />
LAN disk server, 3–5<br />
load balancing, 6–22<br />
serving a shadow set, 3–8, 6–29<br />
shared disks, 6–2<br />
SHOW DEVICE/FULL command, 10–20<br />
specifying preferred path, 6–5<br />
MSCP_ALLOCATION_CLASS command, 6–10<br />
MSCP_BUFFER system parameter, A–5<br />
MSCP_CMD_TMO system parameter, A–5<br />
MSCP_CREDITS system parameter, A–5<br />
MSCP_LOAD system parameter, 4–4, 6–12, 6–20,<br />
8–20, A–5<br />
MSCP_SERVE_ALL system parameter, 4–4,<br />
6–20, 8–20, A–5<br />
Multicast datagram, F–23<br />
Multiple-site <strong>OpenVMS</strong> <strong>Cluster</strong> systems, 1–1<br />
N<br />
NCP (Network Control Program)<br />
See also DECnet software<br />
defining cluster alias, 4–14<br />
disabling LAN adapter, 4–13<br />
enabling MOP service, 9–9, C–9<br />
logging events, C–4<br />
logging line counters, 10–23<br />
NET$CONFIGURE.COM command procedure<br />
See DECnet software<br />
NET$PROXY.DAT files<br />
DECnet–Plus authorization elements, 5–18<br />
NETCONFIG.COM command procedure<br />
See DECnet software<br />
NETNODE_REMOTE.DAT file<br />
renaming to SYS$COMMON directory, 4–13<br />
updating, 8–37<br />
NETNODE_UPDATE.COM command procedure,<br />
10–4<br />
authorization elements, 5–23<br />
NETOBJECT.DAT file<br />
authorization elements, 5–18<br />
NETPROXY.DAT files<br />
authorization elements, 5–18<br />
intracluster security, 5–22<br />
setting up, 5–22<br />
Network connections<br />
See DECnet software<br />
Network Control Program (NCP)<br />
See NCP<br />
Networks<br />
congestion causes of packet loss, F–12, G–3<br />
HELLO datagram congestion, G–5<br />
maintaining configuration data, 5–23<br />
PEDRIVER implementation, F–4<br />
retransmission problems, F–17<br />
security, 5–22<br />
troubleshooting<br />
See LAVC$FAILURE_ANALYSIS.MAR<br />
program<br />
updating satellite data in NETNODE_<br />
REMOTE.DAT, 8–37<br />
NISCA transport protocol, F–1<br />
capturing data, F–27<br />
CC header, F–22<br />
Index–9
NISCA transport protocol (cont’d)<br />
CC protocol, F–3<br />
channel formation problems, F–16<br />
channel selection and congestion control<br />
appendix, G–1<br />
datagram flags, F–24<br />
datagrams, F–19<br />
definition, F–1<br />
diagnosing with a LAN analyzer, F–19<br />
DX header, F–21<br />
DX protocol, F–3<br />
Equivalent Channel Set, G–1<br />
function, F–3<br />
LAN Ethernet header, F–20<br />
LAN FDDI header, F–20<br />
load distribution, G–1<br />
packet format, F–19<br />
packet loss, F–12<br />
PEDRIVER implementation, F–4<br />
PI protocol, F–3<br />
PPC protocol, F–3<br />
PPD protocol, F–3<br />
retransmission problems, F–17<br />
TR header, F–23<br />
troubleshooting, F–1<br />
TR protocol, F–3<br />
NISCA Transport Protocol<br />
channel selection, G–1<br />
congestion control, G–3<br />
preferred channel, G–2<br />
NISCS_CONV_BOOT system parameter, 8–10,<br />
A–6, C–5<br />
NISCS_LAN_OVRHD system parameter, A–7<br />
NISCS_LOAD_PEA0 system parameter, 4–4, A–7,<br />
C–13<br />
caution when setting to 1, 8–20, A–7<br />
NISCS_MAX_PKTSZ system parameter, 10–16,<br />
A–7<br />
NISCS_PORT_SERV system parameter, A–8<br />
Node allocation classes<br />
See Allocation classes<br />
O<br />
OPCOM (Operator Communication Manager),<br />
10–9, D–10<br />
<strong>OpenVMS</strong> Alpha systems<br />
RAID device-naming problems, 6–13<br />
<strong>OpenVMS</strong> <strong>Cluster</strong> sample programs, D–1<br />
LAVC$FAILURE_ANALYSIS.MAR, D–3<br />
LAVC$START_BUS.MAR, D–1<br />
LAVC$STOP_BUS.MAR, D–2<br />
<strong>OpenVMS</strong> <strong>Cluster</strong> systems<br />
adding a computer, 8–10, 8–33, 8–41<br />
adjusting EXPECTED_VOTES, 8–35<br />
architecture, 2–1<br />
benefits, 1–2<br />
clusterwide logical names, 5–6, 5–11<br />
Index–10<br />
<strong>OpenVMS</strong> <strong>Cluster</strong> systems (cont’d)<br />
common environment, 5–2<br />
startup command procedures, 5–14<br />
common SYSUAF.DAT file, 10–17<br />
configurations, 3–1, 9–1<br />
keeping records, 10–3<br />
preconfiguration tasks, 8–3<br />
procedures, 8–1, 8–4<br />
disaster tolerant, 1–1<br />
distributed file system, 1–4, 2–15<br />
distributed processing, 5–1, 7–1<br />
hang condition, C–15<br />
interprocessor communications, 6–29<br />
local resources, 5–2<br />
maintenance, 10–1<br />
members, 1–3<br />
mixed-architecture<br />
booting, 4–1, 10–5, 10–8<br />
system disks, 3–5, 10–8<br />
multiple environments, 5–2, 5–3<br />
functions that must remain specific, 5–14<br />
startup functions, 5–15<br />
operating environments, 5–2<br />
preparing, 4–1<br />
partitioning, 2–5, C–17<br />
reconfiguring a cluster, 8–33<br />
recovering from startup procedure failure,<br />
C–14<br />
removing a computer, 8–33<br />
adjusting EXPECTED_VOTES, 8–35<br />
security management, 5–16, 10–14<br />
single security domain, 5–16<br />
system applications, 1–5<br />
System Communications Services (SCS), 1–4<br />
system management overview, 1–7<br />
system parameters, A–1<br />
tools and utilities, 1–7, 1–11<br />
troubleshooting, C–1<br />
voting member, 2–6<br />
adding, 8–33<br />
removing, 8–33<br />
<strong>OpenVMS</strong> Management Station, 1–10<br />
Operating systems<br />
coordinating files, 5–22<br />
installing, 4–1<br />
on common system disk, 5–4<br />
upgrading, 4–1<br />
Operator Communications Manager<br />
See OPCOM<br />
P<br />
Packet loss<br />
caused by network congestion, F–12, G–3<br />
caused by too many HELLO datagrams, G–5<br />
NISCA retransmissions, F–12
Packets<br />
capturing data, F–26<br />
maximum size on Ethernet, 10–16<br />
maximum size on FDDI, 10–16<br />
transmission window size, G–4<br />
PADRIVER port driver, C–17, C–18<br />
Page files (PAGEFILE.SYS)<br />
created by CLUSTER_CONFIG.COM procedure,<br />
8–1, 8–6<br />
Page sizes<br />
hardware<br />
AUTOGEN determination, A–1<br />
PAKs (Product Authorization Keys), 4–6<br />
PANUMPOLL system parameter, A–13<br />
PAPOLLINTERVAL system parameter, A–13<br />
Partner programs<br />
capturing retransmitted packets, F–34<br />
PASANITY system parameter, A–14<br />
Passwords<br />
See <strong>Cluster</strong> passwords<br />
VMS$PASSWORD_DICTIONARY.DATA file,<br />
5–17, 5–21<br />
VMS$PASSWORD_HISTORY.DATA file, 5–17,<br />
5–21<br />
VMS$PASSWORD_POLICY.EXE file, 5–17,<br />
5–21<br />
PASTDGBUF system parameter, A–8, A–14<br />
PASTIMOUT system parameter, A–14<br />
Paths<br />
specifying preferred for MSCP-served disks,<br />
6–5<br />
PBDRIVER port driver, C–18<br />
PEDRIVER<br />
channel selection, G–1<br />
channel selection and congestion control, G–1<br />
Equivalent Channel Set, G–1<br />
load distribution, G–1<br />
PEDRIVER port driver, 2–2, C–18<br />
congestion control, G–3<br />
HELLO multicasts, G–5<br />
implementing the NISCA protocol, F–4<br />
NISCS_LOAD_PEA0 system parameter, 4–4<br />
preferred channel, G–2<br />
retransmission, F–19<br />
SDA monitoring, F–13<br />
Physical Interconnect protocol<br />
See PI protocol<br />
PIDRIVER port driver, 2–2, C–18<br />
PI protocol<br />
part of the SCA architecture, F–3<br />
PK*DRIVER port driver, 2–2<br />
PMDRIVER port driver, 2–2<br />
PNDRIVER port driver, 2–2<br />
POLYCENTER Console Manager (PCM), 1–9<br />
Port<br />
software controllable selection, 6–5<br />
Port allocation classes<br />
See Allocation classes<br />
Port communications, 1–5, C–17<br />
Port drivers, 1–5, C–17<br />
device error-log entries, C–17<br />
error-log entries, C–24<br />
Port failures, C–18<br />
Port polling, C–17<br />
Port-to-Port Communication protocol<br />
See PPC protocol<br />
Port-to-Port Driver level<br />
See PPD level<br />
Port-to-Port Driver protocol<br />
See PPD protocol<br />
PPC protocol<br />
part of NISCA transport protocol, F–3<br />
PPD level<br />
part of PPD protocol, F–3<br />
PPD protocol<br />
part of SCA architecture, F–3<br />
Preferred channels, G–2<br />
Print queues, 5–1, 7–1<br />
See also Queues and Queue managers<br />
assigning unique name to, 7–5<br />
clusterwide generic, 7–7<br />
setting up clusterwide, 7–4<br />
starting, 7–7<br />
Processes<br />
quotas, 2–14, 10–17<br />
Programs<br />
analyze retransmission errors, F–33<br />
LAN analyzer partner, F–34<br />
LAN analyzer scribe, F–34<br />
LAN analyzer starter, F–33<br />
Protocols<br />
Channel Control (CC), F–3<br />
Datagram Exchange (DX), F–3<br />
PEDRIVER implementation of NISCA, F–4<br />
Physical Interconnect (PI), F–3<br />
Port-to-Port Communication (PPC), F–3<br />
Port-to-Port Driver (PPD), F–3<br />
System Application (SYSAP), F–2<br />
System Communications Services (SCS), F–2<br />
Transport (TR), F–3<br />
Proxy logins<br />
controlling, 5–22<br />
records, 5–22<br />
Q<br />
QDSKINTERVAL system parameter, A–9<br />
QDSKVOTES system parameter, 2–8, A–9<br />
QMAN$MASTER.DAT file, 5–23, 7–2, 7–4<br />
authorization elements, 5–19<br />
Queue managers<br />
availability of, 7–2<br />
clusterwide, 7–1<br />
Index–11
Queues<br />
See also Batch queues and Print queues<br />
common command procedure, 7–12<br />
controlling, 7–1<br />
database files<br />
creating, 7–2<br />
default location, 7–4<br />
setting up in SYSTARTUP_COMMON.COM<br />
procedure, 5–15<br />
Quorum<br />
adding voting members, 8–33<br />
algorithm, 2–5<br />
calculating cluster votes, 2–6<br />
changing expected votes, 10–20<br />
definition, 2–5<br />
DISK_QUORUM system parameter, 2–8<br />
enabling a quorum disk watcher, 2–8<br />
EXPECTED_VOTES system parameter, 2–6,<br />
10–19<br />
QUORUM.DAT file, 2–8<br />
reasons for loss, C–15<br />
removing voting members, 8–33<br />
restoring after unexpected computer failure,<br />
10–19<br />
system parameters, 2–6<br />
VOTES system parameter, 2–6<br />
QUORUM.DAT file, 2–8<br />
Quorum disk<br />
adding, 8–16<br />
adjusting EXPECTED_VOTES, 8–35<br />
disabling, 8–19, 8–33<br />
enabling, 8–33<br />
mounting, 2–8<br />
QDKSVOTES system parameter, 2–8<br />
QUORUM.DAT file, 2–8<br />
restricted from shadow sets, 6–28<br />
watcher<br />
enabling, 2–8<br />
mounting the quorum disk, 2–8<br />
system parameters, 2–8<br />
Quotas<br />
process, 2–14, 10–17<br />
R<br />
RAID (redundant arrays of independent disks),<br />
6–27<br />
device naming problem, 6–13<br />
RBMS (Remote Bridge Management Software),<br />
D–6<br />
monitoring LAN traffic, 10–23<br />
REBOOT_CHECK option, 10–10<br />
RECNXINTERVAL system parameter, A–9<br />
Redundant arrays of independent disks<br />
See RAID<br />
Index–12<br />
Remote Bridge Management Software<br />
See RBMS<br />
REMOVE_NODE option, 10–10<br />
Request descriptor table entries, 9–17<br />
Resource sharing, 1–1<br />
components that manage, 2–5<br />
making DECnet databases available<br />
clusterwide, 4–13<br />
printers, 5–1<br />
processing, 5–1<br />
Retransmissions<br />
analyzing errors, F–33<br />
caused by HELLO datagram congestion, G–5<br />
caused by lost ACKs, F–19<br />
caused by lost messages, F–18<br />
problems, F–17<br />
under congestion conditions, G–4<br />
REXMT flag, F–19<br />
RIGHTSLIST.DAT files<br />
authorization elements, 5–19<br />
merging, B–3<br />
security mechanism, 5–17<br />
RMS<br />
use of distributed lock manager, 2–15<br />
S<br />
Satellite nodes<br />
adding, 8–11<br />
altering local disk labels, 8–38<br />
booting, 3–5, 4–7, 9–5, 9–6<br />
controlling, 9–10<br />
conversational bootstrap operations, 8–10<br />
cross-architecture, 4–1, 10–5<br />
DECnet MOP downline load service, 4–8<br />
LANCP as booting mechanism, 4–7<br />
LAN MOP downline load service, 4–8<br />
preparing for, 9–4<br />
troubleshooting, C–4, C–13<br />
downline loading, 9–9<br />
functions, 3–5<br />
LAN hardware addresses<br />
modifying, 8–21<br />
obtaining, 8–5<br />
local disk used for paging and swapping, 3–5<br />
maintaining network configuration data, 5–23,<br />
10–4<br />
rebooting, 8–39<br />
removing, 8–17<br />
restoring network configuration data, 10–4<br />
system files created by CLUSTER_<br />
CONFIG.COM procedure, 8–1<br />
updating network data in NETNODE_<br />
REMOTE.DAT, 8–37<br />
SATELLITE_PAGE.COM command procedure,<br />
8–6
SAVE_FEEDBACK option, 10–10<br />
SCA (System Communications Architecture)<br />
NISCA transport protocol, F–1, F–3<br />
protocol levels, F–1, F–3<br />
SCA Control Program (SCACP), 10–23<br />
SCACP<br />
See SCA Control Program<br />
Scribe programs<br />
capturing traffic data, F–34<br />
SCS (System Communications Services)<br />
connections, C–18<br />
definition, 1–5<br />
DX header for protocol, F–21<br />
for interprocessor communication, 1–4<br />
part of the SCA architecture, F–2<br />
port polling, C–17<br />
system parameters, A–9 to A–10<br />
SCSMAXDG, A–14<br />
SCSMAXMSG, A–14<br />
SCSRESPCNT, A–10<br />
SCSBUFFCNT system parameter, 9–17<br />
SCSI (Small Computer <strong>Systems</strong> Interface), 2–2<br />
cluster-accessible disks, 1–3<br />
clusterwide reboot requirements, 6–19<br />
configurations, 3–12<br />
device names, 6–6<br />
reboot requirements, 6–19<br />
disks, 1–4, 2–15<br />
support for compliant hardware, 6–28<br />
SCSLOA.EXE image, C–18<br />
SCSNODE system parameter, 4–3<br />
SCSRESPCNT system parameter, 9–18<br />
SCSSYSTEMID system parameter, 4–3<br />
SDI (standard disk interface), 1–4, 2–15<br />
Search lists, 5–5<br />
Security management, 5–16, 10–14<br />
See also Authorize utility (AUTHORIZE) and<br />
CLUSTER_AUTHORIZE.DAT files<br />
controlling conversational bootstrap operations,<br />
8–10<br />
membership integrity, 5–16<br />
network, 5–22<br />
security-relevant files, 5–17<br />
Sequence numbers<br />
for datagram flags, F–24<br />
Servers, 1–4<br />
actions during booting, 3–5<br />
boot, 3–5<br />
configuration memory and LAN adapters, 9–4<br />
disk, 3–5<br />
enabling circuit service for MOP, 4–7<br />
MOP, 3–5<br />
MOP and disk, 8–4<br />
MSCP, 6–29<br />
tape, 3–5<br />
TMSCP, 1–4, 2–16<br />
used for downline load, 3–5<br />
SET ALLOCATE command, 6–9<br />
SET AUDIT command, 5–18<br />
SET LOGINS command, 4–7<br />
SET PREFERRED_PATH command, 6–5<br />
SET TIME command<br />
setting time across a cluster, 5–24<br />
SET VOLUME/REBUILD command, 6–27<br />
Shadow sets<br />
See also Volume shadowing<br />
accessed through MSCP server, 6–29f<br />
definition, 6–27<br />
distributing, 6–27<br />
maximum number, 6–30<br />
virtual unit, 6–27<br />
SHOW CLUSTER command, 10–21<br />
Show <strong>Cluster</strong> utility (SHOW CLUSTER), 1–10,<br />
10–19<br />
CL_QUORUM command, 10–19<br />
CL_VOTES command, 10–19<br />
EXPECTED_VOTES command, 10–19<br />
SHOW DEVICE commands, 10–20<br />
SHUTDOWN command<br />
shutting down a node, 10–11<br />
shutting down a node or subset of nodes, 8–37<br />
shutting down the cluster, 8–36, 10–11<br />
Shutting down a node, 8–37, 10–11<br />
Shutting down the cluster, 8–36, 10–10<br />
SMCI, 1–3<br />
Standalone computers<br />
converting to cluster computer, 8–24<br />
Standard disk interface<br />
See SDI<br />
Standard tape interface<br />
See STI<br />
Star couplers, 3–2<br />
START/QUEUE/MANAGER command<br />
/NEW_VERSION qualifier, 7–2<br />
/ON qualifier, 7–2<br />
Starter programs<br />
capturing retransmitted packets, F–33<br />
Startup command procedures<br />
controlling satellites, 9–10<br />
coordinating, 5–13<br />
site-specific, 5–15<br />
template files, 5–14<br />
STARTUP_P1 system parameter<br />
does not start all processes, 8–35, 8–41<br />
minimum startup, 2–8<br />
State transitions, 1–4, 2–9<br />
Status<br />
returned by SYS$LAVC_START_BUS<br />
subroutine, E–2<br />
STI (standard tape interface), 1–4<br />
Storage Library System<br />
See SLS<br />
Index–13
StorageWorks RAID Array 210 Subsystem<br />
naming devices, 6–13<br />
StorageWorks RAID Array 230 Subsystem<br />
naming devices, 6–13<br />
Stripe sets<br />
shadowed, 6–29<br />
SWAPFILE.SYS, 8–1<br />
Swap files<br />
created by CLUSTER_CONFIG.COM procedure,<br />
8–1<br />
Swap files (SWAPFILE.SYS)<br />
created by CLUSTER_CONFIG.COM procedure,<br />
8–6<br />
SYLOGICALS.COM startup file<br />
cloning system disks, 9–16<br />
clusterwide logical names, 5–12<br />
SYS$COMMON:[SYSMGR] directory<br />
template files, 5–14<br />
SYS$DEVICES.DAT text file, 6–19<br />
SYS$LAVC_DEFINE_NET_COMPONENT<br />
subroutine, E–5<br />
SYS$LAVC_DEFINE_NET_PATH subroutine,<br />
E–7<br />
SYS$LAVC_DISABLE_ANALYSIS subroutine,<br />
E–10<br />
SYS$LAVC_ENABLE_ANALYSIS subroutine,<br />
E–9<br />
SYS$LAVC_START_BUS.MAR subroutine, E–1<br />
SYS$LAVC_STOP_BUS.MAR subroutine, E–3<br />
SYS$LIBRARY system directory, 5–5<br />
SYS$MANAGER:SYCONFIG.COM command<br />
procedure, 5–14<br />
SYS$MANAGER:SYLOGICALS.COM command<br />
procedure, 5–14<br />
SYS$MANAGER:SYPAGSWPFILES.COM<br />
command procedure, 5–14<br />
SYS$MANAGER:SYSECURITY.COM command<br />
procedure, 5–14<br />
SYS$MANAGER:SYSTARTUP_VMS.COM<br />
command procedure, 5–14<br />
SYS$MANAGER system directory, 5–5<br />
SYS$QUEUE_MANAGER.QMAN$JOURNAL file,<br />
7–2<br />
SYS$QUEUE_MANAGER.QMAN$QUEUES file,<br />
7–2<br />
SYS$SPECIFIC directory, 5–5<br />
SYS$SYSROOT logical name, 5–5<br />
SYS$SYSTEM:STARTUP.COM command<br />
procedure, 5–14<br />
SYS$SYSTEM system directory, 5–5<br />
SYSALF.DAT file<br />
authorization elements, 5–19<br />
SYSAP protocol<br />
definition, F–2<br />
use of SCS, 1–5<br />
SYSAPs, 1–5<br />
Index–14<br />
SYSBOOT<br />
SET/CLASS command, 6–18<br />
SYSBOOT.EXE image<br />
renaming before rebooting satellite, 8–40<br />
SYSGEN parameters<br />
See System parameters<br />
SYSMAN (System Management utility)<br />
See System Management utility<br />
SYSMAN utility<br />
/CLUSTER_SHUTDOWN qualifier, 10–10<br />
SHUTDOWN NODE command, 8–37<br />
SYSTARTUP.COM procedures<br />
setting up, 5–15<br />
SYSTARTUP_COM startup file<br />
clusterwide logical names, 5–12<br />
System Application protocol<br />
See SYSAP protocol<br />
System applications (SYSAPs)<br />
See SYSAPs<br />
System Communications Architecture<br />
See SCA<br />
System Communications Services<br />
See SCS<br />
System Dump Analyzer utility (SDA), 1–10<br />
monitoring PEDRIVER, F–13<br />
System management, 1–7<br />
AUTOGEN command procedure, 1–10<br />
operating environments, 5–2<br />
products, 1–8<br />
SYSMAN utility, 1–10<br />
System Dump Analyzer, 1–10<br />
tools for daily operations, 1–7, 1–11<br />
System Management utility (SYSMAN), 1–10<br />
enabling cluster alias operations, 4–16<br />
modifying cluster group data, 10–15<br />
System parameters<br />
ACP_REBLDSYSD, 6–26<br />
adjusting for cluster growth, 9–17<br />
adjusting LRPSIZE parameter, 10–16<br />
adjusting NISCS_MAX_PKTSZ parameter,<br />
10–16<br />
ALLOCLASS, 6–9<br />
caution to prevent data corruption, 8–20, A–7,<br />
A–12<br />
CHECK_CLUSTER, A–1<br />
cluster parameters, A–1 to A–12<br />
CLUSTER_CREDITS, 9–18, A–1<br />
CWCREPRC_ENABLE, A–1<br />
DISK_QUORUM, A–2<br />
DR_UNIT_BASE, A–2<br />
EXPECTED_VOTES, 2–6, 8–11, 8–33, A–2<br />
LOCKDIRWT, 2–14, A–2<br />
LRPSIZE, A–2<br />
MPDEV_AFB_INTVL, A–4<br />
MPDEV_D1, A–4, A–12<br />
MPDEV_ENABLE, A–4
System parameters (cont’d)<br />
MPDEV_LCRETRIES, A–4<br />
MPDEV_POLLER, A–4<br />
MPDEV_REMOTE, A–5<br />
MSCP_BUFFER, A–5<br />
MSCP_CMD_TMO, A–5<br />
MSCP_CREDITS, A–5<br />
MSCP_LOAD, 6–20, A–5<br />
MSCP_SERVE_ALL, 6–20, A–5<br />
NISCS_CONV_BOOT, 8–10, A–6, C–5<br />
NISCS_LAN_OVRHD, A–7<br />
NISCS_LOAD_PEA0, A–7, C–13<br />
NISCS_MAX_PKTSZ, A–7<br />
NISCS_PORT_SERV, A–8<br />
PASTDGBUF, A–8<br />
QDSKINTERVAL, A–9<br />
QDSKVOTES, A–9<br />
quorum, 2–6<br />
RECNXINTERVAL, A–9<br />
retaining with feedback option, 8–40<br />
SCSBUFFCNT, 9–17<br />
SCSMAXDG, A–14<br />
SCSMAXMSG, A–14<br />
SCSRESPCNT, 9–18<br />
setting parameters in MODPARAMS.DAT file,<br />
6–9<br />
STARTUP_P1 set to MIN, 2–8<br />
TAPE_ALLOCLASS, 6–9, A–10<br />
TIMVCFAIL, A–10<br />
TMSCP_LOAD, 6–20, A–10<br />
TMSCP_SERVE_ALL, A–10<br />
updating in MODPARAMS.DAT and AGEN$<br />
files, 8–35<br />
VAXCLUSTER, A–11<br />
VOTES, 2–6, A–12<br />
System time<br />
setting clusterwide, 5–24<br />
SYSUAF.DAT files<br />
authorization elements, 5–20<br />
creating common version, B–2<br />
determining process limits and quotas, 10–17<br />
merging, B–1<br />
printing listing of, B–1<br />
setting up, 5–22<br />
SYSUAFALT.DAT files<br />
authorization elements, 5–20<br />
T<br />
Tape class drivers, 1–4, 2–16<br />
Tape controllers, 1–4<br />
Tape mass storage control protocol servers<br />
See TMSCP servers<br />
Tapes<br />
assigning allocation classes, 6–9<br />
cluster-accessible, 1–3, 6–1<br />
clusterwide access to local, 6–20<br />
dual-pathed, 6–1, 6–13<br />
Tapes (cont’d)<br />
dual-ported, 6–2<br />
managing, 6–1<br />
node allocation class, 6–7<br />
restricted access, 6–1<br />
served by TMSCP, 6–1, 6–20<br />
serving, 6–1<br />
setting allocation class, 6–13<br />
shared, 5–1<br />
TUDRIVER, 1–4, 2–16<br />
Tape servers<br />
TMSCP on LAN configurations, 3–5<br />
TAPE_ALLOCLASS system parameter, 6–9, 6–12,<br />
6–13, 8–21, A–10<br />
TCPIP$PROXY.DAT file, 5–18<br />
Time<br />
See System time<br />
TIMVCFAIL system parameter, A–10<br />
TMSCP servers<br />
booting sequence, C–2<br />
cluster-accessible files, 6–2<br />
cluster-accessible tapes, 6–1, 6–20<br />
configuring, 8–20<br />
functions, 6–20<br />
LAN tape server, 3–5<br />
SCSI retention command restriction, 6–22<br />
TAPE_ALLOCLASS parameter, 6–12<br />
TUDRIVER, 1–4, 2–16<br />
TMSCP_ALLOCATION_CLASS command, 6–10<br />
TMSCP_LOAD system parameter, 6–20, 8–21,<br />
A–10<br />
TMSCP_SERVE_ALL system parameter, 8–20,<br />
A–10<br />
TR/CC flag<br />
setting in the CC header, F–22<br />
setting in the TR header, F–23<br />
Traffic<br />
isolating <strong>OpenVMS</strong> <strong>Cluster</strong> data, F–26<br />
Transmit channel<br />
selection, G–1<br />
selection and congestion control, G–1<br />
Transport<br />
See NISCA transport protocol<br />
Transport header<br />
See TR header<br />
Transport protocol<br />
See TR protocol<br />
TR header, F–24<br />
$TRNLNM system service, 5–11<br />
Troubleshooting<br />
See also LAVC$FAILURE_ANALYSIS.MAR<br />
program<br />
analyzing port error-log entries, C–24<br />
channel formation, F–16<br />
CI booting problems, C–3<br />
CI cables, C–23<br />
CI cabling problems, C–21<br />
Index–15
Troubleshooting (cont’d)<br />
CLUEXIT bugcheck, C–16<br />
data isolation techniques, F–26<br />
disk I/O bottlenecks, 10–21<br />
disk servers, C–10<br />
distributed enable messages, F–32<br />
distributed trigger messages, F–33<br />
error-log entries for CI and LAN ports, C–31<br />
failure of computer to boot, C–1, C–6<br />
failure of computer to join the cluster, C–1,<br />
C–13<br />
failure of startup procedure to complete, C–14<br />
hang condition, C–15<br />
LAN component failures, C–15<br />
LAN network components, D–3<br />
loss of quorum, C–15<br />
MOP servers, C–9<br />
multiple LAN segments, F–31<br />
network retransmission filters, F–31<br />
NISCA communications, F–16<br />
NISCA transport protocol, F–1<br />
OPA0 error messages, C–38<br />
port device problem, C–17<br />
retransmission errors, F–33<br />
retransmission problems, F–17<br />
satellite booting, C–6<br />
on CI, C–10<br />
shared resource is inaccessible, C–16<br />
using distributed trigger filter, F–32<br />
using Ethernet LAN analyzers, F–28<br />
using LAN analyzer filters, F–30<br />
using packet filters, F–31<br />
verifying CI cable connections, C–20<br />
verifying CI port, C–19<br />
verifying virtual circuit state open, C–20<br />
TR protocol<br />
part of NISCA transport protocol, F–3<br />
PEDRIVER implements packet delivery service,<br />
F–4<br />
TUDRIVER (tape class driver), 1–4, 2–2, 2–16<br />
U<br />
UAFs (user authorization files), 5–1<br />
building a common file, 5–17, B–2<br />
UETP (User Environment Test Package), 8–41<br />
use in upgrading the operating system, 10–3<br />
UETP_AUTOGEN.COM command procedure<br />
building large <strong>OpenVMS</strong> cluster systems, 9–1<br />
UICs (user identification codes)<br />
building common file, 5–17, B–2<br />
Unknown opcode errors, C–34<br />
Upgrades, 4–1<br />
for multiple-system disk VAXclusters, 10–3<br />
rolling, 10–3<br />
User accounts<br />
comparing, B–1<br />
coordinating, B–2<br />
Index–16<br />
User accounts (cont’d)<br />
group UIC, B–2<br />
User authorization files<br />
See UAFs<br />
User-defined patterns<br />
ability of LAN protocol analyzer to detect, F–25<br />
User environment<br />
computer-specific functions, 5–14<br />
creating a common-environment cluster, 5–15<br />
defining, 5–22<br />
User Environment Test Package<br />
See UETP<br />
User identification codes<br />
See UICs<br />
V<br />
VACK datagram, F–16, F–23<br />
VAXCLUSTER system parameter, 4–3, A–11<br />
caution when setting to zero, 8–20, A–12<br />
VAXVMSSYS.PAR file<br />
created by CLUSTER_CONFIG.COM procedure,<br />
8–1<br />
VCS for <strong>OpenVMS</strong> (VMScluster Console System),<br />
1–9<br />
VERF datagram, F–16, F–23<br />
Virtual circuits, C–17<br />
definition, F–4<br />
OPEN state, C–20<br />
transmission window size, G–4<br />
Virtual units, 6–27<br />
VMS$AUDIT_SERVER.DAT file<br />
authorization elements, 5–18<br />
VMS$OBJECTS.DAT file<br />
authorization elements, 5–21<br />
VMS$PASSWORD_DICTIONARY.DATA file, 5–17<br />
authorization elements, 5–21<br />
VMS$PASSWORD_HISTORY.DATA file, 5–17<br />
VMS$PASSWORD_HISTORY.DAT file<br />
authorization elements, 5–21<br />
VMS$PASSWORD_POLICY.EXE file, 5–17<br />
authorization elements, 5–21<br />
VMSMAIL_PROFILE.DATA file<br />
authorization elements, 5–21<br />
Volume labels<br />
modifying for satellite’s local disk, 8–6<br />
Volume sets<br />
shadowed, 6–29<br />
Volume shadowing<br />
See also Shadow sets<br />
defined, 6–27<br />
interprocessor communication, 6–29<br />
overview, 6–27<br />
system disk, 9–16<br />
VOTES system parameter, 2–6, A–12
Voting members, 2–6<br />
adding, 8–10, 8–33<br />
removing, 8–18, 8–33<br />
Index–17