Lustre 1.6 Operations Manual

More documents

Recommendations

Info

In Lustre, failover means that a client that tries to do I/O to a failed OST continues to try (forever) until it gets an answer. A userspace sees nothing strange, other than that I/O takes (potentially) a very long time to complete. Lustre failover requires two nodes (a failover pair), which must be connected to a shared storage device. Lustre supports failover for both metadata and object storage servers. Failover is achieved most simply by powering off the node in failure (to be absolutely sure of no multi-mounts of the MDT) and mounting the MDT on the partner. When the primary comes back, it MUST NOT mount the MDT while secondary has it mounted. The secondary can then unmount the MDT and the master mount it. The Lustre filesystem only supports failover at the server level. Lustre does not provide the tool set for system-level components that is needed for a complete failover solution (node failure detection, power control, and so on). 1 Lustre failover is dependant on either a primary or backup OST to recover the filesystem. You need to set up an external HA mechanism. The recommended choice is the Heartbeat package, available at: www.linux-ha.org Heartbeat is responsible to detect failure of the primary server node and control the failover. The HA software controls Lustre with a provided service script (which mounts the OST or MDT on the secondary node). Although Heartbeat is recommended, Lustre works with any HA software that supports resource (I/O) fencing. The hardware setup requires a pair of servers with a shared connection to a physical storage (like SAN, NAS, hardware RAID, SCSI and FC). The method of sharing storage should be essentially transparent at the device level, that is, the same physical LUN should be visible from both nodes. To ensure high availability at the level of physical storage, we encourage the use of RAID arrays to protect against drive-level failures. To have a fully-automated, highly-available Lustre system, you need power management software and HA software, which must provide the following - ■ ■ ■ Resource fencing - Physical storage must be protected from simultaneous access by two nodes. Resource control - Starting and stopping the Lustre processes as a part of failover, maintaining the cluster state, and so on. Health monitoring - Verifying the availability of hardware and network resources, responding to health indications given by Lustre. 1. This functionality has been available for some time in third-party tools. 8-2 Lustre 1.6 Operations Manual • September 2008
For proper resource fencing, the Heartbeat software must be able to completely power off the server or disconnect it from the shared storage device. It is absolutely vital that no two active nodes access the same partition, at the risk of severely corrupting data. When Heartbeat detects a server failure, it calls a process (STONITH) to power off the failed node; and then starts Lustre on the secondary node. HA software controls the Lustre resources with a service script. The supplied filesystem resource script will work with Lustre. Servers providing Lustre resources are configured in primary/secondary pairs for the purpose of failover. When a server umount command is issued, the disk device is set read-only. This allows the second node to start service using that same disk, after the command completes. This is known as a soft failover, in which case both the servers can be running and connected to the net. Powering off the node is known as a hard failover. 8.1.1 The Power Management Software The Linux-HA package includes a set of power management tools, known as STONITH (Shoot The Other Node In The Head). STONITH has native support for many power control devices, and is extensible. It uses expect scripts to automate control. PowerMan, by the Lawrence Livermore National Laboratory (LLNL), is a tool for manipulating remote power control (RPC) devices from a central location. Several RPC varieties are supported natively by PowerMan. The latest versions of PowerMan are available at: http://sourceforge.net/projects/powerman For more information on PowerMan, go to: https://computing.llnl.gov/linux/powerman.html 8.1.2 Power Equipment A multi-port, Ethernet addressable RPC is relatively inexpensive. For recommended products, refer to the list of supported hardware on the PowerMan site. Linux Network Iceboxes are also very good tools. They combine the remote power control and the remote serial console into a single unit. Chapter 8 Failover 8-3
Page 1 and 2:
Lustre 1.6 Operations Manual Sun M
Page 3:
Please Recycle
Page 6 and 7:
2. Understanding Lustre Networking
Page 8 and 9:
6. Configuring Lustre - Examples 6-
Page 10 and 11:
9. Configuring Quotas 9-1 9.1 Worki
Page 12 and 13:
15. Backup and Restore 15-1 15.1 Lu
Page 14 and 15:
20. LustreProc 20-1 20.1 /proc Entr
Page 16 and 17:
22. Lustre Troubleshooting Tips 22-
Page 18 and 19:
Part IV Lustre for Users 24. Free S
Page 20 and 21:
29. Lustre Programming Interfaces (
Page 22 and 23:
33. System Limits 33-1 33.1 Maximum
Page 24 and 25:
Shell Prompts Shell C shell C shell
Page 26 and 27:
xxiv Lustre 1.6 Operations Manual
Page 29:
PART I Lustre Architecture Lustre i
Page 32 and 33:
1.1 Lustre File System Lustre is a
Page 34 and 35:
1.2.4 OSTs An OSTs provide back-end
Page 36 and 37:
1.3 Files in the Lustre File System
Page 38 and 39:
1.3.1 Lustre File System and Stripi
Page 40 and 41:
1.3.3 Lustre System Capacity Lustre
Page 42 and 43:
1.5 Lustre Networking In clusters w
Page 44 and 45:
Note - Lustre does not provide redu
Page 46 and 47:
Key features of LNET include: ■
Page 49 and 50:
CHAPTER 3 Prerequisites This chapte
Page 51 and 52:
3.2 Using a Pre-Packaged Lustre Rel
Page 53 and 54:
Another option is to: 1. Install db
Page 55 and 56:
3.3.4 Choosing a Proper Kernel I/O
Page 57 and 58:
3.4 Memory Requirements This sectio
Page 59 and 60:
CHAPTER 4 Lustre Installation This
Page 61 and 62:
4.1.1 MountConf MountConf is shorth
Page 63 and 64:
Note - For detailed information on
Page 65 and 66: Mounting Lustre on a client node $
Page 67 and 68: We are mounting by disk label here
Page 69 and 70: 4.2.2.2 Mount with Inactive OSTs Mo
Page 71 and 72: Client mount for foo: mount -t lust
Page 73 and 74: $ mount -t lustre -L testfs-MDT0000
Page 75 and 76: $ cd /tmp/kernels/linux-2.6.9 $ rm
Page 77 and 78: Use standard RPM commands to instal
Page 79 and 80: 4.3.2.2 Liblustre The Lustre librar
Page 81 and 82: 4.4 Building a Lustre Source Tarbal
Page 83 and 84: CHAPTER 5 Configuring the Lustre Ne
Page 85 and 86: 5.1.5 Determine Appropriate Mount P
Page 87 and 88: Note - Depending on the Linux distr
Page 89 and 90: live_router_check_interval, dead_ro
Page 91 and 92: 5.2.3 Downed Routers There are two
Page 93 and 94: 5.3.2 Stopping LNET Before the LNET
Page 95 and 96: CHAPTER 6 Configuring Lustre - Exam
Page 97 and 98: 6.1.2 Lustre with Separate MGS and
Page 99 and 100: How to Create a CSV File Five diffe
Page 101 and 102: Linux LVM LV (Logical Volume) The C
Page 103 and 104: The lustre_config.csv file looks li
Page 105 and 106: Example 1: Simple Lustre configurat
Page 107 and 108: CHAPTER 7 More Complicated Configur
Page 109 and 110: Because megan and oscar match the f
Page 111 and 112: 7.2 Elan to TCP Routing Servers meg
Page 113 and 114: 7.3.3 Start clients For the TCP cli
Page 115: CHAPTER 8 Failover This chapter des
Page 119 and 120: 8.1.5 Roles of Nodes in a Failover
Page 121 and 122: 8.4.3 Hardware Requirements for Fai
Page 123 and 124: 8.5.1.1 Configuring Heartbeat This
Page 125 and 126: . Monitor the syslog on both nodes.
Page 127 and 128: To get the proper syntax, run: $ st
Page 129 and 130: 8.6 Using MMP The multiple mount pr
Page 131 and 132: 8.7.2 Configuring the Hardware Hear
Page 133 and 134: Use this procedure: 1. Create the b
Page 135 and 136: 8.7.3.2 Testing 1. Pull power from
Page 137 and 138: CHAPTER 9 Configuring Quotas This c
Page 139 and 140: To enable quotas automatically when
Page 141 and 142: The lfs command now includes these
Page 143 and 144: This sets a much smaller granularit
Page 145 and 146: The quota_bunit_sz parameter displa
Page 147 and 148: 9.1.5.2 Quota Limits Available quot
Page 149 and 150: Quota Event nowait_for_pending_blk_
Page 151 and 152: CHAPTER 10 RAID This chapter descri
Page 153 and 154: 10.1.3 Understanding Double Failure
Page 155 and 156: 10.2 Insights into Disk Performance
Page 157 and 158: OST Filesystem: RAID5 with 5 or 9 d
Page 159 and 160: 10.2.1 Sample Graphs The graphs in
Page 161 and 162: 10.2.1.2 Graphs for Read Performanc
Page 163 and 164: 10.3 Creating an External Journal W
Page 165 and 166: CHAPTER 11 Kerberos This chapter de
Page 167 and 168:
Note - The Heimdal implementation o
Page 169 and 170:
General Installation Notes ■ ■
Page 171 and 172:
. Install the keytab. Note - There
Page 173 and 174:
# uppercase the hex nid=$(echo $nid
Page 175 and 176:
11.2.2 Types of Lustre-Kerberos Fla
Page 177 and 178:
11.2.2.3 Customized Flavor In most
Page 179 and 180:
11.2.2.6 Rules, Syntax and Examples
Page 181 and 182:
CHAPTER 12 Bonding This chapter des
Page 183 and 184:
# ethtool eth1 Settings for eth1: S
Page 185 and 186:
12.4 Bonding Module Parameters Bond
Page 187 and 188:
The examples below are from RedHat
Page 189 and 190:
12.5.1 Examples This is an example
Page 191 and 192:
12.6 Configuring Lustre with Bondin
Page 193 and 194:
CHAPTER 13 Upgrading Lustre The cha
Page 195 and 196:
13.2.2 Supported Upgrade Paths The
Page 197 and 198:
[root@mds1]# tunefs.lustre --mgs --
Page 199 and 200:
13.2.5 Upgrading Multiple File Syst
Page 201 and 202:
13.3 Upgrading Lustre from 1.6.3 to
Page 203 and 204:
13.4 Downgrading Lustre from 1.6.4
Page 205 and 206:
CHAPTER 14 Lustre SNMP Module The L
Page 207 and 208:
14.3 Using the Lustre SNMP Module O
Page 209 and 210:
CHAPTER 15 Backup and Restore This
Page 211 and 212:
15.1.3.1 Backing Up an MDS File To
Page 213 and 214:
3. Mount the filesystem. ■ For 2.
Page 215 and 216:
2. Format LVM volumes as Lustre tar
Page 217 and 218:
15.3.4 Restoring From Old Snapshot
Page 219 and 220:
CHAPTER 16 POSIX This chapter descr
Page 221 and 222:
7. When the system displays this pr
Page 223 and 224:
16.3 Isolating and Debugging Failur
Page 225 and 226:
If this single test is causing prob
Page 227 and 228:
CHAPTER 17 Benchmarking The benchma
Page 229 and 230:
Version 1.03 --Sequential Output--
Page 231 and 232:
17.3 IOzone Benchmark IOZone is a f
Page 233 and 234:
CHAPTER 18 Lustre Recovery This cha
Page 235 and 236:
18.2.2 MDS Failure (and Failover) R
Page 237:
PART III Lustre Tuning, Monitoring
Page 240 and 241:
19.1.1 Downloading an I/O Kit You c
Page 242 and 243:
The sgpdd_survey script must be cus
Page 244 and 245:
3. Determine the obdfilter instance
Page 246 and 247:
19.2.2.3 Output of the sbdfilter_su
Page 248 and 249:
19.3 PIOS Test Tool The PIOS test t
Page 250 and 251:
19.3.2 PIOS I/O Modes There are sev
Page 252 and 253:
Offset(o): Distance between two suc
Page 254 and 255:
19.3.4 PIOS Examples To create a 1
Page 256 and 257:
19.4.1.2 Utilities LNET self-test h
Page 258 and 259:
Note - This script can be easily ad
Page 260 and 261:
19.4.3.2 Group This section lists l
Page 262 and 263:
del_group NAME Removes a group from
Page 264 and 265:
There are only two test types: --pi
Page 266 and 267:
19.4.3.4 Other Commands This sectio
Page 268 and 269:
show_error [--session] [GROUP]|[IDs
Page 270 and 271:
20.1 /proc Entries for Lustre This
Page 272 and 273:
proc/sys/lustre/fail_loc This is th
Page 274 and 275:
20.1.3.1 Configuring Adaptive Timeo
Page 276 and 277:
Note - Changing adaptive timeouts s
Page 278 and 279:
Credits work like a semaphore. At s
Page 280 and 281:
20.2 Lustre I/O Tunables The sectio
Page 282 and 283:
20.2.2 Watching the Client RPC Stre
Page 284 and 285:
20.2.4 Client Read-Write Extents Su
Page 286 and 287:
20.2.5 Watching the OST Block I/O S
Page 288 and 289:
20.2.7 mballoc History /proc/fs/ldi
Page 290 and 291:
Number of groups scanned (grps colu
Page 292 and 293:
20.2.9 Locking /proc/fs/lustre/ldlm
Page 294 and 295:
20.3.1 RPC Information for Other OB
Page 296 and 297:
Where: Parameter Cur. Count Cur. Ra
Page 298 and 299:
20-30 Lustre 1.6 Operations Manual
Page 300 and 301:
21.1.0.1 OSS Service Thread Count T
Page 302 and 303:
21.3 Options to Format MDT and OST
Page 304 and 305:
21.3.3.3 Number of Inodes for OST F
Page 306 and 307:
21.5.3 Setting Write-Back Cache Per
Page 308 and 309:
For example, one OST per tier LUNLa
Page 310 and 311:
21.7 Lockless I/O Tunables The lock
Page 312 and 313:
22.1.2 Lustre Logs The error messag
Page 314 and 315:
22.3 Lustre Performance Tips This s
Page 316 and 317:
22.3.4 OSTs Become Read-Only If the
Page 318 and 319:
22.3.6 Changing Parameters You can
Page 320 and 321:
22.3.8 Default Striping These are t
Page 322 and 323:
22.3.12 Handling/Debugging "Bind: A
Page 324 and 325:
You may also receive this error if
Page 326 and 327:
22.3.17 Handling/Debugging "LustreE
Page 328 and 329:
22.3.22 Number of OSTs Needed for S
Page 330 and 331:
23.1 Lustre Debug Messages Each Lus
Page 332 and 333:
23.2 Tools for Lustre Debugging The
Page 334 and 335:
debug_daemon stop Completely shuts
Page 336 and 337:
4. If you already have a debug log
Page 338 and 339:
23.2.8 Adding Debugging to the Lust
Page 340 and 341:
23.3 Troubleshooting with strace Th
Page 342 and 343:
23.4.1 Determine the Lustre UUID of
Page 344 and 345:
Page 347 and 348:
CHAPTER 24 Free Space and Quotas Th
Page 349 and 350:
Examples [lin-cli1] $ lfs df UUID 1
Page 351 and 352:
CHAPTER 25 Striping and I/O Options
Page 353 and 354:
25.1.2 Disadvantages of Striping Th
Page 355 and 356:
You can also use ls -l /proc//fd/ t
Page 357 and 358:
25.3.1 Changing Striping for a Subd
Page 359 and 360:
25.4.1 Round-Robin Allocator When O
Page 361 and 362:
25.6 Other I/O Options This section
Page 363 and 364:
if (close(fd) < 0) { fprintf(stderr
Page 365 and 366:
} /* Ping all OSTs that belong to t
Page 367 and 368:
CHAPTER 26 Lustre Security This cha
Page 369 and 370:
Lustre ACL support is a system-wide
Page 371 and 372:
The root squash UID and GID can be
Page 373 and 374:
CHAPTER 27 Lustre Operating Tips Th
Page 375 and 376:
27.2 A Simple Data Migration Script
Page 377 and 378:
27.4 Failures Running a Client and
Page 379:
PART V Reference This part includes
Page 382 and 383:
28.1 lfs The lfs utility can be use
Page 384 and 385:
Options The various lfs options are
Page 386 and 387:
Option Description setstripe Create
Page 388 and 389:
Option Description help Provides br
Page 390 and 391:
$ lfs quotacheck -ug /mnt/lustre Ch
Page 392 and 393:
Options The options and description
Page 394 and 395:
Example e2fsck -n -v --mdsdb /tmp/m
Page 396 and 397:
lustre-OST0000: ******* WARNING: Fi
Page 398 and 399:
To fix dangling inodes, lfsck creat
Page 400 and 401:
Options The options and description
Page 402 and 403:
28.5 Handling Timeouts Timeouts are
Page 404 and 405:
29.1.2 Description The group upcall
Page 406 and 407:
Page 408 and 409:
Description The llapi_file_create()
Page 410 and 411:
30.1.2 llapi_file_get_stripe Use ll
Page 412 and 413:
30.1.4 llapi_quotactl Use llapi_quo
Page 414 and 415:
Return Values llapi_quotactl() retu
Page 416 and 417:
Important: All old (pre v.1.4.6) Lu
Page 418 and 419:
ip2nets ("") is a string that lists
Page 420 and 421:
The expansion is a list enclosed in
Page 422 and 423:
31.2.2 SOCKLND Kernel TCP/IP LND Th
Page 424 and 425:
31.2.3 QSW LND The QSW LND (qswlnd)
Page 426 and 427:
31.2.5 VIB LND The VIB LND is conne
Page 428 and 429:
31.2.6 OpenIB LND The OpenIB LND is
Page 430 and 431:
ME/MD Queue Length The ptllnd uses
Page 432 and 433:
31.2.8 Portals LND (Catamount) The
Page 434 and 435:
31.2.9 MX LND MXLND supports a numb
Page 436 and 437:
Page 438 and 439:
32.1 mkfs.lustre The mkfs.lustre ut
Page 440 and 441:
Option Description --stripe-count-h
Page 442 and 443:
Option Description --index=index Fo
Page 444 and 445:
32.3 lctl The lctl utility is used
Page 446 and 447:
Device Operations Option Descriptio
Page 448 and 449:
Options Use the following options t
Page 450 and 451:
32.4 mount.lustre The mount.lustre
Page 452 and 453:
In addition to the standard mount o
Page 454 and 455:
Options Option Description -b inode
Page 456 and 457:
offset_stats The client offset_stat
Page 458 and 459:
32.5.8 Fast e2fsck Using the uninit
Page 460 and 461:
32.5.10 llobdstat The llobdstat uti
Page 462 and 463:
Example To monitor /proc/fs/lustre/
Page 464 and 465:
LNET self-test includes two user ut
Page 466 and 467:
32.5.14 routerstat The routerstat u
Page 468 and 469:
33.2 Maximum Stripe Size For a 32-b
Page 470 and 471:
33.9 MDS Space Consumption A single
Page 472 and 473:
Page 474 and 475:
Modules LNET acceptor accept_port a
Page 476 and 477:
VIB LND service_number arp_retries
Page 478 and 479:
A-6 Lustre 1.6 Operations Manual
Page 480 and 481:
Administrator Tasks Build Install n
Page 482 and 483:
Manual Version Date Details of Edit
Page 484 and 485:
Page 486 and 487:
Page 488 and 489:
C-8 Lustre 1.6 Operations Manual
Page 490 and 491:
Can I run Lustre in a heterogeneous
Page 492 and 493:
How do I abort recovery? Why would
Page 494 and 495:
Filesystem refuses to mount because
Page 496 and 497:
Is it possible to change the IP add
Page 498 and 499:
Can I change the striping of a file
Page 500 and 501:
How do I resize an MDS / OST filesy
Page 502 and 503:
1. Make a mountpoint for the mkdir
Page 504 and 505:
How do I control multiple services
Page 506 and 507:
Note - CluManager requires two 10M
Page 508 and 509:
1. Try: lconf --cleanup --force 2.
Page 510 and 511:
The following computation is used t
Page 512 and 513:
Example 2: Servers have two tcp net
Page 514 and 515:
If you determine the LAST_ID file o
Page 516 and 517:
Information on the Socket LND (sock
Page 518 and 519:
What should I do if I suspect devic
Page 520 and 521:
In most cases, running e2fsck -fp $
Page 522 and 523:
What operations take place in Lustr
Page 524 and 525:
What is the Lustre data path? On th
Page 526 and 527:
D-38 Lustre 1.6 Operations Manual
Page 528 and 529:
Completion Callback Configlog Confi
Page 530 and 531:
GSS Group Sweeping Scheduling. A di
Page 532 and 533:
LOV LOV descriptor Lustre Lustre cl
Page 534 and 535:
opencache Orphan objects Orphan han
Page 536 and 537:
S Storage Object API Storage Object
Page 538 and 539:
Glossary-12 Lustre 1.6 Operations M
Page 540 and 541:
failover, 4-10 filesystem name, 4-8
Page 542 and 543:
Lustre building, 4-18 building a ta
Page 544 and 545:
Q QSW LND, 31-10 Quadrics Elan, 2-2
Page 546:
Index-8 Lustre 1.6 Operations Manua
show all

Lustre 1.6 Operations Manual

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?