11.07.2015 Views

Linux File Systems - Craig Chamberlain

Linux File Systems - Craig Chamberlain

Linux File Systems - Craig Chamberlain

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

<strong>Linux</strong> <strong>File</strong> <strong>Systems</strong>Peter J. BraamCluster <strong>File</strong> <strong>Systems</strong>, Inc


This talk Emphasis on Open Source projects <strong>Linux</strong> enterprise storage requirements <strong>Linux</strong> kernel storage stack Journal file systems LVM, virtualization Lustre object based cluster file system2 - Q2 2002


Enterprise storage requirements Larger storageTerabytes10,000’s of systems Logical volume managementManage many storage devicesResize, aggregate, backup, mirror/synchronize ClusteringNeed fast recovery in clusters for failover/transition<strong>File</strong> systems for replicated and shared storage3 - Q2 2002


<strong>Linux</strong> in the storage fabric NAS servers: Samba, NFS <strong>File</strong> Servers Clustering Embedded <strong>Linux</strong> storage controllers <strong>Linux</strong> storage will be EVERYWHERE!4 - Q2 2002


<strong>Linux</strong> requirements Catch up with major corporationsStorage managementClusteringPerformance In future: get there firstFuture SAN (Storage Area Networks)Scalable clustersNew interconnectsStorage controllers5 - Q2 2002


Storage stack6 - Q2 2002


Traditional Storage StackFS basedApplicationsWWW/mail/officeDatabasesapplications<strong>File</strong> <strong>Systems</strong>Block devicesLogical Volumes & RAIDDevice driversInterConnectsStorage deviceskernel softwarehardware7 - Q2 2002


<strong>File</strong> System Internals<strong>File</strong>shandles to open filesDentriespath names in the kernelInodesfile and directory objects in kernel with block allocationsSuperblocksfor a mounted instance of a file system, manages inode allocations<strong>File</strong> systemsmap file & folder commands to pages/buffers for devices8 - Q2 2002


Block devices Page cachePages belonging to inodes managed by page cache Buffer cacheCache of pages for reading/writing to block devices Mapping a bufferCompute block numbers for buffers & pages Logical volumesPerform re-mapping of buffers9 - Q2 2002


Distributed <strong>File</strong> <strong>Systems</strong> Replicated vs shared <strong>File</strong> ServersKernel or user levelInteract with block devices and file systemsEnforce security ClientsUse networking to reach serversMay have a persistent cache on client disk10 - Q2 2002


New device drivers Networked block devicesNetwork block devices: drdbImportant for clusters Important new interconnectsInfiniBand / VIA – channel I/OGigabit ethernet, iSCSI, WARP – remote DMA Object based storage driversBlock devices only do: read/write blockObject storage: device drivers manage more11 - Q2 2002


Journaling Disk <strong>File</strong> <strong>Systems</strong>12 - Q2 2002


Disk file system characteristics PerformanceRecovery performance: journaled vs. fsckData and metadata performance Parametersfile sizes, file system sizes FeaturesResizing / Versioning13 - Q2 2002


FS Recovery Fast recovery is key to:Large file systems & clusters Journaled recovery is fastKeep it consistent with transactions: subtleNo scanning: recover TB in seconds FSCK recoveryFind inconsistencies, then repairSlow, complicated14 - Q2 2002


Journal recoveryopen(“/dir/newfile”, O_CREAT); -- create a new file1. Create inode for new file2. Create directory entry for name “newfile”Several disk blocks affected: dir data, new inode, allocationChanges must be implemented atomicallyhdr – 3 bufsdir data bufinode bufalloc buf1. append buffers to journal file2. When journal on disk, write to FS locRecovery: copy journal buffers to FS15 - Q2 2002


Journal <strong>File</strong> <strong>Systems</strong> EXT3Compatible with Ext2Written by Stephen Tweedie ReiserFSNew file layout: advanced, fast, novel XFS, JFSBoth working well Benchmarks: all very close16 - Q2 2002


Storage Targets & virtualization-Embedded <strong>Linux</strong> Storage Targets-iSCSI, FC target mode drivers-LVM-DRDB17 - Q2 2002


Storage mainframesKey features: Storage virtualization Lun Mapping Snapshots Redundancy SAN backup supportSAN – SCSI/FCTCP/IPstoragemainframesServer farm18 - Q2 2002


Mainframe -> Embedded <strong>Linux</strong>Use <strong>Linux</strong> infrastructure to offer many featuresMuch lower price pointFalconstor has commercial productTarget mode systems: storage devices on SANInterconnect can be FC, InfiniBand or iSCSICommodity hardware <strong>Linux</strong>Disk data center19 - Q2 2002


What is LVM? Logical Volume Management:LVM – SistinaEVMS – IBM Subdivide physical storage devicesPhysical extents Combine extents, build block devicesThese are logical volumes Can be resized, moved, snapshotted20 - Q2 2002


Why? Growing storageKey issue in enterprise storage management Combine LVM with file system resizersExt2/3, XFS, Reiser, JFS have resizers21 - Q2 2002


DRDB Make block devices availableremotelyefficient replication characteristics22 - Q2 2002


LustreThe Inter-Galactic <strong>File</strong> System forthe National Laboratories!(or for your home)http://www.lustre.orga Open Source project of..23 - Q2 2002


A new storage architecture Use object oriented storage concepts Advanced cluster file system Scalable clustering ideas24 - Q2 2002


History 99: CMU –Seagate–Stelias Computing 00: Los Alamos 01-04: Sandia – Los Alamos – Livermoretry to get Lustre to run on their largest clustersmuch industry interest (BlueArc, HP, Intel…) First deployments:Livermore MCR 1000 node25 - Q2 2002


26 - Q2 2002Big Lustre picture


ClientsAccess ControlSecurity andResourceDatabasesMeta Data ControlCoherenceManagementData TransfersStorage Area Network(FC, GigE, IB)StorageManagementObjectStorageTargets27 - Q2 2002


Orders of magnitude Clients (aka compute servers)10,000’s Storage controllers1000’s to control PB’s of storage (PB = 10**15 Bytes) Cluster control nodes 10’s Aggregate bandwidth100’s GB/sec28 - Q2 2002


clientsSystem & Parallel<strong>File</strong> I/O<strong>File</strong> lockingDirectory Operations,Metadata &ConcurrencyObject storageTargets (OST)Recovery<strong>File</strong> status<strong>File</strong> creationMetadata servers(MDS)Lustre System29 - Q2 2002


Clustered ObjectBased <strong>File</strong> Systemon host AClustered ObjectBased <strong>File</strong> Systemon host BOBD Client DriverType RPCOBD Client DriverType VIAShared object storageFast storagenetworkingOBD TargetType RPCOBD TargetType VIADirect OBD30 - Q2 2002


Key issues: Scalability I/O throughputHow to avoid bottlenecks Meta data scalabilityHow can 10,000’s of nodes work on files in same folder Cluster recoveryIf something fails, how can transparent recoveryhappen ManagementAdding, removing, replacing, systems; data migration& backup31 - Q2 2002


Basic components Clients :metadata cluster handles namespace requestsdeal with object storage targets for file I/O Object storage targetshold objects that make up files (files can be striped)negotiate object id’s with the metadata cluster Metadata clusterholds directories & inodesfile inodes are redirectors to objects32 - Q2 2002


OS bypass<strong>File</strong> I/OclientsOBDclientLustre client filesystemMDSclientmetadatawb cachelockclientrecoverynetworkingLustre Clientreq for LLreq for LCFS deliverableLLNL/CFS joint work33 - Q2 2002


Roadmap Q3 2002: Lustre Lightbetas will appearscales to ~1000s of nodes gradual transition to 2003: Lustremore securityscalable metadata cluster34 - Q2 2002


Thanks for your interest! Open Source <strong>Linux</strong> Storage ismature nowincredibly exciting35 - Q2 2002

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!