Some Lustre FS basics

Lustre is a parallel distributed file system which provides high performance capabilities and open licensing, it is often used in supercomputers and extremely large scale hosting facilities (think Netflix via their 3rd party hosts). Generally, most of the worlds largest supercomputers utilize one or more lustre file systems, this includes the world’s fastest fastest supercomputer, Titan.

Lustre runs exclusively on Linux based systems and is composed of 4 major parts:

1. MDS (Meta Data Server), this single node hosts all file system meta data for a single file system instance, and may host the management service as well. It is composed of supported Linux installation running a supported kernel. It hosts the single major component the MDT (Meta Data Target) which usually a redundant high speed storage array of some make up.

 

When choosing hardware for your MDS it’s critical that is is highly reliable and well tested. It’s also critical that it can perform small reads and writes against your storage target with is much speed as possible. Multiple MDS units can be setup and installed to run against a shared MDT. However these units must run in an actively managed Active<->Passive schema.

 

2. MGS (Management Server), this single node may be part of the MDS (described above) or hosted on it’s own dedicated node. It’s generally used as a site-wide multi-file system management and configuration node, however with new features such as imperative recovery the role of this node will be greatly expanded in the future.

 

3. OSS (Object Storage Server), this can be one or more dedicated, and high bandwidth servers which host all object data on one or more OSTs (Object Storage Targets). These OSTs provide the primary data store of the file system. They also facilitate data transmission directly back to the Client nodes.

 

OSS nodes benefit from as much memory as possible (for file caching), as well as, as much system bus bandwidth as possible. Fast CPU’s with good memory management and bus architecture help this. The other thing that I’ve found works very well is mdraid based storage targets. These arrays allow for a huge amount of direct tuning and tinkering but require much more work to get as much performance as possible out of them.

 

4. Clients, like the server nodes described above, these are dedicated systems which can provide compute environment needs or file system export needs (say to export to CIFS, Samba, NFS, dCache, etc).

The big concern when fitting out your file system is that it will only be as fast as your slowest part. GigE, and even 10GigE or better, are good choices, however remember you’re loosing 30% to TCP/IP overhead and you’re also hurting your CPU with excess processing cycles to deal with those packets. A better choice is Infiniband interconnects which can run Lustre LNET networking (which I will describe in more detail in the next article).

In the next article I’ll discuss file system configuration and LNET (Lustre Networking).

Leave a Reply

Your email address will not be published. Required fields are marked *