Another year, another LUG.

Well it’s been another year, and another Lustre User Group meeting. There were many interesting discussions which took place. Though in general I found the most useful session to be after the LUG was completed. The developer meeting, for me at least was an excellent use of my time.

20150414_172356

 

PANO_20150413_135544

Got to hang out with some colleagues and old friends20150414_214137

 

Quite a few `primary’ developers at the developer meeting.PANO_20150416_144330

Lustre DNE (Distributed Namespace) basics

What is Lustre DNE?

The Lustre Distributed Namespace (DNE) feature is a new technology developed for the Lustre File system. The feature it self allows for the distribution of file and directory metadata across multiple Lustre Metadata Targets (MDT). With this, you can now effectively scale your metadata load across a near unlimited number of targets. This both helps in reducing single metadata server hardware requirements as well as greatly expanding the maximum number of files your Lustre file system may contain.

The metadata structure appears as follows:
dne_remote_dir

In this example from Intel’s DNE proposal document

Here we see that the parent directory (which in this case is the root of the file system) spawns a subdirectory indexed against the secondary Metadata Target (MDT1).

How is Lustre DNE Setup / Enabled?

Lustre DNE was originally introduced in Lustre 2.4 and further refined in later versions. In this case my examples below are from a recent Lustre 2.5 installation.

Lustre DNE can be enabled on any Lustre file system greater than or equal to version 2.4.0.  To enable DNE, all one must do is create a new Metadata Target with an index greater than 0, and pointing to your Lustre management node (MGS). Multiple Metadata targets may exist on the same node, however it is not recommended as you will run into both CPU and memory contention and may not see much gains due to resource starvation (bus speed, etc).

Once formated and mounted, DNE will be available, that said, any clients mounted to the file system will need to re-mount the file system to make use of the new Metadata Target(s). This is because, as of currently, online DNE changes are not supported, however plans are to allow this in the future.

How do I create directories which will exist on different DNE targets?

Creating directories to point to different DNE targets (Metadata Targets) is quite simple. Once you’ve mounted all your MDT’s against your file system, and mounted (or remounted) your clients, you can now create directories which will target specific Metadata targets on specific Metadata servers. To do this, utilize the new lfs mkdir command:

# lfs mkdir -i 1 MDT1
# lfs mkdir -i 2 MDT2

The above command will generate two directories, MDT1 which utilizes metadata target index 1, and MDT2 which utilizes metadata target index 2.

Once complete, all new sub directories and files created within either MDT1 or MDT2 directories will exist on the specified actual metadata target (in this case targets with index numbers 1, and 2).

Files or directories moved out side of, say MDT1 into MDT2 will be moved from the metadata target 1, to metadata target 2, and so on.

How do I disable DNE or remove Metadata Targets I no longer use?

Disabling DNE and removing Metadata targets requires that you first move all file and directory metadata off of all metadata targets greater than 0. An example of this would be a file system structured like so:

(MDT0) ROOT /
(MDT1) ROOT / MDT1
(MDT1) ROOT / MDT1 / HOMES
(MDT1) ROOT / MDT1 / DATA
(MDT2) ROOT / MDT2
(MDT2) ROOT / MDT2 / SCRATCH
(MDT0) ROOT / USER_DATA

To disable MDT2 you will need to move SCRATCH and all associated sub directories and files into either ROOT/MDT1/. or ROOT/.

Once you have removed the data from the MDT2 DNE target you can unmount the provider for MDT2.

To completely disable DNE you will need to move all data on MDT1 and MDT2 DNE targets back to MDT0, in this case ROOT/. or ROOT/USER_DATA/.

In either case of MDT target removal a tunefs.lustre –writeconf is required against all file system targets (primary MGS, MDT, and all OST’s)

mkfs.lustre common problems (on going)

When installing lustre it is necessary to generate the storage targets, this is completed with mkfs.lustre, normally this happens without issue, however from time to time this isn’t the case. The following issues are ones I’ve seen and solutions are provided.

One of the most common issues is seen below (in this case, while we were trying to generate a metadata target):

mkfs.lustre --mdt --mgs --index 0 --fsname=lustre /dev/loop0

   Permanent disk data:
Target:     lustre:MDT0000
Index:      0
Lustre FS:  lustre
Mount type: ldiskfs
Flags:      0x65
              (MDT MGS first_time update )
Persistent mount opts: user_xattr,errors=remount-ro
Parameters:

checking for existing Lustre data: not found
device size = 1024MB
formatting backing filesystem ldiskfs on /dev/loop0
        target name  lustre:MDT0000
        4k blocks     262144
        options        -I 512 -i 2048 -q -O dirdata,uninit_bg,^extents,dir_nlink,quota,huge_file,flex_bg -E lazy_journal_init -F
mkfs_cmd = mke2fs -j -b 4096 -L lustre:MDT0000  -I 512 -i 2048 -q -O dirdata,uninit_bg,^extents,dir_nlink,quota,huge_file,flex_bg -E lazy_journal_init -F /dev/loop0 262144
mkfs.lustre: Unable to mount /dev/loop0: No such device
Is the ldiskfs module available?

mkfs.lustre FATAL: failed to write local files
mkfs.lustre: exiting with 19 (No such device)

As the error above clearly indicates, ldiskfs module isn’t loaded, usually this means you forgot to load the lustre module, or worse, didn’t install the lustre kernel.

Another error commonly seen, especially on more modern systems is one we see below:

mkfs.lustre --mdt --mgs --index 0 --fsname=lustre /dev/loop0

   Permanent disk data:
Target:     lustre:MDT0000
Index:      0
Lustre FS:  lustre
Mount type: ldiskfs
Flags:      0x65
              (MDT MGS first_time update )
Persistent mount opts: user_xattr,errors=remount-ro
Parameters:

checking for existing Lustre data: not found
device size = 1024MB
formatting backing filesystem ldiskfs on /dev/loop0
        target name  lustre:MDT0000
        4k blocks     262144
        options        -I 512 -i 2048 -q -O dirdata,uninit_bg,^extents,dir_nlink,quota,huge_file,flex_bg -E lazy_journal_init -F
mkfs_cmd = mke2fs -j -b 4096 -L lustre:MDT0000  -I 512 -i 2048 -q -O dirdata,uninit_bg,^extents,dir_nlink,quota,huge_file,flex_bg -E lazy_journal_init -F /dev/loop0 262144
mkfs.lustre: Can't make configs dir /tmp/mnt0tovaB/CONFIGS (Permission denied)

mkfs.lustre FATAL: failed to write local files
mkfs.lustre: exiting with -1 (Unknown error 18446744073709551615)

The above error can indicate an issue with the e2fsprogs version installed, try updating it. Another possible cause can be the presents of selinux and restrictive permissions. The simplest solution is to correct the permissions problem with selinux, or disable it.

What is Lustre?

From the Wikipedia article on Lustre File Systems:

Lustre is a type of parallel distributed file system, generally used for large-scale cluster computing. The name Lustre is a portmanteau word derived from Linux and cluster.[3] Lustre file system software is available under the GNU General Public License (version 2 only) and provides high performance file systems for computer clusters ranging in size from small workgroup clusters to large-scale, multi-site clusters.

Because Lustre file systems have high performance capabilities and open licensing, it is often used in supercomputers. At one time, six of the top 10 and more than 60 of the top 100 supercomputers in the world have Lustre file systems in them, including the world’s #2 ranked TOP500supercomputer, Titan in 2013.[4]

Lustre file systems are scalable and can be part of multiple computer clusters with tens of thousands of client nodes, tens of petabytes (PB) of storage on hundreds of servers, and more than a terabyte per second (TB/s) of aggregate I/O throughput.[5][6] This makes Lustre file systems a popular choice for businesses with large data centers, including those in industries such as meteorology, simulation, oil and gas, life science, rich media, and finance.[7]

Lustre 2.x and insane inode numbers…

In Lustre 2 and above, the inode numbers, which on a standard file system represent the static integer ID of a file now appear to be totally insane, being full 64bit integers.

[root@coolermaster lustre]# stat x y z
  File: `x'
  Size: 0               Blocks: 0          IO Block: 4194304 regular empty file
Device: 2c54f966h/743766374d    Inode: 144115238843759750  Links: 1
Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2013-10-08 12:40:26.000000000 -0600
Modify: 2013-10-08 12:40:26.000000000 -0600
Change: 2013-10-08 12:40:26.000000000 -0600
  File: `y'
  Size: 0               Blocks: 0          IO Block: 4194304 regular empty file
Device: 2c54f966h/743766374d    Inode: 144115238843759751  Links: 1
Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2013-10-08 13:23:06.000000000 -0600
Modify: 2013-10-08 13:23:06.000000000 -0600
Change: 2013-10-08 13:23:06.000000000 -0600
  File: `z'
  Size: 0               Blocks: 0          IO Block: 4194304 regular empty file
Device: 2c54f966h/743766374d    Inode: 144115238843759752  Links: 1
Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2013-10-08 13:23:06.000000000 -0600
Modify: 2013-10-08 13:23:06.000000000 -0600

Um.. interesting right? This is on a very small, single node file system with 3 files, and a lifetime total file creation in the tens of thousands, not the hundreds of trillions. After a quick discussion with Green (Oleg Drokin) correctly pointed out that the new inode numbering schema is used to provide FID’s, these are file identifiers unique to Lustre and identify the file uniquely across all nodes.

So mystery solved, and the joy of insane inode sizes now begins.

Some Lustre FS basics

Lustre is a parallel distributed file system which provides high performance capabilities and open licensing, it is often used in supercomputers and extremely large scale hosting facilities (think Netflix via their 3rd party hosts). Generally, most of the worlds largest supercomputers utilize one or more lustre file systems, this includes the world’s fastest fastest supercomputer, Titan.

Lustre runs exclusively on Linux based systems and is composed of 4 major parts:

1. MDS (Meta Data Server), this single node hosts all file system meta data for a single file system instance, and may host the management service as well. It is composed of supported Linux installation running a supported kernel. It hosts the single major component the MDT (Meta Data Target) which usually a redundant high speed storage array of some make up.

 

When choosing hardware for your MDS it’s critical that is is highly reliable and well tested. It’s also critical that it can perform small reads and writes against your storage target with is much speed as possible. Multiple MDS units can be setup and installed to run against a shared MDT. However these units must run in an actively managed Active<->Passive schema.

 

2. MGS (Management Server), this single node may be part of the MDS (described above) or hosted on it’s own dedicated node. It’s generally used as a site-wide multi-file system management and configuration node, however with new features such as imperative recovery the role of this node will be greatly expanded in the future.

 

3. OSS (Object Storage Server), this can be one or more dedicated, and high bandwidth servers which host all object data on one or more OSTs (Object Storage Targets). These OSTs provide the primary data store of the file system. They also facilitate data transmission directly back to the Client nodes.

 

OSS nodes benefit from as much memory as possible (for file caching), as well as, as much system bus bandwidth as possible. Fast CPU’s with good memory management and bus architecture help this. The other thing that I’ve found works very well is mdraid based storage targets. These arrays allow for a huge amount of direct tuning and tinkering but require much more work to get as much performance as possible out of them.

 

4. Clients, like the server nodes described above, these are dedicated systems which can provide compute environment needs or file system export needs (say to export to CIFS, Samba, NFS, dCache, etc).

The big concern when fitting out your file system is that it will only be as fast as your slowest part. GigE, and even 10GigE or better, are good choices, however remember you’re loosing 30% to TCP/IP overhead and you’re also hurting your CPU with excess processing cycles to deal with those packets. A better choice is Infiniband interconnects which can run Lustre LNET networking (which I will describe in more detail in the next article).

In the next article I’ll discuss file system configuration and LNET (Lustre Networking).