LVM cheatsheet

Originally taken from: http://www.datadisk.co.uk/html_docs/redhat/rh_lvm.htm

Logical Volume Manager (LVM)

This is a quick and dirty cheat sheet on LVM using Linux, I have highlighted many of the common attributes for each command however this is not an extensive list, make sure you look up the command.

With the pvs, vgs and lvs commands, the number of verboses added the more verbose information for example pvs -vvvvv

Directory and Files
Directories and Files ## Directories
/etc/lvm – default lvm directory location
/etc/lvm/backup – where the automatic backups go
/etc/lvm/cache – persistent filter cache
/etc/lvm/archive – where automatic archives go after a volume group change
/var/lock/lvm – lock files to prevent metadata corruption

# Files
/etc/lvm/lvm.conf – main lvm configuration file
$HOME/.lvm – lvm history
Tools
diagnostic lvmdump
lvmdump -d <dir>
dmsetup [info|ls|status]

Note: by default the lvmdump command creates a tar ball
Physical Volumes
display
pvdisplay -v
pvs -v
pvs -a
pvs –segments (see the disk segments used)

pvs attributes are:
1. (a)llocatable
2. e(x)ported

scanning pvscan -v

Note: scans for disks for non-LVM and LVM disks
adding pvcreate /dev/sdb1

## Create physical volume with specific UUID, used to recover volume groups (see miscellaneous section)
pvcreate –uuid <UUID> /dev/sdb1

Common Attributes that you may want to use:

-M2 create a LVM2 physical volume
removing pvremove /dev/sdb1
checking pvck -v /dev/sdb1

Note: check the consistency of the LVM metadata
change physical attributes
## do not allow allocation of extents on this drive, however the partition must be in a vg otherwise you get an error
pvchange -x n /dev/sdb1

Common Attributes that you may want to use:

–addtag add a tag
-x allowed to allocate extents
-u change the uuid

moving pvmove -v /dev/sdb2 /dev/sdb3

Note: moves any used extents from this volume to another volume, in readiness to remove that volume. However you cannot use this on mirrored volumes, you must convert back to non-mirror using “lvconvert -m 0”
Volume Groups
display vgdisplay -v
vgs -v
vgs -a -o +devices

vgs flags:
#PV – number of physical devices
#LV – number of configured volumes

vgs attributes are:
1. permissions (r)|(w)
2. resi(z)eable
3. e(x)ported
4. (p)artial
5. allocation policy – (c)ontiguous, c(l)ing, (n)ormal, (a)nywhere, (i)nherited
6. (c)luster
scanning vgscan -v
creating
vgcreate VolData00 /dev/sdb1 /dev/sdb2 /dev/sdb3
vgcreate VolData00 /dev/sdb[123]

## Use 32MB extent size
vgcreate VolData00 -s 32 /dev/sdb1

Common Attributes that you may want to use:

-l maximum logical volumes
-p maximum physical volumes
-s physical extent size (default is 4MB)
-A autobackup
extending vgextend VolData00 /dev/sdb3
reducing vgreduce VolData00 /dev/sdb3

vgreduce –removemissing –force VolData00
removing vgremove VolData00

Common Attributes that you may want to use:

-f force the removal of any logical volumes
checking vgck VolData00

Note: check the consistency of the LVM metadata
change volume attributes vgchange -a n VolData00

Common Attributes that you may want to use:

-a control availability of volumes within the group
-l maximum logical volumes
-p maximum physical volumes
-s physical extent size (default is 4MB)
-x resizable yes or no (see VG status in vxdisplay)
renaming vgrename VolData00 Data_Vol_01

note: the volume group must not have any active logical volumes
converting metadata type vgconvert -M2 VolData00

Note: vgconvert allows you to convert from one type of metadata format to another for example from LVM1 to LVM2, LVM2 offers bigger capacity, clustering and mirroring
merging # the old volumes group will be merged into the new volume group
vgmerge New_Vol_Group Old_Vol_Group

Note: you must unmount any fielsystems and deactivate the vg that is being merged “vgchange -a n <vg>”, then you can activiate it again afterwards “vgchange -a y <vg>”, then perform a vgscan, dont forget to backup the configuration
spliting vgsplit Old_Vol_Group New_Vol_Group [physical volumes] [-n logical volume name]
importing vgimport VolData00

Common Attributes that you may want to use:

-a import all exported volume groups
exporting ## to see if a volume has already been export use “vgs” and look at the third attribute should be a x
vgexport VolData00

Common Attributes that you may want to use:

-a export all inactive volume groups
backing up
## Backup to default location (/etc/lvm/backup)
vgcfgbackup VolData00

# Backup to specific location
vgcfgbackup -f /var/backup/VolData00_bkup VolData00

# Backup to specific location all volume groups (notice the %s)
vgcfgbackup -f /var/backup/vg_backups_%s

Note: the backup is written in plain text and are by default located in /etc/lvm/backup

restoring vgcfgrestore -f /var/backup/VolData00_bkup VolData00

Common Attributes that you may want to use:

-l list backups of file
-f backup file
-M metadataype 1 or 2
cloning vgimportclone /dev/sdb1

Note: used to import and rename duplicated volume group
special files vgmknodes VolData00

Note: recreates volume group directory and logical volume special files in /dev
Logical Volumes
display
lvdisplay -v
lvdisplay –maps display mirror volumes

lvs -v
lvs -a -o +devices

## lvs commands for mirror volumes
lvs -a -o +devices
lvs -a -o +seg_pe_ranges –segments

## Stripe size
lvs -v –segments
lvs -a -o +stripes,stripesize

## use complex command
lvs -a -o +devices,stripes,stripesize,seg_pe_ranges –segments

lvs attributes are:
1. volume type: (m)irrored, (M)irrored without initail sync, (o)rigin, (p)vmove, (s)napshot, invalid (S)napshot, (v)irtual, mirror (i)mage
mirror (I)mage out-of-sync, under (c)onversion
2. permissions: (w)rite, (r)ead-only
3. allocation policy – (c)ontiguous, c(l)ing, (n)ormal, (a)nywhere, (i)nherited
4. fixed (m)inor
5. state: (a)ctive, (s)uspended, (I)nvalid snapshot, invalid (S)uspended snapshot, mapped (d)evice present with-out tables,
mapped device present with (i)nactive table
6. device (o)pen (mounted in other words)

scanning lvscan -v
lvmdiskscan
creating
## plain old volume
lvcreate -L 10M VolData00

## plain old volume but use extents, use 10 4MB extents (if extent size is 4MB)
lvcreate -l 10 VolData00

## plain old volume but with a specific name web01
lvcreate -L 10M -n web01 VolData00

## plain old volume but on a specific disk
lvcreate -L 10M VolData00 /dev/sdb1

## a striped volume called lvol1 (note the captial i for the stripe size), can use -l (extents) instead of -L
lvcreate -i 3 -L 24M -n lvol1 vg01

## Mirrored volume
lvcreate -L 10M -m1 -n data01 vg01

## Mirrored volume without a mirror log file
lvcreate -L 10M -m1 –mirrorlog core -n data01 vg01

Common Attributes that you may want to use:

-L size of the volume [kKmMgGtT]
-l number of extents
-C contiguous [y|n]
-i stripes
-I stripe size
-m mirrors
–mirrorlog
-n volume name

extending
lvextend -L 20M /dev/VolData00/vol01

Common Attributes that you may want to use:

-L size of the volume [kKmMgGtT]
-l number of extents
-C contiguous [y|n]
-i stripes
-I stripe size

Note: you can extend a ext2/ext3 filesystem using the “resize2fs” or “fsadm” command

fsadm resize /dev/VolData01/data01
resize2fs -p /dev/mapper/VolData01-data01 [size]

The -p option displays bars of progress while extendingthe filesystem

reducing/resizing
lvreduce -L 5M /dev/VolData00/vol01
lvresize -L 5M /dev/VolData00/vol01

Note: rounding will occur when extending and reducing volumes to the next extent (4MB by default), you can use resize2fs or fsadm to shrink the filesystem

fsadm resize /dev/VolData01/data01 [size]
resize2fs -p /dev/mapper/VolData01-data01 [size]

removing lvremove /dev/VolData00/vol01
adding a mirror to a non-mirrored volume
lvconvert -m1 –mirrorlog core /dev/VolData00/vol01 /dev/sdb2

Note: you can also use the above command to remove a unwanted log

removing a mirror from a mirrored volume
lvconvert -m0 /dev/VolData00/vol01 /dev/sdb2

Note: the disk in the command is the one you want to remove

Mirror a volume that has stripes lvconvert –stripes 3 -m1 –mirrorlog core /dev/VolData00/data01 /dev/sdd1 /dev/sde1 /devsdf1
change volume attributes
lvchange -a n /dev/VolData00/vol01

Common Attributes that you may want to use:

-a availability
-C contiguous [y|n]
renaming lvrename /dev/VolData00/vol_old /dev/VolData00/vol_new
snapshotting lvcreate –size 100M –snapshot -name snap /dev/vg01/data01
Miscellaneous
Simulating a disk failure dd if=/dev/zero of=/dev/sdb2 count=10
reparing a failed mirror no LVM corruption ## check volume, persume /dev/sdb2 has failed
lvs -a -o +devices

# remove the failed disk from the volume (if not already done so) , this will convert volume into a non-mirrored volume
vgreduce –removemissing –force VolData00

## replace the disk physically, remember to partion it with type 8e
fdisk /dev/sdb
……..

## add new disk to LVM
pvcreate /dev/sdb2

## add the disk back into volume group
vgextend VolData00 /dev/sdb2

## mirror up the volume
lvconvert -m1 –mirrorlog core /dev/VolData00/vol02 /dev/sdb2
corrupt LVM metadata without replacing drive # attempt to bring the volume group online
vgchange -a y VolData00

# Restore the LVM configation
vgcfgrestore VolData00

# attempt to bring the volume grou online
vgchange -a y VolData00

# file system check
e2fsck /dev/VolData00/data01
corrupt LVM metadata but replacing the faulty disk
# attempt to bring the volume group online but you get UUID conflict errors make note of the UUID number
vgchange -a y VolData00
vgchange -a n VolData00

## sometimes it my only be a logical volume problem
lvchange -a y /dev/VolData00/web02
lvchange -a n /dev/Voldata00/web02

## replace the disk physically, remember to partion it with type 8e
fdisk /dev/sdb
……..

# after replacing the faulty drive the disk must have the previuos UUID number or you can get it from /etc/lvm directory
pvcreate –uuid <previous UUID number taken from above command> /dev/sdb2

# Restore the LVM configation
vgcfgrestore VolData00

# attempt to bring the volume group online or logical volume
vgchange -a y VolData00
lvchange -a y /dev/VolData00/web02

# file system check
e2fsck /dev/VolData00/data01

Note: if you have backed the volume group configuration you can obtain the UUID number in the backup file by default located in /etc/lvm/backup or running “pvs -v”

 

Some thoughts about external journaling

As a general rule of thumb external journaling on ext4 / ldiskfs type file systems can and will greatly improve overall write performance. This is due to the fact that you’re off loading small (4K) writes to an external disk or array of disks. Couple that with the fact that these writes are linear and do not require you to move the heads around on the primary data target, great gains can be achieved (especially when performing sequential large I/O writes).

So this is all good right? Faster more optimized writes and the safety of having a journal, tracking all writes in the event of unexpected storage target failure or outage.

Except theres a problem… If your journal device experiences some kind of failure event, which results in journal record corruption or in-transit memory/CPU based errors things can go from good to very very bad quickly.

This is because, upon mount (either clean or unclean shutdown of the file system) the journal record is blindly replayed, regardless as to content. This means that if you have a corrupted transaction record within your journal, or the transaction record is corrupted in-flight (due to memory or CPU error) you are at risk of severe file system damage and likely data loss.

There is a solution, this is to either avoid external journaling (which only partially addresses the issue) or better, enable the journaling checksum feature (journal_checksum). This will go a long way to preventing corrupted data from reaching your file system and hopefully ensure that the file system structure remains undamaged.

Lustre DNE (Distributed Namespace) basics

What is Lustre DNE?

The Lustre Distributed Namespace (DNE) feature is a new technology developed for the Lustre File system. The feature it self allows for the distribution of file and directory metadata across multiple Lustre Metadata Targets (MDT). With this, you can now effectively scale your metadata load across a near unlimited number of targets. This both helps in reducing single metadata server hardware requirements as well as greatly expanding the maximum number of files your Lustre file system may contain.

The metadata structure appears as follows:
dne_remote_dir

In this example from Intel’s DNE proposal document

Here we see that the parent directory (which in this case is the root of the file system) spawns a subdirectory indexed against the secondary Metadata Target (MDT1).

How is Lustre DNE Setup / Enabled?

Lustre DNE was originally introduced in Lustre 2.4 and further refined in later versions. In this case my examples below are from a recent Lustre 2.5 installation.

Lustre DNE can be enabled on any Lustre file system greater than or equal to version 2.4.0.  To enable DNE, all one must do is create a new Metadata Target with an index greater than 0, and pointing to your Lustre management node (MGS). Multiple Metadata targets may exist on the same node, however it is not recommended as you will run into both CPU and memory contention and may not see much gains due to resource starvation (bus speed, etc).

Once formated and mounted, DNE will be available, that said, any clients mounted to the file system will need to re-mount the file system to make use of the new Metadata Target(s). This is because, as of currently, online DNE changes are not supported, however plans are to allow this in the future.

How do I create directories which will exist on different DNE targets?

Creating directories to point to different DNE targets (Metadata Targets) is quite simple. Once you’ve mounted all your MDT’s against your file system, and mounted (or remounted) your clients, you can now create directories which will target specific Metadata targets on specific Metadata servers. To do this, utilize the new lfs mkdir command:

# lfs mkdir -i 1 MDT1
# lfs mkdir -i 2 MDT2

The above command will generate two directories, MDT1 which utilizes metadata target index 1, and MDT2 which utilizes metadata target index 2.

Once complete, all new sub directories and files created within either MDT1 or MDT2 directories will exist on the specified actual metadata target (in this case targets with index numbers 1, and 2).

Files or directories moved out side of, say MDT1 into MDT2 will be moved from the metadata target 1, to metadata target 2, and so on.

How do I disable DNE or remove Metadata Targets I no longer use?

Disabling DNE and removing Metadata targets requires that you first move all file and directory metadata off of all metadata targets greater than 0. An example of this would be a file system structured like so:

(MDT0) ROOT /
(MDT1) ROOT / MDT1
(MDT1) ROOT / MDT1 / HOMES
(MDT1) ROOT / MDT1 / DATA
(MDT2) ROOT / MDT2
(MDT2) ROOT / MDT2 / SCRATCH
(MDT0) ROOT / USER_DATA

To disable MDT2 you will need to move SCRATCH and all associated sub directories and files into either ROOT/MDT1/. or ROOT/.

Once you have removed the data from the MDT2 DNE target you can unmount the provider for MDT2.

To completely disable DNE you will need to move all data on MDT1 and MDT2 DNE targets back to MDT0, in this case ROOT/. or ROOT/USER_DATA/.

In either case of MDT target removal a tunefs.lustre –writeconf is required against all file system targets (primary MGS, MDT, and all OST’s)

mkfs.lustre common problems (on going)

When installing lustre it is necessary to generate the storage targets, this is completed with mkfs.lustre, normally this happens without issue, however from time to time this isn’t the case. The following issues are ones I’ve seen and solutions are provided.

One of the most common issues is seen below (in this case, while we were trying to generate a metadata target):

mkfs.lustre --mdt --mgs --index 0 --fsname=lustre /dev/loop0

   Permanent disk data:
Target:     lustre:MDT0000
Index:      0
Lustre FS:  lustre
Mount type: ldiskfs
Flags:      0x65
              (MDT MGS first_time update )
Persistent mount opts: user_xattr,errors=remount-ro
Parameters:

checking for existing Lustre data: not found
device size = 1024MB
formatting backing filesystem ldiskfs on /dev/loop0
        target name  lustre:MDT0000
        4k blocks     262144
        options        -I 512 -i 2048 -q -O dirdata,uninit_bg,^extents,dir_nlink,quota,huge_file,flex_bg -E lazy_journal_init -F
mkfs_cmd = mke2fs -j -b 4096 -L lustre:MDT0000  -I 512 -i 2048 -q -O dirdata,uninit_bg,^extents,dir_nlink,quota,huge_file,flex_bg -E lazy_journal_init -F /dev/loop0 262144
mkfs.lustre: Unable to mount /dev/loop0: No such device
Is the ldiskfs module available?

mkfs.lustre FATAL: failed to write local files
mkfs.lustre: exiting with 19 (No such device)

As the error above clearly indicates, ldiskfs module isn’t loaded, usually this means you forgot to load the lustre module, or worse, didn’t install the lustre kernel.

Another error commonly seen, especially on more modern systems is one we see below:

mkfs.lustre --mdt --mgs --index 0 --fsname=lustre /dev/loop0

   Permanent disk data:
Target:     lustre:MDT0000
Index:      0
Lustre FS:  lustre
Mount type: ldiskfs
Flags:      0x65
              (MDT MGS first_time update )
Persistent mount opts: user_xattr,errors=remount-ro
Parameters:

checking for existing Lustre data: not found
device size = 1024MB
formatting backing filesystem ldiskfs on /dev/loop0
        target name  lustre:MDT0000
        4k blocks     262144
        options        -I 512 -i 2048 -q -O dirdata,uninit_bg,^extents,dir_nlink,quota,huge_file,flex_bg -E lazy_journal_init -F
mkfs_cmd = mke2fs -j -b 4096 -L lustre:MDT0000  -I 512 -i 2048 -q -O dirdata,uninit_bg,^extents,dir_nlink,quota,huge_file,flex_bg -E lazy_journal_init -F /dev/loop0 262144
mkfs.lustre: Can't make configs dir /tmp/mnt0tovaB/CONFIGS (Permission denied)

mkfs.lustre FATAL: failed to write local files
mkfs.lustre: exiting with -1 (Unknown error 18446744073709551615)

The above error can indicate an issue with the e2fsprogs version installed, try updating it. Another possible cause can be the presents of selinux and restrictive permissions. The simplest solution is to correct the permissions problem with selinux, or disable it.

What is Lustre?

From the Wikipedia article on Lustre File Systems:

Lustre is a type of parallel distributed file system, generally used for large-scale cluster computing. The name Lustre is a portmanteau word derived from Linux and cluster.[3] Lustre file system software is available under the GNU General Public License (version 2 only) and provides high performance file systems for computer clusters ranging in size from small workgroup clusters to large-scale, multi-site clusters.

Because Lustre file systems have high performance capabilities and open licensing, it is often used in supercomputers. At one time, six of the top 10 and more than 60 of the top 100 supercomputers in the world have Lustre file systems in them, including the world’s #2 ranked TOP500supercomputer, Titan in 2013.[4]

Lustre file systems are scalable and can be part of multiple computer clusters with tens of thousands of client nodes, tens of petabytes (PB) of storage on hundreds of servers, and more than a terabyte per second (TB/s) of aggregate I/O throughput.[5][6] This makes Lustre file systems a popular choice for businesses with large data centers, including those in industries such as meteorology, simulation, oil and gas, life science, rich media, and finance.[7]

Lustre 2.x and insane inode numbers…

In Lustre 2 and above, the inode numbers, which on a standard file system represent the static integer ID of a file now appear to be totally insane, being full 64bit integers.

[root@coolermaster lustre]# stat x y z
  File: `x'
  Size: 0               Blocks: 0          IO Block: 4194304 regular empty file
Device: 2c54f966h/743766374d    Inode: 144115238843759750  Links: 1
Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2013-10-08 12:40:26.000000000 -0600
Modify: 2013-10-08 12:40:26.000000000 -0600
Change: 2013-10-08 12:40:26.000000000 -0600
  File: `y'
  Size: 0               Blocks: 0          IO Block: 4194304 regular empty file
Device: 2c54f966h/743766374d    Inode: 144115238843759751  Links: 1
Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2013-10-08 13:23:06.000000000 -0600
Modify: 2013-10-08 13:23:06.000000000 -0600
Change: 2013-10-08 13:23:06.000000000 -0600
  File: `z'
  Size: 0               Blocks: 0          IO Block: 4194304 regular empty file
Device: 2c54f966h/743766374d    Inode: 144115238843759752  Links: 1
Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2013-10-08 13:23:06.000000000 -0600
Modify: 2013-10-08 13:23:06.000000000 -0600

Um.. interesting right? This is on a very small, single node file system with 3 files, and a lifetime total file creation in the tens of thousands, not the hundreds of trillions. After a quick discussion with Green (Oleg Drokin) correctly pointed out that the new inode numbering schema is used to provide FID’s, these are file identifiers unique to Lustre and identify the file uniquely across all nodes.

So mystery solved, and the joy of insane inode sizes now begins.

mdtest usage file / readme

Below is the README file for mdtest, an application designed to test meta data performance of a given file system

/******************************************************************************\
* *
* Copyright (c) 2003, The Regents of the University of California *
* See the file COPYRIGHT for a complete copyright notice and license. *
* *
\******************************************************************************/

Usage: mdtest [-b #] [-B] [-c] [-C] [-d testdir] [-D] [-f first] [-F] [-h]
               [-i iterations] [-I #] [-l last] [-L] [-n #] [-N #] [-p seconds]
               [-r] [-R[#]] [-s #] [-S] [-t] [-T] [-u] [-v] [-V #] [-w #] [-y]
               [-z #]

    -b: branching factor of hierarchical directory structure
    -B: no barriers between phases (create/stat/remove)
    -c: collective creates: task 0 does all creates and deletes
    -C: only create files/dirs
    -d: the directory in which the tests will run
    -D: perform test on directories only (no files)
    -f: first number of tasks on which the test will run
    -F: perform test on files only (no directories)
    -h: prints help message
    -i: number of iterations the test will run
    -I: number of items per tree node
    -l: last number of tasks on which the test will run
    -L: files/dirs created only at leaf level
    -n: every task will create/stat/remove # files/dirs per tree
    -N: stride # between neighbor tasks for file/dir stat (local=0)
    -p: pre-iteration delay (in seconds)
    -r: only remove files/dirs
    -R: randomly stat files/dirs (optional seed can be provided)
    -s: stride between the number of tasks for each test
    -S: shared file access (file only, no directories)
    -t: time unique working directory overhead
    -T: only stat files/dirs
    -u: unique working directory for each task
    -v: verbosity (each instance of option increments by one)
    -V: verbosity value
    -w: number of bytes to write to each file
    -y: sync file after write completion
    -z: depth of hierarchical directory structure

NOTES:
 * -N allows a "read-your-neighbor" approach by setting stride to
    tasks-per-node
 * -d allows multiple paths for the form '-d fullpath1@fullpath2@fullpath3'
 * -B allows each task to time itself. The aggregate results reflect this
    change.
 * -n and -I cannot be used together. -I specifies the number of files/dirs
   created per tree node, whereas the -n specifies the total number of
   files/dirs created over an entire tree. When using -n, integer division is
   used to determine the number of files/dirs per tree node. (E.g. if -n is
   10 and there are 4 tree nodes (z=1 and b=3), there will be 2 files/dirs per
   tree node.)
 * -R and -T can be used separately. -R merely indicates that if files/dirs
   are going to be stat'ed, then they will be stat'ed randomly.

Illustration of terminology:

                     Hierarchical directory structure (tree)

                                   =======
                                  | | (tree node)
                                   =======
                                  / | \
                            ------ | ------
                           / | \
                       ======= ======= =======
                      | | | | | | (leaf level)
                       ======= ======= =======

    In this example, the tree has a depth of one (z=1) and branching factor of
    three (b=3). The node at the top of the tree is the root node. The level
    of nodes furthest from the root is the leaf level. All trees created by
    mdtest are balanced.

    To see how mdtest operates, do a simple run like the following:

        mdtest -z 1 -b 3 -I 10 -C -i 3

    This command will create a tree like the one above, then each task will
    create 10 files/dirs per tree node. Three of these trees will be created
    (one for each iteration).

Example usages:

mdtest -I 10 -z 5 -b 2

    A directory tree is created in the current working directory that has a
    depth of 5 and a branching factor of 2. Each task operates on 10
    files/dirs in each tree node.

mdtest -I 10 -z 5 -b 2 -R

    This example is the same as the previous one except that the files/dirs are
    stat'ed randomly.

mdtest -I 10 -z 5 -b 2 -R4

    Again, this example is the same as the previous except a seed of 4 is
    passed to the random number generator.

mdtest -I 10 -z 5 -b 2 -L

    A directory tree is created as described above, but in this example
    files/dirs exist only at the leaf level of the tree.

mdtest -n 100 -i 3 -d /users/me/testing

    Each task creates 100 files/dirs in a root node (there are no branches
    out of the root node) within the path /users/me/testing. This is done
    three times. Aggregate values are calculated over the iterations.

mdtest -n 100 -F -C

    Each task only creates 100 files in the current directory.
    Directories are not created. The files are neither stat'ed nor
    removed.

mdtest -I 5 -z 3 -b 5 -u -d /users/me/testing

    Each task creates a directory tree in the /users/me/testing
    directory. Each tree has a depth of 3 and a branching factor of
    5. Five files/dirs are operated upon in each node of each tree.

mdtest -I 5 -z 3 -b 5 -u -d /users/me/testing@/some/other/location

    This run is the same as the previous except that each task creates
    its tree in a different directory. Task 0 will create a tree in
    /users/me/testing. Task 1 will create a tree in /some/other/location.
    After all of the directories are used, the remaining tasks round-
    robin over the directories supplied. (I.e. Task 2 will create a
    tree in /users/me/testing, etc.)

IOR usage file

IOR is a file system testing tool designed to work in a clustered environment. Below is the usage instructions

/*****************************************************************************\
*                                                                             *
*       Copyright (c) 2003, The Regents of the University of California       *
*     See the file COPYRIGHT for a complete copyright notice and license.     *
*                                                                             *
\*****************************************************************************/
/********************* Modifications to IOR-2.10.1 ****************************
* Hodson, 8/18/2008:                                                          *
* Documentation updated for the following new option:                         *
* The modifications to IOR-2.10.1 extend existing random I/O capabilities and *
* enhance performance output statistics.                                      *
*                                                                             *
*  cli        script                  Description                             *
* -----      -----------------        ----------------------------------------*
* 1)  -A N   testnum                - test reference number for easier test   *
*                                     identification in log files             *
* 2)  -Q N   taskpernodeoffset      - for read tests. Use with -C & -Z options*
*                                     If -C (reordertasks) specified,         *
*                                     then node offset read by CONSTANT  N.   *
*                                     If -Z (reordertasksrandom) specified,   *
*                                     then node offset read by RANDOM >= N.   *
* 3)  -Z     reordertasksrandom     - random node task ordering for read tests*
*                                     In this case, processes read            *
*                                     data written by other processes with    *
*                                     node offsets specified by the -Q option *
*                                     and -X option.                          *
* 4)  -X N   reordertasksrandomseed - random seed for -Z (reordertasksrandom) *
*                                     If N>=0, use same seed for all iters    *
*                                     If N< 0, use different seed for ea. iter*
* 5)  -Y     fsyncperwrite          - perform fsync after every POSIX write   *
\*****************************************************************************/

                                 IOR USER GUIDE

Index:
  * Basics
      1.  Description
      2.  Building IOR
      3.  Running IOR
      4.  Options

  * More Information
      5.  Option details
      6.  Verbosity levels
      7.  Using Scripts

  * Troubleshooting
      8.  Compatibility with older versions

  * Frequently Asked Questions
      9.  How do I . . . ?

  * Output
     10.  Enhanced output description

*******************
* 1.  DESCRIPTION *
*******************
IOR can be used for testing performance of parallel file systems using various 
interfaces and access patterns.  IOR uses MPI for process synchronization.  
IOR version 2 is a complete rewrite of the original IOR (Interleaved-Or-Random) 
version 1 code.  

*******************
* 2. BUILDING IOR *
*******************
Build Instructions:

  Type 'gmake [posix|mpiio|hdf5|ncmpi|all]' from the IOR/ directory. 
  On some platforms, e.g., Cray XT, specify "CC=cc" to build using "cc" instead of "mpicc".
  In IOR/src/C, the file Makefile.config currently has settings for AIX, Linux,
  OSF1 (TRU64), and IRIX64 to model on.  Note that MPI must be present for
  building/running IOR, and that MPI I/O must be available for MPI I/O, HDF5,
  and Parallel netCDF builds.  As well, HDF5 and Parallel netCDF libraries are
  necessary for those builds.  All IOR builds include the POSIX interface.

  You can build IOR as a native Windows application. One way to do this is to 
  use the "Microsoft Windows SDK", which is available as a free download and 
  includes development tools like the Visual C++ compiler. To build IOR with 
  MPI-IO support, also download and install the "Microsoft HPC Pack SDK". 
  Once these packages are installed on your Windows build system, follow these 
  steps:
    1. Open a "CMD Shell" under the Start menu Microsoft Windows SDK group.
    2. cd to the IOR directory containing ior.vcproj
    3. Run: vcbuild ior.vcproj "Release|x64"

  ior.exe will be created in the directory IOR/x64/Release. "Debug|x64",
  "Release|Win32", and "Debug|Win32" configurations can also be built.  
  To build IOR without MPI-IO support, first edit ior.vcproj and replace 
  aiori-MPIIO.c with aiori-noMPIIO.c.

******************
* 3. RUNNING IOR *
******************
Two ways to run IOR:

  * Command line with arguments -- executable followed by command line options.

    E.g., to execute:  IOR -w -r -o filename
    This performs a write and a read to the file 'filename'.

  * Command line with scripts -- any arguments on the command line will 
    establish the default for the test run, but a script may be used in
    conjunction with this for varying specific tests during an execution of the
    code.

    E.g., to execute:  IOR -W -f script
    This defaults all tests in 'script' to use write data checking.

**************
* 4. OPTIONS *
**************
These options are to be used on the command line. E.g., 'IOR -a POSIX -b 4K'.
  -A N  testNum -- test number for reference in some output
  -a S  api --  API for I/O [POSIX|MPIIO|HDF5|NCMPI]
  -b N  blockSize -- contiguous bytes to write per task  (e.g.: 8, 4k, 2m, 1g)
  -B    useO_DIRECT -- uses O_DIRECT for POSIX, bypassing I/O buffers
  -c    collective -- collective I/O
  -C    reorderTasks -- changes task ordering to n+1 ordering for readback
  -Q N  taskPerNodeOffset for read tests use with -C & -Z options (-C constant N, -Z at least N) [!HDF5]
  -Z    reorderTasksRandom -- changes task ordering to random ordering for readback
  -X N  reorderTasksRandomSeed -- random seed for -Z option
  -d N  interTestDelay -- delay between reps in seconds
  -D N  deadlineForStonewalling -- seconds before stopping write or read phase
  -Y    fsyncPerWrite -- perform fsync after each POSIX write
  -e    fsync -- perform fsync upon POSIX write close
  -E    useExistingTestFile -- do not remove test file before write access
  -f S  scriptFile -- test script name
  -F    filePerProc -- file-per-process
  -g    intraTestBarriers -- use barriers between open, write/read, and close
  -G N  setTimeStampSignature -- set value for time stamp signature
  -h    showHelp -- displays options and help
  -H    showHints -- show hints
  -i N  repetitions -- number of repetitions of test
  -I    individualDataSets -- datasets not shared by all procs [not working]
  -j N  outlierThreshold -- warn on outlier N seconds from mean
  -J N  setAlignment -- HDF5 alignment in bytes (e.g.: 8, 4k, 2m, 1g)
  -k    keepFile -- don't remove the test file(s) on program exit
  -K    keepFileWithError  -- keep error-filled file(s) after data-checking
  -l    storeFileOffset -- use file offset as stored signature
  -m    multiFile -- use number of reps (-i) for multiple file count
  -n    noFill -- no fill in HDF5 file creation
  -N N  numTasks -- number of tasks that should participate in the test
  -o S  testFile -- full name for test
  -O S  string of IOR directives (e.g. -O checkRead=1,lustreStripeCount=32)
  -p    preallocate -- preallocate file size
  -P    useSharedFilePointer -- use shared file pointer [not working]
  -q    quitOnError -- during file error-checking, abort on error
  -r    readFile -- read existing file
  -R    checkRead -- check read after read
  -s N  segmentCount -- number of segments
  -S    useStridedDatatype -- put strided access into datatype [not working]
  -t N  transferSize -- size of transfer in bytes (e.g.: 8, 4k, 2m, 1g)
  -T N  maxTimeDuration -- max time in minutes to run tests
  -u    uniqueDir -- use unique directory name for each file-per-process
  -U S  hintsFileName -- full name for hints file
  -v    verbose -- output information (repeating flag increases level)
  -V    useFileView -- use MPI_File_set_view
  -w    writeFile -- write file
  -W    checkWrite -- check read after write
  -x    singleXferAttempt -- do not retry transfer if incomplete
  -z    randomOffset -- access is to random, not sequential, offsets within a file

NOTES: * S is a string, N is an integer number.
       * For transfer and block sizes, the case-insensitive K, M, and G
         suffices are recognized.  I.e., '4k' or '4K' is accepted as 4096.

*********************
* 5. OPTION DETAILS *
*********************
For each of the general settings, note the default is shown in brackets.
IMPORTANT NOTE: For all true/false options below [1]=true, [0]=false
IMPORTANT NOTE: Contrary to appearance, the script options below are NOT case sensitive

GENERAL:
========
  * testNum              - test reference number for some output [-1]

  * api                  - must be set to one of POSIX, MPIIO, HDF5, or NCMPI
                           depending on test [POSIX]

  * testFile             - name of the output file [testFile]
                           NOTE: with filePerProc set, the tasks can round 
                                 robin across multiple file names '-o S@S@S'

  * hintsFileName        - name of the hints file []

  * repetitions          - number of times to run each test [1]

  * multiFile            - creates multiple files for single-shared-file or
                           file-per-process modes; i.e. each iteration creates
                           a new file [0=FALSE]

  * reorderTasksConstant - reorders tasks by a constant node offset for writing/reading neighbor's
                           data from different nodes [0=FALSE]

  * taskPerNodeOffset    - for read tests. Use with -C & -Z options. [1]
                           With reorderTasks, constant N. With reordertasksrandom, >= N

  * reorderTasksRandom   - reorders tasks to random ordering for readback [0=FALSE]

  * reorderTasksRandomSeed - random seed for reordertasksrandom option. [0]
                              >0, same seed for all iterations. <0, different seed for each iteration

  * quitOnError          - upon error encountered on checkWrite or checkRead,
                           display current error and then stop execution;
                           if not set, count errors and continue [0=FALSE]

  * numTasks             - number of tasks that should participate in the test
                           [0]
                           NOTE: 0 denotes all tasks

  * interTestDelay       - this is the time in seconds to delay before
                           beginning a write or read in a series of tests [0]
                           NOTE: it does not delay before a check write or
                                 check read

  * outlierThreshold     - gives warning if any task is more than this number
                           of seconds from the mean of all participating tasks.
                           If so, the task is identified, its time (start,
                           elapsed create, elapsed transfer, elapsed close, or
                           end) is reported, as is the mean and standard
                           deviation for all tasks.  The default for this is 0,
                           which turns it off.  If set to a positive value, for
                           example 3, any task not within 3 seconds of the mean
                           displays its times. [0]

  * intraTestBarriers    - use barrier between open, write/read, and close [0=FALSE]

  * uniqueDir            - create and use unique directory for each
                           file-per-process [0=FALSE]

  * writeFile            - writes file(s), first deleting any existing file [1=TRUE]
                           NOTE: the defaults for writeFile and readFile are
                                 set such that if there is not at least one of
                                 the following -w, -r, -W, or -R, it is assumed
                                 that -w and -r are expected and are
                                 consequently used -- this is only true with
                                 the command line, and may be overridden in
                                 a script

  * readFile             - reads existing file(s) (from current or previous
                           run) [1=TRUE]
                           NOTE: see writeFile notes

  * filePerProc          - accesses a single file for each processor; default
                           is a single file accessed by all processors [0=FALSE]

  * checkWrite           - read data back and check for errors against known
                           pattern; can be used independently of writeFile [0=FALSE]
                           NOTES: * data checking is not timed and does not
                                    affect other performance timings
                                  * all errors tallied and returned as program
                                    exit code, unless quitOnError set

  * checkRead            - reread data and check for errors between reads; can
                           be used independently of readFile [0=FALSE]
                           NOTE: see checkWrite notes

  * keepFile             - stops removal of test file(s) on program exit [0=FALSE]

  * keepFileWithError    - ensures that with any error found in data-checking,
                           the error-filled file(s) will not be deleted [0=FALSE]

  * useExistingTestFile  - do not remove test file before write access [0=FALSE]

  * segmentCount         - number of segments in file [1]
                           NOTES: * a segment is a contiguous chunk of data
                                    accessed by multiple clients each writing/
                                    reading their own contiguous data;
                                    comprised of blocks accessed by multiple
                                    clients
                                  * with HDF5 this repeats the pattern of an
                                    entire shared dataset

  * blockSize            - size (in bytes) of a contiguous chunk of data
                           accessed by a single client; it is comprised of one
                           or more transfers [1048576]

  * transferSize         - size (in bytes) of a single data buffer to be
                           transferred in a single I/O call [262144]

  * verbose              - output information [0]
                           NOTE: this can be set to levels 0-5 on the command
                                 line; repeating the -v flag will increase
                                 verbosity level

  * setTimeStampSignature - set value for time stamp signature [0]
                            NOTE: used to rerun tests with the exact data
                                  pattern by setting data signature to contain
                                  positive integer value as timestamp to be
                                  written in data file; if set to 0, is
                                  disabled

  * showHelp             - display options and help [0=FALSE]

  * storeFileOffset      - use file offset as stored signature when writing
                           file [0=FALSE]
                           NOTE: this will affect performance measurements

  * maxTimeDuration      - max time in minutes to run tests [0]
                           NOTES: * setting this to zero (0) unsets this option
                                  * this option allows the current read/write
                                    to complete without interruption

  * deadlineForStonewalling - seconds before stopping write or read phase [0]
                           NOTES: * used for measuring the amount of data moved
                                    in a fixed time.  After the barrier, each
                                    task starts its own timer, begins moving
                                    data, and the stops moving data at a pre-
                                    arranged time.  Instead of measuring the
                                    amount of time to move a fixed amount of
                                    data, this option measures the amount of
                                    data moved in a fixed amount of time.  The
                                    objective is to prevent tasks slow to
                                    complete from skewing the performance. 
                                  * setting this to zero (0) unsets this option
                                  * this option is incompatible w/data checking

  * randomOffset         - access is to random, not sequential, offsets within a file [0=FALSE]
                           NOTES: * this option is currently incompatible with:
                                    -checkRead
                                    -storeFileOffset
                                    -MPIIO collective or useFileView
                                    -HDF5 or NCMPI

POSIX-ONLY:
===========
  * useO_DIRECT          - use O_DIRECT for POSIX, bypassing I/O buffers [0]

  * singleXferAttempt    - will not continue to retry transfer entire buffer
                           until it is transferred [0=FALSE]
                           NOTE: when performing a write() or read() in POSIX,
                                 there is no guarantee that the entire
                                 requested size of the buffer will be
                                 transferred; this flag keeps the retrying a
                                 single transfer until it completes or returns
                                 an error

  * fsyncPerWrite        - perform fsync after each POSIX write  [0=FALSE]
  * fsync                - perform fsync after POSIX write close [0=FALSE]

MPIIO-ONLY:
===========
  * preallocate          - preallocate the entire file before writing [0=FALSE]

  * useFileView          - use an MPI datatype for setting the file view option
                           to use individual file pointer [0=FALSE]
                           NOTE: default IOR uses explicit file pointers

  * useSharedFilePointer - use a shared file pointer [0=FALSE] (not working)
                           NOTE: default IOR uses explicit file pointers

  * useStridedDatatype   - create a datatype (max=2GB) for strided access; akin
                           to MULTIBLOCK_REGION_SIZE [0] (not working)

HDF5-ONLY:
==========
  * individualDataSets   - within a single file each task will access its own
                           dataset [0=FALSE] (not working)
                           NOTE: default IOR creates a dataset the size of
                                 numTasks * blockSize to be accessed by all
                                 tasks

  * noFill               - no pre-filling of data in HDF5 file creation [0=FALSE]

  * setAlignment         - HDF5 alignment in bytes (e.g.: 8, 4k, 2m, 1g) [1]

MPIIO-, HDF5-, AND NCMPI-ONLY:
==============================
  * collective           - uses collective operations for access [0=FALSE]

  * showHints            - show hint/value pairs attached to open file [0=FALSE]
                           NOTE: not available in NCMPI

LUSTRE-SPECIFIC:
================
  * lustreStripeCount    - set the lustre stripe count for the test file(s) [0]

  * lustreStripeSize     - set the lustre stripe size for the test file(s) [0]

  * lustreStartOST       - set the starting OST for the test file(s) [-1]

  * lustreIgnoreLocks    - disable lustre range locking [0]

***********************
* 6. VERBOSITY LEVELS *
***********************
The verbosity of output for IOR can be set with -v.  Increasing the number of
-v instances on a command line sets the verbosity higher.

Here is an overview of the information shown for different verbosity levels:
  0 - default; only bare essentials shown
  1 - max clock deviation, participating tasks, free space, access pattern,
      commence/verify access notification w/time
  2 - rank/hostname, machine name, timer used, individual repetition
      performance results, timestamp used for data signature
  3 - full test details, transfer block/offset compared, individual data
      checking errors, environment variables, task writing/reading file name,
      all test operation times
  4 - task id and offset for each transfer
  5 - each 8-byte data signature comparison (WARNING: more data to STDOUT
      than stored in file, use carefully)

********************
* 7. USING SCRIPTS *
********************

IOR can use a script with the command line.  Any options on the command line
will be considered the default settings for running the script.  (I.e.,
'IOR -W -f script' will have all tests in the script run with the -W option as
default.)  The script itself can override these settings and may be set to run
run many different tests of IOR under a single execution.  The command line is:

  IOR/bin/IOR -f script

In IOR/scripts, there are scripts of testcases for simulating I/O behavior of
various application codes.  Details are included in each script as necessary.

An example of a script:
===============> start script <===============
IOR START
  api=[POSIX|MPIIO|HDF5|NCMPI]
  testFile=testFile
  hintsFileName=hintsFile
  repetitions=8
  multiFile=0
  interTestDelay=5
  readFile=1
  writeFile=1
  filePerProc=0
  checkWrite=0
  checkRead=0
  keepFile=1
  quitOnError=0
  segmentCount=1
  blockSize=32k
  outlierThreshold=0
  setAlignment=1
  transferSize=32
  singleXferAttempt=0
  individualDataSets=0
  verbose=0
  numTasks=32
  collective=1
  preallocate=0
  useFileView=0
  keepFileWithError=0
  setTimeStampSignature=0
  useSharedFilePointer=0
  useStridedDatatype=0
  uniqueDir=0
  fsync=0
  storeFileOffset=0
  maxTimeDuration=60
  deadlineForStonewalling=0
  useExistingTestFile=0
  useO_DIRECT=0
  showHints=0
  showHelp=0
RUN
  # additional tests are optional
  <snip>
RUN
  <snip>
RUN
IOR STOP
===============> stop script <===============

NOTES: * Not all test parameters need be set.  The defaults can be viewed in
         IOR/src/C/defaults.h. 
       * White space is ignored in script, as are comments starting with '#'.

****************************************
* 8. COMPATIBILITY WITH OLDER VERSIONS *
****************************************
1)  IOR version 1 (c. 1996-2002) and IOR version 2 (c. 2003-present) are
    incompatible.  Input decks from one will not work on the other.  As version
    1 is not included in this release, this shouldn't be case for concern.  All
    subsequent compatibility issues are for IOR version 2.

2)  IOR versions prior to release 2.8 provided data size and rates in powers
    of two.  E.g., 1 MB/sec referred to 1,048,576 bytes per second.  With the
    IOR release 2.8 and later versions, MB is now defined as 1,000,000 bytes
    and MiB is 1,048,576 bytes.

3)  In IOR versions 2.5.3 to 2.8.7, IOR could be run without any command line
    options.  This assumed that if both write and read options (-w -r) were
    omitted, the run with them both set as default.  Later, it became clear
    that in certain cases (data checking, e.g.) this caused difficulties.  In
    IOR versions 2.8.8 and later, if not one of the -w -r -W or -R options is
    set, then -w and -r are set implicitly.

*********************************
* 9. FREQUENTLY ASKED QUESTIONS *
*********************************
HOW DO I PERFORM MULTIPLE DATA CHECKS ON AN EXISTING FILE?

  Use this command line:  IOR -k -E -W -i 5 -o file

  -k keeps the file after the access rather than deleting it
  -E uses the existing file rather than truncating it first
  -W performs the writecheck
  -i number of iterations of checking
  -o filename

  On versions of IOR prior to 2.8.8, you need the -r flag also, otherwise
  you'll first overwrite the existing file.  (In earlier versions, omitting -w
  and -r implied using both.  This semantic has been subsequently altered to be
  omitting -w, -r, -W, and -R implied using both -w and -r.)

  If you're running new tests to create a file and want repeat data checking on 
  this file multiple times, there is an undocumented option for this.  It's -O 
  multiReRead=1, and you'd need to have an IOR version compiled with the 
  USE_UNDOC_OPT=1 (in iordef.h).  The command line would look like this:

  IOR -k -E -w -W -i 5 -o file -O multiReRead=1

  For the first iteration, the file would be written (w/o data checking).  Then
  for any additional iterations (four, in this example) the file would be
  reread for whatever data checking option is used.

HOW DOES IOR CALCULATE PERFORMANCE?

  IOR performs get a time stamp START, then has all participating tasks open a
  shared or independent file, transfer data, close the file(s), and then get a
  STOP time.  A stat() or MPI_File_get_size() is performed on the file(s) and
  compared against the aggregate amount of data transferred.  If this value
  does not match, a warning is issued and the amount of data transferred as
  calculated from write(), e.g., return codes is used.  The calculated
  bandwidth is the amount of data transferred divided by the elapsed
  STOP-minus-START time.

  IOR also gets time stamps to report the open, transfer, and close times.
  Each of these times is based on the earliest start time for any task and the
  latest stop time for any task.  Without using barriers between these
  operations (-g), the sum of the open, transfer, and close times may not equal
  the elapsed time from the first open to the last close.

HOW DO I ACCESS MULTIPLE FILE SYSTEMS IN IOR?

  It is possible when using the filePerProc option to have tasks round-robin
  across multiple file names.  Rather than use a single file name '-o file',
  additional names '-o file1@file2@file3' may be used.  In this case, a file
  per process would have three different file names (which may be full path
  names) to access.  The '@' delimiter is arbitrary, and may be set in the
  FILENAME_DELIMITER definition in iordef.h.

  Note that this option of multiple filenames only works with the filePerProc
  -F option.  This will not work for shared files.

HOW DO I BALANCE LOAD ACROSS MULTIPLE FILE SYSTEMS?

  As for the balancing of files per file system where different file systems
  offer different performance, additional instances of the same destination
  path can generally achieve good balance.

  For example, with FS1 getting 50% better performance than FS2, set the '-o'
  flag such that there are additional instances of the FS1 directory.  In this
  case, '-o FS1/file@FS1/file@FS1/file@FS2/file@FS2/file' should adjust for
  the performance difference and balance accordingly.

HOW DO I USE STONEWALLING?

  To use stonewalling (-D), it's generally best to separate write testing from
  read testing.  Start with writing a file with '-D 0' (stonewalling disabled)
  to determine how long the file takes to be written.  If it takes 10 seconds
  for the data transfer, run again with a shorter duration, '-D 7' e.g., to
  stop before the file would be completed without stonewalling.  For reading,
  it's best to create a full file (not an incompletely written file from a
  stonewalling run) and then run with stonewalling set on this preexisting
  file.  If a write and read test are performed in the same run with
  stonewalling, it's likely that the read will encounter an error upon hitting
  the EOF.  Separating the runs can correct for this.  E.g.,

  IOR -w -k -o file -D 10  # write and keep file, stonewall after 10 seconds
  IOR -r -E -o file -D 7   # read existing file, stonewall after 7 seconds

  Also, when running multiple iterations of a read-only stonewall test, it may
  be necessary to set the -D value high enough so that each iteration is not
  reading from cache.  Otherwise, in some cases, the first iteration may show
  100 MB/s, the next 200 MB/s, the third 300 MB/s.  Each of these tests is
  actually reading the same amount from disk in the allotted time, but they
  are also reading the cached data from the previous test each time to get the
  increased performance.  Setting -D high enough so that the cache is
  overfilled will prevent this.  

HOW DO I BYPASS CACHING WHEN READING BACK A FILE I'VE JUST WRITTEN?

  One issue with testing file systems is handling cached data.  When a file is
  written, that data may be stored locally on the node writing the file.  When
  the same node attempts to read the data back from the file system either for
  performance or data integrity checking, it may be reading from its own cache
  rather from the file system.

  The reorderTasks '-C' option attempts to address this by having a different
  node read back data than wrote it.  For example, node N writes the data to
  file, node N+1 reads back the data for read performance, node N+2 reads back
  the data for write data checking, and node N+3 reads the data for read data
  checking, comparing this with the reread data from node N+4.  The objective
  is to make sure on file access that the data is not being read from cached
  data.

    Node 0: writes data
    Node 1: reads data
    Node 2: reads written data for write checking
    Node 3: reads written data for read checking
    Node 4: reads written data for read checking, comparing with Node 3

  The algorithm for skipping from N to N+1, e.g., expects consecutive task
  numbers on nodes (block assignment), not those assigned round robin (cyclic
  assignment).  For example, a test running 6 tasks on 3 nodes would expect
  tasks 0,1 on node 0; tasks 2,3 on node 1; and tasks 4,5 on node 2.  Were the
  assignment for tasks-to-node in round robin fashion, there would be tasks 0,3
  on node 0; tasks 1,4 on node 1; and tasks 2,5 on node 2.  In this case, there
  would be no expectation that a task would not be reading from data cached on
  a node.

HOW DO I USE HINTS?

  It is possible to pass hints to the I/O library or file system layers
  following this form:
    'setenv IOR_HINT__<layer>__<hint> <value>'
  For example:
    'setenv IOR_HINT__MPI__IBM_largeblock_io true'
    'setenv IOR_HINT__GPFS__important_hint true'
  or, in a file in the form:
    'IOR_HINT__<layer>__<hint>=<value>'
  Note that hints to MPI from the HDF5 or NCMPI layers are of the form:
    'setenv IOR_HINT__MPI__<hint> <value>'

HOW DO I EXPLICITY SET THE FILE DATA SIGNATURE?

  The data signature for a transfer contains the MPI task number, transfer-
  buffer offset, and also timestamp for the start of iteration.  As IOR works
  with 8-byte long long ints, the even-numbered long longs written contain a
  32-bit MPI task number and a 32-bit timestamp.  The odd-numbered long longs
  contain a 64-bit transferbuffer offset (or file offset if the '-l'
  storeFileOffset option is used).  To set the timestamp value, use '-G' or
  setTimeStampSignature.

HOW DO I EASILY CHECK OR CHANGE A BYTE IN AN OUTPUT DATA FILE?

  There is a simple utility IOR/src/C/cbif/cbif.c that may be built.  This is a
  stand-alone, serial application called cbif (Change Byte In File).  The
  utility allows a file offset to be checked, returning the data at that
  location in IOR's data check format.  It also allows a byte at that location
  to be changed.

HOW DO I CORRECT FOR CLOCK SKEW BETWEEN NODES IN A CLUSTER?

  To correct for clock skew between nodes, IOR compares times between nodes,
  then broadcasts the root node's timestamp so all nodes can adjust by the
  difference.  To see an egregious outlier, use the '-j' option.  Be sure
  to set this value high enough to only show a node outside a certain time
  from the mean.

WHAT HAPPENED TO THE GUI?

  In versions of IOR earlier than 2.9.x, there was a GUI available.  Over time
  it became clear that it wasn't find enough use to warrant maintenance.  It
  was retired in IOR-2.10.x.

**************************
* 10. OUTPUT DESCRIPTION *
**************************
(FIXME -- this section needs updating and some rewrite.)

Output Statistics:

The quantity "aggregate operations/sec" was added to the existing "aggregate data rate" test log print file.
An "operation" is defined to be a write or read within an open/close [open/write|read/close]. 
Multiple writes or reads within an open/close are also counted as multiple operations. 
Also various other test relevant quantities are printed on a single "grepable" line using the pattern EXCEL. 
This way, output from large parameter space runs can easily be imported to excel for analysis. Below is an example.

grep EXCEL :IOR.o406550     

Operation  Max (MiB)  Min (MiB)  Mean (MiB)   Std Dev  Max (OPs)  Min (OPs)  Mean (OPs)   Std Dev   Mean(s)
---------  ---------  ---------  ----------   -------  ---------  ---------  ----------   -------   -------
read          309.30      17.20      164.67     73.80     309.30      17.20      164.67     73.80   0.06581
(line-continued)
#Tasks tPN reps  fPP reord reordoff reordrand seed segcnt blksiz   xsize   aggsize
   8     2 100     1     0     1     0         0     1     1048576 1048576 8388608 5 EXCEL

Where:
Max (MiB) - Maximum aggregate data rate of 100 iterations (reps)
Min (MiB) - Minumum aggregate data rate of 100 iterations (reps)
Mean(MiB) - Mean    aggregate data rate of 100 iterations (reps) 
Std Dev   - Standard deviation aggregate data rate of 100 iterations (reps)
Max (OPs) - Maximum aggregate operations per second of 100 iterations (reps)
Min (OPs) - Minimum aggregate operations per second of 100 iterations (reps)
Mean (OPs)- Mean    aggregate operations per second of 100 iterations (reps)
Std Dev   - Standard deviation aggregate operations per second of 100 iteration (reps)
Mean(s)   - Mean time per iteration (seconds)
#Tasks    - number of I/O processes
tPN       - number of I/O processes per node (per shared memory environment)
reps      - number of times (iterations) each test is run.
            The max,min,mean,ave,sdev above are calculated over "reps" tests
fPP       - files per process
reord     - constant node offset flag for reads
reordoff  -          node offset      for reads
reordrand - random   node offset flag for reads
seed      - random seed for node random node offset
segcnt    - number of seqments per file
blksiz    - total MBytes written/read per process
xsize     - total MBytes written/read per process per operation
aggsize   - total Mbytes written/read by all processes per operations
5         - testnum
EXCEL     - grep pattern for this summary print

More detail information can be obtained with AT LEAST "-v -v" [verbose=0] level and grepping "XXCEL". 
This includes a file "contention" histogram showing the number of files accessed 0,1,2,3,... times for a specified random pattern.