Some notes on how to build a Linux cluster

Compiled in 2003 — Based on many FAQs and Usenet group contributions.

"SUPERCOMPUTER: what it sounded like before you bought it."

On this page:

Cluster And Grid Computing
Cluster Hardware
Software / distributed libraries
Other distributed OSs
Distributed Numerical processing
Installing OpenMosix on RedHat
IP Choice
Cluster Books
Bugs
Presentation
Working with OpenMosix
DSH, The Distributed Shell
Cluster administration
Hard Drives
Installed software and services

Cluster And Grid Computing

Linux based clusters are pretty much in fashion these days. The ability to use cheap off-the-shelf hardware together with open source software makes Linux an ideal platform for supercomputing. Linux clusters are either: "Beowulf" clusters, "MOSIX" clusters or "High-Availability" clusters. Cluster types are chosen depending on the application requirements which usually fall in one of the following categories: computational intensive (Beowulf, Mosix), I/O intensive (supercomputers) or high availability (failover, servers...). In addition to clusters, there are two related architectures: distributed systems (seti@home, folding@home, Condor...) that can run on computers with totally different architectures spread out over the Internet; and multiprocessor machines (those, like the SGIs are much superior to clusters when memory needs to be shared fast between processes).

New 2006: Although I don't work on clusters anymore, I try to follow what's going on. There are several recent developments like Starfish or Amazon EC2 (Elastic Compute Cloud). The first one is implemented in Ruby and only takes a few lines of code to get up and running. The latter runs a Linux virtual image remotely on as many servers as you need. There are many grid services available, but the economics of them is sometimes hard to fathom; Sun's service has been a failure for instance as it takes way too much work on the programmer's part.

High Availability

High Availability (HA) clusters are highly fault tolerant server systems where 100% up-times are required. In the event of failure of a node, the other nodes which form the cluster will take over the functionality of the failed node transparently. HA clusters are typically used for DNS server, proxy and web servers.

High-Availability cluster Links:

Linux High Availability Project

[Pengates.jpg]
Good evening Mr Gates, I'll be your server today.

Beowulf clusters

Clusters are used where tremendous raw processing power is required like in simulations and rendering. Clusters use parallel computing to reduce the time required for a CPU intensive job. The workload is distributed among the nodes that make up the clusters and the instructions are executed in parallel. More nodes means faster execution and less time taken.

The difference between the two main cluster types, Mosix and Beowulf is that OpenMosix is a kernel implementation of process migration, whereas Beowulf is a programming model for parallel computation. Beowulf clusters need distributed application programming environments such as PVM (Parallel Virtual Machine) or MPI (Message Passing Interface). PVM is the standard interface for parallel computing. But MPI is becoming the industry standard. PVM still has an upper edge over MPI as there are more PVM aware applications when compared to MPI based ones. But this could soon change as MPI becomes popular.

Beowulf Links:

Mosix clusters

MOSIX is a software package that was specifically designed to enhance the Linux kernel with cluster computing capabilities. The core of MOSIX are adaptive (on-line) load-balancing, memory ushering and file I/O optimization algorithms that respond to variations in the use of the cluster resources, e.g., uneven load distribution or excessive disk swapping due to lack of free memory in one of the nodes. In such cases, MOSIX initiates process migration from one node to another, to balance the load, or to move a process to a node that has sufficient free memory or to reduce the number of remote file I/O operations. MOSIX clusters are typically used in data centers and data warehouses.

Mosix works best when running plenty of separate CPU intensive tasks. Shared memory is its big drawback, like Beowulf: for applications using shared memory, such as Web servers or database servers, there will not be any benefit from [Open]Mosix because all processes accessing said shared memory must resided on the same node.

MPI and openmosix play very nicely together, the processes that MPI spawns on remote nodes are migrated by the OpenMosix load balancing algorithm like any other process. So in a sense it's better in a dynamic environment where different MPI programs will be running that are unaware of each other and they could both try to overload any individual node.

Mosix Links:

A good read about Distributed OSs
Mosix, Posix and Linux
The MOSIX Project
Mosix on FreshMeat.
OpenMosix and Slashdot comments.
OpenMosix Internals
The Mandrake Mosix Terminal Server Project.
LTSP+OpenMosix: A Mini How-To
A Recipe for a diskless MOSIX cluster using Cluster-NFS
K12LTSP MOSIX Howto... for diskless clusters and the Slashdot comments.
Sparse: Linux cluster with diskless clients, SuSE 8.0, ClusterNFS and MOSIX
ClusterKnoppix
Debian GNU/Linux openmosix
openMosix programs in perl with Parallel::ForkManager (also available from CPAN).

Cluster Hardware

What's in a cluster ?: Commodity hardware components (usually PC), interconnected by a private high-speed network. The nodes run the jobs. It is sometimes connected to the outside world through only a single node, if at all. Overall cost may be a third to a tenth of the price of a traditional supercomputer.
What's the most important: CPU speed, memory speed, memory size, cache size, disk speed, disk size, or network bandwidth ? Should I use dual-CPU machines ? Should I use Alphas, PowerPCs, ARMs, x86s or Xeons ? Should I use Fast Ethernet, Gigabit Ethernet, Myrinet, SCI, FDDI, FiberChannel ? Should I use Ethernet switches or hubs ? ...: The short answer is: it all depends upon your application! Benchmark, profile, find the bottleneck, fix it, repeat.; Some people have reported that dual-CPU machines scale better than single-CPU machines because your computation can run uninterrupted on one CPU while the other CPU handles all the network interrupts.
Can I make a cluster out of different kinds of machines: single-processor, dual-processor, 2.0GHz, 2.4GHz, etc ?: Yes. Depending upon the application mix, the cluster can be configured to look like multiple clusters, each optimized for the specific application.
Is it possible to make a cluster of already used workstations ?: Yes, you can use your secretary's computer's unused CPU cycles. Install a cluster package like Mosix or Condor on it and add its IP to the overall cluster. This kind of architecture is called a COW (Cluster Of Workstations). The main drawbacks are security (any compromised machine compromises the entire cluster), unstable apps (if you crash a machine, part or even the entire cluster could be affected) and deciding who gets to run the CPU intensive jobs.
How do you connect the nodes ?: Lately, with the fashion of simple clusters of low-quality hardware, the most common type of interconnect has been the cheap Gb ethernet LAN. But there are other type that can be faster or with lower latency or both: Inifiniband, Myrinet, SCI, Quadrix... Those can be expensive though.

Hardware Links:

The cluster cookbook
x86 64 bits, not quite ready yet but good reviews...
Alpha processors, reliable but being phased out (that's what you get for going for DEC to Compaq to HP to Alient or whatever its called now).
Double up your cluster with two CPU boards which are a good way to double CPU power for very extra money. On the other and, quad (or more) motherboards are prohibitively expensive.
Sharing an IEEE 1394 Device Between Machines
Different types of interconnects.

Hardware cost for a sample cluster (March 2003)

Here's an Excel spreadsheet of the hardware list and total cost involved in our cluster at CSU. About 24000$ (including shipping) for a 12 nodes, 24 processor cluster. Some of the key characteristics are: AMD Athlon MP 2400 with better heatsinks, Tyan Thunder K7X Pro motherboards (without SCSI), internal 2Tb RAID-5 IDE/ATA array, Gb LAN, 24Gb of RAM, Linux RedHat 8.0 with OpenMosix...

Software / distributed libraries

Where's the Beowulf software?: There isn't a software package called "Beowulf". There are, however, several pieces of software many people have found useful for building Beowulf clusters: MPI, LAM, PVM, Linux....
How much faster will my software run?: It's very application dependent. You'll need to split it into parallel tasks that communicate using MPI or PVM. Then you need to recompile it. Some "highly parallel" tasks can take advantage of running on multiple processors simultaneously, but then a Mosix cluster can do that more easily.
What are PVM and MPI?: PVM and MPI are libraries that allow you to write message-passing parallel programs in Fortran or C to run on a cluster. PVM used to be the defacto standard until MPI appeared. But PVM is still widely used and very good. MPI (Message Passing Interface) is a defacto standard for portable message-passing parallel programs standardized by the MPI Forum and available on all massively-parallel supercomputers.
Is there a compiler that will automatically parallelize my code for a Beowulf?: No. But there is BERT from plogic.com which will help you manually parallelize your Fortran code. And NAG's and Portland Group's Fortran compilers can also build parallel versions of your Fortran code, given some hints from you (in the form of HPF and OpenMP directives). These versions may not run any faster than the non-parallel versions. The gnu gcc, Portland Group, Intel, Fujitsu and Absoft are the most commonly used compilers.

PVM (Parallel Virtual Machine)
Condor: High Throughput Computing (HTC)
Harness: the next generation system beyond PVM.
MPI (Message Passing Interface). Note that you cannot get MPI to run on Mosix.
Religious war about the Best Linux distro ?

Other distributed OSs

Rocks

With ROCKS, you get:

SCSI Support
Easily reinstalled nodes
Pre-installed queue software
brain dead admin tools

Mandrake 'Click'

The new Mandrake 'Click' Turn-Key Clustering Distribution (in beta). Pretty easy to install.

ClusterKnoppix/Quantian

ClusterKnoppix and Quantian are Knoppix boot CDs with OpenMosix self-discovery daemons and network boot. It's an excellent way to have an instant cluster: put the CD in, boot the master, reboot the slaves and you are clustering !!! Includes mathematical software. You don't need hard drives to run it and it's very easy and quick to install. If you want to try out clustering, start with this first.

Distributed Numerical processing

About GridMathematica and some Slashdot comments.
For interactive symbolic manipulation, Maxima is an excellent open-source alternative.
For numerical applications, Numerical Python and its associated packages beat both Matlab and Mathematica.
For 3D visualization, you can get VTK, which also has Python bindings.
You can get a copy of Numerical Python and run PyPVM or PyMPI with it for distributed computing.

Installing OpenMosix on RedHat

RedHat 8.0
openmosix-kernel-2.4.20-openmosix3.i686.rpm (or the most current, or Athlon architecture)
openmosix-tools-0.2.4-1.i386.rpm (or Athlon)
openMosixUserland-0.2.4.tgz
openMosix-2.4.20-4.gz (identical to your kernel)
2.4.20 (or latest) kernel source installed in /usr/src/linux/
DSH (Distributed Shell)

Then:

Install RedHat
Install openMosix rpms
Install kernel source to /usr/src/linux and patch with openMosix-2.4.20-4.gz
Compile and install openMosixUserland-0.2.4
Configure /etc/openmosix.map
Reboot new openMosix kernel
setpe -W -f; edit /etc/openmosix.map on each node
Install libdshconfig and DSH
Put the other nodes' hard drives as slaves in your just configured box, format it and remount your main drive as read-only (mount -o remount,ro /), dd if=/dev/sda of=/dev/sdb. Mount the new disk to update a few files (like /etc/sysconfig/network-scripts/ifcfg-eth0).

Using OpenMosix

What do you need to know to use OpenMosix ?: In short: nothing. Log in the Master node, compile your programs (with optional AMD Athlon optimization), run them, they will migrate to an available node. You never need to login into the slave nodes.; In long: it depends if the task is CPU intensive or I/O intensive. Read on...
CPU intensive tasks: Those are perfect for Mosix and should migrate well (there are some exceptions, like Java applications) and make the cluster feel like an SMP machine. You can use pmake or make -j 24 to compile in parallel.
I/O intensive tasks: Use mosmon to find a node (N) that doesn't have much running, copy your data to its disk "/mfs/N/dir" and just launch your code, it will move to that node in order to minimize I/O.
The Mosix Filesystem (MFS): All disks are shared between nodes in the following way: /mfs/Node#/Dir. In addition you can refer to process dependent directories: /mfs/here (the process' current node), /mfs/home (the process' home node).
The Distributed Shell (DSH): DSH allows you to run the same command on all the nodes: dsh -r ssh "uname -a"
More Mosix control: the following commands give you info on running processes, or some control on how they are migrated by OpenMosix: migrate, mon, mosctl, mosrun, setpe, tune, or the graphical programs openMosixview, openMosixprocs, openMosixcollector, openMosixanalyser, openMosixhistory, openMosixmigmon... Use man.
More info ?: Read the Mosix HowTo (only 99 pages).

Possible hostnames for a cluster

Lists kindly generated by Google sets.

Northern mythology: Accalia, Aegir, Aesir, Aldwin, Asgard, Avalon, Balder, Baldric, Beowulf, Bragi, Brahma, Dolvar, Drexel, Fenrir, Forseti, Frey, Freya, Freyr, Frigg, Fulla, Grendel, Heimdall, Hel, Hermod, Hildreth, Hod, Hofstra, Hrothgar, Hyglac, Hyoga, Hyster, Idun, Jord, Loki, Megalon, Midgard, Naegling, Njord, Norns, Odin, Ragnarok, Ran, Rhung, Sif, Skadi, Snotra, Thor, Tyr, Ull, Unferth, Valhalla, Valkyrie, Vidar, Vor, Wiglaf, Yggdrasil...
More technical: criticalregion, dde, deadlock, distributed, globallock, messagepassing, multiprocess, mutex, namedpipe, parallel, rendezvous, rpc, semaphore, setiathome, sharedmemory, thread...
Hungry: Anise, Basil, Chamomile, Chive, Coriander, Dill, Fennel, Hyssop, laurel, Lavender, Mint, Oregano, Parsley, Rosemary, Tarragon, Thyme...

IP Choice

Private IP Networks: RFC 1918 reserved three IP network addresses for private networks. The addresses 10.0.0.0/8, 192.168.0.0/16 and 172.16.0.0/20 can be used for anyone for setting up their own internal IP networks.
Class 'E' Addresses: The Class 'E' IP address is reserved for future use. In Class 'E' addresses, the 4 highest-order bits are set to 1, 1, 1, and 1 (in practice 241.0.0.0 through 248.255.255.254). Routers and Firewalls currently ignore class 'E' IP addresses so you can use them between the machines, and have the master node also have a normal IP as the only way in the cluster.

Cluster Books

Except for the first one, most of those books are either outdated, very vague, too specific (only source code) or with bad reviews. You've been warned.

The Linux Enterprise Cluster by Karl Kopper.
Linux Clustering: Building and Maintaining Linux Clusters by Charles Bookman.
Building Linux Clusters by David H. M. Spector.
Linux Cluster Architecture by Alex Vrenios .
Beowulf Cluster Computing with Linux (Scientific and Engineering Computation) by Thomas Sterling (Editor), et al.
How to Build a Beowulf cluster by Thomas L. Sterling, et al.
Shared Data Clusters: Scaleable, Manageable, and Highly Available Systems (VERITAS Series) by Dilip M. Ranade.
Cluster Computing by Rajkumar Buyya (Editor), Clemens Szyperski.
High Performance Cluster Computing: Architectures and Systems by Rajkumar Buyya.
High Performance Cluster Computing: Programming and Applications, Volume 2 by Rajkumar Buyya.
In Search of Clusters (2nd Edition) by Gregory F. Pfister.
High Performance Mass Storage and Parallel I/O: Technologies and Applications by Hai Jin.

And a little wallpaper as a gift:

[ManyEmperors_LinuxCluster.jpg]
A cluster of penguins...

Welcome to the Neumann cluster

Bugs

There is a bug in mfs that causes some recursive commands to fail painfully. For instance the following commands will cause the node to hang: find /, du -ks /... In other words, do not lauch recursive commands from / (the root directory) or /mfs. A workaround is to use find / \! -fstype mfs or locate. Note: that problem may be solved now (by unmounting a shared memory module), but I don't want to risk it...

rpm -Uvh --oldpackage fileutils-4.1-10.1.i386.rpm to avoid bug in cp and move returning "cp: skipping file `foo', as it was replaced while being copied".

Presentation

Neumann is an x86 Linux cluster running OpenMosix. There are twelve nodes in the cluster, each node having two AMD Athlon MP 2400+ CPUs, 2Gb of RAM and 60Gb of disk space. The main node, on which you login, also has a 2Tb RAID array. A 1Gbps ethernet switch interconnects the nodes, and only the master node is connected to the outside via a 100Mbps LAN. Each CPU has 3 floating point units able to work simultaneously.

Each node runs Red Hat Linux 8.0, fully patched, with a customized kernel 2.4.20. OpenMosix is a kernel modification that makes a group of computers act like a multiprocessor machine (SMP), so the simplest way to use Neumann is to just ignore it, it will run your CPU intensive jobs on whatever node has spare CPU.

Working with OpenMosix

In addition to the current page, you might want to take a look at the following information:

The OpenMosix project site.
I strongly recommend taking a look at the OpenMosix HowTo pdf file, even if you are only a user.
The most recent version of this HowTo is also available in other formats (including html) here.

The simplest way to use OpenMosix is to just ignore it. Just compile and run your CPU intensive jobs on the master node, and they will be dispatched automagically to whatever node has spare CPU cycles. And when a node becomes overloaded, processes will migrate to a more idle node.

MFS

One very useful part of OpenMosix if the MFS (Mosix FileSystem). You can access any drive on any node from any node using a path like /mfs/3/usr/src/. Just prepend /mfs/NodeNum to the normal path. Note that you don't need to do this for your home directory as it's already linked to /mfs/1/home/username. Note that when doing I/O from a local disk (like /), the processes need to migrate back so they can read the proper drive (/ on node 1, not / on node N); if on the other hand you use mfs, processes can read without migrating. So Command </mfs/1/dir/InputFile >/mfs/1/dir/OutputFile is much preferable to Command <InputFile >OutputFile Remember this syntax when hard-coding pathnames in your programs, it is the simplest and best way to optimize your programs for OpenMosix.

In addition to /mfs/NodeNum/, there are virtual directories accessible on a 'per process' base such as /mfs/home or /mfs/here for quick access. Read the HowTo for more.

Mosix Commands

If you want more in depth control, here are a few tools, both X-windows and command line:

X-windows Command line

X-windows	Command line
openmosixview Display the state of the various nodes and their load (the load includes both CPU and outstanding I/O). openmosixmigmon Display the various jobs running on the various nodes. The green squares are jobs that have been migrated away from the master node. You can click on them and drag them onto another 'penguin' to migrate them manually. Or issue `migrate PID node#` from the command line. I would advise you to run your large jobs with `mosrun -j2-12 -F job` and then move them around manually with `openmosixmigmon`. In other words, try not to run your big jobs on the master node (neumann, 1) as it's already busy handling network I/O, interactive users and more. openmosixprocs Process box (similar to to Linux system monitor). Double click a process to be able to change its settings (nice, locked, migration...) openmosixhistory Process history for the cluster openmosixanalyser Similar to `xload` for the entire cluster	mosrun use this to start a job in a specific way, for instance `mosrun -j2-12 -F job` will start a job on one of the slave nodes and it will not migrate automatically after that. `mosrun -c job` for CPU intensive jobs, `mosrun -i job` for I/O intensive... `man mosrun` for more. Various of the mosrun options have command line aliases, for instance `nomig` to lauch a process not allowed to migrate. migrate PID [Node] Manually migrate a process to another node. Note that if the process is not node-locked it may change node later. mps and mtop OpenMosix aware replacements for the ps and top utilities mosctl Controls the way a node behaves in the cluster (if it's willing to run remote jobs and more...) `man mosctl` for more.

openmosixview: Display the state of the various nodes and their load (the load includes both CPU and outstanding I/O).
openmosixmigmon: Display the various jobs running on the various nodes. The green squares are jobs that have been migrated away from the master node. You can click on them and drag them onto another 'penguin' to migrate them manually. Or issue migrate PID node# from the command line. I would advise you to run your large jobs with mosrun -j2-12 -F job and then move them around manually with openmosixmigmon. In other words, try not to run your big jobs on the master node (neumann, 1) as it's already busy handling network I/O, interactive users and more.
openmosixprocs: Process box (similar to to Linux system monitor). Double click a process to be able to change its settings (nice, locked, migration...)
openmosixhistory: Process history for the cluster
openmosixanalyser: Similar to xload for the entire cluster

mosrun: use this to start a job in a specific way, for instance mosrun -j2-12 -F job will start a job on one of the slave nodes and it will not migrate automatically after that. mosrun -c job for CPU intensive jobs, mosrun -i job for I/O intensive... man mosrun for more. Various of the mosrun options have command line aliases, for instance nomig to lauch a process not allowed to migrate.
migrate PID [Node]: Manually migrate a process to another node. Note that if the process is not node-locked it may change node later.
mps and mtop: OpenMosix aware replacements for the ps and top utilities
mosctl: Controls the way a node behaves in the cluster (if it's willing to run remote jobs and more...) man mosctl for more.

Advanced clusterization

I you write shell scripts, you can use the power of OpenMosix with the simple '&' to fork jobs to other nodes. In C, use the fork() function; there is a code sample chapter 12 of the OpenMosix HowTo. Also see the HowTo file for more information on Python, Perl, Blast, Pvm, MPICH... Note that Java doesn't migrate.

OpenMosix is designed to allow you to run multiple jobs simultaneously in a simple way. You cannot run a single job (even multithreaded) on more than one processor. If you need to do something like that, you need to use the MPI (Message Passing) library. MPICH is the implementation of MPI which is compatible with OpenMosix, version 1.2.5 is installed in /usr/local/mpich for use with the PGI compilers (not the GNU compilers). There's also an MPI library coming with RedHat Linux, but there's no warranty about its OpenMosix compatibility, so make sure you are using the proper one. Also read this short page about MPICH on OpenMosix (it tells you to use a MOSRUN_ARGS environment variable). A more recent discussion about Looking for Portable MPI I/O Implementation.

If you need checkpoint and restart facility for extremely time consuming problems (the cluster can reboot and the computation continues), you can ask for Condor to be installed on top of OpenMosix.

You normally do not need to log onto the slave nodes. But it is possible. You just need to type ssh neumann2 (3..12). Note that you can use the usual ssh-keygen -t dsa, cat ~/.ssh/*.pub >>~/.ssh/authorized_keys, ssh-agent $SHELL and ssh-add to define a (possibly empty) passphrase so that you won't have to enter your password each time. Note that you can access neumann# only from neumann, '#' being the node number (2 to 12). See below.

DSH, The Distributed Shell

DSH allows you to run a command on all the nodes of the cluster simultaneously. For instance:

dsh -a reboot: Reboot the entire (-a) cluster.
dsh -c -g slaves "shutdown -h now": Shut down gracefully all the slave nodes.
dsh -a df: Execute df on every node, in order (option -c executes concurrently)
dsh -c -g custom /home/CustomJob: Run a custom job on the computers defined in the file ~/.dsh/group/custom, so that even if the master node crashes, they shouldn't be affected (but bear in mind that /home is a symlink to /mfs/1/home
dsh -a passwd: the password files need to be the same on all the nodes, so you need to change your password everywhere at once... Sorry, but you need to type it in 24 times !

Setting up password-less access (necessary for dsh).

Run ssh-keygen -t dsa from a shell. It creates a secret key and a public key. Use an empty passphrase.
Copy the public key onto any machine you want to access: cat ~/.ssh/*.pub >>~/.ssh/authorized_keys. Here we copy to the same machine because the home directory is shared by every node (except for root where you need to do cat ~/.ssh/*.pub >>/mfs/Node#/root/.ssh/authorized_keys for every node).
Add ssh-agent $SHELL to you login file (.profile or .bash_profile)
Type ssh-add each time after you log in and type your passphrase.
You can now type ssh neumann# to log onto any node (or use dsh) without password. The first time you have to answer yes to add the node to the list of known hosts.

Cluster administration

Kernels

Several kernels are available to boot from, depending on use and issue to troubleshoot:

2.4.20-openmosix-gd: Optimized OpenMosix kernel. This is the default kernel. Note that this kernel is slightly different on the master than on the slaves (although named the same). Everything that wasn't strictly necessary has been removed from the slave kernels (no CD, no floppy, no SCSI, no enhanced video, no sound...) and almost all from the master kernel.
2.4.20gd: Optimized 2.4.20 kernel
2.4.20-openmosix2smp: Basic openmosix kernel
2.4.20: Vanilla 2.4.20 kernel, no customization.
2.4.18-14smp: Default multiprocessor RedHat 8.0 kernel
2.4.18-14: Basic RedHat 8.0 kernel, if all else fails.

A tarred copy of the slave kernel is kept in /boot on the master node. Note that when you make xconfig dep clean bzImage modules modules_install install, the last phase (install) cannot work if you try to recompile the kernel currently in use; reboot and use another kernel first.

Important note: if you recompile a kernel, you need to edit /boot/grub/grub.conf after running make install and change root=LABEL=/ to root=/dev/hda3 (master) or hda7 (slaves). If you get a kernel panic on boot, you probably forgot to do that.

There is a complete disk image of the neumann2 slave in the file Slave.img.gz. It was created by putting the slave drive in the master node and doing dd if=/dev/hdd | gzip >Slave.img.gz. You can create a new slave by putting a blank hard drive in the master (or any node) and doing zcat Slave.img.gz >/dev/hdd. It takes about 12 hours and OpenMosix should be disabled while you do that (/etc/init.d/openmosix stop). After completion, you need to change the node hostname and IP with /usr/bin/neat and also various settings that may have changed since the image file was created (last kernel...). Note: if you connect the drive as primary slave it is /dev/hdb, secondary master is /dev/hdc and secondary slave is /dev/hdd.

Boot sequence

ECC memory scrubbing is the first thing to start and makes it look like the computer is stuck for a minute when just turned on (screen stays blank, no POST beeps...). Important: do not turn the machine off during the sequence.
The 3ware hardware raid controller BIOS setup runs briefly (press [Alt][3] to enter its configuration mode)
Then the BIOS starts several tests. Press [F2] to interrupt and enter the BIOS. Password is the same as the root password. Everything that wasn't necessary has been disabled: no floppy, CD, USB, serial or parallel port. Keyboard and mouse are optional. Full ECC (memory parity) recovery is enabled. At the end the BIOS shows a summary page for 30 seconds.
The Intel boot agent takes over; press [Ctrl][S] to enter it, but there's nothing useful in it.
The GRUB boot loader starts and waits for 30 seconds to boot alternate kernels.
The Linux kernel starts. If there's been an unclean shutdown, you have 5 seconds to press Y to do a integrity disk checking. Once the kernel has finished loading, OpenMosix is operational.

OpenMosix administration

The user IDs must be the same on all the nodes. In other words /etc/passwd, /etc/shadow and /etc/group must be identical on all nodes. You can edit and run /root/AddUser.sh to add a user on all nodes.

Use /etc/init.d/openmosix {start|stop|status|restart} to control openmosix on the current node.

/etc/mosix.map contains the list of nodes and must be identical on all nodes. You want to keep it consistent with /etc/hosts. Autodiscovery is not used.

If a slave node crashes, the jobs that had been migrated to it are lost (and the cluster may hang for a while while it sort things out). If the master node crashes, all jobs are lost (except if they were started somewhere else). This is because a job that has been migrated away keeps some references/pointers/resources on its home node.

To keep the master node from running too many jobs, either migrate them manually with migrate or openmosixmigmon, or set its speed to some low value with mosctl setspeed 7000

Hard Drives

Partitions

Partitions are slightly different on the master than on the slaves.

Mount	Type	Size	Master	Slaves
`/boot`	ext3	100Mb	/dev/hda1	/dev/hda1
	swap	4Gb	/dev/hda2	/dev/hda3
`/`	ext3	1Gb	/dev/hda3	/dev/hda7
`/usr`	ext3	4Gb	/dev/hda5	/dev/hda2
`/var`	ext3	1Gb	/dev/hda6	/dev/hda6
`/tmp`	ext3	1Gb	/dev/hda7	/dev/hda5
`/home`	ext3	47Gb	/dev/hda8
`/spare`	ext3	47Gb		/dev/hda8
`/raid`	ext3	1.7Tb	/dev/sda1

Note that there's no specific use for the 11 /spare partitions (totaling about 500Gb) present on the slaves. Here is my suggestion: if you have I/O intensive jobs, put your data on one of those drives (say /mfs/7/spare/) and then run your job in a similar way: mosrun -j7 -F -i "job </mfs/7/spare/LargeDataFile" so that the job stays close to its data and doesn't use the network. Remember to delete your data afterwards (or back it up on /raid).

RAID management

The RAID is based on a 3ware hardware controller model Escalade 7500-8 with 8 250Gb Maxtor hard drives. The RAID is configured as a RAID-5 where data parity is spread over all the drives. One drive may fail without causing any data loss. After a drive failure, the array must either be rebuilt on the remaining drives, or a blank drive may be inserted to replace the failed one prior to rebuilding. In case of drive failure, an email is sent to the admin.

Configuration is done both on boot in the 3ware utility (for basic configuration), or with cli or preferably with 3dm through a web browser. From neumann itself, type mozilla http://localhost:1080/ to manage the array.

U: drive and remote mounts

Some NFS mounts are available under /mnt/, for instance personal network files are mounted under /mnt/udrive/yourusername. Note that this is the Engineering network username, not the Neumann username, which may be different.

Installed software and services

Compilers and their options

The AMD Athlon MP processors are able to run native i86 code, but for full performance, you should use the special athlon compiling options (SSE and 3DNow!) whenever applicable in your compiler. One of the main difference between Intel and AMD is that the Athlon has 3 floating point units (FPUs) leading to a theorical performance of 2 double precision multiplications and 2 single precision additions per clock cycle. The actual performance is 3.4 FLOPs per clock, leading to a peak performance of 160 Gflops for the entire cluster.

For the GNU compilers are the default Linux (g77, gcc, gcj). Use g77 --target-help or gcc --target-help to get the relevant compiling options for optimizing your programs (add them to your makefile):

For the PGI compilers (pgcc, pgCC, pgdbg, pgf77, pgf90, pghpf...) you need the proper variables set ($PGI, $MANPATH...) in your ~/.profile or ~/.bash_profile before you can use them or use man. Note that PGI workstation is limited to 4 processors per thread and that you can compile only from node 1.

CC	CFLAGS	Description
gcc	-O3 -march=athlon-mp	GNU compiler 3.2
pgcc	-O2 -tp athlonxp -fast	Portland group C compiler
pgCC	-O2 -tp athlonxp -fast	Portland group C++ compiler
bcc	-3	Bruce's C compiler
FC	FFLAGS	Description
g77	-O3 -march=athlon-mp	G77 3.2
pgf77	-O2 -tp athlonxp -fast	Portland group fortran 77 compiler
pgf90	-O2 -tp athlonxp -fast	Portland group fortran 90 compiler
pghpf	-O2 -tp athlonxp -fast	Portland group High Performance Fortran compiler
Other languages		Description
gcj	?	Ahead-of-time GNU compiler for the Java language
perl	?	The Perl Compiler
python	?	An interpreted, interactive, object-oriented programming language
pgdbg	?	The PGI X-windows degugger

Services running on Neumann (5 is the normal runlevel)

Active services	Disabled services
anacron apmd atd autofs crond gpm httpd isdn keytable kudzu lpd netfs network nfslock ntpd openmosix pcmcia portmap random rawdevices rhnsd sendmail sshd syslog xfs xinetd	aep1000 bcm5820 firstboot iptables irda named nfs nscd saslauthd smb snmpd snmptrapd tux winbind
xinetd based services:
mosstatd sgi_fam	chargen chargen-udp daytime daytime-udp echo echo-udp rsync servers services time time-udp

Specific software

Type /mnt/fluent/bin/fluent to start the fluent application (remotely mounted from gohan). Also installed is xfig. I'm having trouble compiling Grace, a successor to xmgr.

If you need to work with MS Office documents, you can start OpenOffice with /usr/lib/openoffice/program/soffice or /usr/lib/openoffice/program/scalc or more in that directory...

The PGI Workstation suite of compilers are installed in /usr/pgi (see above).

Software packages installed on Neumann

Here's the short list below. For a complete list with version numbers and comment, see the complete RPM list with rpm -qa --info at the command line.

4Suite          gal             libf2c          netconfig       setuptool
a2ps            gawk            libgal19        netpbm          sgml
acl             gcc             libgcc          newt            sh
alchemist       gconf           libgcj          nfs             shadow
anacron         GConf           libghttp        nmap            slang
apmd            GConf2          libglade        nscd            slocate
ash             gd              libglade2       nss_ldap        slrn
aspell          gdb             libgnat         ntp             sox
at              gdbm            libgnome        ntsysv          specspo
atk             gdk             libgnomecanvas  oaf             splint
attr            gdm             libgnomeprint   octave          star
audiofile       gedit           libgnomeprint15 Omni            statserial
authconfig      gettext         libgnomeprintui omtest          strace
autoconf        gftp            libgnomeui      openjade        stunnel
autofs          ggv             libgtop2        openldap        sudo
automake        ghostscript     libIDL          openmosix       swig
automake14      gimp            libjpeg         openmosixview   switchdesk
automake15      glib            libmng          openmotif       sysklogd
basesystem      glib2           libmrproject    openoffice      syslinux
bash            glibc           libogg          openssh         SysVinit
bc              Glide3          libole2         openssl         talk
bdflush         glut            libpcap         orbit           tar
bind            gmp             libpng          ORBit           tcl
binutils        gnome           libpng10        ORBit2          tcpdump
bison           gnumeric        librpm404       pam             tcp_wrappers
bitmap          gnupg           librsvg2        pam_krb5        tcsh
blas            gnuplot         libstdc++       pam_smb         telnet
bonobo          gpg             libtermcap      pango           termcap
byacc           gphoto2         libtiff         parted          tetex
bzip2           gpm             libtool         passivetex      texinfo
cdecl           gqview          libungif        passwd          textutils
chkconfig       grep            libunicode      patch           time
chkfontpath     groff           libusb          patchutils      timeconfig
compat          grub            libuser         pax             tk
comps           gtk             libvorbis       pciutils        tmpwatch
control         gtk+            libwnck         pcre            traceroute
cpio            gtk2            libwvstreams    perl            transfig
cpp             gtkam           libxml          php             ttfprint
cracklib        gtkhtml         libxml2         pilot           tux
crontabs        gtkhtml2        libxslt         pine            umb
cups            guile           lilo            pinfo           units
curl            gzip            linc            pkgconfig       unix2dos
cvs             hdparm          linuxdoc        pmake           unzip
cyrus           hesiod          lockdev         pnm2ppa         up2date
db4             hotplug         logrotate       popt            urw
desktop         hpijs           logwatch        portmap         usbutils
dev             htmlview        lokkit          postfix         usermode
dev86           httpd           losetup         ppp             utempter
dhclient        hwbrowser       LPRng           procmail        util
dia             hwcrypto        lrzsz           procps          VFlib2
dialog          hwdata          lsof            psmisc          vim
diffstat        ImageMagick     ltrace          pspell          vixie
diffutils       imlib           lvm             psutils         vte
docbook         indent          lynx            pvm             w3c
dos2unix        indexhtml       m4              pygtk2          webalizer
dosfstools      info            magicdev        pyOpenSSL       wget
doxygen         initscripts     mailcap         python          which
dump            intltool        mailx           pyxf86config    whois
e2fsprogs       iproute         make            PyXML           WindowMaker
ed              iptables        MAKEDEV         qt              wireless
eel2            iputils         man             quota           words
eject           irda            maximum         raidtools       wvdial
emacs           isdn4k          mc              rcs             Xaw3d
eog             jadetex         memprof         rdate           xdelta
esound          jfsutils        metacity        rdist           xfig
ethereal        kbd             mingetty        readline        XFree86
ethtool         kbdconfig       minicom         redhat          Xft
evolution       kernel          mkbootdisk      reiserfs        xinetd
expat           krb5            mkinitrd        rhl             xinitrc
expect          krbafs          mktemp          rhn             xisdnload
fam             ksymoops        mod_perl        rhnlib          xml
fbset           kudzu           mod_python      rhpl            xmltex
fetchmail       lam             mod_ssl         rmt             xmlto
file            lapack          modutils        rootfiles       xpdf
filesystem      less            mount           rp              xsane
fileutils       lesstif         mouseconfig     rpm             xscreensaver
findutils       lftp            mozilla         rpm404          xsri
finger          lha             mpage           rsh             Xtest
firstboot       libacl          mrproject       rsync           yelp
flex            libaio          mt              samba           yp
fontconfig      libart_lgpl     mtools          sane            ypbind
foomatic        libattr         mtr             screen          zip
fortune         libbonobo       mutt            scrollkeeper    zlib
freetype        libbonoboui     nautilus        sed
ftp             libcap          ncurses         sendmail
gail            libcapplet0     nedit           setserial
gaim            libelf          net             setup