Compiled in 2003 — Based on many FAQs and Usenet group contributions.
"SUPERCOMPUTER: what it sounded like before you bought it."
Linux based clusters are pretty much in fashion these days. The ability to use cheap off-the-shelf hardware together with open source software makes Linux an ideal platform for supercomputing. Linux clusters are either: "Beowulf" clusters, "MOSIX" clusters or "High-Availability" clusters. Cluster types are chosen depending on the application requirements which usually fall in one of the following categories: computational intensive (Beowulf, Mosix), I/O intensive (supercomputers) or high availability (failover, servers...). In addition to clusters, there are two related architectures: distributed systems (seti@home, folding@home, Condor...) that can run on computers with totally different architectures spread out over the Internet; and multiprocessor machines (those, like the SGIs are much superior to clusters when memory needs to be shared fast between processes).
New 2006: Although I don't work on clusters anymore, I try to follow what's going on. There are several recent developments like Starfish or Amazon EC2 (Elastic Compute Cloud). The first one is implemented in Ruby and only takes a few lines of code to get up and running. The latter runs a Linux virtual image remotely on as many servers as you need. There are many grid services available, but the economics of them is sometimes hard to fathom; Sun's service has been a failure for instance as it takes way too much work on the programmer's part.
High Availability (HA) clusters are highly fault tolerant server systems where 100% up-times are required. In the event of failure of a node, the other nodes which form the cluster will take over the functionality of the failed node transparently. HA clusters are typically used for DNS server, proxy and web servers.
High-Availability cluster Links:Clusters are used where tremendous raw processing power is required like in simulations and rendering. Clusters use parallel computing to reduce the time required for a CPU intensive job. The workload is distributed among the nodes that make up the clusters and the instructions are executed in parallel. More nodes means faster execution and less time taken.
The difference between the two main cluster types, Mosix and Beowulf is that OpenMosix is a kernel implementation of process migration, whereas Beowulf is a programming model for parallel computation. Beowulf clusters need distributed application programming environments such as PVM (Parallel Virtual Machine) or MPI (Message Passing Interface). PVM is the standard interface for parallel computing. But MPI is becoming the industry standard. PVM still has an upper edge over MPI as there are more PVM aware applications when compared to MPI based ones. But this could soon change as MPI becomes popular.
Beowulf Links:MOSIX is a software package that was specifically designed to enhance the Linux kernel with cluster computing capabilities. The core of MOSIX are adaptive (on-line) load-balancing, memory ushering and file I/O optimization algorithms that respond to variations in the use of the cluster resources, e.g., uneven load distribution or excessive disk swapping due to lack of free memory in one of the nodes. In such cases, MOSIX initiates process migration from one node to another, to balance the load, or to move a process to a node that has sufficient free memory or to reduce the number of remote file I/O operations. MOSIX clusters are typically used in data centers and data warehouses.
Mosix works best when running plenty of separate CPU intensive tasks. Shared memory is its big drawback, like Beowulf: for applications using shared memory, such as Web servers or database servers, there will not be any benefit from [Open]Mosix because all processes accessing said shared memory must resided on the same node.
MPI and openmosix play very nicely together, the processes that MPI spawns on remote nodes are migrated by the OpenMosix load balancing algorithm like any other process. So in a sense it's better in a dynamic environment where different MPI programs will be running that are unaware of each other and they could both try to overload any individual node.
Mosix Links:Here's an Excel spreadsheet of the hardware list and total cost involved in our cluster at CSU. About 24000$ (including shipping) for a 12 nodes, 24 processor cluster. Some of the key characteristics are: AMD Athlon MP 2400 with better heatsinks, Tyan Thunder K7X Pro motherboards (without SCSI), internal 2Tb RAID-5 IDE/ATA array, Gb LAN, 24Gb of RAM, Linux RedHat 8.0 with OpenMosix...
mount -o remount,ro /
), dd if=/dev/sda of=/dev/sdb
. Mount the new disk to update a few files (like /etc/sysconfig/network-scripts/ifcfg-eth0)./mfs/N/dir
" and just launch your code, it will move to that node in order to minimize I/O.dsh -r ssh "uname -a"
Except for the first one, most of those books are either outdated, very vague, too specific (only source code) or with bad reviews. You've been warned.
And a little wallpaper as a gift:
There is a bug in mfs that causes some recursive commands to fail painfully. For instance the following commands will cause the node to hang: find /
, du -ks /
... In other words, do not lauch recursive commands from / (the root directory) or /mfs. A workaround is to use find / \! -fstype mfs
or locate
. Note: that problem may be solved now (by unmounting a shared memory module), but I don't want to risk it...
rpm -Uvh --oldpackage fileutils-4.1-10.1.i386.rpm
to avoid bug in cp and move returning "cp: skipping file `foo', as it was replaced while being copied".
Neumann is an x86 Linux cluster running OpenMosix. There are twelve nodes in the cluster, each node having two AMD Athlon MP 2400+ CPUs, 2Gb of RAM and 60Gb of disk space. The main node, on which you login, also has a 2Tb RAID array. A 1Gbps ethernet switch interconnects the nodes, and only the master node is connected to the outside via a 100Mbps LAN. Each CPU has 3 floating point units able to work simultaneously.
Each node runs Red Hat Linux 8.0, fully patched, with a customized kernel 2.4.20. OpenMosix is a kernel modification that makes a group of computers act like a multiprocessor machine (SMP), so the simplest way to use Neumann is to just ignore it, it will run your CPU intensive jobs on whatever node has spare CPU.
In addition to the current page, you might want to take a look at the following information:
The simplest way to use OpenMosix is to just ignore it. Just compile and run your CPU intensive jobs on the master node, and they will be dispatched automagically to whatever node has spare CPU cycles. And when a node becomes overloaded, processes will migrate to a more idle node.
One very useful part of OpenMosix if the MFS (Mosix FileSystem). You can access any drive on any node from any node using a path like /mfs/3/usr/src/. Just prepend /mfs/NodeNum to the normal path. Note that you don't need to do this for your home directory as it's already linked to /mfs/1/home/username. Note that when doing I/O from a local disk (like /), the processes need to migrate back so they can read the proper drive (/ on node 1, not / on node N); if on the other hand you use mfs, processes can read without migrating. So Command </mfs/1/dir/InputFile >/mfs/1/dir/OutputFile
is much preferable to Command <InputFile >OutputFile
Remember this syntax when hard-coding pathnames in your programs, it is the simplest and best way to optimize your programs for OpenMosix.
In addition to /mfs/NodeNum/, there are virtual directories accessible on a 'per process' base such as /mfs/home or /mfs/here for quick access. Read the HowTo for more.
If you want more in depth control, here are a few tools, both X-windows and command line:
X-windows | Command line |
---|---|
|
|
I you write shell scripts, you can use the power of OpenMosix with the simple '&' to fork jobs to other nodes. In C, use the fork()
function; there is a code sample chapter 12 of the OpenMosix HowTo. Also see the HowTo file for more information on Python, Perl, Blast, Pvm, MPICH... Note that Java doesn't migrate.
OpenMosix is designed to allow you to run multiple jobs simultaneously in a simple way. You cannot run a single job (even multithreaded) on more than one processor. If you need to do something like that, you need to use the MPI (Message Passing) library. MPICH is the implementation of MPI which is compatible with OpenMosix, version 1.2.5 is installed in /usr/local/mpich for use with the PGI compilers (not the GNU compilers). There's also an MPI library coming with RedHat Linux, but there's no warranty about its OpenMosix compatibility, so make sure you are using the proper one. Also read this short page about MPICH on OpenMosix (it tells you to use a MOSRUN_ARGS
environment variable). A more recent discussion about Looking for Portable MPI I/O Implementation.
If you need checkpoint and restart facility for extremely time consuming problems (the cluster can reboot and the computation continues), you can ask for Condor to be installed on top of OpenMosix.
You normally do not need to log onto the slave nodes. But it is possible. You just need to type ssh neumann2
(3..12). Note that you can use the usual ssh-keygen -t dsa
, cat ~/.ssh/*.pub >>~/.ssh/authorized_keys
, ssh-agent $SHELL
and ssh-add
to define a (possibly empty) passphrase so that you won't have to enter your password each time. Note that you can access neumann# only from neumann, '#' being the node number (2 to 12). See below.
DSH allows you to run a command on all the nodes of the cluster simultaneously. For instance:
df
on every node, in order (option -c executes concurrently)ssh-keygen -t dsa
from a shell. It creates a secret key and a public key. Use an empty passphrase.cat ~/.ssh/*.pub >>~/.ssh/authorized_keys
. Here we copy to the same machine because the home directory is shared by every node (except for root where you need to do cat ~/.ssh/*.pub >>/mfs/Node#/root/.ssh/authorized_keys
for every node).ssh-agent $SHELL
to you login file (.profile or .bash_profile)ssh-add
each time after you log in and type your passphrase.ssh neumann#
to log onto any node (or use dsh) without password. The first time you have to answer yes to add the node to the list of known hosts.Several kernels are available to boot from, depending on use and issue to troubleshoot:
make xconfig dep clean bzImage modules modules_install install
, the last phase (install) cannot work if you try to recompile the kernel currently in use; reboot and use another kernel first.
Important note: if you recompile a kernel, you need to edit /boot/grub/grub.conf after running make install
and change root=LABEL=/
to root=/dev/hda3
(master) or hda7
(slaves). If you get a kernel panic on boot, you probably forgot to do that.
There is a complete disk image of the neumann2 slave in the file Slave.img.gz. It was created by putting the slave drive in the master node and doing dd if=/dev/hdd | gzip >Slave.img.gz
. You can create a new slave by putting a blank hard drive in the master (or any node) and doing zcat Slave.img.gz >/dev/hdd
. It takes about 12 hours and OpenMosix should be disabled while you do that (/etc/init.d/openmosix stop
). After completion, you need to change the node hostname and IP with /usr/bin/neat
and also various settings that may have changed since the image file was created (last kernel...). Note: if you connect the drive as primary slave it is /dev/hdb, secondary master is /dev/hdc and secondary slave is /dev/hdd.
The user IDs must be the same on all the nodes. In other words /etc/passwd
, /etc/shadow
and /etc/group
must be identical on all nodes. You can edit and run /root/AddUser.sh
to add a user on all nodes.
Use /etc/init.d/openmosix {start|stop|status|restart}
to control openmosix on the current node.
/etc/mosix.map
contains the list of nodes and must be identical on all nodes. You want to keep it consistent with /etc/hosts
. Autodiscovery is not used.
If a slave node crashes, the jobs that had been migrated to it are lost (and the cluster may hang for a while while it sort things out). If the master node crashes, all jobs are lost (except if they were started somewhere else). This is because a job that has been migrated away keeps some references/pointers/resources on its home node.
To keep the master node from running too many jobs, either migrate them manually with migrate
or openmosixmigmon
, or set its speed to some low value with mosctl setspeed 7000
Partitions are slightly different on the master than on the slaves.
Mount | Type | Size | Master | Slaves |
---|---|---|---|---|
/boot | ext3 | 100Mb | /dev/hda1 | /dev/hda1 |
swap | 4Gb | /dev/hda2 | /dev/hda3 | |
/ | ext3 | 1Gb | /dev/hda3 | /dev/hda7 |
/usr | ext3 | 4Gb | /dev/hda5 | /dev/hda2 |
/var | ext3 | 1Gb | /dev/hda6 | /dev/hda6 |
/tmp | ext3 | 1Gb | /dev/hda7 | /dev/hda5 |
/home | ext3 | 47Gb | /dev/hda8 | |
/spare | ext3 | 47Gb | /dev/hda8 | |
/raid | ext3 | 1.7Tb | /dev/sda1 |
Note that there's no specific use for the 11 /spare partitions (totaling about 500Gb) present on the slaves. Here is my suggestion: if you have I/O intensive jobs, put your data on one of those drives (say /mfs/7/spare/) and then run your job in a similar way: mosrun -j7 -F -i "job </mfs/7/spare/LargeDataFile"
so that the job stays close to its data and doesn't use the network. Remember to delete your data afterwards (or back it up on /raid).
The RAID is based on a 3ware hardware controller model Escalade 7500-8 with 8 250Gb Maxtor hard drives. The RAID is configured as a RAID-5 where data parity is spread over all the drives. One drive may fail without causing any data loss. After a drive failure, the array must either be rebuilt on the remaining drives, or a blank drive may be inserted to replace the failed one prior to rebuilding. In case of drive failure, an email is sent to the admin.
Configuration is done both on boot in the 3ware utility (for basic configuration), or with cli
or preferably with 3dm
through a web browser. From neumann itself, type mozilla http://localhost:1080/
to manage the array.
Some NFS mounts are available under /mnt/, for instance personal network files are mounted under /mnt/udrive/yourusername. Note that this is the Engineering network username, not the Neumann username, which may be different.
The AMD Athlon MP processors are able to run native i86 code, but for full performance, you should use the special athlon
compiling options (SSE and 3DNow!) whenever applicable in your compiler. One of the main difference between Intel and AMD is that the Athlon has 3 floating point units (FPUs) leading to a theorical performance of 2 double precision multiplications and 2 single precision additions per clock cycle. The actual performance is 3.4 FLOPs per clock, leading to a peak performance of 160 Gflops for the entire cluster.
For the GNU compilers are the default Linux (g77, gcc, gcj). Use g77 --target-help
or gcc --target-help
to get the relevant compiling options for optimizing your programs (add them to your makefile):
For the PGI compilers (pgcc, pgCC, pgdbg, pgf77, pgf90, pghpf...) you need the proper variables set ($PGI, $MANPATH...) in your ~/.profile or ~/.bash_profile before you can use them or use man
. Note that PGI workstation is limited to 4 processors per thread and that you can compile only from node 1.
CC | CFLAGS | Description |
---|---|---|
gcc | -O3 -march=athlon-mp | GNU compiler 3.2 |
pgcc | -O2 -tp athlonxp -fast | Portland group C compiler |
pgCC | -O2 -tp athlonxp -fast | Portland group C++ compiler |
bcc | -3 | Bruce's C compiler |
FC | FFLAGS | Description |
g77 | -O3 -march=athlon-mp | G77 3.2 |
pgf77 | -O2 -tp athlonxp -fast | Portland group fortran 77 compiler |
pgf90 | -O2 -tp athlonxp -fast | Portland group fortran 90 compiler |
pghpf | -O2 -tp athlonxp -fast | Portland group High Performance Fortran compiler |
Other languages | Description | |
gcj | ? | Ahead-of-time GNU compiler for the Java language |
perl | ? | The Perl Compiler |
python | ? | An interpreted, interactive, object-oriented programming language |
pgdbg | ? | The PGI X-windows degugger |
Active services | Disabled services |
---|---|
anacron apmd atd autofs crond gpm httpd isdn keytable kudzu lpd netfs network nfslock ntpd openmosix pcmcia portmap random rawdevices rhnsd sendmail sshd syslog xfs xinetd |
aep1000 bcm5820 firstboot iptables irda named nfs nscd saslauthd smb snmpd snmptrapd tux winbind |
xinetd based services: | |
mosstatd sgi_fam |
chargen chargen-udp daytime daytime-udp echo echo-udp rsync servers services time time-udp |
Type /mnt/fluent/bin/fluent
to start the fluent application (remotely mounted from gohan). Also installed is xfig. I'm having trouble compiling Grace, a successor to xmgr.
If you need to work with MS Office documents, you can start OpenOffice with /usr/lib/openoffice/program/soffice
or /usr/lib/openoffice/program/scalc
or more in that directory...
The PGI Workstation suite of compilers are installed in /usr/pgi (see above).
Here's the short list below. For a complete list with version numbers and comment, see the complete RPM list with rpm -qa --info
at the command line.
4Suite gal libf2c netconfig setuptool a2ps gawk libgal19 netpbm sgml acl gcc libgcc newt sh alchemist gconf libgcj nfs shadow anacron GConf libghttp nmap slang apmd GConf2 libglade nscd slocate ash gd libglade2 nss_ldap slrn aspell gdb libgnat ntp sox at gdbm libgnome ntsysv specspo atk gdk libgnomecanvas oaf splint attr gdm libgnomeprint octave star audiofile gedit libgnomeprint15 Omni statserial authconfig gettext libgnomeprintui omtest strace autoconf gftp libgnomeui openjade stunnel autofs ggv libgtop2 openldap sudo automake ghostscript libIDL openmosix swig automake14 gimp libjpeg openmosixview switchdesk automake15 glib libmng openmotif sysklogd basesystem glib2 libmrproject openoffice syslinux bash glibc libogg openssh SysVinit bc Glide3 libole2 openssl talk bdflush glut libpcap orbit tar bind gmp libpng ORBit tcl binutils gnome libpng10 ORBit2 tcpdump bison gnumeric librpm404 pam tcp_wrappers bitmap gnupg librsvg2 pam_krb5 tcsh blas gnuplot libstdc++ pam_smb telnet bonobo gpg libtermcap pango termcap byacc gphoto2 libtiff parted tetex bzip2 gpm libtool passivetex texinfo cdecl gqview libungif passwd textutils chkconfig grep libunicode patch time chkfontpath groff libusb patchutils timeconfig compat grub libuser pax tk comps gtk libvorbis pciutils tmpwatch control gtk+ libwnck pcre traceroute cpio gtk2 libwvstreams perl transfig cpp gtkam libxml php ttfprint cracklib gtkhtml libxml2 pilot tux crontabs gtkhtml2 libxslt pine umb cups guile lilo pinfo units curl gzip linc pkgconfig unix2dos cvs hdparm linuxdoc pmake unzip cyrus hesiod lockdev pnm2ppa up2date db4 hotplug logrotate popt urw desktop hpijs logwatch portmap usbutils dev htmlview lokkit postfix usermode dev86 httpd losetup ppp utempter dhclient hwbrowser LPRng procmail util dia hwcrypto lrzsz procps VFlib2 dialog hwdata lsof psmisc vim diffstat ImageMagick ltrace pspell vixie diffutils imlib lvm psutils vte docbook indent lynx pvm w3c dos2unix indexhtml m4 pygtk2 webalizer dosfstools info magicdev pyOpenSSL wget doxygen initscripts mailcap python which dump intltool mailx pyxf86config whois e2fsprogs iproute make PyXML WindowMaker ed iptables MAKEDEV qt wireless eel2 iputils man quota words eject irda maximum raidtools wvdial emacs isdn4k mc rcs Xaw3d eog jadetex memprof rdate xdelta esound jfsutils metacity rdist xfig ethereal kbd mingetty readline XFree86 ethtool kbdconfig minicom redhat Xft evolution kernel mkbootdisk reiserfs xinetd expat krb5 mkinitrd rhl xinitrc expect krbafs mktemp rhn xisdnload fam ksymoops mod_perl rhnlib xml fbset kudzu mod_python rhpl xmltex fetchmail lam mod_ssl rmt xmlto file lapack modutils rootfiles xpdf filesystem less mount rp xsane fileutils lesstif mouseconfig rpm xscreensaver findutils lftp mozilla rpm404 xsri finger lha mpage rsh Xtest firstboot libacl mrproject rsync yelp flex libaio mt samba yp fontconfig libart_lgpl mtools sane ypbind foomatic libattr mtr screen zip fortune libbonobo mutt scrollkeeper zlib freetype libbonoboui nautilus sed ftp libcap ncurses sendmail gail libcapplet0 nedit setserial gaim libelf net setup