Tomorrow some of us are meeting with Dirk Duellmann of CERN over the hardware requirements for setting up RAC. I went to a RAC administration class last week which dealt with the software installation, but not with the hardware behind it. I am pretty good at Oracle, but I am by no means a hardware specialist. My first inclination is to copy what CERN has done, albeit on a much smaller scale.
Very simply put a real application cluster is a group of nodes which share disks and exchange information held in their individual shared memories.
We need a minimum of two servers, each server needs at least one host bus adapter. We need at least one disk array, a switch allowing both nodes to talk to the array, cabling between the nodes and the switch and the switch and the array. We need network cards/ports and cabling to directly connect the two machines, and also cards/ports and cabling for network traffic from outside the cluster.
We will need to install the following software from Oracle. Clusterware (Cluster Ready Services AKA CRS), Automatic Storage Management software, and the Oracle database software. The nodes will need two disks. One for the installation of the OS and one for the Oracle software. We'll need to be able to keep at least two versions of the Oracle software products on disk.
Each node will have its own SGA (System Global Area) - a shared memory allocation, and each will be running the background processes which together with the SGA make up an Oracle instance. We will need the CPU power and memory on each node to support the above. The SGA's may need to be as much as 25% bigger than with a single node instance.
We will need to place on raw disks: copies of the Oracle Cluster Repository which stores information about the cluster - copies of the voting files which determine if a node is participating; and copies of the ASM database which maintains information about the disk configurations done via ASM.
Also on the array, but maintained by ASM we will need to establish two disk groups. One for current database operations and a flash recovery area for quick recovery from backups. Oracle recommends the FRA to be two or three times the size of the current database area. Redo logs and undo tablespaces are needed for each instance.
Dirk was able to answer most, but not all of the questions pertaining to the RAC setup at CERN. The following is from my notes on the meeting. My own comments are italicized.
For the individual nodes: Currently they are running on 2-CPU 32-bit LINUX boxes with 4 GB of memory. The speed-rating of the CPU was not known. They want to move to 2-CPU quad-core 64-bit Linux boxes with 16 GB of memory.
The number of internal disks was not stated. Dirk did mention that different clusters provide backup for each other. We presently use internal disks to guard certain files against array failure. CERN is taking advantage of their multiple clusters to provide that protection. This probably is not going to be a choice four us. I would deduce they have two internal disks per node. We will need three.
Current SGA size is 3 GB. This seems a bit high to me as it leaves only 1 GB per node for everything else. Perhaps they can do this by distributing the work over the eight nodes of their cluster. Dirk did state that that they try to keep their CPU's at no more than 50% usage. I envision our processors working harder.
Interconnects: Interconnectivity is provided by gigabit ethernet. There was mention of some latency problems. The hot block problem, for example multiple nodes inserting rows at a high rate obligating the update of a sequential index which needs to be copied from one SGA to another, is something which can doom RAC. CERN has a test cluster to ensure applications are well-behaved. They are rewritten if they are not. I'm not certain such freedom exists for GLAST.
Between the nodes and the disk arrays: Each node has four host bust adapters ,two per each fiber channel switch. Not much information was known about the switches except they from QLogic. The four HBA's and two switches are employed for redundancy and not increased capacity.
Disk Arrays: They are two SUN Storage Array Network arrays per cluster. There are 16? 300 GB 7500 rpm disks per node. The arrays have simple controllers. Oracle's ASM software is used for both striping and mirroring. The two arrays are mirrors of each other allowing the system to continue if an entire array is lost. No additional logical volume manager is used, as currently only ASM can setup mirroring between separate physical arrays.
Dirk did not have details on the placement and number of cluster registry, voting disks, or ASM database files. He is going to see if he can get me access to the internal documents which hold that information.
They are indeed using ASM to create a data and a flash recovery area disk groups. The disk groups are placed on the same physical spindles with the FRA files being placed on the slower inside tracks. The setup does make it possible two lose both current and backup information. They protect themselves against it by having it mirrored to the other array and also copied in some manner to arrays belonging to another cluster. The database files of concern here are the control files, the active redo logs, and the archived redo_logs before they go to tape. I recommend a copy of these be written two internal disk.
Additional Nuggets: Where they have employed standby databases, they have been physical, not logical. They worry about the ability of logical standby's to keep up with the load. They have made no promises that outages for maintenance can be avoided, just that their frequency can be reduced.