Ian wrote up a straw-man architecture plan and some questions:

Ian's Proposed Architecture

Intro

Tomorrow some of us are meeting with Dirk Duellmann of CERN over the hardware requirements for setting up RAC.  I went to a RAC administration class last week which dealt with the software installation, but not with the hardware behind it.  I am pretty good at Oracle, but I am by no means a hardware specialist. My first inclination is to copy what CERN has done, albeit on a much smaller scale.

Very simply put a real application cluster is a group of nodes which share disks and exchange information held in their individual shared memories.

Assumptions

We need a minimum of two servers, each server needs at least one host bus adapter.  We need at least one disk array, a switch allowing both nodes to talk to the array, cabling between the nodes and the switch and the switch and the array.  We need network cards/ports and cabling to directly connect the two machines, and also cards/ports and cabling for network traffic from outside the cluster.

Looking more closely at the individual nodes

We will need to install the following software from Oracle.  Clusterware (Cluster Ready Services AKA CRS), Automatic Storage Management software, and the Oracle database software.  The nodes will need two disks.  One for the installation of the OS and  one  for the Oracle software.  We'll need to be able to keep at least two versions of the Oracle software products on disk.

Each node will have its own SGA (System Global Area) - a shared memory allocation, and each will be running the background processes which together with the SGA make up an Oracle instance.  We will need the CPU power and memory on each node to support the above.  The SGA's may need to be as much as 25% bigger than with a single node instance.

On the disk arrays

We will need to place on raw disks:  copies of the Oracle Cluster Repository which stores information about the cluster - copies of the voting files which determine if a node is participating; and copies of the ASM database which maintains information about the disk configurations done via ASM.

Also on the array, but maintained by ASM we will need to establish two disk groups.  One for current database operations and a flash recovery area for quick recovery from backups.  Oracle recommends the FRA to be two or three times the size of the current database area.  Redo logs and undo tablespaces are needed for each instance.

Ian's Questions for Dirk

For the individual nodes

  • How much memory and how many disks, (size and number) does each have?
  • How many CPU'S?
    • What is their speed? 
    • How many cores?
  • What is the size of the individual SGAs?
  • How fast are the interconnects between nodes?
    • How many HBA's per node?
    • What is their throughput?

Between the nodes and the disk arrays

  • What type of switch do you have?
    • What is its capacity?

Disk Arrays

  • What types of arrays are you using? 
    • How fast are the disks?
    • What type are they?
    • What is their capacity?
  • Can you provide detail on where you are placing:
    • the OCR,
    • the voting disks,
    • and the ASM databases
        WRT each other and the files maintained by ASM?
  • How are you maintaining multiple copies of the OCR, voting disks, and ASM databases?
  • Are you using ASM for both striping and mirroring.
  • Is there no LVM involved?
    • Why did you choose to do so?
  • I noticed from your wiki, that you are placing the flash recovery area on the slower spinning inner tracks, and current data on the outer ones. 
    • Other than RAID, what is preventing the loss of a disk resulting in not only the loss of the current data, but its backup in the FRA as well?

Dirk's Answers to Ian's Questions:

Dirk was able to answer most, but not all of the questions pertaining to the RAC setup at CERN.  The following is from my notes on the meeting.  My own comments  are italicized.

For the individual nodes:  Currently they are running on 2-CPU 32-bit LINUX boxes with 4 GB of memory.  The speed-rating of the CPU was not known.  They want to move to 2-CPU quad-core 64-bit Linux  boxes with 16 GB of memory. 

The number of internal disks was not stated.  Dirk did mention that different clusters provide backup for each other.  We presently use internal disks to guard certain files against array failure. CERN is taking advantage of their multiple clusters to provide that protection.  This probably is not going to be a choice four us.  I would deduce they have two internal disks per node. We will need three.

Current SGA size is  3 GB.  This seems a bit high to me as it leaves only 1 GB per node for  everything else.  Perhaps they can  do this by distributing the work over the eight nodes of their cluster.  Dirk did state that that they try to keep their CPU's at  no more than 50% usage.  I envision our processors working harder.

Interconnects:  Interconnectivity is provided by gigabit ethernet.  There was mention of some latency problems.   The  hot block problem, for example multiple nodes inserting  rows at a high rate obligating the update of a sequential index which needs to be copied  from one SGA to another, is something which can doom RAC.  CERN has a test cluster to ensure applications are well-behaved. They are rewritten if  they are not.  I'm not certain such freedom exists for GLAST.

Between the nodes and the disk arrays:  Each node has four host bust adapters ,two per each fiber channel switch.  Not much information was known  about the switches except they from QLogic.   The four HBA's and two switches are employed for redundancy  and not increased capacity.

Disk Arrays:   They are two SUN Storage Array Network  arrays per cluster.  There are 16? 300 GB 7500  rpm disks per node.  The arrays have simple controllers.  Oracle's ASM software is used for both striping and  mirroring.  The two arrays are mirrors of each other allowing the system to continue if an entire array is lost.   No additional logical volume manager is used, as currently only ASM can setup mirroring between separate physical arrays.

Dirk did not have details on the  placement and number of cluster registry, voting disks, or ASM database files.  He is going to see if he can get me access to the internal documents which hold that information.

They are indeed using ASM to create a data and a flash recovery area disk groups.  The disk groups are placed on the same physical spindles  with the FRA files being placed on the slower inside tracks.   The setup does make it possible two lose both  current and backup information.  They protect themselves against it by having it mirrored to the other array and also copied in some manner to arrays belonging to another cluster. The database files of concern here are the control files, the active redo logs, and the archived redo_logs before they go to tape.  I recommend a copy of these be written two internal disk.

Additional Nuggets:  Where they have employed standby databases, they have been physical, not logical. They worry about the ability of logical standby's to keep up with the load.  They have made no promises that outages for maintenance can be avoided, just that their frequency can be reduced.

 

  • No labels