lssues with Production Grids Tony Hey Director of uk e-Science Core Programme
Issues with Production Grids Tony Hey Director of UK e-Science Core Programme
NGS S NGs“ Today Interfaces IERAGRI Projects ecee e-Minerals cabling Grids for OGSI:: Lite e-Materials Orbital Dynamics of Galaxies Bioinformatics(using BLAST GEODISE project Core Node Software Stack UKQCD Singlet meson project Census data analysis Foundations HPCM MIAKT project Middleware e-HTPX project. SR83 Globus( Reality Grid(chemistry) Backend ■盟sY UNIVERSIT Oxford PGIIntelGCc Total View Debugger UCL RedHat Enterprise Linux 3.0 Southampton Imperial Example Applications 器说★ 夥、 NAg Sheffield DL POLY Molecular Foundation QUB BBSRC CCLRC ecee Enabling Grids for -science in Europe
NGS “Today” Projects e-Min erals e-M aterials Orbit al Dyna mics of G ala xies Bioinform atics (usin g BLA ST) GEODISE proj e ct UKQCD Sin glet meson proj e ct Census data analysis MIAKT proj e ct e-HTPX proj e ct. R ealityGrid (chemistry) Users Leeds Oxford UCL Car diff Southampton Imperial Liverpool Sheffiel d C ambridge Edinburgh QUB BBSRC CCLRC. Interfaces OGSI::Lite
dustervision NGS Hardware Compute cluster Data Cluster .64 dual CPU Intel 3.06 GHz(1MB cache )nodes.20 dual CPU Intel 3.06 GHz nodes 2GB memory per node .4GB memory per node .2X 120GB IDE disks(1 boot, 1 data) 2X120GB IDE disks(1 boot, 1 data . Gigabit network .Gigabit network . Myrinet M3F-PCIXD-2 Myrinet M3FPCⅨXD2 . Front end (as node) Front end(as node) .Disk server(as node) with 2x Infortrend 2.1TB-18TB Fibre SAN(Infortrend F16F41TB Fibre U16U SCSI Arrays(Ultra Star 146Z10 disks) Arrays(Ultra Star 146Z10 disks) .PGI compilers .PGI compilers .Intel Compilers, Mr .Intel Compilers, MKL .PBSPro .TotalView Debugger TotalView Debugger . Redhat es 3.0 oracle 9i rac .Oracle Application server .RedHat ES 3.0
NGS Hardware Compute Cluster •64 dual CPU Intel 3.06 GHz (1MB cache) nodes •2GB memory per no d e •2x 120GB IDE disks (1 boot, 1 data) •Gigabit network •Myrinet M3F-PCIXD-2 •Front end (as node) •Disk server (as n o d e) with 2x Infortrend 2.1TB U16 U SCSI Arrays (UltraStar 1 4 6 Z10 disks) •PGI compilers •Intel Compilers, MKL •PBSPro •TotalView Debugger •RedHat ES 3.0 Data Cluster •20 dual CPU Intel 3.06 GHz nodes •4GB memory per no d e •2x120GB IDE disks (1 boot, 1 data) •Gigabit network •Myrinet M3F-PCIXD-2 •Front end (as node) •18TB Fibre SAN ( Infortrend F16F 4.1TB Fibre Arrays (UltraStar 1 46Z10 disks) •PGI compilers •Intel Compilers, MKL •PBSPro •TotalView Debugger •Oracle 9i R AC •Oracle Applicati on server •RedHat ES 3.0
NGS Software Core Node Software Stack Foundations OGSA- DAL Middleware VDT 1.2 SRB 3 Globus gr3 Backend Oracle RAC 9i Libraries Tools PGI Intel Total View Debugger RedHat Enterprise Linux 3.0 Example Applications NAg DL POLY Molecular Foundations. 确m0s Ab initi NCBI B MATLAB Reality Grid
NGS Software
Reality Grid AHM Experiment Measuring protein-peptide binding energies -44G ind is vital for e. g. understanding fundamental physical processes at play at the molecular level, for designing new drugs Computing a pept otide-protein binding energy traditionally takes weeks to months We have developed a grid ligand based method to accelerate this process We computed 44Ghind during the uK ami.e. Src SH2 domain in less than 48 hours
RealityGrid AHM Experiment • Measuring protein-peptide binding energies – ∆∆Gbind is vital for e.g. understanding fundamental physical processes at play at the molecular level, for designing new drugs. • Computing a peptide-protein binding energy traditionally takes weeks to months. • We have developed a gridbased method to accelerate this process. We computed ∆∆Gbind during the UK AHM i.e. in less than 48 hours ligand Src SH2 domain
Experiment Details A Grid based approach, using the reality Grid steering library enables us to launch, monitor checkpoint and spawn multiple simulations Each simulation is a parallel molecular dynamic simulation running on a supercomputer class machine At any given instant, we had up to nine simulations in progress(over 140 processors) on machines at 5 different sites e.g 1X TG-SDSC, 3X TG-NCSA, 3x NGS-OXford 1x NGS-Leeds 1X NGS-RAL
Experiment Details • A Grid based approach, using the RealityGrid steering library enables us to launch, monitor, checkpoint and spawn multiple simulations • Each simulation is a parallel molecular dynamic simulation running on a supercomputer class machine • At any given instant, we had up to nine simulations in progress (over 140 processors) on machines at 5 different sites: e.g 1x TG-SDSC, 3x TG-NCSA, 3x NGS-Oxford, 1x NGS-Leeds, 1x NGS-RAL
Experiment Details(2) In all 26 simulations were run over 48 hours We simulated over 6.8ns of classical molecular dynamics in this time Real time visualization and off-line analysis required bringing back data from sImulations in progress We used UK-light between UCL and the TeraGrid machines(SDSC, NCSa)
Experiment Details (2) • In all 26 simulations were run over 48 hours. We simulated over 6.8ns of classical molecular dynamics in this time • Real time visualization and off-line analysis required bringing back data from simulations in progress. • We used UK-light between UCL and the TeraGrid machines (SDSC, NCSA)
The e-Infrastructure UK NGS Starlight( Chicago) anchester US TeraGrid Netherlight Amsterdam) Oxford SDSC RAL NCSA PSC UCL UKLight AHM 2004 All sites connected by and manchester production network(not vncserver ll shown) Computation Steering clients O Network PoP O Service Registry
Computation Starlight (Chicago) Netherlight (Amsterdam) Leeds PSC SDSC NCSA Manchester Oxford RAL US TeraGrid UK NGS UCL UKLight The e-Infrastructure AHM 2004 Local laptops and Manchester vncserver All sites connected by production network (not all shown) Steering clients Network PoP Service Registry
The scientific results 400 Thermodynamic Integrations 300▲ ′d 200 100 0.2 0.4 0.6 08 100 lambda -200 Some simulations require extending and more sophisticated analysis needs to be performed
The scientific results … T h e r m odyna mic I nte g r ations -200 -100 0 100 200 300 400 0 0.2 0.4 0.6 0.8 1 la mbda dE/dl dp p o Some simulations require extending and more sophisticated analysis needs to be performed
and the problems Restarted the GridService container Wednesday evenin Numerous quota and permission issues, especially at TG-SDSC NGS-Oxford was unreachable Wednesday evening to Thursday morning The steerer and launcher occasionally fail We were unable to checkpoint two simulations The batch queuing systems occasionally did not like our simulations 5 simulations died of natural causes Overall, up to six people were working on this calculation to solve these problems
… and the problems • Restarted the GridService container Wednesday evening • Numerous quota and permission issues, especially at TG-SDSC • NGS-Oxford was unreachable Wednesday evening to Thursday morning • The steerer and launcher occasionally fail • We were unable to checkpoint two simulations • The batch queuing systems occasionally did not like our simulations • 5 simulations died of natural causes • Overall, up to six people were working on this calculation to solve these problems