Presentation is loading. Please wait.

Presentation is loading. Please wait.

Review of Recent CASTOR Database Problems at RAL Gordon D. Brown Rutherford Appleton Laboratory 3D/WLCG Workshop CERN, Geneva 11 th -14 th November 2008.

Similar presentations


Presentation on theme: "Review of Recent CASTOR Database Problems at RAL Gordon D. Brown Rutherford Appleton Laboratory 3D/WLCG Workshop CERN, Geneva 11 th -14 th November 2008."— Presentation transcript:

1 Review of Recent CASTOR Database Problems at RAL Gordon D. Brown Rutherford Appleton Laboratory 3D/WLCG Workshop CERN, Geneva 11 th -14 th November 2008

2 Overview Current setup Issues Lessons Learnt Monitoring Future

3

4 RAL CASTOR Architecture Our setup is for: –Atlas (Stager, SRM) –CMS (Stager, SRM) –LHCb (Stager, SRM) –General (SRM) –Name Server –DLF –Gen Stager –Repack

5 RAL CASTOR Architecture 12 nodes to use –Need production and test Options included: –Single instance (or small cluster) for each schema –One huge RAC –Combination of above Constraints –Licenses –Single points of failure (did lose all paths at one point) –Resources

6 RAL CASTOR Architecture Outcome –2 x 5 node production clusters –1 x 2 node test clusters neptune1Atlas DLF LHCB DLF neptune2Atlas SRM neptune3LHCb Stager neptune4LHCb SRM neptune5Atlas Stager pluto1Name server CMS Stager pluto2CMS SRM pluto3Gen Stager pluto4Gen SRM Repack pluto5CMS DLF Gen DLF

7 RAL CASTOR Architecture Oracle Enterprise RAC –Production 10.2.0.4 –Test 10.2.0.3 –All clusters patched with July CPU Backups –RMAN to disk –Tape to Atlas Data Store Monitoring –Oracle Enterprise Manager –Nagios and ganglia on machines

8 Village of CASTOR, Cambridgeshire, UK

9 Issues – “crosstalk” Terminology –SQL executing in wrong schema Issue –14000 files lost on LHCb Evidence –Garbage collection on CASTOR –“Deleting local file which is no longer in the stager catalog” –Also in LHCb stager log: “No object found for id : 1517806678” This is in the Atlas files2delete table

10 Issues – “crosstalk” Suspicion –Not seen by Oracle in 10.2.0.3 –Redo logs inconclusive –Lots of areas with possible wrong config Disk server tnsnames entries IP address for VIPs on database servers Puppet config (on disk servers and central servers) Connection to wrong schema Outcome –Synchronisation is suspended –Haven’t recreated –Difficult for Oracle to analyse

11 Issues – core dumping Issue –ORA-600 sometimes when delete on id2type table –Happens twice a week on average Evidence –Only at least two stager schemas (and nodes) –Application and Oracle logs Outcome –Application recovers –SR Open and RDA being performed

12 Issues – cursor invalidation Issue –Detected after getting DML partition lock (ORA-14403) Strangeness –Oracle say resolved in 10.2.0.4 (which we’re on!) –Action from Oracle “nothing to be done, error should never be returned to user” –Can not recreate at will Outcome –SR Open –Parameter to implement (needs instance restart)

13 Issues – constraint violations Issue –Violation of primary key constraint (ORA-00001) –Seen on Atlas Stager id2type table –Complicated Outcome –Implemented Eric’s code to trap error and log it to alert log (will be effective when existing Stager processes restarted)

14 Issues – Big IDs Issue –Huge numbers appearing in INSERT statements –Not from any sequences on the database –Complicated Example: insert into "SRMCMS"."ID2TYPE"("ID","TYPE“) values ('8868517','1002'); insert into "SRMCMS"."ID2TYPE"("ID","TYPE“) values ('8868518','1008'); insert into "SRMCMS"."ID2TYPE"("ID","TYPE") values ('58432730170283524000','1005'); insert into "SRMCMS"."ID2TYPE"("ID","TYPE") values ('58432730307722478000','1002');

15 Issues – performance Issue 1 –Stale statistics appeared even though gathered –Noticed because of poor performance –Re-gathered, pool flushed and all fine Issue 2 –Well-used SQL query time degraded on Stager (by 300%) –New SQL Profile improved performance again –Due to stats on fluctuating tables? –Cluster waits on Atlas, high network I/O in Atlas/LHCb

16 Issues – performance Issue 3 –CPU load increasing over 3-4 days –Bonny cleared up subrequest table –Shrank table and it was solved

17 CASTOR Oil Plant

18 Monitoring DB Load –Difficult to know if linked to requests/files –Tools of CASTOR “load” would useful –Is application “good” at being on RAC Oracle Services –Currently one “preferred” node and one “available” node for each schema –Stagers failover to SRM for example –Is two nodes per Stager better?

19 Lessons Learnt 1 Machine configuration –Be careful with tnsnames –IP and VIP addresses need care –Hardware should be similar –Schema names are similar Database Administration –We can add/remove cluster node without downtime –Tuning, shrinking and profiles experience –Log miner skills

20 Lessons Learnt 2 Volume –Very high number of transactions –200GB of archive redo logs per day (DB on 80GB) –Recovery would be an issue? Image copies? –Need lots of space for log miner Space –Space needed for analysis (e.g. log miner) –More space needed for redo logs/backups

21 CASTOR River, Ontario, Canada

22 People DBAs –Team of four –Good to share skills and experience –Not enough knowledge of application –Pressure CASTOR team at RAL –Excellent communication with DBAs –Gained knowledge of databases –Difficult to know if database or application at fault

23 People CERN and other Tier-1s –Invaluable support –Good communication via email lists –Thanks! –More work together for future architecture –Wiki page appreciated Oracle –Metalink support has been very good

24 Next Steps Set-up –Moving to single instance for 2-3 weeks –Don’t change too much at once! –Difficult to rule out DB issues –Hardware resilience –Auditing? Overhead. Performance –Any more data to clean out? –Tune more SQL –More tests on failover –Backup/recovery –Proactivity

25 CASTOR star in Gemini (second brightest)

26 Questions for CERN/Tier-1s CASTOR Reporting Tools –Shaun has produced stats on SRM showing transactions –What do others use? –What would be useful? Monitoring –What do you monitor (DB and application)? –What’s important in the logs? –Any custom threshold alerts in OEM/lemon?

27 Questions for CERN/Tier-1s Database –Do you gather stats every night? Full? –Any other regular DB jobs? Shrinking? –Amount of transactions/redo logs? –CPU levels? –Plans for 11g? –Backups – full? Level 1? Validate every night? People –How many DBAs (working on CASTOR)? –DBAs knowledge of application? –3D/CASTORCollaboration

28 Questions and (hopefully) Answers databaseservices@stfc.ac.uk


Download ppt "Review of Recent CASTOR Database Problems at RAL Gordon D. Brown Rutherford Appleton Laboratory 3D/WLCG Workshop CERN, Geneva 11 th -14 th November 2008."

Similar presentations


Ads by Google