Thursday 12 June 2008

CCRC Post Mortem workshop (day1 - PM)

Monitoring


Talk about how everything feeds to gridview, but conveniently skips the lack of detail you normally find in gridview. Mentioned ongoing effort for sites to get an integrated view of VO activity ath their site. Too many pages to look at.

Once again 'nice' names for sites were discussed - we want the 'real' name that a site is known as. Network monitoring tools were discussed (well, lack of them!) as was the issue when a service is 'up' but not performing well.

Experiments quickly see when a site fails but sites don't know this. Nice to have a mashup of the type of jobs vo's are running at site (production / user) etc

Support


Bit Rot on Wiki Pages - they have to be kept up to date. (CMS)
Basic functionality tests (can a job run / file get copied) SHOULD be sent directly to sites (ATLAS)

Discussions - Dowtime announcement only goes to VO managers by default - users not informed. RSS of downtimes can be parsed. How do sites do updates - if they announce SE downtime jobs still run on CE but fail as no storage. Do they take entire site down?
perhaps flag the CE as not in production?

"view from ROCs" - No input from the UKI-ROC - drowned out in email?
Procedures to follow when you get big (power) outages.
[as an aside - the power cut at CERN - scotgrid-glasgow noticed that the proxy servers went down within the hour due to nagios email alert sitting in inbox]. Watch out for power cats :-). Perhaps we need an SMS to EGEE-Broadcast gateway if sending broadcast is too hard for operator staff.

GGUS updates - Alarm tickets in progress (via email or portal). Being extended for the LHCOPN to use to track incidents. A Vigorous discussion ensured about the number / role of people who were allowed to declare an experiment down (even if no alarms yet seen) to call out all the relevant people.


Tier2 report - Had phone call and missed part of the talk.

Tier1/PIC - had an unfesably large cache infront of T1D0 due to T1D1 tests (~100TB) CMS Skimming jobs saturated SE switches but service stayed up.
Need for QA/Testing in middleware. Need to do something with hot files - causes high load on afs cell.

RAL - pretty much as expected. Few issues - Low CMS efficency, User tape recalls, OPN routing. Overall expected transfer rates not known in advance. Yet another req for a site dashboard.

Should we force the experiments to try and allocate a week to do a combined reconstruction session to check that we're ready.

right - thats it for now. Off to generate metrics for the curry....

CCRC Post Mortem workshop (day1 - PM)

More from the CCRC post mortem meeting

Middleware


1) No change from normal release procedures. FTM didn't seem to be picked up as a requirement by Tier-1s (however RAL had their own already in place). Problem was that there wasn't a definirive software list, just the changes needed. CREAM is coming (slowly) and SL5 WN perhaps by Sept 08

SL5 plans - ATLAS would like to get software ready for winter break (build on 4 and 5)
CMS - Switch to SL5 wintr shutdown. Like to have software before that for testing.
CERN - lxbatch / lxplus test cluster will be avail from sept onwards (assuming hardware)

CCRC Post Mortem workshop (day1 - AM)

Some crib notes from the CCRC postmortem workshop - These should be read together with the Agenda. Where I've commented its parts that I consider particularly useful.

Storage


1) Castor / Storm
GPFS vs GridFTP block sizes - GPFS 1MB, SL3 GridFTP 64K, SL4 GridFTP 256K
Reduced gridftp timeouts from 3600s to 3000s

CMS had problems with GPFS s/w area - latency issue [same as ECDF?] - partly due to switch fault but they migrated s/w are to another filesystem

RAL - General slowdown needed DB intervention. LHCb RFIO failures. Need to prioritise tape mounts - Prodn >>> users.


2) CASTOR development - Better testing planned and in progress. Also logging improved.

3) dCache (SARA) 20 gridftp movers on thumpers cf 6 on linux boxes. gsidcap only on SRM node itself. Good tape metrics gathered and published on their wiki. HSM pools filled up due to orphanned files- demoved from pnfs namespace but not deleted. Caused by failed FTS transfers (timeout now increased)

... coffee time :-)

4) DPM (GRIF/LAL) - Mostly OK -- some polishing required (highlighted dpm-drain) and Greigs monitoring work.

Databases


1) 'make em resilient' - have managed 99.96% availability (3.5h downtime/yr)
They have upgraded to new hardware on the oracle cluster @ Tier0
Lessons learnt from the powercut - make sure your network stuff is UPSd esp if you have machines you expect to be up...

2) ATLAS - rely on DB for reprocessing capable of ~1k concurrent sessions. Some nice replication tricks to (and from) remote sites such as the muon calibration centres

3) SRM issues - went more or less according to plan. Understood and corrected issues found. Otherwise - seemed to be within 'normal' load range that they were used to.