Thursday 12 June 2008

CCRC Post Mortem workshop (day1 - AM)

Some crib notes from the CCRC postmortem workshop - These should be read together with the Agenda. Where I've commented its parts that I consider particularly useful.

Storage


1) Castor / Storm
GPFS vs GridFTP block sizes - GPFS 1MB, SL3 GridFTP 64K, SL4 GridFTP 256K
Reduced gridftp timeouts from 3600s to 3000s

CMS had problems with GPFS s/w area - latency issue [same as ECDF?] - partly due to switch fault but they migrated s/w are to another filesystem

RAL - General slowdown needed DB intervention. LHCb RFIO failures. Need to prioritise tape mounts - Prodn >>> users.


2) CASTOR development - Better testing planned and in progress. Also logging improved.

3) dCache (SARA) 20 gridftp movers on thumpers cf 6 on linux boxes. gsidcap only on SRM node itself. Good tape metrics gathered and published on their wiki. HSM pools filled up due to orphanned files- demoved from pnfs namespace but not deleted. Caused by failed FTS transfers (timeout now increased)

... coffee time :-)

4) DPM (GRIF/LAL) - Mostly OK -- some polishing required (highlighted dpm-drain) and Greigs monitoring work.

Databases


1) 'make em resilient' - have managed 99.96% availability (3.5h downtime/yr)
They have upgraded to new hardware on the oracle cluster @ Tier0
Lessons learnt from the powercut - make sure your network stuff is UPSd esp if you have machines you expect to be up...

2) ATLAS - rely on DB for reprocessing capable of ~1k concurrent sessions. Some nice replication tricks to (and from) remote sites such as the muon calibration centres

3) SRM issues - went more or less according to plan. Understood and corrected issues found. Otherwise - seemed to be within 'normal' load range that they were used to.

No comments: