Thursday 12 June 2008

CCRC Post Mortem workshop (day1 - PM)

Monitoring


Talk about how everything feeds to gridview, but conveniently skips the lack of detail you normally find in gridview. Mentioned ongoing effort for sites to get an integrated view of VO activity ath their site. Too many pages to look at.

Once again 'nice' names for sites were discussed - we want the 'real' name that a site is known as. Network monitoring tools were discussed (well, lack of them!) as was the issue when a service is 'up' but not performing well.

Experiments quickly see when a site fails but sites don't know this. Nice to have a mashup of the type of jobs vo's are running at site (production / user) etc

Support


Bit Rot on Wiki Pages - they have to be kept up to date. (CMS)
Basic functionality tests (can a job run / file get copied) SHOULD be sent directly to sites (ATLAS)

Discussions - Dowtime announcement only goes to VO managers by default - users not informed. RSS of downtimes can be parsed. How do sites do updates - if they announce SE downtime jobs still run on CE but fail as no storage. Do they take entire site down?
perhaps flag the CE as not in production?

"view from ROCs" - No input from the UKI-ROC - drowned out in email?
Procedures to follow when you get big (power) outages.
[as an aside - the power cut at CERN - scotgrid-glasgow noticed that the proxy servers went down within the hour due to nagios email alert sitting in inbox]. Watch out for power cats :-). Perhaps we need an SMS to EGEE-Broadcast gateway if sending broadcast is too hard for operator staff.

GGUS updates - Alarm tickets in progress (via email or portal). Being extended for the LHCOPN to use to track incidents. A Vigorous discussion ensured about the number / role of people who were allowed to declare an experiment down (even if no alarms yet seen) to call out all the relevant people.


Tier2 report - Had phone call and missed part of the talk.

Tier1/PIC - had an unfesably large cache infront of T1D0 due to T1D1 tests (~100TB) CMS Skimming jobs saturated SE switches but service stayed up.
Need for QA/Testing in middleware. Need to do something with hot files - causes high load on afs cell.

RAL - pretty much as expected. Few issues - Low CMS efficency, User tape recalls, OPN routing. Overall expected transfer rates not known in advance. Yet another req for a site dashboard.

Should we force the experiments to try and allocate a week to do a combined reconstruction session to check that we're ready.

right - thats it for now. Off to generate metrics for the curry....

No comments: