Friday 20 June 2008

CCRC post-mortem workshop - talk summaries

WLCG CCRC post-mortem workshop – summary
Agenda: http://indico.cern.ch/conferenceTimeTable.py?confId=23563

First part of workshop was on storage. Generally observed that there was insufficient time for middleware testing.

CASTOR: Saw SRM ‘lock-ups’ – CASTOR network related. The move to SL4 is becoming urgent. There needs to be more pre-release testing. They want to prioritise tape recalls. [UK: RAL tape servers mentioned as “flaky”].

STORM – monitoring is an issue. A better admin interface is needed together with FAQs and improved use of logs.

dCache (based on SARA) – spotted many problems in CCRC (e.g. lack of movers). Wrong information provided by SRM. Crashing daemons were an issue. There were questions about the development roadmap. Smaller changes advertised but more implemented. For tape reading there was not enough testing.

DPM (GRIF): Balance with file system not right – some full areas while others not well used. Balancing is needed. Main uncertainty seems to be the xrootd plugin (not tested yet). Advanced monitoring tools would be useful.

Database (MG) – ATLAS 3D milestones met. Burst loads are seen when data reprocessing jobs start on an empty cluster.

Middleware – no special treatment in this area for CCRC. Numerous fixes published – CE; dCache; DPM; FTS/M; VDT. Middleware baseline was defined early on. Main question is “are communications clear?” EMT – deals with short-term plans and TMB the medium-term plans. Aside: Are people aware of the Application Area Repository which gives early access to new clients?

Monitoring:
- no easy way to show external people status of experiment. GridView used for common transfer plots. Sites still bit disoriented understanding their performance.
- ATLAS – dashboard for DDM. Problems generally raised in GGUS and elog. Still difficult to interpret errors and correlate information to diagnose (is it a site or experiment problem). Downtime on dashboard would be useful together with a heartbeat signal (to check sites in periods low activity).
- CMS used PHEDEX and GridView. SiteStatusBoard was most used. Commissioning of sites worked well. Better diagnosis regarding failure reason from the applications are required.
- LHCb – data transfer and monitoring via DIRAC. Issue with UK certificates picked up in talk.
- GridView is limited to T1-T1 traffic. FTM has to be installed at all sites for ths monitoring but it was not a clear requirement.
- Core monitoring issue now relates to multiple information sources – which is the top view. The main site view should be via local fabric monitoring – this is progressing.
- Better views of the success of experiment workflows is needed. Getting better for critical services.


Support

ALICE positive about ALARM feedback.

CMS: Use HyperNews and mail to reach sites. Sites need to be more pedantic in notifying ALL interventions on critical services (regardless of criticality).
LHCb: Speed of resolution still too dependent on local contacts. Problem still often seen first by VO.
ATLAS – use a ticket hierarchy. Shifters follow up with GGUS and in daily meetings. They note it is hard to follow site broadcast messages.

ROCs: Few responses before the meeting. There was a question about where (EGEE) ROCs sit in the WLCG process.

GGUS:Working on alarms mechanisms for LHC VOs – these feed into T1 alarm mail lists. Team tickets are comig – editable by whole shift team. Tickets will soon have case types: incident, change request, documentation. An MoU area (to define agreements at risk with problem) and support for LHCOPN also being implemented.

Tier-2s: Initial tuning took time – new sites inclusion, setting up FTS channels etc.
Most T2 peak rates > nominal required. T2 traffic tended to be bursty. One issue is how to interpret no or (only) short lived jobs at a site. Much time is used to establish that everything is normal! Mailing lists contain too much information – often exchanges are site specific. Direct VO contacts were found to be helpful.
What should T2s check on the VO monitoring pages? What are the VO activities.

Tier-1: (BNL-ATLAS): Went well. Found problem with backup link but resolved quickly..
FNAL(CMS): Jobs seen to be faster at CERN (this was IO not CPU bound – SRM stress authn/authz). Site fully utilized. Used file families for tape. Request instructions on who to inform about irregularities. There is no clear recommended phedex FTS download agent configuration.

PIC: Site updates were scheduled for after CCRC. The number of queued jobs sometimes led to excessive load on CE. Wish to set limit with GlueCEPolicyMaxWaitingJobs. Need to automate directory creation for file families. Import rates ok. Issue with some exports – FTS settings? Also concern with conditions DB file getting “hot” ~400 jobs try to access same file and pool hung as 1Gb internal link saturated. CMS skim jobs filled farm and this led WN-SE network switches (designed 1-2 MB/s/job) to be saturated and become a bottleneck – but farm stayed running. The CMS read ahead buffer is too large.

IN2P3: Main issue was LHCb data access problems with dCache-managed data via gsidcap. Regional LFC for ATLAS failed.

SARA: Quality control of SRM releases is a problem

RAL: CE overload after new WNs installed. Poor CMS efficiency seen - production work must pre-stage. Normal users capped to give priority to production work. CASTOR simultaneous deletes problem. CASTOR gridftp restarts led to some ATLAS transfer failures. Target data rates are no longer clear. Need a document with targets agreed. Need monitoring which shows site its experiment contribution vs expectation.

T0: Powercut was “good” for debugging the alert procedure (how to check etc) and find GC problem. SRM reliability remains an issue. More service decoupling planned. SRM DBs will move to Oracle. More tape load was expected in CCRC. Tuning bulk transfers still needed. Many repeat tape mounts seen per VO. Power cut planning – run a test every 3 months. Changed to publish CPUs in place of cores to get more work. Publish cores as physical CPUs. Pilot SLC4 gLite3.1 WMS used by CMS.

ATLAS: Fake load generator at T0. Double registration problem (DDM-LFC). NDGF limited to 1 buffer in front of 1 tape drive (limits rates). T1-T1 RAL issue seen to INFN and IN2P3 (now resolved). Slight issue with FZK. “Slightly low rate – not aggressive in FTS setting”. Global tuning of FTS parameters needed. Want 0 internal retries in FTS at T1s (action). Would like logfiles exposed. UK CA rpm not updated on disk servers caused problem for RAL. 1 transfer error on full test. T1-T1 impacted by power cut. T1->T2s ok and higher rates possible. CASTOR problems seen: too many threads busy; nameserver overload; SRM fails to contact Oracle backend. For dCache saw PNFS overload. Problematic upgrade. StoRM need to focus on interaction between gridftp and GPFS.

Sites need to instrument some monitoring of space token existence (action). CNAF used 2 FTS servers (1 tape and 1 disk) – painful for ATLAS. Most problems in CCRC concerned storage. Power cuts highlighted some procedures are missing in ATLAS.

LHCb: No checksum probs seen in T0->T1 transfers. Issue with UK certificates hit RAL transfers. Submitted jobs done/created for RAL = 74%. (avg. 67%). For reconstruction a large number of jobs failed to upload rDST to local SE (file registering procedures in LFC). RAL success/created = 68%. For RAL the application was crashing with seg fault in RFIO. [Is this now understood?]. The fallback was to copy i/p data to WN via gridFTP. Stripping: highlighted major issues with LHCb book keeping. Data access issues mainly for dCache.

CMS: Many issues with t1transfer pool in real life. CAF – no bottlenecks. T0->T1 ok. Like others impacted by UK CA certificate problem. T1->T1: RAL ok. [note: graph seems more spread than other T1s]. T1->T2 aggregates for own region similar/lower than others (#user jobs limited).For UK, no single dataset found so 22 small datasets used. [Slide 66: RHUL no direct CMS support – some timeout errors. RALPP – lacked space for phase 2. QMUL not tested. Britsol – host cert expired in round two so no additional transfers. Estonia – rate was good! General: LAN overload in some dCache sites: ~30MB/s/job – this is thought to relate to read_ahead being on by default – issue for erratic reads. Prod+Anal at T2s: UK phase 1 ok but number of jobs small. Phase 2 (chaotic job submission) and Phase 3 (local scope DBS) results not clear and work continues.

ALICE: Unable to load talk

Critical services: ATLAS: Use service map http://servicemap.cern.ch/ccrc08/servicemap.html. ATLAS dashboard is the main information source. Hit by kernel upgrades at CERN. Quattorized installation etc. works well.

ALICE: Highest rank services are VO-boxes and CASTOR-xrootd. Use http://alimonitor.cern.ch

CMS: Has a critical services Task Force. CMS critical services (with ranking): https://twiki.cern.ch/twiki/bin/view/CMS/SWIntCMSServices. Collaboration differentiates services (to run) and tasks (to deliver).

LHCb: List is here: https://twiki.cern.ch/twiki/pub/LHCb/CCRC08/LHCb_critical_Services2.pdf


Proposed next workshop – 13-14 Nov. Other meetings to note EGEE’08 and pre-CHEP workshop.