Tuesday, 13 February 2007

camont not yet enabled on CEs

Thanks to Mona the Imperial RBs are now set up to accept camont jobs (both production RB (gfe01) and testzone RB(gm02)). Unfortunately no CEs yet seem able to take the jobs:

"edg-job-list-match hostname.jdl

Selected Virtual Organisation name (from JDL): camont.gridpp.ac.uk Connecting to host gfe01.hep.ph.ic.ac.uk, port 7772

===================== edg-job-list-match failure ====================== No Computing Element matching your job requirements has been found! ======================================================================

Hopefully some good news soon!

Querying SAM

You can query the SAM results by installing lcg-sam-client from the CERN repository on a UI. Once installed and you have a proxy, it is possible to run commands like this:

$ /opt/lcg/same/client/bin/same-query nodename voname servicestatus sitename=ScotGrid-Edinburgh serviceabbr=SE

you will also have to set the SAM_SERVER_HOST to lcg-sam.cern.ch. There is a config file in /opt/lcg/same/client/etc as well as some basic documentation in /opt/lcg/same/client/docs .

From what I've seen so far it doesn't look like it's possible to drill down and get details of the subtests, you only get the overall result.

ATLAS Tests Status

I still haven't managed to make replicas at Durham and Sheffield. The one at Liverpool was deleted by persons unknown but is now back.

There were serious problems with the IC RB on Thursday/Friday last week which caused the whole system to collapse. Now using RAL RB again.

Overall success is currently 74%. All sites are working pretty well except Bristol, Edinburgh, UCL Central and QMUL. QMUL was broken since last week because /opt/edg/var/info/atlas/atlas.list got overwritten somehow. It worked again yesterday but is now broken again because lcg-info doesn't report anything (although ldap does).

Apart from that there are occasional failures or aborts of single jobs which need to be investigated sometime.

Monday, 12 February 2007

Replica disappeared

My replica at Liverpool has disappeared for no apparent reason. I tried to recreate it but get obscure errors:

java.rmi.RemoteException: SRM Authorization failed;

RB Problems

Since last Thursday (8/2) all my jobs were reported as "running - unavailable" and by the weekend my whole system was screwed up. I cleaned everything up and switched from the IC RB to the (2nd) RAL one. Everything seems to be OK again now. In case it has something to do with it, if I now cancel a job I mark it as cancelled in my log and don't wait to see if the status is "Cancelled" the next time I try and find out its status. Previously if it didn't come back as "Cancelled" I tried to cancel it again (for ever) which may have caused problems since cancelling seems pretty broken.