Saturday 22 December 2007

Running Reliable Grid Services?

Well, Since it was Friday Night, just before Xmas, I thought I'd do what ervery sad geek would do, and fire off a batch of grid jobs to last over the shutdown.

However, the voms servers at CERN had other plans:

Contacting lcg-voms.cern.ch:15004 [/DC=ch/DC=cern/OU=computers/CN=lcg-voms.cern.ch] "dteam" Failed
Error: dteam: User unknown to this VO.
Trying next server for dteam.
Creating temporary proxy ...................................... Done
Contacting voms.cern.ch:15004 [/DC=ch/DC=cern/OU=computers/CN=voms.cern.ch] "dteam" Failed
Error: Could not connect to socket. Connection refused
None of the contacted servers for dteam were capable of returning a valid AC for the user.


And I'm not the only one - This morning I saw GGUS tickets in from Atlas, CMS and LHCb with the same problem. Ho Hum. Merry festive season to you too.

Wednesday 28 November 2007

-x

Been debugging apparently slow transfers using FTS (now a 2.0 application....) and discovered a slight change in the UK setup.

Previously I just set the total no if files per channel, but I noticed that even with that set high, my dteam transfers were still only popping one file at a time off the list.

Turns out that the VO shares now have hard coded limits in which are only visible with the -x flag

ie:
glite-transfer-channel-list -x STAR-UKISCOTGRIDECDF
Channel: STAR-UKISCOTGRIDECDF
Between: * and UKI-SCOTGRID-ECDF
State: Active
Contact: lcg-support@gridpp.rl.ac.uk
Bandwidth: 0
Nominal throughput: 0
Number of files: 9, streams: 1
TCP buffer size: default
Message: no reason was provided
Last modification by: /C=UK/O=eScience/OU=CLRC/L=RAL/CN=lcgfts.gridpp.rl.ac.uk/Email=tier1a-certificate@gridpp.rl.ac.uk
Last modification time: 2007-11-28 15:10:39
Number of VO shares: 6
VO 'alice' share is: 1 and is limited to 1 transfers
VO 'lhcb' share is: 1 and is limited to 1 transfers
VO 'dteam' share is: 5 and is limited to 5 transfers
VO 'atlas' share is: 1 and is limited to 1 transfers
VO 'cms' share is: 0 and is not limited
VO 'ops' share is: 1 and is limited to 1 transfers

Wednesday 7 November 2007

Virtualisation hassles

Got a shiny new Dell Optiplex 740 (AMD 64 X2 goodness) that I wanted to run request-tracker in a VM on. Installed ubuntu gutsy fine, Installed Xen fine, just wouldn't correctly boot any domU's to completion. I even tried out KVM seeing as it's new hardware that has the 'svm' flags in /proc/cpuinfo. That failed miserably at the modprobe kvm_amd stage with kvm: disabled by bios. Most odd as there wasn't a bios option for v12n[1]. However I (after much hassle finding a bootable flopy) upgraded from 1.1.3 to 1.2.2 BIOS and lo - xen 'Just Works'

it's actually scarily easy to do - "xen-create-image --hostname whatever" and you're off :-)

More postings as I get to grips with this. Initial impressions are good.

Monday 22 October 2007

cfengine gotcha

Not strictly gridpp, but posting this here may save an administrator some sanity...

I've just spent the day battling against ubuntu on a lab cluster. The server had successfully upgraded from Feisty (7.04) to Gutsy (7.10) and cfengine refused to work. At all. It was only a minor upgrade in cfengine too - 2.1.20 to 2.1.22.

cfagent -qv -d2 hung - even with debugging on cfservd nothing obvious. Finally got a tip on IRC (freenode.net #cfengine) that it's likely to be a Berkeley DB upgrade thats the problem. Blew away all the .db files from state/ (and the __something.db that I found lying around in the cfengine dir too). tada! much stress reduced.

Tuesday 9 October 2007

Feed Me!

At todays dteam meeting, there was a discussion about informing the PMB of any significant changes to the infrastructure. As the standard SOP is to use the EGEE Broadcast tool on the CIC portal, it'd be nice if there was say an RSS feed that we could parse and extract UKI significant info into say the planet feed. Anyone care to add comments?

Sunday 30 September 2007

RAL Network performance

Prior to running any transfer tests, I normally check the Gridmon plots for that site, the relevant T2 and RAL T1. Using the (non-default) options of 'Metric data on same graph" and "New graph on new test dest" (and "new test src"), you generally get a good feel for the sites capacity, and any 'slow' links associated with it. Sadly recently the Tier1 seems to be performing dreadfully WRT the other sites.
example to Glasgow from Durham, Edinburgh and RAL:

Wednesday 11 July 2007

Q3 Transfer Tests

OK, new quarter and time to get down to some testing as I'm well overdue. First up - Lancaster. Apart from some user error at this end (typo in script) it went pretty smoothly and can cope with 25 files in flight happily:

ganglia plot from Glasgow end:

showing 10,15,20,25 file setting in transfer channel

I've also started the Oxford tests, but it seems to be much less happy - When transferring from RAL-T2, even at low nos of files (5) the CPU load on t2se01 seems awfy awfy high.


Hmm. Have mailed Pete, but something doesn't look happy...

Wednesday 4 July 2007

Corporate Makeover

Thanks to Neasan O'Neill, the Planet GridPP page now has a shiny new stylesheet to match the main GridPP site.

Wednesday 6 June 2007

planet.gridpp.ac.uk

I'd been trialling planet, the RSS aggregator for a few months now, and I'm fairly pleased with it. So much so that it's now official :-)

http://planet.gridpp.ac.uk should now show you, in a single place all the various Tier-2, operations and storage blogs. If there are any others that people want adding, drop me a note.

Friday 1 June 2007

srm-advisory-delete

OK, I'll admit I occasionally rant about some things I consider "bad", but I'm afraid deleting files from grid storage is just Painful.

to delete 490 (I nuked the first ten within the inner $i loop as a test of xargs) files I ended up with this nasty nasty hack:
for j in `seq -w 2 49 ` ; do for i in 0 1 2 3 4 5 6 7 8 9 ; do echo -n "srm://$DPM_HOST:8443/dpm/gla.scotgrid.ac.uk/home/dteam/rt2-gla-5-1/tfr000-file00$j$i " ; done ; done | xargs srm-advisory-delete

I mean, even altering for a `seq -w 20 499` single loop, just, well, is Bad and Wrong[*]

/rant

Mind you, it worked, as demonstrated by Paul's MonAMI plot of dteam storage use:

Wednesday 30 May 2007

iperf between Gla-Bristol

In preparation for some serious testing of Bristol's StoRM implementation, Jon and I worked together to run some iperf tests so we know what our target is for FTS tests. Plot below - looks like we've got a 500Mb/s maximum.

Wednesday 16 May 2007

Wouldn't it be cool if....

Someone could hack something like flickrvision into the realtime monitor. hmmmm.

Monday 14 May 2007

Transfer Channel wiredness

As I still haven't completed this quarters round of transfer tests I decided to set up all the scripts and prereqs for doing each T2. Ran my little perl script to get the current (pre-tuned) values for the transfer channels and noticed that they were all set to 5. Nope, my script hadn't broken (more than originally designed anyway) - it looks like someone changed 'em. Why?

Site           From RAL  Star    To RAL
UKI-NORTHGRID-SHEF-HEP 5 5 5
UKI-LT2-IC-LESC 5 5 5
UKI-SOUTHGRID-BHAM-HEP 5 5 5
UKI-SOUTHGRID-OX-HEP 5 5 5
UKI-LT2-IC-HEP 5 5 5
UKI-SOUTHGRID-CAM-HEP 5 5 5
UKI-LT2-UCL-HEP 5 5 5
SCOTGRID-EDINBURGH 5 5 5
UKI-LT2-UCL-CENTRAL 5 5 5
UKI-NORTHGRID-MAN-HEP 5 5 5
UKI-SCOTGRID-GLASGOW 5 5 5
UKI-NORTHGRID-LANCS-HEP 5 5 5
UKI-SOUTHGRID-BRIS-HEP 5 5 5
UKI-LT2-BRUNEL 5 5 5
UKI-SCOTGRID-DURHAM 5 5 5
UKI-SOUTHGRID-RALPP 5 5 5
UKI-LT2-RHUL 5 5 5
UKI-LT2-QMUL 5 5 5
UKI-NORTHGRID-LIV-HEP 5 5 5

Monday 30 April 2007

You're only supposed to blow the bloody doors off!

Apart from the Spring HEPiX 2007 meeting last week (which was excellent BTW - I'll do a proper writeup soon) I have been pushing on with Transfer tests. Rather than risk stressing the RAL CASTOR service too much while we work on improving the individual T2 sites, Chris Brew (RAL T2) was v helpful and we've been using their dCache pool (lots of beefy disk servers and some spare capacity) to push out to the T2 sites.

I have knocked up a simple "template" based script for performing the tests now so that I can run serveral sites serially automatically. This means I can set up a screen session to do all sites within a T2 then come back later and look at logs. What we've discovered is that we can set the glite-transfer-cheannel settings much more aggressively than we have been.

Some Examples
* GLA to RALT2 1TB transferred in 3:09:29, Bandwidth = 703.63 Mb/s
* RALT2 to GLA 618G transferred in 2:11:48, Bandwidth = 625.13 Mb/s

As CMS were also doing tests from the T1, we managed to saturate the 1G connection at Imperial College. Whoops. However we also set a new record for HEP Throughput at that site (previous record was ~800Gb/s overnight)

Wednesday 28 March 2007

IGTF release 1.13

As you may have seen, there is a new IGTF release. In case they are still hiding the changelog from you, here is my annotated version:

* Added BG.ACAD CA accredited under the classic profile (BG)

This is Bulgaria. They are new - welcome on board.

* Added SWITCHaai SLCS and (classic) Root CA (CH)
NOTE: the SWITCHaai SLCS CA is included in the ca_policy_igtf-slcs bundle

Ah. This one is a biggie. As you may know, the Swiss are into Shibboleth, like we are (or will be) in the UK, except they are further ahead with deployment. This CA is one that enables users within the Swiss Shibboleth federation to generate certificates via a CA which is effectively a Shibboleth SP (Service Provider).

The SLCS (short lived credential service) flavour CAs are _very_ different from the CAs that we know and love. Basically the identity is being managed by institutional databases outside the CA. FNAL, which is trusted by LCG but not yet accredited, is one such example. The Switch one is different again because identities are being managed by several institutions, namely everyone in the Swiss Federation, and thus also everyone who will later join that federation. Moreover, the identity management at the institutions cannot in general be audited because it is private to themselves. But in the Swiss case DNs will have the institution's name embedded in them (except for one generic catch-all institution, the Virtual Home Organisation, which is operated by SWITCH (Swiss NREN) itself), so you can in principle accept certain institutions via the signing policy file. Nevertheless, although the internal identity management may be solid gold, for these reasons the assurance of the SLCS-profile CA is considered (slightly/somewhat) lower than the usual "classic" profile.

This is also why SLCS-profile CAs are in a separate bundle.

Here is a list of the participating institutions:
http://www.switch.ch/aai/participants/homeorgs.html

Incidentally, I was one of the reviewers of this CA. I know the operators personally and have no doubt they will do a good job, but of course I do not know the identity managers in the institutions.

Note that the federation is otherwise mostly used for access to e-learning resources which in some sense are "cheaper" than Grid resources, but it also gives access to a Microsoft software download page and an "Internet Remote Emulation Experiment Platform" which could be considered as "expensive" as Grid resources. So the id management bar is probably high enough for our purposes.

* Extended lifetime of CyGrid CA to 2013 based on same key pair (CY)

Yep. Will cause usual headaches with Firefox and other NSS-based browsers, but this will have a low impact. The old certificate doesn't expire till Feb 2008 so there is no urgency.

However, they have now set the keyUsage extension (correctly) which all CA certificates ought to have, and they have made the basicConstraints extension critical (which it should be). The old certificate was not standards (RFC and IGTF) compliant (but would have worked anyway).

* Updated ArmeSFO CA root certificate following TACAR (AM)

This change looks minimal. What happened was they have updated the OID of their CP/CPS and the new CA certificate contains this updated CP/CPS OID (note this is almost always a bad idea).

More importantly for the rest of you is that the CA's revocation URL has been updated to a mirror of the CRL which hopefully will have higher availability than the default location. So the update to the .info file will give you the Armenia CRL with higher reliability. It matters less for the CA certificate itself but there may be some software - Internet explorer springs to mind - that checks the reference, so this will mean a shorter annoying wait before the software accepts stuff signed by it.

* Discontinued old (pre-2004) LIP CA (PT)

This is OK - the CA expired 21 March, so there are no consequences of not removing it except it will now no longer issue CRLs so will eventually set off an alert. Best to clean it out.

* Extended lifetime of NorduGrid CA for 2 years (DK)

As above - NSS problems, but everything else OK. The old certificate would expire 12 May so you need to upgrade it before then.

* Added TERENA SCS CA hierarchy to the "worthless" area. Please note that the SCS CA has not been accredited yet (EU)

The purpose of this CA is to provide a pan-european CA which can issue host certificates. The main target is browser-facing hosts - the CA can then be distributed with the browser keystores and normal joe user will not see warning popups. The UK e-Science CA will continue to provide host certificates for normal host certificates, but eventually (once this CA has been approved), these certificates will co-exist.

Cheers
--jens

Friday 23 March 2007

Transfer Slowdown

This week (apart from the GridPP 18 meeting) I have been trying to complete the transfer tests, but am getting terrible rates out of RAL. The CERN Network Stats To RAL Don't look too excessive:


But the iperf stats out of RAL have plummeted - The RAL-Lanc (who even have a private lightpath) one below is typical.

Tuesday 13 March 2007

Sponsored by "No space left on device"

Dear root@Birmingham,
I broke your system

Well, I blame the filetransfer script - It uses your (long lifetime) myproxy for the transfers and the shorter voms-proxy for the delete. Result:
one filled up disk when my voms proxy expired. Seems to have recovered OK.

Friday 9 March 2007

Prettification

With a hat-tip to xkcd I decided to knock up an ugly script to convert this:
aelwell@ppepc62:~$ ls tra*06T*.log
transfer-2007-03-06T09-52-38.log transfer-2007-03-06T14-38-59.log transfer-2007-03-06T19-06-22.log

consisting of blocks like
Transfer: srm://ralsrma.rl.ac.uk:8443//castor/ads.rl.ac.uk/prod/grid/hep/disk1tape1/dteam/j/jkf/castorTest/1GBcanned000 to srm:
//svr018.gla.scotgrid.ac.uk:8443/dpm/gla.scotgrid.ac.uk/home/dteam/aetest10/tfr000-file00000
Size: 1000000000.0
FTS Duration: 188
FTS Retries: 0
FTS State: Done
FTS Reason: (null)
Local_Created: 1173174769.27
Local_Submit: 1173174769.3
Submitted: 1173174770.69
Active: 1173174832.97
Done: 1173175041.23
Delete_time: 1173175050.67

into the much prettier



As normal, it's available on http://www.gridpp.ac.uk/wiki/User:Andrew_elwell/Scripts

Thursday 8 March 2007

voms proxy issues

Another day, another pile of transfer tests. Or not. Problem renewing my voms-proxy.

voms.cern.ch bombed out with:
Error: Could not establish authenticated connection with the server.
GSS Major Status: Unexpected Gatekeeper or Service Name
GSS Minor Status Error Chain:

an unknown error occurred


so I raised GGUS Ticket (19457)

Old Logs Available

There is a new link on my ATLAS tests page "All Logs" which allows one to look at old log and output files. These can be filtered by institute. It's a bit flaky because the webserver isn't really up to it. Logs before February 8 have been lost.

Monday 5 March 2007

Transfer Tests Milestones

Started to work on the Q1 transfer tests (hey, I know I only have a month to go) - Stub on the Wiki at http://www.gridpp.ac.uk/wiki/2007-Q1_Transfer_Tests

Next step this afternoon is to try and get the experiment dress rehersal timetable and ensure we're not clashing

Thursday 1 March 2007

{{wikify}}

Following Jeremy's posting to the gridpp-dteam gently reminding us all that the wiki was in need of a tidy, I decided to take a look at some stats:

There are currently 92 Uncategorized pages and there are 52 Categories.

I guess I'll start by categorising the uncategorised pages, then creating stubs for those that are presently the most requested (ie, we've created a hanging link) - see List of Wanted Pages

Friday 23 February 2007

SE-posix test

Here is a summary of a set of grid jobs that I ran yesterday to test posix access to site SEs. The job will eventually become the SAM test for this type of storage access. In summary it lcg-cr's a small test file to the SE, then reads it back again using a GFAL client, checks for consistency and then deletes the file using lcg-del.

srm.epcc.ed.ac.uk Passed
svr018.gla.scotgrid.ac.uk Passed

fal-pygrid-20.lancs.ac.uk Failed
lcgse1.shef.ac.uk Passed

epgse1.ph.bham.ac.uk Passed
serv02.hep.phy.cam.ac.uk Passed
t2se01.physics.ox.ac.uk Passed
lcgse01.phy.bris.ac.uk Passed

gfe02.hep.ph.ic.ac.uk Failed (both IC-HEP and LeSC)
se01.esc.qmul.ac.uk Passed
dgc-grid-34.brunel.ac.uk Passed
se1.pp.rhul.ac.uk Passed
gw-3.ccc.ucl.ac.uk Passed

For the sites that are not listed, the jobs that I submitted either still claim to be running or are scheduled (it turns out that both Durham and RAL-PPD are in downtime). I've tested RAL before and it was failing due to problems with CASTOR.

IC-HEP failed due to there being no route to host during the GFAL read. Possibly the relevant ports are not open in the firewall (22125 for dcap and 22128 for gsidcap)?

IC-LeSC failed with a Connection timed out error during the GFAL read.

Lancaster failed as the file could not even be lcg-cr'd to the SE. There was a no such file or directory error.

I'll run the tests next week once people have had a chance to look at some of these issues. It would be good to include this test in Steve Lloyds existing framwork.

Monday 19 February 2007

ATLAS Tests Status

Replicas: Finally managed to make a replica at Durham (strange bug discovered). Only Sheffield now outstanding (not the same bug). My analysis jobs now run at Durham (most of the time).

There was a screw up on Friday when my proxy expired and this stopped everything till I sorted it out in the evening. I think I've deleted all the jobs from the record so they don't enter the statistics.

Current Situation (Monday afternoon): Overall 58% success. LeSC, QMUL, UCL CCC, Bham and Bristol seem to be broken. Lancs and RHUL seem to be full (my jobs queue till I kill them).

Tuesday 13 February 2007

camont not yet enabled on CEs

Thanks to Mona the Imperial RBs are now set up to accept camont jobs (both production RB (gfe01) and testzone RB(gm02)). Unfortunately no CEs yet seem able to take the jobs:

"edg-job-list-match hostname.jdl

Selected Virtual Organisation name (from JDL): camont.gridpp.ac.uk Connecting to host gfe01.hep.ph.ic.ac.uk, port 7772

===================== edg-job-list-match failure ====================== No Computing Element matching your job requirements has been found! ======================================================================
"

Hopefully some good news soon!

Querying SAM

You can query the SAM results by installing lcg-sam-client from the CERN repository on a UI. Once installed and you have a proxy, it is possible to run commands like this:

$ /opt/lcg/same/client/bin/same-query nodename voname servicestatus sitename=ScotGrid-Edinburgh serviceabbr=SE

you will also have to set the SAM_SERVER_HOST to lcg-sam.cern.ch. There is a config file in /opt/lcg/same/client/etc as well as some basic documentation in /opt/lcg/same/client/docs .

From what I've seen so far it doesn't look like it's possible to drill down and get details of the subtests, you only get the overall result.

ATLAS Tests Status

I still haven't managed to make replicas at Durham and Sheffield. The one at Liverpool was deleted by persons unknown but is now back.

There were serious problems with the IC RB on Thursday/Friday last week which caused the whole system to collapse. Now using RAL RB again.

Overall success is currently 74%. All sites are working pretty well except Bristol, Edinburgh, UCL Central and QMUL. QMUL was broken since last week because /opt/edg/var/info/atlas/atlas.list got overwritten somehow. It worked again yesterday but is now broken again because lcg-info doesn't report anything (although ldap does).

Apart from that there are occasional failures or aborts of single jobs which need to be investigated sometime.

Monday 12 February 2007

Replica disappeared

My replica at Liverpool has disappeared for no apparent reason. I tried to recreate it but get obscure errors:

java.rmi.RemoteException: SRM Authorization failed;
...

RB Problems

Since last Thursday (8/2) all my jobs were reported as "running - unavailable" and by the weekend my whole system was screwed up. I cleaned everything up and switched from the IC RB to the (2nd) RAL one. Everything seems to be OK again now. In case it has something to do with it, if I now cancel a job I mark it as cancelled in my log and don't wait to see if the status is "Cancelled" the next time I try and find out its status. Previously if it didn't come back as "Cancelled" I tried to cancel it again (for ever) which may have caused problems since cancelling seems pretty broken.

Friday 9 February 2007

EU PMA Version 1.12 out

Version 1.12 of the EU PMA certificates are available. Since LCG in their infinite wisdom sometimes don't tell you the content, here's what's new, with my annotations:

* Extended life time of root certificate for SlovakGrid (SK)
* Extended life time of root certificate for PolishGrid (PL)

These will cause headaches for people with NSS-based browsers, like Moz and FF, but only for people with certs from these CAs.

* Fixed SHA-1 finger print for new SiGNET CA (SI)

Possibly also this one. Not sure which fingerprint was fixed here. Could just be the one in the .info file.

* Obsoleted Russian DataGrid CA also in RPM updates (RU)

No worries - no live certs.

* Add NECTEC GOC CA (TH)

This is Thailand - it was already accredited by AP Grid PMA. Probably mostly helpful for people using the GOC Wiki - if you import it into your browser.

* Added SWITCH Personal and Server 2007 CAs, removed 2005 CAs (CH)

No problem - just servers and stuff. SWITCH hierarchy is interesting in many ways.

* Changed CRL URL of the NAREGI CA from https to http (JP)

This is good. Serving CRLs over https is asking for some systems to deadlock.
To do https you need to check the CRL. To check the CRL you need to use https. Repeat.

Next version expected around March. Share and enjoy.

Thursday 8 February 2007

SE issues

We now have SE issues being discovered and reported in a number of ways. It would be useful to list the main ones and any resolutions that have been found so far.