NUC 11.01.2012
Attendance:
virtual meeting
News
- happy New Year!
NAF status
The main subject for the virtual meeting is the intervention of yesterday - and unfortunately today.
Let me summarize what was done and is still being done:
- The core router in the main computing room were upgraded. This intervention unfortunately took longer than expected and was lead to a longer complete service disruption affecting most of DESY - so not only the NAF
- Moving some Lustre file server to another computing room was successful. One server had a defective mainboard, fortunately we had a spare one. One server currently has lost some LUNs - affecting CMS and ILC to some extent (some files not accessible, df not working). This is being investigated. The Lustre server are currently connected to simple Gbit. Once the overall network has calmed down, we will switch to 10 Gbit. We hope that this can be done during normal operations, with a disruption of ~5 sec per pool, which should not affect user jobs.
- The DDN maintenance itself was successful. The DDN dCache pools are connected to the IP net via 10 GE via a dedicated switch which has routing problems. These pools are therefore disabled, resulting in large numbers of missing files.
- The Grid/NAF/dCache subnets have been made available now in the two computing centers in Hamburg. This was partly successful (otherwise the Lustre server would not be visible), but routing problems have occurred since. We were able to resolve some of them, but alas not all. My current knowledge of what is not working:
- The Zeuthen part of the NAF is completely off. Some AFS volumes, LUSTRE and WGS as well as dCache in Zeuthen are affected.
- Login and AFS are only reachable from within DESY (Hamburg, to be precise).
- Routing to LHCONE does not work: No VOMS certificates from CERN. No data transfers from/to LHCONE sites.
- HH-dCache connectivity of course. (ILC is not affected)
- We have decided to start the NAF yesterday evening with all known and unknown problems. In the course of today, the network people will try to solve problems. We will continue to keep users informed via email about progress.
Given the current situation, I was not able to compile material for other open questions. If you feel that things are urgent, I can do so for next week, otherwise we leave things open until February NUC.
Existing Action Items
id |
who |
description |
status/comments |
1007-2 |
NAF |
report on help desk tickets |
|
1010-4 |
NAF |
statistics on multi-core/PROOF usage |
|
1101-2 |
NAF |
naf_token for ubuntu&MacOSX |
|
ATLAS report
CMS report
ILC/CALICE report
LHCb report
From LHCb there is not much news, only one issue which might come up in the coming days.
There's one user who wants to submit multi-threaded jobs to the batch system,
and is curious if there is a special queue for that.
AOB
New Action Items:
id |
who |
description |
status/comments |
Next meeting
- Wednesday, February 8th 2012, 1 pm