NUC 11.04.2012
Attendance
Steve Aplin; Wolfgang Ehrenfeld; Andreas Gellrich; Kai Leffhalm; Shaojun Lu; Hartmut Stadie; Alexey Zhelezov; Yves Kemp
Agenda
https://indico.desy.de/conferenceDisplay.py?confId=5765
News
- discuss structure of meeting
- be finished at 14:00
NAF status
- operational issues:
- AFS: timeouts, slow reaction time, related to overloaded AFS volumes, please use AFS wisely
- Lustre: heavy load on one file server, slow data and metadata operations, might cause I/O errors,
- tickets:
- errors from qstat due to large number of jobs in system; maybe using array jobs helps
- Lustre file corruption can happen due to crash of overloaded server, but no hint to it yet
- request for AFS backup restore: in this case backup did not work
- change certificate DN for autoproxy requires expert intervention (open ticket)
- CVMFS status:
- CMS: two Squid servers for CMS, new ini-script for CMSSW
- ATLAS: two Squid servers in HH, 1 in ZN, configuration added to nodes, tests run successfully, better performance seen than with AFS, can also get other ATLAS tools via CVMFS
- LHCb: looking into it, switch should be easy
- schedule of SL 5.8 unclear
Existing Action Items
id |
who |
description |
status/comments |
1007-2 |
NAF |
report on help desk tickets |
shown |
1010-4 |
NAF |
statistics on multi-core/PROOF usage |
shown |
1102-2 |
NAF |
fix extensive directory listing on (CMS) dCache via gsiftp |
closed |
1103-1 |
experiments |
test CVMFS, find configuration, communicate setting to NAF admins |
|
ATLAS report
(see https://indico.desy.de/getFile.py/access?subContId=0&contribId=2&resId=minutes&materialId=minutes&confId=5765)
- problem with slow reaction time for Lustre problem on March 29th
- better understanding of AFS structure (servers, volume location) would help understanding AFS problems
- normal ATLAS NAF accounts should only be used for schools for official ATLAS members
- more local(Lustre, SONAS) space needed for analysis of 2012 data
CMS report
- maybe want to impose a job limit on 15-min-queue
ILC/CALICE report
- problem with specifying strict memory limits for SGE jobs, sometimes different usage seen (maybe related to Java usage in gLite), will follow up
- problem with monitoring Lustre
LHCb report
AOB
- short discussion of plans for redesigned NAF (NAF 2.0):
- integrate NAF better into DESY infrastructure, while maintaining it as a plattform for analysis within the Terascale alliance
New Action Items:
id |
who |
description |
status/comments |
1204-1 |
NAF |
improve autoproxy handling of old certificate |
|
1204-2 |
NAF&experiments |
rethink queue setups for 15min queue, etc |
|
1204-3 |
NAF |
short report on current queue setup and usage for June NUC |
|
1204-4 |
NAF |
show NAF 2.0 plans in May NUC |
|
Next meeting