Minutes of NAF User Committee meeting from 10.2.2010
----------------------------------------------------

Present: Steve Aplin (ILC), Johan Blouw (LHCb), Wolfgang Ehrenfeld
(ATLAS), Andreas Gellrich (NAF), Yves Kemp (IT), Angela Lucaci-Timoce
(ILC), Jan Erik Sundermann (ATLAS), Alexey Zhelezov (LHCb)

Excused: Kai Leffhalm (NAF), Andreas Haupt (IT), Harmut Stadie (CMS)


1. News from the chair:

No news.


2. Status report:

Status report was given by Yves. See the agenda for the report. Below a
few highlights from the discussion are listed:


 - Multicore jobs

Accounting of multicore jobs is not done properly, only one slot is
accounted for each job at the moment. Proper accounting give 0.8% of
walltime used from 1.1.2010 up to now for multicore jobs.

Multicore job monitoring is on the long term todo list for the NAF. If
the multicore job hit 5% of the total walltime, proper accounting is
needed.

In order to get a feeling, which experiments are using multicore jobs
and how much, the NAF will give the number of multicore jobs per VO
every month.


 - Next downtime

General remark: The NAF operators are the experts in operating the NAF
and should decide which systems should be disabled for user access
during a downtime. This is not the duty of the experiments. On the
other side the date of downtimes should be coordinated with the experiments.

As a reminder a quote from the November 2008 NUC meeting:

'Maintenance can be done every Thursday between 8am and 10am. It should
be announced well in advanced. Please note that this is an optional
maintenance slot, which usually will not be used every week.  Urgent
things, e. g. fixing security issues, are not affect by this and will
be fixed as soon as possible.'

For the next hardware intervention February 18th is envisioned. As
batch queues need to be stopped the announcement should be made one
week before. This should be latest February 11th.


 - Group profiles

The infrastructure for group profiles is set up. Please use them with
care.

ATLAS is grateful that this feature was implemented.


 - gLite 3.1/3.2

As a reminder gLite 3.1 is 32bit and build for SL4, but runs also on
SL5. gLite 3.2 is mainly 64bit and build for SL5, but will no run on SL4.
ini can be used to set up either one.

The NAF should inform all users and check if the documentation is
up to date.


 - AFS scratch

ATLAS asked for an estimate when AFS scratch will be implemented for
every user. This is one of the more important things for the
experiments. The NAF should report to the NUC as soon as possible.


 - SL5.5

As a reminder, it was agreed in one of the NUC meetings to prepare a
test system for OS upgrade in advance and allow for 1 to 2 weeks of
testing by the experiments.


3. Action items:


The following action items have been closed this or the previous
meeting:

NAF: change batch default memory limit to 1.5 GB
CMS: batch default memory limit of 1.5 GB
NAF: gLite 3.2 (ini)
NAF: different sysname for SL5 (offline between ILC and NUC)
all: quotas on /scratch
NAF: monitoring tools for /scratch
NAF: group profiles


 - different sysname for SL5 and SL4

ILC had decided to move away from using sysname for detecting the
build platform.


Open action items are:

NAF: gLite 3.2 (user information and documentation)
all: document (software updates, downtimes, ...) and user information (news letter, motd, news section on naf web, ...)
NAF/ATLAS: CMT problem.
ALL: NAF SL5 migration (running)
NAF: advice on user files storage (code development, small log files)
NAF: AFS scratch space creation
NAF: cleanup tool for /scratch
NAF: multicore batch job monitoring (needs SGE fix)


 - documentation

The general action item of documentation will be discussed at the
March NUC meeting.


 - CMT problem

The NAF technical coordinators should organise a meeting with experts
and ATLAS to decide on a strategy to solve this problem.


 - SL5 migration

No negative feedback from the migration of the login host for the work
group servers to SL5. ILC do not need SL4 work group server any more
and request 3 SL5 work group server in total. The same applies for
LHCb. They only need two SL5 work group server. ATLAS will continue
with 2 SL4 and 2 SL5 work group server. CMS should give feedback.


 - advice on user files storage

The experiments should read the documentation at
http://naf.desy.de/general_naf_docu/naf_storage/ and should give
feedback if this addresses the action item.


 - cleanup tool for /scratch

In the long term the NAF is investigating into a solution of the
problem, but for various reasons (security, stability, ..) they will
not provide a solution for Lustre.

ATLAS explicitly asked again for an interim working model to delete
data from Lustre by the ATLAS admins.

After some discussion it was agreed that the NAF should discuss this
issue again and present a interim solution at the March meeting if
possible.

Renamed to 'deletion model for /scratch'


5. AOB:

 - Angela had two users complaining about long waiting times in the
   batch system. She reported this to naf-helpdesk@desy.de but never
   got a reply. 

   The NAF should investigate, why this ticket was not answered and
   check if the problem with the batch system can still be understood.
   Experiment admins can also contact the NAF experts at naf@desy.de.

 - ATLAS asked if they new Blade Encloser from DESY-ATLAS is already
   installed and working, which is the case. Further ATLAS asked for a
   recalulation of the fair share of the batch system. This will be
   done after the next downtime when also the new CPUs from CMS
   University of Hamburg are running. It was agreed with the CMS group
   from the university of Hamburg, that the resources will be
   available for all CMS users. For the moment, the same applies for
   the ATLAS-DESY CPUs and the ATLAS group.

 - ATLAS asked to be able to change the priority of an ATLAS user
   compared to other ATLAS users in the batch system. This is needed
   to give important work some additional computing power of a limited
   time.

   After some discussion, it was agreed that the NAF will check if
   this is possible with SGE.

 - Next meeting is in one month. No face to face meeting at the DPG is
   foreseen.