Minutes of NAF User Committee meeting from 10.2.2010 ---------------------------------------------------- Present: Steve Aplin (ILC), Johan Blouw (LHCb), Wolfgang Ehrenfeld (ATLAS), Andreas Gellrich (NAF), Yves Kemp (IT), Angela Lucaci-Timoce (ILC), Jan Erik Sundermann (ATLAS), Alexey Zhelezov (LHCb) Excused: Kai Leffhalm (NAF), Andreas Haupt (IT), Harmut Stadie (CMS) 1. News from the chair: No news. 2. Status report: Status report was given by Yves. See the agenda for the report. Below a few highlights from the discussion are listed: - Multicore jobs Accounting of multicore jobs is not done properly, only one slot is accounted for each job at the moment. Proper accounting give 0.8% of walltime used from 1.1.2010 up to now for multicore jobs. Multicore job monitoring is on the long term todo list for the NAF. If the multicore job hit 5% of the total walltime, proper accounting is needed. In order to get a feeling, which experiments are using multicore jobs and how much, the NAF will give the number of multicore jobs per VO every month. - Next downtime General remark: The NAF operators are the experts in operating the NAF and should decide which systems should be disabled for user access during a downtime. This is not the duty of the experiments. On the other side the date of downtimes should be coordinated with the experiments. As a reminder a quote from the November 2008 NUC meeting: 'Maintenance can be done every Thursday between 8am and 10am. It should be announced well in advanced. Please note that this is an optional maintenance slot, which usually will not be used every week. Urgent things, e. g. fixing security issues, are not affect by this and will be fixed as soon as possible.' For the next hardware intervention February 18th is envisioned. As batch queues need to be stopped the announcement should be made one week before. This should be latest February 11th. - Group profiles The infrastructure for group profiles is set up. Please use them with care. ATLAS is grateful that this feature was implemented. - gLite 3.1/3.2 As a reminder gLite 3.1 is 32bit and build for SL4, but runs also on SL5. gLite 3.2 is mainly 64bit and build for SL5, but will no run on SL4. ini can be used to set up either one. The NAF should inform all users and check if the documentation is up to date. - AFS scratch ATLAS asked for an estimate when AFS scratch will be implemented for every user. This is one of the more important things for the experiments. The NAF should report to the NUC as soon as possible. - SL5.5 As a reminder, it was agreed in one of the NUC meetings to prepare a test system for OS upgrade in advance and allow for 1 to 2 weeks of testing by the experiments. 3. Action items: The following action items have been closed this or the previous meeting: NAF: change batch default memory limit to 1.5 GB CMS: batch default memory limit of 1.5 GB NAF: gLite 3.2 (ini) NAF: different sysname for SL5 (offline between ILC and NUC) all: quotas on /scratch NAF: monitoring tools for /scratch NAF: group profiles - different sysname for SL5 and SL4 ILC had decided to move away from using sysname for detecting the build platform. Open action items are: NAF: gLite 3.2 (user information and documentation) all: document (software updates, downtimes, ...) and user information (news letter, motd, news section on naf web, ...) NAF/ATLAS: CMT problem. ALL: NAF SL5 migration (running) NAF: advice on user files storage (code development, small log files) NAF: AFS scratch space creation NAF: cleanup tool for /scratch NAF: multicore batch job monitoring (needs SGE fix) - documentation The general action item of documentation will be discussed at the March NUC meeting. - CMT problem The NAF technical coordinators should organise a meeting with experts and ATLAS to decide on a strategy to solve this problem. - SL5 migration No negative feedback from the migration of the login host for the work group servers to SL5. ILC do not need SL4 work group server any more and request 3 SL5 work group server in total. The same applies for LHCb. They only need two SL5 work group server. ATLAS will continue with 2 SL4 and 2 SL5 work group server. CMS should give feedback. - advice on user files storage The experiments should read the documentation at http://naf.desy.de/general_naf_docu/naf_storage/ and should give feedback if this addresses the action item. - cleanup tool for /scratch In the long term the NAF is investigating into a solution of the problem, but for various reasons (security, stability, ..) they will not provide a solution for Lustre. ATLAS explicitly asked again for an interim working model to delete data from Lustre by the ATLAS admins. After some discussion it was agreed that the NAF should discuss this issue again and present a interim solution at the March meeting if possible. Renamed to 'deletion model for /scratch' 5. AOB: - Angela had two users complaining about long waiting times in the batch system. She reported this to naf-helpdesk@desy.de but never got a reply. The NAF should investigate, why this ticket was not answered and check if the problem with the batch system can still be understood. Experiment admins can also contact the NAF experts at naf@desy.de. - ATLAS asked if they new Blade Encloser from DESY-ATLAS is already installed and working, which is the case. Further ATLAS asked for a recalulation of the fair share of the batch system. This will be done after the next downtime when also the new CPUs from CMS University of Hamburg are running. It was agreed with the CMS group from the university of Hamburg, that the resources will be available for all CMS users. For the moment, the same applies for the ATLAS-DESY CPUs and the ATLAS group. - ATLAS asked to be able to change the priority of an ATLAS user compared to other ATLAS users in the batch system. This is needed to give important work some additional computing power of a limited time. After some discussion, it was agreed that the NAF will check if this is possible with SGE. - Next meeting is in one month. No face to face meeting at the DPG is foreseen.