Minutes of NAF User Committee meeting from 11.11.2009 (face-to-face) -------------------------------------------------------------------- Present: Steve Aplin (ILC), Johan Blouw (LHCb), Wolfgang Ehrenfeld (ATLAS), Andreas Haupt (IT), Carsten Hof (CMS), Yves Kemp (IT), Kai Leffhalm (NAF), Angela Lucaci-Timoce (ILC), Niels Meyer (ILC), Hartmut Stadie (CMS), Jan Erik Sundermann (ATLAS), Alexey Zhelezov (LHCb) 1. News from the chair: Andreas Nowack (Aachen) will replace Carsten Hof as one of the CMS representatives from December 2009 on. Jan Erik Sundermann will leave the NUC end of the year. A new ATLAS representative is searched for at the moment. 2. Report from ATLAS: ATLAS presented some uses cases and open issues. See the agenda for the talk. Below are a few important points: - automatic clean up for part of /scratch is needed - some space for longer term storage is needed on /scratch (needs quota or separate instance). Or can it be moved to AFS? What are the limits there? - Can AFS user scratch be created automatically and mounted into the home directory? - The CMT problem is a major show stopper. This needs to be fixed immediately. 3. Report from CMS: CMS presented some uses cases and feedback. See the agenda for the talk. Below are a few important points: - access /scratch from Grid, i.e. with Grid tools. - larger /scratch - user/group quota on /scratch - better handling of AFS scratch (admin via registry, linked/mounted into home) - need native SL5 build of Grid UI - 2GB memory as default for batch jobs 4. Report from LHCb: LHCb presented their current requirements and the expected ones for 2010. See the agenda for the talk. Below are a few important points: - dCache is used as main storage, /scratch is hardly used - difficulties with dCache monitoring and dCache data management - slow network connection between NAF and Heidelberg For all issues LHCb is in contact with the NAF people from Zeuthen. 5. Report from ILC: ILC presented typical use cases and some feedback. See the agenda for the talk. Below are a few important points: - definition of /scratch - how to use /scratch best - problems with autoproxy service and RFC proxy 6. NAF perspectives: Andreas Haupt review the NAF batch system, which uses Sun Grid Engine (SGE) as batch system. See the agenda for the slides. Yves Kemp presented the NAF perspectives. See the agenda for the slides. 7. Discussion: Most of the issues of the experiments were discussed. Below are the summaries: - AFS scratch: The current practise to create AFS scratch space for users is not very efficient. It should be done automatically and the scratch space should be linked or mounted into the home directory. - Quota on /scratch: Most of the experiment would like to have user/group quota on /scratch, although the use cases are different between the experiments: ATLAS wants group quota to separate short and long term storage on /scratch. CMS would like to have half of their /scratch space for users with quotas. ILC wants user quotas. This will be discussed again at the next meeting. - Monitoring of /scratch usage It is important to monitor the /scratch usage by user. At the moment the experiment admins do not have full rights to this always. The NAF will provide tools for this. - Cleanup of /scratch The main point about /scratch is, that there is no backup! Data placement and deletion policies are defined by the experiments. ATLAS wants automatic cleanup of part of the /scratch space. As access rights are not provided to the experiments, the NAF will provide tools and run them in consultation with ATLAS. - default memory per batch job At the moment 1GB is the default memory limit for batch jobs. Every user can and should adjust this. Memory over subscription is not permitted by the batch system. Also, the swap partition is very small, as swapping will not happen due to the batch system policy. To ease this issue for the users it was discussed if the value can be adjusted to a more realistic value of 2GB. At the moment most machines have 2GB/core or in other words 16 GB for a 8 core system. Effectively this comes to 1.8-1.9 GB/job. This means that 2 GB/job are not realistic at the moment. The NAF will check, if the default can be raised to 1.8/1.9 GB/job on a short time scale. On the other side, some slots will not be usable if users require more than 2 GB/job. It is not clear if sufficient monitoring of requested/used memory is available. The NAF will present at the next meeting their proposal for a long term change. - Batch system and /scratch It would be easier for the user to have a shell variable pointing to the users /scratch directory. This can either be set centrally or by th experiments. The NAF should investigate if group profiles are possible. - multi core batch jobs More users use multi core batch jobs. More documentations and tips are needed on the NAF web pages. The NAF should investigate how to monitor the multi core usage. - Grid UI 3.2 for SLD5: IT is working on this and the SLD5 version will be available this year via ini. It is important that both version 3.1 and 3.2 are available in parallel for some time.