Accounting Working Meeting

Europe/Berlin
    • 10:00 11:00
      Round Table 1h
      Speakers: Manuel Giffels (Karlsruher Institut für Technologie (KIT)), Matthias Schnepf (CMS (CMS-Experiment)), Dr Max Fischer (KIT SCC), Michael Boehler, Dr Oliver Freyermuth (University of Bonn (DE)), Ralf Florian von Cube (Karlsruhe Institute of Technology (KIT))

      ROUND:

       

      WUP:

      • worked on comparison autitor output vs numbers from experiment
        • system cpu time not always reliable in AUDITOR (does HTCondor report this number properly?)
        • --> most likely missing numbers from extremely short running jobs
      • Question: what is the current working model? Where do we want to let AUDITOR and the reporting plugin setup (individual sites, or OBS)?
        •  current setup tested for BO: reporting runs on site with OBS
      • spotted many open file descriptors when running containers with cvmfs
        • --> caching ?  

       

      KA:

      • after some fixes, KA publishes accounting data for BO to APEL, but no numbers published on web portal
      • Max opened a GGUS ticket for understanding why the BO accounting data is misssing
      • PR pending: including integration test for HTCondor collector (spinning up containers for a docker based HTCondor cluster and running tests)
      • not yet included: request for adding a default score, which is published, if a given node does not report it's hepscore value 

       

      BO:

      • finalizing full node benchmarks
        • found some bugs in benchmark suite, which where reported to the developers
        • cvmfs based test open factor 4-5 more files than downloading containers and run these
        • comparision of HammerCloud hepscore23 with full node benchmarks shows difference of hepscore23 = 21 vs hepscore23 = 14
          • this effect is driven by largely heterogeneous jobs on a given HPC node vs full stress, when running benchmark 

       

      FR:

      • PR of Florian on todo list -> now
      • update on items for release v 0.2.0 milestones:
        • AUDITOR_APEL_plugin: move code to AUDITOR repo - done
        • Add workflows for linting of python code enhancement - implemented
        • publish to PyPI - on hold until orga is established
        • Slurm collector: Add docs
        • Add workflow(s) for publishing the python parts of the repo
        • Integration test for HTCondor-Collector
        • Set log level via config file
        • adjust (color schema on) web page
      • pyauditor integration tests fail randomly
        • rust io sync pyO3
        • test if dropping python 3.6 support might fix this issue
      • C/T auditor plugin crashes 
        • some records tried to be updated, before added to auditor db
        • records are created when drone is in AVAILABLE state
        • records are updated with endtime, when drone reaches DOWN state (some drones never reach AVAILABLE, but DOWN) -> error handler required!
      • APEL plugin in repo requires python 3.7 or newer
        • in order to support centos7, provide plugin in docker container  
           

       

      Next Meetings: 

           7th August 

        11th September