Speakers:
Manuel Giffels
(Karlsruher Institut für Technologie (KIT)),
Matthias Schnepf
(CMS (CMS-Experiment)), Dr
Max Fischer
(KIT SCC),
Michael Boehler, Dr
Oliver Freyermuth
(University of Bonn (DE)),
Ralf Florian von Cube
(Karlsruhe Institute of Technology (KIT))
ROUND:
WUP:
- worked on comparison autitor output vs numbers from experiment
- system cpu time not always reliable in AUDITOR (does HTCondor report this number properly?)
- --> most likely missing numbers from extremely short running jobs
- Question: what is the current working model? Where do we want to let AUDITOR and the reporting plugin setup (individual sites, or OBS)?
- current setup tested for BO: reporting runs on site with OBS
- spotted many open file descriptors when running containers with cvmfs
KA:
- after some fixes, KA publishes accounting data for BO to APEL, but no numbers published on web portal
- Max opened a GGUS ticket for understanding why the BO accounting data is misssing
- PR pending: including integration test for HTCondor collector (spinning up containers for a docker based HTCondor cluster and running tests)
- not yet included: request for adding a default score, which is published, if a given node does not report it's hepscore value
BO:
- finalizing full node benchmarks
- found some bugs in benchmark suite, which where reported to the developers
- cvmfs based test open factor 4-5 more files than downloading containers and run these
- comparision of HammerCloud hepscore23 with full node benchmarks shows difference of hepscore23 = 21 vs hepscore23 = 14
- this effect is driven by largely heterogeneous jobs on a given HPC node vs full stress, when running benchmark
FR:
- PR of Florian on todo list -> now
- update on items for release v 0.2.0 milestones:
- AUDITOR_APEL_plugin: move code to AUDITOR repo - done
- Add workflows for linting of python code enhancement - implemented
- publish to PyPI - on hold until orga is established
- Slurm collector: Add docs
- Add workflow(s) for publishing the python parts of the repo
- Integration test for HTCondor-Collector
- Set log level via config file
- adjust (color schema on) web page
- pyauditor integration tests fail randomly
- rust io sync pyO3
- test if dropping python 3.6 support might fix this issue
- C/T auditor plugin crashes
- some records tried to be updated, before added to auditor db
- records are created when drone is in AVAILABLE state
- records are updated with endtime, when drone reaches DOWN state (some drones never reach AVAILABLE, but DOWN) -> error handler required!
- APEL plugin in repo requires python 3.7 or newer
- in order to support centos7, provide plugin in docker container
Next Meetings:
7th August
11th September