EAR 4.3
Reference Manual
Report plugins

EAR includes several report plugins that are used to send data to various services:

  • EARD/EARDBD: this plugins are used internally by EAR to send data between services, which in turn will aggregate it and send it to the configured databases or other services.
  • MySQL/PostgreSQL: both plugins implement full EAR job and system accounting, using both the official C bindings to send the data to the database. For more information on the database structure, see the corresponding section
  • Prometheus: this plugin exposes system monitoring data in OpenMetrics format, which is fully compatible with Prometheus. For information about how to compile and set it up, check the Prometheus section.
  • csv_ts: reports loop and application data to a .csv file. The structure is the same as eacct's CSV option (see `eacct`) with an added column for the timestamp.
  • EXAMON: sends application accounting and system metrics to EXAMON. For more information, see its dedicated section.
  • DCDB: sends application accounting and system metrics to DCDB. For more information, see its dedicated section.
  • sysfs: exposes system monitoring data through the file system. For more information, see its dedicated section.

Prometheus report plugin

Requirements

The Prometheus plugin has only one dependency, microhttpd. To be able to compile it make sure that it is in your LD_LIBRARY_PATH.

Installation

Currently, to compile and install the prometheus plugin one has the run the following command.

make FEAT_DB_PROMETHEUS=1
make FEAT_DB_PROMETHEUS=1 install

With that, the plugin will be correctly placed in the usual folder.

Configuration

Due to the way in which Prometheus works, this plugin is designed to be used by the EAR Daemons, although the EARDBD should not have many issues running it too.

To have it running in the daemons, simply add it to the corresponding line in the [configuration file](Configuration).

EARDReportPlugins=eardbd.so:prometheus.so

This will expose the metrics on each node on a small HTTP server. You can access them normally through a browser at port 9011 (fixed for now).

In Prometheus, simply add the nodes you want to scrape in prometheus.yml with the port 9011. Make sure that the scrape interval is equal or shorter than the insertion time (NodeDaemonPowermonFreq in ear.conf) since metrics only stay in the page for that duration.

Examon

ExaMon (Exascale Monitoring) is a lightweight monitoring framework for supporting accurate monitoring of power/energy/thermal and architectural parameters in distributed and large-scale high-performance computing installations.

Compilation and installation

To compile the EXAMON plugin you need a functioning EXAMON installation.

Modify the main Makefile and set FEAT_EXAMON=1. In src/report/Makefile, update EXAMON_BASE with the path to the current EXAMON installation. Finally, set an examon.conf file somewhere on your installation, and modify src/report/examon.c (line 83, variable `char* conffile = "/hpc/opt/ear/etc/ear/examon.conf"`) to point to the new examon.conf file.

The file should look like this:

[MQTT]
brokerHost = hostip
brokerPort = 1883
topic = org/bsc
qos = 0
data_topic_string = plugin/ear/chnl/data
cmd_topic_string = plugin/ear/chnl/cmd

Where hostip is the actual ip of the node.

Once that is set up, you can compile EAR normally and the plugin will be installed in the lib/plugins/report folder inside EAR's installation. To activate it, set it as one of the values in the EARDReportPlugins of ear.conf and restart the EARD.

The plugin is designed to be used locally in each node (EARD level) together with EXAMON's data broker.

DCDB

The Data Center Data Base (DCDB) is a modular, continuous, and holistic monitoring framework targeted at HPC environments.

This plugin implements the functions to report periodic metrics, report loops, and report events.

When the DCDB plugin is loaded the collected EAR data per report type are stored into a shared memory which is accessed by DCDB ear sensor (report plugin implemented on the DCDB side) to collect the data and push them into the database using MQTT messages.

Compilation and configuration

This plugin is automatically installed with the default EAR installation. To activate it, set it as one of the values in the EARDReportPlugins of ear.conf and restart the EARD.

The plugin is designed to be used locally in each node (EARD level) with the DCDB collect agent.

Sysfs Report Plugin

This is a new report plugin to write EAR collected data into a file. Single file is generated per metric per jobID & stepID per node per island per cluster. Only the last collected data metrices are stored into the files, means every time the report runs it saves the current collected values by overwriting the pervious data.

Namespace Format

The below schema has been followed to create the metric files:

{/root_directory/cluster/island/nodename/avg/metricFile}
/root_directory/cluster/island/nodename/current/metricFile
/root_directory/cluster/island/jobs/jobID/stepID/nodename/avg/metricFile
/root_directory/cluster/island/jobs/jobID/stepID/nodename/current/metricFile

The root_directory is the default path where all the created metric files are generated.

The cluster, island and nodename will be replaced by the island number, cluster name, and node information.

metricFile will be replaced by the name of the metrics collected by EAR.

Metric File Naming Format

The naming format used to create the metric files is implementing the standard sysfs interface format. The current commonly used schema of file naming is:

<type>_<component>_<metric-name>_<unit>

Numbering is used with some metric files if the component has more than one instance like FLOPS counters or GPU data.

Examples of some generated metric files:

  • dc_power_watt
  • app_sig_pck_power_watt
  • app_sig_mem_gbs
  • app_sig_flops_6
  • avg_imc_freq_KHz

Metrics reported

The following are the reported values for each type of metric recorded by ear:

  • report_periodic_metrics
    • Average values
      • The frequency and temperature values have been calculated by summing the values of all periods since the report loaded until the current period and divide it by the total number of periods.
      • The energy value is accumulated value of all the periods since the report loaded until the current one.
      • The path to those metric files built as: /root_directory/cluster/island/nodename/avg/metricFile
    • Current values
      • Represent the current collected EAR metric per period.
      • The path to those metric files built as: /root_directory/cluster/island/nodename/current/metricFile
  • report_loops
    • Current values
      • Represent the current collected EAR metric per loop.
      • The path to those metric files built as: /root_directory/cluster/island/jobs/jobID/stepID/nodename/current/metricFile
  • report_applications
    • Current values
      • Represent the current collected EAR metric per application.
      • The path to those metric files built as: /root_directory/cluster/island/jobs/jobID/stepID/nodename/avg/metricFile
  • report_events
    • Current values
      • Represent the current collected EAR metric pere event.
      • The path to those metric files built as: /root_directory/cluster/island/jobs/jobID/stepID/nodename/current/metricFile

``` Note: If the cluster contains GPUs, both report_loops and report_applications will generate new schema files will per GPU which contain all the collected data for each GPU with the paths below: ◦ /root_directory/cluster/island/jobs/jobID/stepID/nodename/current/GPU-ID/metricFile ◦ /root_directory/cluster/island/jobs/jobID/stepID/nodename/avg/GPU-ID/metricFile