![]() |
EAR
4.2.1
EAR Reference Manual
|
[[TOC]]
EAR offers some environment variables in order to provide users the opportunity to tune or request some of EAR features. They must be exported before the job submission, e.g., in the batch script.
The current EAR version has support for SLURM, PBS and OAR batch schedulers. In SLURM systems the scheduler may filter environment variables not prefixed with SLURM_ character set (this happens when the batch script is submitted purging all environment variables to work in a clean environment). For that reason, the first design of EAR environment variables was to have variable names with the form SLURM_<variable_name>.
Now that EAR has support for other batch schedulers, and in order to maintain the coherency of environment variables names, below environment variables need the prefix of the scheduler used on the system the job is submitted on, plus an underscore. For example, in SLURM systems, the environment variable presented as EAR_LOADER_APPLICATION
must be exported as SLURM_EAR_LOADER_APPLICATION
in the submission batch script. In an OAR installed system, this variable would be exported as OAR_EAR_LOADER_APPLICATION
. This design may only have a real effect on SLURM systems, but it makes it easier for the development team to provide support for multiple batch schedulers.
All examples showing the usage of below environment variables assume a system using SLURM.
Rules the EAR Loader to load the EAR Library for a specific application that does not follow any of the current programming models (or maybe a sequential app) supported by EAR. Your system must have installed the non-MPI version of the Library (ask your system administrator).
The value of the environment variable must coincide with the job name of the application you want to launch with EAR. If you don’t provide it, the EAR Loader will compare it against the executable name. For example:
See the Use cases section to read more information about how to run jobs with EAR.
Forces to load a specific MPI version of the EAR Library. This is needed, for example, when you want to load the EAR Library for Python + MPI applications, where the Loader is not able to detect the MPI implementation the application is going to use. Accepted values are either intel or open mpi. The following example runs Tensorflow 1 benchmarks for several convulational neural networks with EAR. It can be downloaded from Tensorflow benchmarks repository.
See the Use cases section to read more information about how to run jobs with EAR.
Specify a report plug-in to be loaded. The value must be a shared object file, and it must be located at $EAR_INSTALL_PATH/lib/plugins/report
or at the path from where the job was launched. Alternatively, you can provide the full path (absolute or relative) of the report plug-in.
Specify a path to create a file (one per node involved in a job) where to print messages from the EAR Library. This is useful when you run a job in multiple nodes, as EAR verbose information for each of them can result in lots of messages mixed at stderr (EAR messages default channel). Also, there are applications that print information in both stdout and stderr, so maybe a user wants to have information separated.
If the path does not exist, EAR will create it. The format of generated files names is earl_log.<node_rank>.<local_rank>.<job_step>.<job_id>
, where the node_rank is an integer set by EAR from 0 to n_nodes - 1 involved in the job, and it indicates to which node the information belongs to. The local rank is an arbitrary rank set by EAR of a process in the node (from 0 to n_procceses_in_node - 1). It indicates which process is printing messages to the files, and it will be always the first one indexed, i.e., 0. Finally, the job_step and job_id are fields showing information about the job corresponding to the execution from where messages were generated.
After the above job example completion, in the same directory where the application was submitted, there will be a directory called ear_logs_dir_name with two files, i.e., one for each node, called earl_logs.0.0.<job_step>.<job_id> and earl_logs.1.0.<job_step>.<job_id>, respectively.
Set a GPU frequency (in kHz) to be fixed while your job is running. The same frequency is set for all GPUs used by the job.
Indicate whether the job will run in a node exclusively (non-zero value). EAR will reduce the CPU frequency of those cores not used by the job. This feature explodes a very easy vector of power saving.
EARL offers the possibility to control the Integrated Memory Controller (IMC) for Intel(R) architectures and Infinity Fabric (IF) for AMD architectures. On this page we will use the term uncore to refer both of them. Environment variables related to uncore control covers policy specific settings or the chance for a user to fix it during an entire job.
Enables/disables EAR's eUFS feature. Type ear-info
to see whehter eUFS is enabled by default.
You can control eUFS' maximum permitted time penalty by exporting EAR_POLICY_IMC_TH
, which is a float indicating the threshold value that prevents the policy to reduce so much the uncore frequency, possible leading to considerable performance penalty.
Below example enables eUFS with a penalty threshold of 3.5%:
Set the maximum and minimum values (in kHz) at which uncore frequency should be. Two variables were designed because Intel(R) architectures let to set a range of frequencies that limits its internal UFS mechanism. If you set both variables with different values, the minimum one will be set.
Below example shows a job execution fixing the uncore frequency at 2.0GHz:
By default, EAR policies try to set the best CPU (and uncore, if enabled) frequency according to node grain metrics. This behaviour can be changed telling EAR to detect and deal with unbalanced workloads, i.e., there is no equity between processes regarding their MPI/computational activity.
When EAR detects such behaviour, policies slightly modify its way of CPU frequency selection by setting a different frequency for each process' cores according how far it is from the critical path. Please, contact with ear-support@bsc.es if you want more details about how it works.
A correct CPU binding it's required to get the most benefit of this feature. Check the documentation of your application programming model/vendor/flavour or yur system batch scheduler.
Enables/Disables EAR's Load Balance strategy in energy policies. Type ear-info
to see whether this feature is enabled by default.
Load unbalance detection algorithm is based on POP-CoE's Load Balance Efficiency metric, which is computed as the ratio between average useful computation time (across all processes) and maximum useful computation time (also across all processes). By default (if EAR_LOAD_BALANCE
is enabled), a node load balance efficiency below 0.8 will trigger EAR's Load Balancing algorithm. This threshold value can be modified by setting EAR_LOAD_BALANCE_TH
environment variable. For example, if you want to be more permissive with the application load balance and prevent per-process CPU frequency selection, you can increase the load balance threshold:
Since version 4.2, EAR supports the interaction with Intel(R) Speed Select Technology (Intel(R) SST) which lets the user to have more fine grained control over per-CPU Turbo frequency. This feature opens a door to users for getting more control over the performance (also power consumption) across CPUs running their applications and jobs. It is available on selected SKUs of Intel(R) Xeon(R) Scalable processors. For more information about Intel(R) SST, below are listed useful links to official documentation:
EAR offers two environment variables that let to specify a list of priorities (CLOS) in two different ways. The first one will set a CLOS for each task involved in the job. On the other hand, the second offered variable will set a list of priorities per CPU involved in the job. Values must be within the range of available CLOS that Intel(R) SST provides you.
If some of the two supported environment variables are set, EAR will set-up all of its internals transparently if the architecture supports it. Also, it will restore configuration on the job ending. If Intel(R) SST is not supported, no effect will occur. If you enable EARL verbosity you will see the mapping of the CLOS set for each CPU in the node. Note that a -1
value means that no change was done on the specific CPU.
A list that specifies the CLOS that CPUs assigned to tasks must be set. This variable is useful because you can configure your application transparently without concerning about the affinity mask that the scheduler is assigning to your tasks. You can use this variable when you know (or guess) your application's tasks workload and you want to tune it by setting manually different Turbo priorities. Note that you still need to ensure that different tasks do not share CPUs.
For example, imagine you want to submit a job that runs a MPI application with 16 tasks, each one pinned on a single core, in a two-socket Intel(R) Xeon(R) Platinum 8352Y with 32 cores each, with Hyper-threading enabled, i.e., each task will run on two CPUs and 32 of the total 128 will be allocated by this application. Below could be a (simplified) batch script that submits this example:
The above script sets CLOS 0 to tasks 0 to 3, CLOS 1 to tasks 4 to 7, CLOS 2 to tasks 8 to 11 and CLOS 3 to tasks 12 to 15. The srun
command binds each task to one core (through --cpu-bind
flag), sets the turbo frequency and enables EAR verbosity. Below there is the output message shown by the batch scheduler (i.e., SLURM):
We can see here that SLURM spreaded out tasks accross the two sockets of the node, e.g., task 0 runs on CPUs 0 and 64, task 1 runs on CPUs 32 and 96. Below output shows how EAR sets and verboses CLOS list per CPU in the node. Following the same example, you can see that CPUs 0, 64, 32 and 96 have priority/CLOS 0. Note that those CPUs not involved in the job show a -1.
A list of priorities that should have the same length as the number of CPUs your job is using. This configuration lets to set up CPUs CLOS in a more low level way: the n-th priority value of the list will set the priority of the n-th CPU your job is using.
This way of configuring priorities rules the user to know exactly the affinity of its job's tasks before launching the application, so it becomes harder to use if your goal is the same as the one you can get by setting the above environment variable: task-focused CLOS setting. But it becomes more flexible when the user has more control over the affinity set to its application, because you can discriminate between different CPUs assigned to the same task. Moreover, this is the only way to set different priorities over different threads in no-MPI applications.
For both [Load Balancing](load-balancing) and Intel(R) SST support, EAR uses processes' affinity mask read at the beginning of the job. If you are working on an application that changes (or may change) the affinty mask of tasks, this can lead some miss configuration not detected by EAR. To avoid any unexpected problem, we highly recommend you to export EAR_NO_AFFINITY_MASK
environment variable (even your are not planning to work with some of the mentioned features).
Use this variable to generate two files at the end of the job execution that will contain global, per process MPI information. You must specify the prefix (optionally with a path) of the filename. One file ([path/]prefix.ear_mpi_stats.full_nodename.csv) will contain a resume about MPI throughput (per-process), while the other one ([path/]prefix.ear_mpi_calls_stats.full_nodename.csv) will contain a more fine grained information about different MPI call types. Here is an example:
At the end of the job, two files will be created at the directory named *<job_id>-mpi_stats* located in the same directory where the application was submitted. They will be named mpi_job_name.ear_mpi_stats.full_nodename.csv and mpi_job_name.ear_mpi_calls_stats.full_nodename.csv. File pairs will be created for each node involved in the job.
Take into account that each process appends its own MPI statistics to files. This behavior does not guarantee that the header of files will be on the first line of them, as only one process writes it. You must move it at the top of each file manually before reading them with some tool you use to visualize and work with CSV files, e.g., spreadsheet, a R or Python package.
Below table shows fields available by ear_mpi_stats file:
Field | Description |
---|---|
mrank | The EAR's internal node ID used to identify the node. |
lrank | The EAR's internal rank ID used to identify the process. |
total_mpi_calls | The total number of MPI calls. |
exec_time | The execution time, in microseconds. |
mpi_time | The time spent in MPI calls, in microseconds. |
perc_mpi_time | The percentage of total execution time (i.e., exec_time) spent in MPI calls. |
Below table shows fields available by ear_mpi_calls_stats file:
Field | Description |
---|---|
Master | The EAR's internal node ID used to identify the node. |
Rank | The EAR's internal rank ID used to identify the process. |
Total MPI calls | The total number of MPI calls. |
MPI_time/Exec_time | The ration between time spent in MPI calls and the total execution time. |
Exec_time | The execution time, in microseconds. |
Sync_time | Time spent (in microseconds) in blocking synchronization calls, i.e., MPI_Wait, MPI_Waitall, MPI_Waitany, MPI_Waitsome and MPI_Barrier. |
Block_time | Time spent in blocking calls, i.e., MPI_Allgather, MPI_Allgatherv, MPI_Allreduce, MPI_Alltoall, MPI_Alltoallv, MPI_Barrier, MPI_Bcast, MPI_Bsend, MPI_Cart_create, MPI_Gather, MPI_Gatherv, MPI_Recv, MPI_Reduce, MPI_Reduce_scatter, MPI_Rsend, MPI_Scan, MPI_Scatter, MPI_Scatterv, MPI_Send, MPI_Sendrecv, MPI_Sendrecv_replace, MPI_Ssend and all Wait calls of Sync_time field. |
Collec_time | Time spent in blocking collective calls, i.e., MPI_Allreduce, MPI_Reduce and MPI_Reduce_scatter. |
Total MPI sync calls | Total number of synchronization calls. |
Total blocking calls | Total number of blocking calls. |
Total collective calls | Total number of collective calls. |
Gather | Total number of blocking Gather calls, i.e., MPI_Allgather, MPI_Allgatherv, MPI_Gather and MPI_Gatherv. |
Reduce | Total number of blocking Reduce calls, i.e., MPI_Allreduce, MPI_Reduce and MPI_Reduce_scatter. |
All2all | Total number of blocking All2all calls, i.e., MPI_Alltoall and MPI_Alltoallv. |
Barrier | Total number of blocking Barrier calls, i.e., MPI_Barrier. |
Bcast | Total number of blocking Bcast calls, i.e., MPI_Bcast. |
Send | Total number of blocking Send calls, i.e., MPI_Bsend, MPI_Rsend, MPI_Send and MPI_Ssend. |
Comm | Total number of blocking Comm calls, i.e., MPI_Cart_create. |
Receive | Total number of blocking Receive calls, i.e., MPI_Recv. |
Scan | Total number of blocking Scan calls, i.e., MPI_Scan. |
Scatter | Total number of blocking Scatter calls, i.e., MPI_Scatter and MPI_Scatterv. |
SendRecv | Total number of blocking SendRecv calls, i.e., MPI_Sendrecv, MPI_Sendrecv_replace. |
Wait | Total number of blocking Wait calls, i.e., all MPI_Wait calls. |
t_Gather | Time (in microseconds) spent in blocking Gather calls. |
t_Reduce | Time (in microseconds) spent in blocking Reduce calls. |
t_All2all | Time (in microseconds) spent in blocking All2all calls. |
t_Barrier | Time (in microseconds) spent in blocking Barrier calls. |
t_Bcast | Time (in microseconds) spent in blocking Bcast calls. |
t_Send | Time (in microseconds) spent in blocking Send calls. |
t_Comm | Time (in microseconds) spent in blocking Comm calls. |
t_Receive | Time (in microseconds) spent in blocking Receive calls. |
t_Scan | Time (in microseconds) spent in blocking Scan calls. |
t_Scatter | Time (in microseconds) spent in blocking Scatter calls. |
t_SendRecv | Time (in microseconds) spent in blocking SendRecv calls. |
t_Wait | Time (in microseconds) spent in blocking Wait calls. |
EAR offers the chance to generate Paraver traces to visualize runtime metrics with the Paraver tool. Paraver is a visualization tool developed by CEPBA-Tools team and currently maintained by the Barcelona Supercomputing Center’s tools team.
The EAR trace generation mechanism was designed to support different trace generation plug-ins although the Paraver trace plug-in is the only supported by now. You must set the value of this variable to tracer_paraver.so
to load the tracer. This shared object comes with the official EAR distribution and it is located at $EAR_INSTALL_PATH/lib/plugins/tracer
. Then you need to set the EAR_TRACE_PATH
variable (see below) to specify the destination path of the generated Paraver traces.
Specify the path where you want to store the trace files generated by the EAR Library. The path must be fully created. Otherwise, the Paraver tracer plug-in won’t be loaded.
Here is an example of the usage of the above explained environment variables:
Use this variable (i.e., export SLURM_REPORT_EARL_EVENTS=1
) to make EARL send internal events to the [Database](EAR-Database). These events are useful to have more information about Library's behaviour, like when DynAIS **(REFERENCE DYNAIS)** is turned off, the computational phase EAR is guessing the application is on or the status of the applied policy **(REF POLICIES)**. You can query job-specific events through eacct -j <JobID> -x
, and you will get a table of all reported events:
Field name | Description |
---|---|
Event_ID | Internal ID of the event stored at the Database. |
Timestamp | yyyy-mm-dd hh:mm:ss. |
Event_type | Which kind of event is it. Possible event types explained below. |
Job_id | The JobID of the event. |
Value | The value stored with the event. Categorical events explained below. |
node_id | The node from where the event was reported. |
Below are listed all kind of event types you can get when requesting job events. For categorical event values, the (value, category) mapping is explained.
The above event types may be useful only for advanced users. Please, contact with
ear-support@bsc.es if you want to know more about EARL internals.