Access GPU RAM in Tensorflow 2.0. It is essential to monitor constantly… | by Simone Appella | Feb, 2024

It is essential to monitor constantly the available GPU memory while allocating variables and during training.

Many machine learning frameworks for GPU computation (e.g. PyToch, Tensorflow) are used for a variety of AI-related task.

If one or more GPUs are used as main tool for the task, it is fundamental to know to hardware specifications.

NVIDIA System Management Interface

The NVIDIA System Management Interface (nvidia-smi) is a command line tool used to monitor the NVIDIA GPU devices.

This is a snapshot of of the output command nvidia-smi from a Linux terminal:

Information such as CUDA driver and available memory can be easily accessed though that command.

Unfortunately, TensorFlow will allocate all the GPU memory by default, so it makes nvidia-smi useless for monitoring.

Access real-time GPU memory usage in Python in Tensorflow 2.0

When running a python script with Tensorflow 2.0, different issues are raised:

  • All the variables are allocated to the GPU device with lowest number.
  • By default, Tensorflow allocate all the available GPU memory when a variable is instantiated.

The function tf.config.experimental.get_memory_info can be used to such purpose and returns a dictionary with two keys:

  • current: The current memory used by the device, in bytes.
  • peak: The peak memory used by the device across the run of the program, in bytes. This is the value of our interest, as it shows the memory used and not the allocated one visible from nvidia-smi.

In the older versions of Tensorflow, tf.config.experimental.get_memory_info(‘DEVICE_NAME’) was the only available function to return the used memory (no option for determining the peak memory).

physical_devices = tf.config.list_physical_devices('GPU')

# use GPU:0 as device name to get memory_info

These settings can be used to limit the maximum memory usage while executing a script.

How can we monitor the GPU usage inside a functions compiled in a computational graph (and decorated with tf.function)? According to the official documentation, we can call tf.config.run_functions_eagerly(True) to make all invocations of tf.function run eagerly instead of running as a traced graph function. This can be useful for debugging. As the code now runs line-by-line,print messages or pdb breakpoints can be added at any line to monitor the inputs/outputs of each Tensorflow operation. However, this should be avoided for actual production because it significantly slows down execution.

Source link

Be the first to comment

Leave a Reply

Your email address will not be published.