PyTorch vs. TensorFlow for Transformer-Based NLP Applications | by Mathieu Lemay


Like many things in the AI sphere, the opportunity lies in how fast you can change and adapt for improved performance. BERT and its derivatives have most definitely established a new baseline. It is large and in charge. (In fact, we’ve recently had so many BERT-based projects launch at the same time that we needed company-wide training just to make sure everyone had the same programming style.)

Another one of our companies recently went through a few headaches related to its Tensorflow-based models that, hopefully, you’ll get to learn from. Below are some of the aspects we learned in this project.

Shopping for models sometimes feels like browsing a marketplace. Photo by NICE GUYS on Pexels.com.

If you want to use models whose publications are hot off the press, you’ll still be going through GitHub. Otherwise, you can go straight to transformer model repository hubs, such as HuggingFace, Tensorflow Hub, and PyTorch Hub.

A few months after BERT came out, it was a bit clunky to get it up and running. This is kind of moot now ever since HuggingFace made a push to consolidate a transformer model library. Since most (almost all) models are properly retrievable on HuggingFace, the first and primary source for anything transformers, there are fewer questions these days around model availability.

However, there were certain instances of models being only available on proprietary repositories. For example, the Universal Sentence Encoder by Google seems to still only be available on TensorFlow Hub. (At the time of its release, this was one of the best word and sentence embedding models out there so this was an issue, but it has since been superseded by the likes of MPNet and Sentence-T5.)

At the time of writing, there were 2,669 Tensorflow models on HuggingFace, compared to a whopping 31,939 PyTorch models. This is mainly due to newer models being published as a PyTorch model first; there is an academic preference for PyTorch models, albeit not a universal one.

Takeaway: There are more models for PyTorch, but the main ones are available on both frameworks.

Pure, unadulterated firepower. Photo by Nana Dua on Pexels.com

It’s no surprise that these leviathanic models have tremendous compute requirements, and GPUs will be involved at various points in both the training and inference cycles. Additionally, you’re probably using these models as part of an NLP/document intelligence pipeline, with other libraries fighting for GPU space during pre-processing or custom classifiers.

Thankfully, there are many popular libraries that already use Tensorflow and PyTorch in their backend, and so playing nice with other models *should* be easy. SpaCy and Flair for example, two popular NLP libraries, run primarily* on Torch (1, 2).

*Note: SpaCy uses Thinc for interchangeability between frameworks, but we noticed more stability, native support, and reliability if we stuck with the base PyTorch models.

It’s much easier to share a GPU between custom BERT models and library-specific models for a single framework. If you can share a GPU, then deployment costs go down. (More on this later in “Quantization”.) In an ideal deployment, there are sufficient resources for every library to be effectively scaled; in reality, the compute vs. costs constraints happen really quickly.

If you’re running a multi-step deployment (let’s say document intelligence), then you’ll have some functions that are improved by moving them to GPU, such as sentencizing and classification.

PyTorch has native GPU incremental usage and usually reserves the correct memory boundaries to a given model. From their CUDA Semantics documentation:

PyTorch uses a caching memory allocator to speed up memory allocations. This allows fast memory deallocation without device synchronizations. However, the unused memory managed by the allocator will still show as if used in nvidia-smi. You can use memory_allocated() and max_memory_allocated() to monitor memory occupied by tensors, and use memory_reserved() and max_memory_reserved() to monitor the total amount of memory managed by the caching allocator. Calling empty_cache() releases all unused cached memory from PyTorch so that those can be used by other GPU applications. However, the occupied GPU memory by tensors will not be freed so it can not increase the amount of GPU memory available for PyTorch.

Compared with TensorFlow, which has a by-default complete memory takeover, you need to specify incremental_memory_growth():

By default, TensorFlow maps nearly all of the GPU memory of all GPUs (subject to CUDA_VISIBLE_DEVICES) visible to the process. This is done to more efficiently use the relatively precious GPU memory resources on the devices by reducing memory fragmentation. To limit TensorFlow to a specific set of GPUs, use the tf.config.set_visible_devices method.

Takeaway: Both frameworks have multi-model deployment capabilities on a single GPU, but Tensorflow is slightly less well managed. Use caution.

Photo by Pixabay on Pexels.com.

Quantization primarily involves converting Float64 to Unsigned Int8 or UInt16 for both the reduction of the model size and the number of bits required to complete a single computation, and it is also a well-accepted model compression technique. This is analogous to the pixelation and color loss of images. It also has considerations for the distribution of weights, with both Tensorflow and PyTorch supporting fixed and dynamic range quantization in their general model generation pipelines.

The main reason why quantization is a worthwhile step in model performance optimization is that the typical loss of performance over time (due to increased latency) is costlier than the loss of quality over time (such as a drop in F1). Another way of explaining this is “good now is better than better later”.

We’ve anecdotally seen average F1-score drops of 0.005 after post-training quantization (as opposed to 0.03–0.05 for in-training quantization), an acceptable drop in quality for most of our clients and our main applications, especially if this meant running on much cheaper infrastructure and within a reasonable time frame.

An example: considering the volume of text that we analyze in our AuditMap application, most of the risk insights that we identify are valuable due to the speed at which we’re able to retrieve them, signaling to our auditor and risk manager clients with their risk landscape actually look like. Most of our models’ F1-score fall between 0.85 to 0.95, completely acceptable for decision support based on an analysis at scale.

These models do need to train and (usually) run on GPUs to be effective. However, if we wanted to run these models on CPU only, we would need to move away from a Float64 representation to a int8 or uint8 to run within an acceptable time frame. From my experiments and retrieved examples, I’ll limit the scope of my observation to the following:

I have not been able to find a simple or direct mechanism to quantize Tensorflow-based HuggingFace models.

Compare this with PyTorch:

A quick example I wrote of dynamic quantization in PyTorch.

Takeaway: Quantization in PyTorch is a single line of code, ready to be deployed to CPU machines. Tensorflow is…less streamlined.



Source link

Be the first to comment

Leave a Reply

Your email address will not be published.


*