利用NVIDIA NGC的TensorRT容器优化和加速人工智能推理

利用NVIDIA NGC的TensorRT容器优化和加速人工智能推理

Optimizing and Accelerating AI Inference with the TensorRT Container from NVIDIA NGC

自然语言处理（NLP）是人工智能最具挑战性的任务之一，因为它需要理解上下文、语音和重音来将人类语音转换为文本。构建这个人工智能工作流首先要训练一个能够理解和处理口语到文本的模型。

BERT是这项任务的最佳模型之一。您不必从头开始构建像BERT这样的最先进的模型，而是可以针对您的特定用例微调经过预训练的BERT模型，并将其与NVIDIA Triton推理服务器配合使用。有两种基于BERT的模型可用：

BERT-Base有12层、12个注意头和1.1亿个参数的

BERT-large有24层，16个注意头，3.4亿个参数

这些模型中的许多参数都是稀疏的。大量的参数因此降低了推理的吞吐量。本文将使用BERT推理作为一个例子来展示如何利用nvidiangc的TensorRT容器，并通过您的AI模型提高推理的性能。

Prerequisites

This post uses the following resources:

The TensorFlow container for GPU-accelerated training
A system with up to eight NVIDIA GPUs, such as DGX-1
Other NVIDIA GPUs can be used but the training time varies with the number and type of GPU.
GPU-based instances are available on all major cloud service providers.
NVIDIA Docker
The latest CUDA driver

Get the assets from NGC

Before you can start the BERT optimization process, you must obtain a few assets from NGC:

A fine-tuned BERT-large model
Model scripts for running inference with the fine-tuned model, in TensorFlow

Fine-tuned BERT-Large model

If you followed our previous post, Jump-start AI Training with NGC Pretrained Models On-Premises and in the Cloud, you’ll see that we are using the same fine-tuned model for optimization.

If you didn’t get a chance to fine-tune your own model, make a directory and download the pretrained model files. You have several download options.

Option 1: Download from the command line using the following commands. In the terminal, use wget to download the fine-tuned model:

mkdir bert_model && cd bert_model

wget https://api.ngc.nvidia.com/v2/models/nvidia/bert_tf_v1_1_large_fp16_384/versions/2/files/bert_config.json

wget https://api.ngc.nvidia.com/v2/models/nvidia/bert_tf_v1_1_large_fp16_384/versions/2/files/model.ckpt-5474.data-00000-of-00001

wget https://api.ngc.nvidia.com/v2/models/nvidia/bert_tf_v1_1_large_fp16_384/versions/2/files/model.ckpt-5474.index

wget https://api.ngc.nvidia.com/v2/models/nvidia/bert_tf_v1_1_large_fp16_384/versions/2/files/model.ckpt-5474.meta

wget https://api.ngc.nvidia.com/v2/models/nvidia/bert_tf_v1_1_large_fp16_384/versions/2/files/vocab.txt

Option 2: Download from the NGC website.

In your browser, navigate to the model repo page.
In the top right corner, choose Download.
After the zip file finishes downloading, unzip the files.

Refer to the directory where the fine-tuned model is saved as $MODEL_DIR. It can be the model that you saved from our previous post, or the model that you just downloaded.

When you are in this directory, export it:

export MODEL_DIR=$PWD

cd ..

Model scripts for running inference with the fine-tuned model

Use the following scripts to see the performance of BERT inference in TensorFlow format. To download the model scripts:

In your browser, navigate to the model scripts page.
At the top right, choose Download.

Figure 1. BERT inference model in TensorFlow from NGC.

Alternatively, the model script can be downloaded using git from the NVIDIA Deep Learning Examples on GitHub:

mkdir bert_tf && cd bert_tf

git clone https://github.com/NVIDIA/DeepLearningExamples.git

You are doing TensorFlow inference from the BERT directory. Whether you downloaded using the NGC webpage or GitHub, refer to this directory moving forward as $BERT_DIR.

Export this directory as follows:

export BERT_DIR=$PWD'/DeepLearningExamples/TensorFlow/LanguageModeling/BERT/'

cd ..

Before cloning the TensorRT GitHub repo, run the following command:

mkdir bert_trt && cd bert_trt

To get the script required for converting and running BERT TensorFlow model into TensorRT, follow the steps in Downloading the TensorRT Components. Make sure that the directory locations are correct:

$MODEL_DIR—Location of the BERT model checkpoint files.
$BERT_DIR—Location of the BERT TF scripts.

TensorFlow performance evaluation

In this section, you build, run, and evaluate the performance of BERT in TensorFlow.

Set up and run a Docker container

Build the Docker container by running the following command:

docker build $BERT_DIR -t bert

Launch the BERT container, with two mounted volumes:

One volume for the BERT model scripts code repo, mounted to /workspace/bert.
One volume for the fine-tuned model that you either fine-tuned yourself or downloaded from NGC, mounted to /finetuned-model-bert.

docker run --gpus all -it \

-v $BERT_DIR:/workspace/bert \

-v $MODEL_DIR:/finetuned-model-bert \

bert

Prepare the dataset

You are evaluating the BERT model using the SQuAD dataset. For more information, see SQuAD1.1: The Stanford Question Answering Dataset.

export BERT_PREP_WORKING_DIR="/workspace/bert/data"

python3 /workspace/bert/data/bertPrep.py --action download --dataset squad

if the line import PubMedTextFormatting gives any errors in the bertPrep.py script, comment this line out, as you don’t need the PubMed dataset in this example.

This script downloads two folders in $BERT_PREP_WORKING_DIR/download/squad/: v2.0/ and v1.1/. For this post, use v1.1/.

Run evaluations with the TensorFlow model

Inside the container, navigate to the BERT workspace that contains the model scripts:

cd /workspace/bert/

You can run inference with a fine-tuned model in TensorFlow using scripts/run_squad.sh. For more information, see Jump-start AI Training with NGC Pretrained Models On-Premises and in the Cloud.

There are two modifications to this script. First, set it to prediction-only mode:

--do_train=False
--do_predict=True

When you manually edit --do_train=False in run_squad.sh, the training-related parameters that you pass into run_squad.sh aren’t relevant in this scenario.

Second, comment out the following block starting at line number 27:

#if [ "$bert_model" = "large" ] ; then

# export BERT_DIR=data/download/google_pretrained_weights/uncased_L-24_H-1024_A-16

#else

# export BERT_DIR=data/download/google_pretrained_weights/uncased_L-12_H-768_A-12

#fi

Because you can get vocab.txt and bert_config.json from the mounted directory /finetuned-model-bert, you do not need this block of code.

Now, export BERT_DIR inside the container:

export BERT_DIR=/finetuned-model-bert

After making the modifications, issue the following command:

bash scripts/run_squad.sh 1 5e-6 fp16 true 1 384 128 large 1.1 /finetuned-model-bert/model.ckpt<-num>

Put the correct checkpoint number <-num> available:

INFO:tensorflow:Throughput Average (sentences/sec) = 106.56

We observed that inference speed is 106.56 sentences per second for running inference directly in TensorFlow on a system powered with a single NVIDIA T4 GPU. Performance may differ depending on the number of GPUs and the architecture of the GPUs.

This is good performance, but could it be better? Investigate by using the scripts in /workspace/bert/trt/ to convert the TF model into TensorRT 7.1, then run inference on the TensorRT BERT model engine. For that process, switch over to the TensorRT repo and build a Docker image to launch.

Issue the following command:

exit

TensorRT performance evaluation

In the following section, you build, run, and evaluate the performance of BERT in TensorFlow. Before proceeding, make sure that you have downloaded and set up the TensorRT GitHub repo.

Set up a Docker container

In this step, you build and launch the Docker image from Dockerfile for TensorRT.

On your host machine, navigate to the TensorRT directory:

cd TensorRT

The script docker/build.sh builds the TensorRT Docker container:

./docker/build.sh --file docker/ubuntu.Dockerfile --tag tensorrt-ubuntu --os 18.04 --cuda 11.0

After the container is built, you must launch it by executing the docker/launch.sh script. However, before launching the container, modify docker/launch.sh to add -v $MODEL_DIR:/finetuned-model-bert and -v $BERT_DIR/data/download/squad/v1.1:/data/squad in docker_args to pass in your fine-tuned model and squad dataset, respectively.

The docker_args at line 49 should look like the following code:

docker_args="$extra_args -v $MODEL_DIR:/finetuned-model-bert -v $BERT_DIR/data/download/squad/v1.1:/data/squad -v $arg_trtrelease:/tensorrt -v $arg_trtsource:/workspace/TensorRT -it $arg_imagename:latest"

Now build and launch the Docker image locally:

./docker/launch.sh --tag tensorrt-ubuntu --gpus all --release $TRT_RELEASE --source $TRT_SOURCE

When you are in the container, you must build the TensorRT plugins:

cd $TRT_SOURCE

export LD_LIBRARY_PATH=`pwd`/build/out:$LD_LIBRARY_PATH:/tensorrt/lib

mkdir -p build && cd build

cmake .. -DTRT_LIB_DIR=$TRT_RELEASE/lib -DTRT_OUT_DIR=`pwd`/out

make -j$(nproc)

pip3 install /tensorrt/python/tensorrt-7.1*-cp36-none-linux_x86_64.whl

Now you are ready to build the BERT TensorRT engine.

Build the TensorRT engine

Make a directory to store the TensorRT engine:

mkdir -p /workspace/TensorRT/engines

Optionally, explore /workspace/TensorRTdemo/BERT/scripts/download_model.sh to see how you can use the ngc registry model download-version command to download models from NGC.

Run the builder.py script, noting the following values:

Path to the TensorFlow model /finetuned-model-bert/model.ckpt-<num>/li>
Output path for the engine to be built
Batch size 1
Sequence length 384
Precision fp16
Checkpoint path /finetuned-model-bert

cd /workspace/TensorRT/demo/BERT

python3 builder.py -m /finetuned-model-bert/model.ckpt-5474 -o /workspace/TensorRT/engines/bert_large_384.engine -b 1 -s 384 --fp16 -c /finetuned-model-bert/

Make sure that you provide the correct checkpoint model. The script takes ~1-2 mins to build the TensorRT engine.

Run the TensorRT inference

Now run the built TensorRT inference engine on 2K samples from the SQADv1.1 evaluation dataset. To run and get the throughput numbers, replace the code from line number 222 to line number 228 in inference.py, as shown in the following code block.

Be mindful of indentation. If the prompt asks for a password while you are installing vim in the container, use the password nvidia.

if squad_examples:

eval_time_l = []

all_predictions = collections.OrderedDict()

for example_index, example in enumerate(squad_examples):

print("Processing example {} of {}".format(example_index+1, len(squad_examples)), end="\r")

features = question_features(example.doc_tokens, example.question_text)

eval_time_elapsed, prediction, nbest_json = inference(features, example.doc_tokens)

eval_time_l.append(1.0/eval_time_elapsed)

all_predictions[example.id] = prediction

if example_index+1 == 2000:

break

print("Throughput Average (sentences/sec) = ",np.mean(eval_time_l))

Now run the inference:

CUDA_VISIBLE_DEVICES=0 python3 inference.py -e /workspace/TensorRT/engines/bert_large_384.engine -b

1 -s 384 -sq /data/squad/dev-v1.1.json -v /finetuned-model-bert/vocab.txt

Throughput Average (sentences/sec) = 136.59

We observed that inference speed is 136.59 sentences per second for running inference with TensorRT 7.1 on a system powered with a single NVIDIA T4 GPU. Performance may differ depending on the number of GPUs and the architecture of the GPUs, where the data is stored and other factors. However, you’ll always observe a performance boost due to model optimization using TensorRT.

Figure shows that the TensorRT BERT engine gives an average throughput of 136.59 sentences/sec compared to 106.56 sentences/sec given by the BERT model in TensorFlow. This is a 28% boost in throughput.

Figure 2. Performance gained when running BERT in TensorRT over TensorFlow.

Summary

Pull the TensorRT container from NGC to easily and quickly performance tune your models in all major frameworks, create novel low-latency inference applications, and deliver the best quality of service (QoS) to customers.

利用NVIDIA NGC的TensorRT容器优化和加速人工智能推理的更多相关文章

Amazon SageMaker和NVIDIA NGC加速AI和ML工作流
Amazon SageMaker和NVIDIA NGC加速AI和ML工作流从自动驾驶汽车到药物发现,人工智能正成为主流,并迅速渗透到每个行业.但是,开发和部署AI应用程序是一项具有挑战性的工作.该过 ...
如何运行具有奇点的NGC深度学习容器
如何运行具有奇点的NGC深度学习容器 How to Run NGC Deep Learning Containers with Singularity 高性能计算机和人工智能的融合使新的科学突破成为可 ...
利用NVIDIA-NGC中的MATLAB容器加速语义分割
利用NVIDIA-NGC中的MATLAB容器加速语义分割 Speeding Up Semantic Segmentation Using MATLAB Container from NVIDIA NG ...
http应用优化和加速说明-负载均衡
负载均衡技术现代企业信息化应用越来越多的采用B/S应用架构来承载企业的关键业务,因此,确保这些任务的可靠运行就变得日益重要.随着越来越多的企业实施数据集中,应用的扩展性.安全性和可靠性也 ...
Flex利用titleIcon属性给Panel容器标题部添加一个ICON图标
Flex利用titleIcon属性,给Panel容器标题部添加一个ICON图标. 让我们先来看一下Demo(可以右键View Source或点击这里察看源代码): 下面是完整代码(或点击这里察看): ...
如何利用Nginx的缓冲、缓存优化提升性能
使用缓冲释放后端服务器反向代理的一个问题是代理大量用户时会增加服务器进程的性能冲击影响.在大多数情况下,可以很大程度上能通过利用Nginx的缓冲和缓存功能减轻. 当代理到另一台服务器,两个不同的连接 ...
利用text插件和css插件优化web应用
JavaScript的模块化开发到如今,已经相当成熟了,当然,一个应用包含的不仅仅有js,还有html模板和css文件. 那么,如何将html和css也一起打包,来减少没必要的HTTP请求数呢? 本文 ...
Spring：利用PerformanceMonitorInterceptor来协助应用性能优化
前段时间对公司产品做性能优化,如果单依赖于测试,进度就会很慢.所以就通过对代码的方式来完成,并以此来加快项目进度.具体的执行方案自然就是要知道各个业务执行时间,针对业务来进行优化. 因为项目中使用了S ...
利用getBoundingClientRect()来实现div容器滚动固定
ele.getBoundingClientRect()的方法是可以获得一个元素在整个视图窗口的位置可以return的值有width,height,top,left,x,y,right,bottom ...

随机推荐

Python脚本抓取京东手机的配置信息
以下代码是使用python抓取京东小米8手机的配置信息首先找到小米8商品的链接:https://item.jd.com/7437788.html 然后找到其配置信息的标签,我们找到其配置信息的标签为 ...
nodejs-Buffer(缓冲区)
Node.js Buffer(缓冲区) JavaScript 语言自身只有字符串数据类型,没有二进制数据类型. 但在处理像TCP流或文件流时,必须使用到二进制数据.因此在 Node.js中,定义了一个 ...
SqlServer数据库主从同步
分发/订阅模式实现SqlServer主从同步在文章开始之前,我们先了解一下几个关键的概念: 分发服务器分发服务器是负责存储在同步过程中所用复制信息的服务器.可以比喻成报刊发行商. 分发数据库分发数据 ...
矩阵旋转-Eigen应用（QTCreator编辑器）
* { font-family: "Tibetan Machine Uni", "sans-serif", STFangSong; outline: none ...
【Azure 环境】基于Azure搭建企业级内部站点, 配置私有域名访问的详细教程 (含演示动画)
前言在Azure中,可以通过App Service快速部署,构建自定义站点(PaaS服务).默认情况下,这些站点被访问URL都是面向公网,通过公网进行解析.为了最好的安全保障,是否可以有一种功能实现 ...
Asp.NetCore Web开发之RazorPage
这节讲一下Razor页面. 首先要明确,Razor 不是一种编程语言.它是服务器端的标记语言,配合C#语言,就可以像PHP语言语言一样(但它们并不相同),处理HTML页面逻辑.它是Asp.NetCor ...
"mysql第一次查询很慢，以后就很快"的解决方案
背景有个项目使用的mysql数据库,第一次查询很慢,大约15s左右出结果,再次查询就很快了. 分析后面变快的原因是mysql有缓存机制,但是过上一段时间不使用缓存会过期,我个人测了一下2~3分钟一 ...
其他CSS属性
一.设置元素的颜色和透明度 a.color color 属性规定文本的颜色.这个属性设置了一个元素的前景色(在 HTML 表现中,就是元素文本的颜色):光栅图像不受 color 影响.这个颜色还会应用 ...
戴尔服务器如何配置远程管理卡（IDRAC9）适用于戴尔R740服务器
戴尔服务器如何配置远程管理卡(IDRAC9)适用于戴尔R740服务器转: DELL IDRAC9 该配置方法适合于所有戴尔14G服务器,包括全系列戴尔服务器,标准版适用于R440/R540/R640 ...
Linux是一个基于POSIX和Unix的多用户、多任务、支持多线程和多CPU的性能稳定的操作系统，可免费使用并自由传播。
Linux是一个基于POSIX和Unix的多用户.多任务.支持多线程和多CPU的性能稳定的操作系统,可免费使用并自由传播. Linux是众多操作系统之一 , 目前流行的服务器和 PC 端操作系统有 L ...

利用NVIDIA NGC的TensorRT容器优化和加速人工智能推理

利用NVIDIA NGC的TensorRT容器优化和加速人工智能推理的更多相关文章

随机推荐

热门专题