原文地址:https://github.com/Kaixhin/grokking-pytorch

PyTorch is a flexible deep learning framework that allows automatic differentiation(自动求导) through dynamic neural networks (i.e., networks that utilise dynamic control flow like if statements and while loops). It supports GPU acceleration, distributed training, various optimisations, and plenty more neat features. These are some notes on how I think about using PyTorch, and don't encompass all parts of the library or every best practice, but may be helpful to others.

Neural networks are a subclass of computation graphs(计算图). Computation graphs receive input data, and data is routed to and possibly transformed by nodes which perform processing on the data. In deep learning, the neurons (nodes) in neural networks typically transform data with parameters and differentiable functions, such that the parameters can be optimised to minimise a loss via gradient descent(梯度下降). More broadly, the functions can be stochastic, and the structure of the graph can be dynamic. So while neural networks may be a good fit for dataflow programming, PyTorch's API has instead centred around imperative programming, which is a more common way for thinking about programs. This makes it easier to read code and reason about complex programs, without necessarily sacrificing much performance; PyTorch is actually pretty fast, with plenty of optimisations that you can safely forget about as an end user (but you can dig in if you really want to).

The rest of this document, based on the official MNIST example, is about grokking PyTorch, and should only be looked at after the official beginner tutorials. For readability, the code is presented in chunks interspersed with comments, and hence not separated into different functions/files as it would normally be for clean, modular code.

Imports

import argparse
import os
import torch
from torch import nn, optim
from torch.nn import functional as F
from torch.utils.data import DataLoader
from torchvision import datasets, transforms

These are pretty standard imports, with the exception of the torchvision modules that are used for computer vision problems in particular.

Setup

parser = argparse.ArgumentParser(description='PyTorch MNIST Example')
parser.add_argument('--batch-size', type=int, default=64, metavar='N',
help='input batch size for training (default: 64)')
parser.add_argument('--epochs', type=int, default=10, metavar='N',
help='number of epochs to train (default: 10)')
parser.add_argument('--lr', type=float, default=0.01, metavar='LR',
help='learning rate (default: 0.01)')
parser.add_argument('--momentum', type=float, default=0.5, metavar='M',
help='SGD momentum (default: 0.5)')
parser.add_argument('--no-cuda', action='store_true', default=False,
help='disables CUDA training')
parser.add_argument('--seed', type=int, default=1, metavar='S',
help='random seed (default: 1)')
parser.add_argument('--save-interval', type=int, default=10, metavar='N',
help='how many batches to wait before checkpointing')
parser.add_argument('--resume', action='store_true', default=False,
help='resume training from checkpoint')
args = parser.parse_args() use_cuda = torch.cuda.is_available() and not args.no_cuda
device = torch.device('cuda' if use_cuda else 'cpu')
torch.manual_seed(args.seed)
if use_cuda:
torch.cuda.manual_seed(args.seed)

argparse is a standard way of dealing with command-line arguments in Python.

A good way to write device-agnostic code (benefitting from GPU acceleration when available but falling back to CPU when not) is to pick and save the appropriate torch.device, which can be used to determine where tensors should be stored. See the official docs for more tips on device-agnostic code. The PyTorch way is to put device placement under the control of the user, which may seem a nuisance for simple examples, but makes it much easier to work out where tensors are - which is useful for a) debugging and b) making efficient use of devices manually.

For repeatable experiments, it is necessary to set random seeds for anything that uses random number generation (including random or numpy if those are used too). Note that cuDNN uses nondeterministic algorithms, and it can be disabled using torch.backends.cudnn.enabled = False.

Data

data_path = os.path.join(os.path.expanduser('~'), '.torch', 'datasets', 'mnist')
train_data = datasets.MNIST(data_path, train=True, download=True,
transform=transforms.Compose([
transforms.ToTensor(),
transforms.Normalize((0.1307,), (0.3081,))]))
test_data = datasets.MNIST(data_path, train=False, transform=transforms.Compose([
transforms.ToTensor(),
transforms.Normalize((0.1307,), (0.3081,))])) train_loader = DataLoader(train_data, batch_size=args.batch_size,
shuffle=True, num_workers=4, pin_memory=True)
test_loader = DataLoader(test_data, batch_size=args.batch_size,
num_workers=4, pin_memory=True)

Since torchvision models get stored under ~/.torch/models/, I like to store torchvision datasets under ~/.torch/datasets. This is my own convention, but makes it easier if you have lots of projects that depend on MNIST, CIFAR-10 etc. In general it's worth keeping datasets separately to code if you end up reusing several datasets.

torchvision.transforms contains lots of handy transformations for single images, such as cropping and normalisation.

DataLoader contains many options, but beyond batch_size and shuffle, num_workers and pin_memory are worth knowing for efficiency. num_workers > 0 uses subprocesses to asynchronously load data, rather than making the main process block on this. The typical use-case is when loading data (e.g. images) from disk and maybe transforming them too - this can be done in parallel with the network processing the data. You will want to tune the amount to a) minimise the number of workers and hence CPU and RAM usage (each worker loads a separate batch, not individual samples within a batch) b) minimise the time the network is waiting for data. pin_memory uses pinned RAM to speed up RAM to GPU transfers (and does nothing for CPU-only code).(在Windows上num_workers应取默认值0,否则会出错;shuffle:set to true have the data reshuffled(洗牌) at every epoch(default:False)一般在train的时候设置为true,test时取默认值False)

Model

class Net(nn.Module):
def __init__(self):
super(Net, self).__init__()
self.conv1 = nn.Conv2d(1, 10, kernel_size=5)
self.conv2 = nn.Conv2d(10, 20, kernel_size=5)
self.conv2_drop = nn.Dropout2d()
self.fc1 = nn.Linear(320, 50)
self.fc2 = nn.Linear(50, 10) def forward(self, x):
x = F.relu(F.max_pool2d(self.conv1(x), 2))
x = F.relu(F.max_pool2d(self.conv2_drop(self.conv2(x)), 2))
x = x.view(-1, 320)
x = F.relu(self.fc1(x))
x = self.fc2(x)
return F.log_softmax(x, dim=1) model = Net().to(device)
optimiser = optim.SGD(model.parameters(), lr=args.lr, momentum=args.momentum) if args.resume:
model.load_state_dict(torch.load('model.pth'))
optimiser.load_state_dict(torch.load('optimiser.pth'))

Network initialisation typically includes member variables, layers which contain trainable parameters, and maybe separate trainable parameters and non-trainable buffers. The forward pass then uses these in conjunction with functions from F that are purely functional (don't contain parameters). Some people prefer to have completely functional networks (e.g., keeping parameters separately and using F.conv2d instead of nn.Conv2d) or networks completely made of layers (e.g., nn.ReLU instead of F.relu).

.to(device) is a convenient way of sending the device parameters (and buffers) to GPU if device is set to GPU, doing nothing otherwise (when device is set to CPU). It's important to transfer the network parameters to the appropriate device before passing them to the optimiser, otherwise the optimiser will not be keeping track of the parameters properly!

Both neural networks (nn.Module) and optimisers (optim.Optimizer) have the ability to save and load their internal state, and .load_state_dict(state_dict) is the recommended way to do so - you'll want to reload the state of both to resume training from previously saved state dictionaries. Saving the entire object can be error prone. If you have saved tensors on GPU and want to load them on CPU or another GPU, the easiest way is to directly load them onto CPU using the map_location option, e.g., torch.load('model.pth', map_location='cpu').

Some points of note not shown here are that the forward pass can make use of control flow (控制流)(e.g., a member variable or even the data itself can determine the execution of an if statement. It is also perfectly valid to print tensors in the middle, making debugging much easier. Finally, the forward pass can make use of multiple arguments(多个参数). A short snippet (not tied to any sensible idea) to illustrate this is below:

def forward(self, x, hx, drop=False):
hx2 = self.rnn(x, hx)
print(hx.mean().item(), hx.var().item())
if hx.max.item() > 10 or self.can_drop and drop:
return hx
else:
return hx2

Training

model.train()
train_losses = [] for i, (data, target) in enumerate(train_loader):
data, target = data.to(device), target.to(device)
optimiser.zero_grad()
output = model(data)
loss = F.nll_loss(output, target)
loss.backward()
train_losses.append(loss.item())
optimiser.step() if i % 10 == 0:
print(i, loss.item())
torch.save(model.state_dict(), 'model.pth')
torch.save(optimiser.state_dict(), 'optimiser.pth')
torch.save(train_losses, 'train_losses.pth')

Network modules are by default set to training mode - which impacts the way some modules work, most noticeably dropout and batch normalisation. It's best to set this manually anyway with .train(), which propagates the training flag down all children modules.

Before collecting a new set of gradients with loss.backward() and doing backpropagation with optimiser.step(), it's necessary to manually zero the gradients of the parameters being optimised with optimiser.zero_grad(). By default, PyTorch accumulates gradients, which is very handy when you don't have enough resources to calculate all the gradients you need in one go.

PyTorch uses a tape-based automatic gradient (autograd) system - it collects which operations were done on tensors in order, and then replays them backwards to do reverse-mode differentiation. This is why it is super flexible and allows arbitrary computation graphs. If none of the tensors require gradients (you'd have to set requires_grad=True when constructing a tensor for this) then no graph is stored! However, networks tend to have parameters that require gradients, so any computation done from the output of a network will be stored in the graph. So if you want to store data resulting from this, you'll need to manually disable gradients or, more commonly, store it as a Python number (via .item() on a PyTorch scalar) or numpy array. See the official docs for more on autograd.

One way to cut the computation graph is to use .detach(), which you may use when passing on a hidden state when training RNNs with truncated backpropagation-through-time. It's also handy when differentiating a loss where one component is the output of another network, but this other network shouldn't be optimised with respect to the loss - examples include training a discriminator from a generator's outputs in GAN training, or training the policy of an actor-critic algorithm using the value function as a baseline (e.g. A2C). Another technique for preventing gradient calculations that is efficient in GAN training (training the generator from the discriminator) and typical in fine-tuning is to loop through a networks parameters and set param.requires_grad = False.

Apart from logging results in the console/in a log file, it's important to checkpoint model parameters (and optimiser state) just in case. You can also use torch.save() to save normal Python objects, but other standard choices include the built-in pickle.

Testing

model.eval()
test_loss, correct = 0, 0 with torch.no_grad():
for data, target in test_loader:
data, target = data.to(device), target.to(device)
output = model(data)
test_loss += F.nll_loss(output, target, size_average=False).item()
pred = output.argmax(1, keepdim=True)
correct += pred.eq(target.view_as(pred)).sum().item() test_loss /= len(test_data)
acc = correct / len(test_data)
print(acc, test_loss)

In response to .train() earlier, networks should explicitly be set to evaluation mode using .eval().

As mentioned previously, the computation graph would normally be made when using a network. By using the no_grad context manager via with torch.no_grad() this is prevented from happening.

Extra

This is an extra section just to add a few useful asides.

Memory problems? Check the official docs for tips.

CUDA errors? They are a pain to debug, and are usually a logic problem that would come up with a more intelligible error message on CPU. It's best to be able to easily switch between CPU and GPU if you are planning on using the GPU. A more general development tip is to be able to set up your code so that it's possible to run through all of the logic quickly to check it before launching a proper job - examples would be preparing a small/synthetic dataset, running one train + test epoch, etc. If it is a CUDA error, or you really can't switch to CPU, setting CUDA_LAUNCH_BLOCKING=1 will make CUDA kernel launches synchronous and as a result provide better error messages.

A note for torch.multiprocessing, or even just running multiple PyTorch scripts at once. Because PyTorch uses multithreaded BLAS libraries to speed up linear algebra computations on CPU, it'll typically use several cores. If you want to run several things at once, with multiprocessing or several scripts, it may be useful to manually reduce these by setting the environment variable OMP_NUM_THREADS to 1 or another small number - this reduces the chance of CPU thrashing. The official docs have some other notes for multiprocessing in particular.

Additions From Official Docs

A typical training procedure for a neural network is as follows:

-Define the neural network that has some learnable parameters(or weights)

-iterate over a dataset of inputs

-Process input through the network

-Compute the loss (how far is the output from being correct)

-Propagate gradients back into the network’s parameters

-Update the weights of the network, typically using a simple update rule:weight=weight-learning_rate*gradient

Dataset class

torch.utils.data.Dataset is an abstract class representing a dataset. Your custom dataset should inherit Dataset and override the following methods:

__len__ so that len(dataset) returns the size of the dataset.

__getitem__ to support the indexing such that dataset[i] can be used to get ith sample.

Transforms

Rescale: to scale the image

RandomCrop: to crop from image randomly. This is data augmentation.

ToTensor: to convert the numpy images to torch images (we need to swap axes).

Compose transforms

torchvision.transforms.Compose

composed = transforms.Compose([Rescale(256),
RandomCrop(224)])

rescale the shorter side of the image to 256 and then randomly crop a square of size 224 from it.To compose Rescale and RandomCrop transforms

DataLoader

torch.utils.data.DataLoader is an iterator which provides all these features(Batching the data;Shuffling the data;Load the data in parallel using multiprocessing workers). Parameters used below should be clear. One parameter of interest is collate_fn. You can specify how exactly the samples need to be batched using collate_fn. However, default collate should work fine for most use cases.

torchvision

torchvision package provides some common datasets and transforms. You might not even have to write custom classes. One of the more generic datasets available in torchvision is ImageFolder.

import torch
from torchvision import transforms, datasets data_transform = transforms.Compose([
transforms.RandomSizedCrop(224),
transforms.RandomHorizontalFlip(),
transforms.ToTensor(),
transforms.Normalize(mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225])
])
hymenoptera_dataset = datasets.ImageFolder(root='hymenoptera_data/train',
transform=data_transform)
dataset_loader = torch.utils.data.DataLoader(hymenoptera_dataset,
batch_size=4, shuffle=True,
num_workers=4)

Custom nn Modules

Sometimes you will want to specify models that are more complex than a sequence of existing Modules; for these cases you can define your own Modules by subclassing nn.Module and defining a forward which receives input Tensors and produces output Tensors using other modules or other autograd operations on Tensors.

import torch

class TwoLayerNet(torch.nn.Module):
def __init__(self, D_in, H, D_out):
"""
In the constructor we instantiate two nn.Linear modules and assign them as
member variables.
"""
super(TwoLayerNet, self).__init__()
self.linear1 = torch.nn.Linear(D_in, H)
self.linear2 = torch.nn.Linear(H, D_out) def forward(self, x):
"""
In the forward function we accept a Tensor of input data and we must return
a Tensor of output data. We can use Modules defined in the constructor as
well as arbitrary operators on Tensors.
"""
h_relu = self.linear1(x).clamp(min=0)
y_pred = self.linear2(h_relu)
return y_pred # N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10 # Create random Tensors to hold inputs and outputs
x = torch.randn(N, D_in)
y = torch.randn(N, D_out) # Construct our model by instantiating the class defined above
model = TwoLayerNet(D_in, H, D_out) # Construct our loss function and an Optimizer. The call to model.parameters()
# in the SGD constructor will contain the learnable parameters of the two
# nn.Linear modules which are members of the model.
criterion = torch.nn.MSELoss(reduction='sum') #loss function
optimizer = torch.optim.SGD(model.parameters(), lr=1e-4) #Optimizer
for t in range(500):
# Forward pass: Compute predicted y by passing x to the model
y_pred = model(x) # Compute and print loss
loss = criterion(y_pred, y)
print(t, loss.item()) # Zero gradients, perform a backward pass, and update the weights.
optimizer.zero_grad()
loss.backward()
optimizer.step()

Grokking PyTorch的更多相关文章

  1. Ubutnu16.04安装pytorch

    1.下载Anaconda3 首先需要去Anaconda官网下载最新版本Anaconda3(https://www.continuum.io/downloads),我下载是是带有python3.6的An ...

  2. 解决运行pytorch程序多线程问题

    当我使用pycharm运行  (https://github.com/Joyce94/cnn-text-classification-pytorch )  pytorch程序的时候,在Linux服务器 ...

  3. 基于pytorch实现word2vec

    一.介绍 word2vec是Google于2013年推出的开源的获取词向量word2vec的工具包.它包括了一组用于word embedding的模型,这些模型通常都是用浅层(两层)神经网络训练词向量 ...

  4. 基于pytorch的CNN、LSTM神经网络模型调参小结

    (Demo) 这是最近两个月来的一个小总结,实现的demo已经上传github,里面包含了CNN.LSTM.BiLSTM.GRU以及CNN与LSTM.BiLSTM的结合还有多层多通道CNN.LSTM. ...

  5. pytorch实现VAE

    一.VAE的具体结构 二.VAE的pytorch实现 1加载并规范化MNIST import相关类: from __future__ import print_function import argp ...

  6. PyTorch教程之Training a classifier

    我们已经了解了如何定义神经网络,计算损失并对网络的权重进行更新. 接下来的问题就是: 一.What about data? 通常处理图像.文本.音频或视频数据时,可以使用标准的python包将数据加载 ...

  7. PyTorch教程之Neural Networks

    我们可以通过torch.nn package构建神经网络. 现在我们已经了解了autograd,nn基于autograd来定义模型并对他们有所区分. 一个 nn.Module模块由如下部分构成:若干层 ...

  8. PyTorch教程之Autograd

    在PyTorch中,autograd是所有神经网络的核心内容,为Tensor所有操作提供自动求导方法. 它是一个按运行方式定义的框架,这意味着backprop是由代码的运行方式定义的. 一.Varia ...

  9. Linux安装pytorch的具体过程以及其中出现问题的解决办法

    1.安装Anaconda 安装步骤参考了官网的说明:https://docs.anaconda.com/anaconda/install/linux.html 具体步骤如下: 首先,在官网下载地址 h ...

随机推荐

  1. Android JNI--基础篇(二)

    编写一个可以与C代码交互的android工程需要如下步骤: 1.JAVA代码中写声明native 方法 2. 创建jni目录,编写c代码,方法名字要对应 3.编写Android.mk文件(交叉编译的规 ...

  2. hdu3461Marriage Match IV 最短路+最大流

    //给一个图.给定起点和终点,仅仅能走图上的最短路 //问最多有多少种走的方法.每条路仅仅能走一次 //仅仅要将在最短路上的全部边的权值改为1.求一个最大流即可 #include<cstdio& ...

  3. 模块化模式与 OSGi

    模块化模式与 OSGi Android 模块化探索与实践 一.前言 万维网发明人 Tim Berners-Lee 谈到设计原理时说过:“简单性和模块化是软件工程的基石:分布式和容错性是互联网的生命.” ...

  4. IdentityServer4实战 - 谈谈 JWT Token 的安全策略

    原文:IdentityServer4实战 - 谈谈 JWT Token 的安全策略 一.前言 众所周知,IdentityServer4 默认支持两种类型的 Token,一种是 Reference To ...

  5. HTTP协议(一些报头字段的作用,如cace-control、keep-alive)

    ---恢复内容开始--- Http连接是一种短连接,是一种无状态的连接. 所谓的无状态,是指浏览器每次向服务器发起请求的时候,不是通过一个连接,而是每次都建立一个新的连接. 如果是一个连接的话,服务器 ...

  6. python 反转列表

    翻转一个链表 您在真实的面试中是否遇到过这个题? Yes 样例 给出一个链表1->2->3->null,这个翻转后的链表为3->2->1->null 步骤是这样的: ...

  7. DDD实战4 实现产品仓储

    a.要实现仓储,首先要定义仓储接口.在领域层定义仓储接口,IProductRepository.cs. public interface IProductRepository { void Creat ...

  8. WPF绘制自定义窗口

    原文:WPF绘制自定义窗口 WPF是制作界面的一大利器,下面就用WPF模拟一下360的软件管理界面,360软件管理界面如下: 界面不难,主要有如下几个要素: 窗体的圆角 自定义标题栏及按钮 自定义状态 ...

  9. Linux性能测试 iostat命令

    Linux系统出现了性能问题,一般我们可以通过top.iostat.free.vmstat等命令 来查看初步定位问题.其中iostat可以给我们提供丰富的IO状态数据.iostat 由 Red Hat ...

  10. Java利用Zxing生成二维码

    Zxing是Google提供的关于条码(一维码.二维码)的解析工具,提供了二维码的生成与解析的方法,现在我简单介绍一下使用Java利用Zxing生成与解析二维码 1.二维码的生成 1.1 将Zxing ...