原文地址:https://github.com/Kaixhin/grokking-pytorch

PyTorch is a flexible deep learning framework that allows automatic differentiation(自动求导) through dynamic neural networks (i.e., networks that utilise dynamic control flow like if statements and while loops). It supports GPU acceleration, distributed training, various optimisations, and plenty more neat features. These are some notes on how I think about using PyTorch, and don't encompass all parts of the library or every best practice, but may be helpful to others.

Neural networks are a subclass of computation graphs(计算图). Computation graphs receive input data, and data is routed to and possibly transformed by nodes which perform processing on the data. In deep learning, the neurons (nodes) in neural networks typically transform data with parameters and differentiable functions, such that the parameters can be optimised to minimise a loss via gradient descent(梯度下降). More broadly, the functions can be stochastic, and the structure of the graph can be dynamic. So while neural networks may be a good fit for dataflow programming, PyTorch's API has instead centred around imperative programming, which is a more common way for thinking about programs. This makes it easier to read code and reason about complex programs, without necessarily sacrificing much performance; PyTorch is actually pretty fast, with plenty of optimisations that you can safely forget about as an end user (but you can dig in if you really want to).

The rest of this document, based on the official MNIST example, is about grokking PyTorch, and should only be looked at after the official beginner tutorials. For readability, the code is presented in chunks interspersed with comments, and hence not separated into different functions/files as it would normally be for clean, modular code.

Imports

import argparse
import os
import torch
from torch import nn, optim
from torch.nn import functional as F
from torch.utils.data import DataLoader
from torchvision import datasets, transforms

These are pretty standard imports, with the exception of the torchvision modules that are used for computer vision problems in particular.

Setup

parser = argparse.ArgumentParser(description='PyTorch MNIST Example')
parser.add_argument('--batch-size', type=int, default=64, metavar='N',
help='input batch size for training (default: 64)')
parser.add_argument('--epochs', type=int, default=10, metavar='N',
help='number of epochs to train (default: 10)')
parser.add_argument('--lr', type=float, default=0.01, metavar='LR',
help='learning rate (default: 0.01)')
parser.add_argument('--momentum', type=float, default=0.5, metavar='M',
help='SGD momentum (default: 0.5)')
parser.add_argument('--no-cuda', action='store_true', default=False,
help='disables CUDA training')
parser.add_argument('--seed', type=int, default=1, metavar='S',
help='random seed (default: 1)')
parser.add_argument('--save-interval', type=int, default=10, metavar='N',
help='how many batches to wait before checkpointing')
parser.add_argument('--resume', action='store_true', default=False,
help='resume training from checkpoint')
args = parser.parse_args() use_cuda = torch.cuda.is_available() and not args.no_cuda
device = torch.device('cuda' if use_cuda else 'cpu')
torch.manual_seed(args.seed)
if use_cuda:
torch.cuda.manual_seed(args.seed)

argparse is a standard way of dealing with command-line arguments in Python.

A good way to write device-agnostic code (benefitting from GPU acceleration when available but falling back to CPU when not) is to pick and save the appropriate torch.device, which can be used to determine where tensors should be stored. See the official docs for more tips on device-agnostic code. The PyTorch way is to put device placement under the control of the user, which may seem a nuisance for simple examples, but makes it much easier to work out where tensors are - which is useful for a) debugging and b) making efficient use of devices manually.

For repeatable experiments, it is necessary to set random seeds for anything that uses random number generation (including random or numpy if those are used too). Note that cuDNN uses nondeterministic algorithms, and it can be disabled using torch.backends.cudnn.enabled = False.

Data

data_path = os.path.join(os.path.expanduser('~'), '.torch', 'datasets', 'mnist')
train_data = datasets.MNIST(data_path, train=True, download=True,
transform=transforms.Compose([
transforms.ToTensor(),
transforms.Normalize((0.1307,), (0.3081,))]))
test_data = datasets.MNIST(data_path, train=False, transform=transforms.Compose([
transforms.ToTensor(),
transforms.Normalize((0.1307,), (0.3081,))])) train_loader = DataLoader(train_data, batch_size=args.batch_size,
shuffle=True, num_workers=4, pin_memory=True)
test_loader = DataLoader(test_data, batch_size=args.batch_size,
num_workers=4, pin_memory=True)

Since torchvision models get stored under ~/.torch/models/, I like to store torchvision datasets under ~/.torch/datasets. This is my own convention, but makes it easier if you have lots of projects that depend on MNIST, CIFAR-10 etc. In general it's worth keeping datasets separately to code if you end up reusing several datasets.

torchvision.transforms contains lots of handy transformations for single images, such as cropping and normalisation.

DataLoader contains many options, but beyond batch_size and shuffle, num_workers and pin_memory are worth knowing for efficiency. num_workers > 0 uses subprocesses to asynchronously load data, rather than making the main process block on this. The typical use-case is when loading data (e.g. images) from disk and maybe transforming them too - this can be done in parallel with the network processing the data. You will want to tune the amount to a) minimise the number of workers and hence CPU and RAM usage (each worker loads a separate batch, not individual samples within a batch) b) minimise the time the network is waiting for data. pin_memory uses pinned RAM to speed up RAM to GPU transfers (and does nothing for CPU-only code).(在Windows上num_workers应取默认值0,否则会出错;shuffle:set to true have the data reshuffled(洗牌) at every epoch(default:False)一般在train的时候设置为true,test时取默认值False)

Model

class Net(nn.Module):
def __init__(self):
super(Net, self).__init__()
self.conv1 = nn.Conv2d(1, 10, kernel_size=5)
self.conv2 = nn.Conv2d(10, 20, kernel_size=5)
self.conv2_drop = nn.Dropout2d()
self.fc1 = nn.Linear(320, 50)
self.fc2 = nn.Linear(50, 10) def forward(self, x):
x = F.relu(F.max_pool2d(self.conv1(x), 2))
x = F.relu(F.max_pool2d(self.conv2_drop(self.conv2(x)), 2))
x = x.view(-1, 320)
x = F.relu(self.fc1(x))
x = self.fc2(x)
return F.log_softmax(x, dim=1) model = Net().to(device)
optimiser = optim.SGD(model.parameters(), lr=args.lr, momentum=args.momentum) if args.resume:
model.load_state_dict(torch.load('model.pth'))
optimiser.load_state_dict(torch.load('optimiser.pth'))

Network initialisation typically includes member variables, layers which contain trainable parameters, and maybe separate trainable parameters and non-trainable buffers. The forward pass then uses these in conjunction with functions from F that are purely functional (don't contain parameters). Some people prefer to have completely functional networks (e.g., keeping parameters separately and using F.conv2d instead of nn.Conv2d) or networks completely made of layers (e.g., nn.ReLU instead of F.relu).

.to(device) is a convenient way of sending the device parameters (and buffers) to GPU if device is set to GPU, doing nothing otherwise (when device is set to CPU). It's important to transfer the network parameters to the appropriate device before passing them to the optimiser, otherwise the optimiser will not be keeping track of the parameters properly!

Both neural networks (nn.Module) and optimisers (optim.Optimizer) have the ability to save and load their internal state, and .load_state_dict(state_dict) is the recommended way to do so - you'll want to reload the state of both to resume training from previously saved state dictionaries. Saving the entire object can be error prone. If you have saved tensors on GPU and want to load them on CPU or another GPU, the easiest way is to directly load them onto CPU using the map_location option, e.g., torch.load('model.pth', map_location='cpu').

Some points of note not shown here are that the forward pass can make use of control flow (控制流)(e.g., a member variable or even the data itself can determine the execution of an if statement. It is also perfectly valid to print tensors in the middle, making debugging much easier. Finally, the forward pass can make use of multiple arguments(多个参数). A short snippet (not tied to any sensible idea) to illustrate this is below:

def forward(self, x, hx, drop=False):
hx2 = self.rnn(x, hx)
print(hx.mean().item(), hx.var().item())
if hx.max.item() > 10 or self.can_drop and drop:
return hx
else:
return hx2

Training

model.train()
train_losses = [] for i, (data, target) in enumerate(train_loader):
data, target = data.to(device), target.to(device)
optimiser.zero_grad()
output = model(data)
loss = F.nll_loss(output, target)
loss.backward()
train_losses.append(loss.item())
optimiser.step() if i % 10 == 0:
print(i, loss.item())
torch.save(model.state_dict(), 'model.pth')
torch.save(optimiser.state_dict(), 'optimiser.pth')
torch.save(train_losses, 'train_losses.pth')

Network modules are by default set to training mode - which impacts the way some modules work, most noticeably dropout and batch normalisation. It's best to set this manually anyway with .train(), which propagates the training flag down all children modules.

Before collecting a new set of gradients with loss.backward() and doing backpropagation with optimiser.step(), it's necessary to manually zero the gradients of the parameters being optimised with optimiser.zero_grad(). By default, PyTorch accumulates gradients, which is very handy when you don't have enough resources to calculate all the gradients you need in one go.

PyTorch uses a tape-based automatic gradient (autograd) system - it collects which operations were done on tensors in order, and then replays them backwards to do reverse-mode differentiation. This is why it is super flexible and allows arbitrary computation graphs. If none of the tensors require gradients (you'd have to set requires_grad=True when constructing a tensor for this) then no graph is stored! However, networks tend to have parameters that require gradients, so any computation done from the output of a network will be stored in the graph. So if you want to store data resulting from this, you'll need to manually disable gradients or, more commonly, store it as a Python number (via .item() on a PyTorch scalar) or numpy array. See the official docs for more on autograd.

One way to cut the computation graph is to use .detach(), which you may use when passing on a hidden state when training RNNs with truncated backpropagation-through-time. It's also handy when differentiating a loss where one component is the output of another network, but this other network shouldn't be optimised with respect to the loss - examples include training a discriminator from a generator's outputs in GAN training, or training the policy of an actor-critic algorithm using the value function as a baseline (e.g. A2C). Another technique for preventing gradient calculations that is efficient in GAN training (training the generator from the discriminator) and typical in fine-tuning is to loop through a networks parameters and set param.requires_grad = False.

Apart from logging results in the console/in a log file, it's important to checkpoint model parameters (and optimiser state) just in case. You can also use torch.save() to save normal Python objects, but other standard choices include the built-in pickle.

Testing

model.eval()
test_loss, correct = 0, 0 with torch.no_grad():
for data, target in test_loader:
data, target = data.to(device), target.to(device)
output = model(data)
test_loss += F.nll_loss(output, target, size_average=False).item()
pred = output.argmax(1, keepdim=True)
correct += pred.eq(target.view_as(pred)).sum().item() test_loss /= len(test_data)
acc = correct / len(test_data)
print(acc, test_loss)

In response to .train() earlier, networks should explicitly be set to evaluation mode using .eval().

As mentioned previously, the computation graph would normally be made when using a network. By using the no_grad context manager via with torch.no_grad() this is prevented from happening.

Extra

This is an extra section just to add a few useful asides.

Memory problems? Check the official docs for tips.

CUDA errors? They are a pain to debug, and are usually a logic problem that would come up with a more intelligible error message on CPU. It's best to be able to easily switch between CPU and GPU if you are planning on using the GPU. A more general development tip is to be able to set up your code so that it's possible to run through all of the logic quickly to check it before launching a proper job - examples would be preparing a small/synthetic dataset, running one train + test epoch, etc. If it is a CUDA error, or you really can't switch to CPU, setting CUDA_LAUNCH_BLOCKING=1 will make CUDA kernel launches synchronous and as a result provide better error messages.

A note for torch.multiprocessing, or even just running multiple PyTorch scripts at once. Because PyTorch uses multithreaded BLAS libraries to speed up linear algebra computations on CPU, it'll typically use several cores. If you want to run several things at once, with multiprocessing or several scripts, it may be useful to manually reduce these by setting the environment variable OMP_NUM_THREADS to 1 or another small number - this reduces the chance of CPU thrashing. The official docs have some other notes for multiprocessing in particular.

Additions From Official Docs

A typical training procedure for a neural network is as follows:

-Define the neural network that has some learnable parameters(or weights)

-iterate over a dataset of inputs

-Process input through the network

-Compute the loss (how far is the output from being correct)

-Propagate gradients back into the network’s parameters

-Update the weights of the network, typically using a simple update rule:weight=weight-learning_rate*gradient

Dataset class

torch.utils.data.Dataset is an abstract class representing a dataset. Your custom dataset should inherit Dataset and override the following methods:

__len__ so that len(dataset) returns the size of the dataset.

__getitem__ to support the indexing such that dataset[i] can be used to get ith sample.

Transforms

Rescale: to scale the image

RandomCrop: to crop from image randomly. This is data augmentation.

ToTensor: to convert the numpy images to torch images (we need to swap axes).

Compose transforms

torchvision.transforms.Compose

composed = transforms.Compose([Rescale(256),
RandomCrop(224)])

rescale the shorter side of the image to 256 and then randomly crop a square of size 224 from it.To compose Rescale and RandomCrop transforms

DataLoader

torch.utils.data.DataLoader is an iterator which provides all these features(Batching the data;Shuffling the data;Load the data in parallel using multiprocessing workers). Parameters used below should be clear. One parameter of interest is collate_fn. You can specify how exactly the samples need to be batched using collate_fn. However, default collate should work fine for most use cases.

torchvision

torchvision package provides some common datasets and transforms. You might not even have to write custom classes. One of the more generic datasets available in torchvision is ImageFolder.

import torch
from torchvision import transforms, datasets data_transform = transforms.Compose([
transforms.RandomSizedCrop(224),
transforms.RandomHorizontalFlip(),
transforms.ToTensor(),
transforms.Normalize(mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225])
])
hymenoptera_dataset = datasets.ImageFolder(root='hymenoptera_data/train',
transform=data_transform)
dataset_loader = torch.utils.data.DataLoader(hymenoptera_dataset,
batch_size=4, shuffle=True,
num_workers=4)

Custom nn Modules

Sometimes you will want to specify models that are more complex than a sequence of existing Modules; for these cases you can define your own Modules by subclassing nn.Module and defining a forward which receives input Tensors and produces output Tensors using other modules or other autograd operations on Tensors.

import torch

class TwoLayerNet(torch.nn.Module):
def __init__(self, D_in, H, D_out):
"""
In the constructor we instantiate two nn.Linear modules and assign them as
member variables.
"""
super(TwoLayerNet, self).__init__()
self.linear1 = torch.nn.Linear(D_in, H)
self.linear2 = torch.nn.Linear(H, D_out) def forward(self, x):
"""
In the forward function we accept a Tensor of input data and we must return
a Tensor of output data. We can use Modules defined in the constructor as
well as arbitrary operators on Tensors.
"""
h_relu = self.linear1(x).clamp(min=0)
y_pred = self.linear2(h_relu)
return y_pred # N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10 # Create random Tensors to hold inputs and outputs
x = torch.randn(N, D_in)
y = torch.randn(N, D_out) # Construct our model by instantiating the class defined above
model = TwoLayerNet(D_in, H, D_out) # Construct our loss function and an Optimizer. The call to model.parameters()
# in the SGD constructor will contain the learnable parameters of the two
# nn.Linear modules which are members of the model.
criterion = torch.nn.MSELoss(reduction='sum') #loss function
optimizer = torch.optim.SGD(model.parameters(), lr=1e-4) #Optimizer
for t in range(500):
# Forward pass: Compute predicted y by passing x to the model
y_pred = model(x) # Compute and print loss
loss = criterion(y_pred, y)
print(t, loss.item()) # Zero gradients, perform a backward pass, and update the weights.
optimizer.zero_grad()
loss.backward()
optimizer.step()

Grokking PyTorch的更多相关文章

  1. Ubutnu16.04安装pytorch

    1.下载Anaconda3 首先需要去Anaconda官网下载最新版本Anaconda3(https://www.continuum.io/downloads),我下载是是带有python3.6的An ...

  2. 解决运行pytorch程序多线程问题

    当我使用pycharm运行  (https://github.com/Joyce94/cnn-text-classification-pytorch )  pytorch程序的时候,在Linux服务器 ...

  3. 基于pytorch实现word2vec

    一.介绍 word2vec是Google于2013年推出的开源的获取词向量word2vec的工具包.它包括了一组用于word embedding的模型,这些模型通常都是用浅层(两层)神经网络训练词向量 ...

  4. 基于pytorch的CNN、LSTM神经网络模型调参小结

    (Demo) 这是最近两个月来的一个小总结,实现的demo已经上传github,里面包含了CNN.LSTM.BiLSTM.GRU以及CNN与LSTM.BiLSTM的结合还有多层多通道CNN.LSTM. ...

  5. pytorch实现VAE

    一.VAE的具体结构 二.VAE的pytorch实现 1加载并规范化MNIST import相关类: from __future__ import print_function import argp ...

  6. PyTorch教程之Training a classifier

    我们已经了解了如何定义神经网络,计算损失并对网络的权重进行更新. 接下来的问题就是: 一.What about data? 通常处理图像.文本.音频或视频数据时,可以使用标准的python包将数据加载 ...

  7. PyTorch教程之Neural Networks

    我们可以通过torch.nn package构建神经网络. 现在我们已经了解了autograd,nn基于autograd来定义模型并对他们有所区分. 一个 nn.Module模块由如下部分构成:若干层 ...

  8. PyTorch教程之Autograd

    在PyTorch中,autograd是所有神经网络的核心内容,为Tensor所有操作提供自动求导方法. 它是一个按运行方式定义的框架,这意味着backprop是由代码的运行方式定义的. 一.Varia ...

  9. Linux安装pytorch的具体过程以及其中出现问题的解决办法

    1.安装Anaconda 安装步骤参考了官网的说明:https://docs.anaconda.com/anaconda/install/linux.html 具体步骤如下: 首先,在官网下载地址 h ...

随机推荐

  1. Ntp配置文件详解

    1. http://www.ine.com/resources/ntp-authentication.htm 2. http://blog.chinaunix.net/uid-773723-id-16 ...

  2. php如何实现把多平台文件中所有的行合成一行?

    php如何实现把多平台文件中所有的行合成一行? 一.总结 1.str_replace中的数组替换:str_replace(array("/r","/n",&qu ...

  3. HelloWorld RabbitMQ

    RabbitMQ入门-从HelloWorld开始 从读者的反馈谈RabbitMQ 昨天发完<RabbitMQ入门-初识RabbitMQ>,我陆陆续续收到一些反馈.鉴于部分读者希望结合实例来 ...

  4. 数据结构与算法——常用高级数据结构及其Java实现

    前文 数据结构与算法--常用数据结构及其Java实现 总结了基本的数据结构,类似的,本文准备总结一下一些常见的高级的数据结构及其常见算法和对应的Java实现以及应用场景,务求理论与实践一步到位. 跳跃 ...

  5. 神经进化学的简介和一个简单的CPPN(Compositional Pattern Producing Networks)DEMO

    近期迷上神经进化(Neuroevolution)这个方向,感觉是Deep Learning之后的一个非常不错的研究领域. 该领域的一个主导就是仿照人的遗传机制来进化网络參数与结构.注意,连网络结构都能 ...

  6. Web开发标配--开发人员工具-F12

    喜欢从业的专注,七分学习的态度. 360浏览器-开发工具 谷歌-开发工具 IE-开发工具 Web开发中最最烦琐的莫过于调整CSS和JS,而最方便最高效的方式就是利用浏览器的开发工具调整.CSS可以实时 ...

  7. 三次握手、四次握手、backlog

    TCP:三次握手.四次握手.backlog及其他   TCP是什么 首先看一下OSI七层模型: 然后数据从应用层发下来,会在每一层都加上头部信息进行封装,然后再发送到数据接收端,这个基本的流程中每个数 ...

  8. BZOJ 2286 消耗战 - 虚树 + 树型dp

    传送门 题目大意: 每次给出k个特殊点,回答将这些特殊点与根节点断开至少需要多少代价. 题目分析: 虚树入门 + 树型dp: 刚刚学习完虚树(好文),就来这道入门题签个到. 虚树就是将树中的一些关键点 ...

  9. 使用RpcLite构建SOA/Web服务

    使用RpcLite构建SOA/Web服务 SOA框架系列 1. 使用RpcLite构建SOA/Web服务 提到Web服务最先想到的就是WebService此外常用的还有WCF.ServiceStack ...

  10. GANs(生成对抗网络)初步

    Image Completion with Deep Learning in TensorFlow 1. 基本思路 首先定义一个简单的.常见的概率分布,将其表示为 pz,不妨将其作为 [-1, 1] ...