Grokking PyTorch
原文地址:https://github.com/Kaixhin/grokking-pytorch
PyTorch is a flexible deep learning framework that allows automatic differentiation(自动求导) through dynamic neural networks (i.e., networks that utilise dynamic control flow like if statements and while loops). It supports GPU acceleration, distributed training, various optimisations, and plenty more neat features. These are some notes on how I think about using PyTorch, and don't encompass all parts of the library or every best practice, but may be helpful to others.
Neural networks are a subclass of computation graphs(计算图). Computation graphs receive input data, and data is routed to and possibly transformed by nodes which perform processing on the data. In deep learning, the neurons (nodes) in neural networks typically transform data with parameters and differentiable functions, such that the parameters can be optimised to minimise a loss via gradient descent(梯度下降). More broadly, the functions can be stochastic, and the structure of the graph can be dynamic. So while neural networks may be a good fit for dataflow programming, PyTorch's API has instead centred around imperative programming, which is a more common way for thinking about programs. This makes it easier to read code and reason about complex programs, without necessarily sacrificing much performance; PyTorch is actually pretty fast, with plenty of optimisations that you can safely forget about as an end user (but you can dig in if you really want to).
The rest of this document, based on the official MNIST example, is about grokking PyTorch, and should only be looked at after the official beginner tutorials. For readability, the code is presented in chunks interspersed with comments, and hence not separated into different functions/files as it would normally be for clean, modular code.
Imports
import argparse
import os
import torch
from torch import nn, optim
from torch.nn import functional as F
from torch.utils.data import DataLoader
from torchvision import datasets, transforms
These are pretty standard imports, with the exception of the torchvision modules that are used for computer vision problems in particular.
Setup
parser = argparse.ArgumentParser(description='PyTorch MNIST Example')
parser.add_argument('--batch-size', type=int, default=64, metavar='N',
help='input batch size for training (default: 64)')
parser.add_argument('--epochs', type=int, default=10, metavar='N',
help='number of epochs to train (default: 10)')
parser.add_argument('--lr', type=float, default=0.01, metavar='LR',
help='learning rate (default: 0.01)')
parser.add_argument('--momentum', type=float, default=0.5, metavar='M',
help='SGD momentum (default: 0.5)')
parser.add_argument('--no-cuda', action='store_true', default=False,
help='disables CUDA training')
parser.add_argument('--seed', type=int, default=1, metavar='S',
help='random seed (default: 1)')
parser.add_argument('--save-interval', type=int, default=10, metavar='N',
help='how many batches to wait before checkpointing')
parser.add_argument('--resume', action='store_true', default=False,
help='resume training from checkpoint')
args = parser.parse_args() use_cuda = torch.cuda.is_available() and not args.no_cuda
device = torch.device('cuda' if use_cuda else 'cpu')
torch.manual_seed(args.seed)
if use_cuda:
torch.cuda.manual_seed(args.seed)
argparse is a standard way of dealing with command-line arguments in Python.
A good way to write device-agnostic code (benefitting from GPU acceleration when available but falling back to CPU when not) is to pick and save the appropriate torch.device, which can be used to determine where tensors should be stored. See the official docs for more tips on device-agnostic code. The PyTorch way is to put device placement under the control of the user, which may seem a nuisance for simple examples, but makes it much easier to work out where tensors are - which is useful for a) debugging and b) making efficient use of devices manually.
For repeatable experiments, it is necessary to set random seeds for anything that uses random number generation (including random or numpy if those are used too). Note that cuDNN uses nondeterministic algorithms, and it can be disabled using torch.backends.cudnn.enabled = False.
Data
data_path = os.path.join(os.path.expanduser('~'), '.torch', 'datasets', 'mnist')
train_data = datasets.MNIST(data_path, train=True, download=True,
transform=transforms.Compose([
transforms.ToTensor(),
transforms.Normalize((0.1307,), (0.3081,))]))
test_data = datasets.MNIST(data_path, train=False, transform=transforms.Compose([
transforms.ToTensor(),
transforms.Normalize((0.1307,), (0.3081,))]))
train_loader = DataLoader(train_data, batch_size=args.batch_size,
shuffle=True, num_workers=4, pin_memory=True)
test_loader = DataLoader(test_data, batch_size=args.batch_size,
num_workers=4, pin_memory=True)
Since torchvision models get stored under ~/.torch/models/, I like to store torchvision datasets under ~/.torch/datasets. This is my own convention, but makes it easier if you have lots of projects that depend on MNIST, CIFAR-10 etc. In general it's worth keeping datasets separately to code if you end up reusing several datasets.
torchvision.transforms contains lots of handy transformations for single images, such as cropping and normalisation.
DataLoader contains many options, but beyond batch_size and shuffle, num_workers and pin_memory are worth knowing for efficiency. num_workers > 0 uses subprocesses to asynchronously load data, rather than making the main process block on this. The typical use-case is when loading data (e.g. images) from disk and maybe transforming them too - this can be done in parallel with the network processing the data. You will want to tune the amount to a) minimise the number of workers and hence CPU and RAM usage (each worker loads a separate batch, not individual samples within a batch) b) minimise the time the network is waiting for data. pin_memory uses pinned RAM to speed up RAM to GPU transfers (and does nothing for CPU-only code).(在Windows上num_workers应取默认值0,否则会出错;shuffle:set to true have the data reshuffled(洗牌) at every epoch(default:False)一般在train的时候设置为true,test时取默认值False)
Model
class Net(nn.Module):
def __init__(self):
super(Net, self).__init__()
self.conv1 = nn.Conv2d(1, 10, kernel_size=5)
self.conv2 = nn.Conv2d(10, 20, kernel_size=5)
self.conv2_drop = nn.Dropout2d()
self.fc1 = nn.Linear(320, 50)
self.fc2 = nn.Linear(50, 10) def forward(self, x):
x = F.relu(F.max_pool2d(self.conv1(x), 2))
x = F.relu(F.max_pool2d(self.conv2_drop(self.conv2(x)), 2))
x = x.view(-1, 320)
x = F.relu(self.fc1(x))
x = self.fc2(x)
return F.log_softmax(x, dim=1) model = Net().to(device)
optimiser = optim.SGD(model.parameters(), lr=args.lr, momentum=args.momentum) if args.resume:
model.load_state_dict(torch.load('model.pth'))
optimiser.load_state_dict(torch.load('optimiser.pth'))
Network initialisation typically includes member variables, layers which contain trainable parameters, and maybe separate trainable parameters and non-trainable buffers. The forward pass then uses these in conjunction with functions from F that are purely functional (don't contain parameters). Some people prefer to have completely functional networks (e.g., keeping parameters separately and using F.conv2d instead of nn.Conv2d) or networks completely made of layers (e.g., nn.ReLU instead of F.relu).
.to(device) is a convenient way of sending the device parameters (and buffers) to GPU if device is set to GPU, doing nothing otherwise (when device is set to CPU). It's important to transfer the network parameters to the appropriate device before passing them to the optimiser, otherwise the optimiser will not be keeping track of the parameters properly!
Both neural networks (nn.Module) and optimisers (optim.Optimizer) have the ability to save and load their internal state, and .load_state_dict(state_dict) is the recommended way to do so - you'll want to reload the state of both to resume training from previously saved state dictionaries. Saving the entire object can be error prone. If you have saved tensors on GPU and want to load them on CPU or another GPU, the easiest way is to directly load them onto CPU using the map_location option, e.g., torch.load('model.pth', map_location='cpu').
Some points of note not shown here are that the forward pass can make use of control flow (控制流)(e.g., a member variable or even the data itself can determine the execution of an if statement. It is also perfectly valid to print tensors in the middle, making debugging much easier. Finally, the forward pass can make use of multiple arguments(多个参数). A short snippet (not tied to any sensible idea) to illustrate this is below:
def forward(self, x, hx, drop=False):
hx2 = self.rnn(x, hx)
print(hx.mean().item(), hx.var().item())
if hx.max.item() > 10 or self.can_drop and drop:
return hx
else:
return hx2
Training
model.train()
train_losses = [] for i, (data, target) in enumerate(train_loader):
data, target = data.to(device), target.to(device)
optimiser.zero_grad()
output = model(data)
loss = F.nll_loss(output, target)
loss.backward()
train_losses.append(loss.item())
optimiser.step() if i % 10 == 0:
print(i, loss.item())
torch.save(model.state_dict(), 'model.pth')
torch.save(optimiser.state_dict(), 'optimiser.pth')
torch.save(train_losses, 'train_losses.pth')
Network modules are by default set to training mode - which impacts the way some modules work, most noticeably dropout and batch normalisation. It's best to set this manually anyway with .train(), which propagates the training flag down all children modules.
Before collecting a new set of gradients with loss.backward() and doing backpropagation with optimiser.step(), it's necessary to manually zero the gradients of the parameters being optimised with optimiser.zero_grad(). By default, PyTorch accumulates gradients, which is very handy when you don't have enough resources to calculate all the gradients you need in one go.
PyTorch uses a tape-based automatic gradient (autograd) system - it collects which operations were done on tensors in order, and then replays them backwards to do reverse-mode differentiation. This is why it is super flexible and allows arbitrary computation graphs. If none of the tensors require gradients (you'd have to set requires_grad=True when constructing a tensor for this) then no graph is stored! However, networks tend to have parameters that require gradients, so any computation done from the output of a network will be stored in the graph. So if you want to store data resulting from this, you'll need to manually disable gradients or, more commonly, store it as a Python number (via .item() on a PyTorch scalar) or numpy array. See the official docs for more on autograd.
One way to cut the computation graph is to use .detach(), which you may use when passing on a hidden state when training RNNs with truncated backpropagation-through-time. It's also handy when differentiating a loss where one component is the output of another network, but this other network shouldn't be optimised with respect to the loss - examples include training a discriminator from a generator's outputs in GAN training, or training the policy of an actor-critic algorithm using the value function as a baseline (e.g. A2C). Another technique for preventing gradient calculations that is efficient in GAN training (training the generator from the discriminator) and typical in fine-tuning is to loop through a networks parameters and set param.requires_grad = False.
Apart from logging results in the console/in a log file, it's important to checkpoint model parameters (and optimiser state) just in case. You can also use torch.save() to save normal Python objects, but other standard choices include the built-in pickle.
Testing
model.eval()
test_loss, correct = 0, 0 with torch.no_grad():
for data, target in test_loader:
data, target = data.to(device), target.to(device)
output = model(data)
test_loss += F.nll_loss(output, target, size_average=False).item()
pred = output.argmax(1, keepdim=True)
correct += pred.eq(target.view_as(pred)).sum().item() test_loss /= len(test_data)
acc = correct / len(test_data)
print(acc, test_loss)
In response to .train() earlier, networks should explicitly be set to evaluation mode using .eval().
As mentioned previously, the computation graph would normally be made when using a network. By using the no_grad context manager via with torch.no_grad() this is prevented from happening.
Extra
This is an extra section just to add a few useful asides.
Memory problems? Check the official docs for tips.
CUDA errors? They are a pain to debug, and are usually a logic problem that would come up with a more intelligible error message on CPU. It's best to be able to easily switch between CPU and GPU if you are planning on using the GPU. A more general development tip is to be able to set up your code so that it's possible to run through all of the logic quickly to check it before launching a proper job - examples would be preparing a small/synthetic dataset, running one train + test epoch, etc. If it is a CUDA error, or you really can't switch to CPU, setting CUDA_LAUNCH_BLOCKING=1 will make CUDA kernel launches synchronous and as a result provide better error messages.
A note for torch.multiprocessing, or even just running multiple PyTorch scripts at once. Because PyTorch uses multithreaded BLAS libraries to speed up linear algebra computations on CPU, it'll typically use several cores. If you want to run several things at once, with multiprocessing or several scripts, it may be useful to manually reduce these by setting the environment variable OMP_NUM_THREADS to 1 or another small number - this reduces the chance of CPU thrashing. The official docs have some other notes for multiprocessing in particular.
Additions From Official Docs
A typical training procedure for a neural network is as follows:
-Define the neural network that has some learnable parameters(or weights)
-iterate over a dataset of inputs
-Process input through the network
-Compute the loss (how far is the output from being correct)
-Propagate gradients back into the network’s parameters
-Update the weights of the network, typically using a simple update rule:weight=weight-learning_rate*gradient
Dataset class
torch.utils.data.Dataset is an abstract class representing a dataset. Your custom dataset should inherit Dataset and override the following methods:
__len__ so that len(dataset) returns the size of the dataset.
__getitem__ to support the indexing such that dataset[i] can be used to get ith sample.
Transforms
Rescale: to scale the image
RandomCrop: to crop from image randomly. This is data augmentation.
ToTensor: to convert the numpy images to torch images (we need to swap axes).
Compose transforms
torchvision.transforms.Compose
composed = transforms.Compose([Rescale(256),
RandomCrop(224)])
rescale the shorter side of the image to 256 and then randomly crop a square of size 224 from it.To compose Rescale and RandomCrop transforms
DataLoader
torch.utils.data.DataLoader is an iterator which provides all these features(Batching the data;Shuffling the data;Load the data in parallel using multiprocessing workers). Parameters used below should be clear. One parameter of interest is collate_fn. You can specify how exactly the samples need to be batched using collate_fn. However, default collate should work fine for most use cases.
torchvision
torchvision package provides some common datasets and transforms. You might not even have to write custom classes. One of the more generic datasets available in torchvision is ImageFolder.
import torch
from torchvision import transforms, datasets data_transform = transforms.Compose([
transforms.RandomSizedCrop(224),
transforms.RandomHorizontalFlip(),
transforms.ToTensor(),
transforms.Normalize(mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225])
])
hymenoptera_dataset = datasets.ImageFolder(root='hymenoptera_data/train',
transform=data_transform)
dataset_loader = torch.utils.data.DataLoader(hymenoptera_dataset,
batch_size=4, shuffle=True,
num_workers=4)
Custom nn Modules
Sometimes you will want to specify models that are more complex than a sequence of existing Modules; for these cases you can define your own Modules by subclassing nn.Module and defining a forward which receives input Tensors and produces output Tensors using other modules or other autograd operations on Tensors.
import torch class TwoLayerNet(torch.nn.Module):
def __init__(self, D_in, H, D_out):
"""
In the constructor we instantiate two nn.Linear modules and assign them as
member variables.
"""
super(TwoLayerNet, self).__init__()
self.linear1 = torch.nn.Linear(D_in, H)
self.linear2 = torch.nn.Linear(H, D_out) def forward(self, x):
"""
In the forward function we accept a Tensor of input data and we must return
a Tensor of output data. We can use Modules defined in the constructor as
well as arbitrary operators on Tensors.
"""
h_relu = self.linear1(x).clamp(min=0)
y_pred = self.linear2(h_relu)
return y_pred # N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10 # Create random Tensors to hold inputs and outputs
x = torch.randn(N, D_in)
y = torch.randn(N, D_out) # Construct our model by instantiating the class defined above
model = TwoLayerNet(D_in, H, D_out) # Construct our loss function and an Optimizer. The call to model.parameters()
# in the SGD constructor will contain the learnable parameters of the two
# nn.Linear modules which are members of the model.
criterion = torch.nn.MSELoss(reduction='sum') #loss function
optimizer = torch.optim.SGD(model.parameters(), lr=1e-4) #Optimizer
for t in range(500):
# Forward pass: Compute predicted y by passing x to the model
y_pred = model(x) # Compute and print loss
loss = criterion(y_pred, y)
print(t, loss.item()) # Zero gradients, perform a backward pass, and update the weights.
optimizer.zero_grad()
loss.backward()
optimizer.step()
Grokking PyTorch的更多相关文章
- Ubutnu16.04安装pytorch
1.下载Anaconda3 首先需要去Anaconda官网下载最新版本Anaconda3(https://www.continuum.io/downloads),我下载是是带有python3.6的An ...
- 解决运行pytorch程序多线程问题
当我使用pycharm运行 (https://github.com/Joyce94/cnn-text-classification-pytorch ) pytorch程序的时候,在Linux服务器 ...
- 基于pytorch实现word2vec
一.介绍 word2vec是Google于2013年推出的开源的获取词向量word2vec的工具包.它包括了一组用于word embedding的模型,这些模型通常都是用浅层(两层)神经网络训练词向量 ...
- 基于pytorch的CNN、LSTM神经网络模型调参小结
(Demo) 这是最近两个月来的一个小总结,实现的demo已经上传github,里面包含了CNN.LSTM.BiLSTM.GRU以及CNN与LSTM.BiLSTM的结合还有多层多通道CNN.LSTM. ...
- pytorch实现VAE
一.VAE的具体结构 二.VAE的pytorch实现 1加载并规范化MNIST import相关类: from __future__ import print_function import argp ...
- PyTorch教程之Training a classifier
我们已经了解了如何定义神经网络,计算损失并对网络的权重进行更新. 接下来的问题就是: 一.What about data? 通常处理图像.文本.音频或视频数据时,可以使用标准的python包将数据加载 ...
- PyTorch教程之Neural Networks
我们可以通过torch.nn package构建神经网络. 现在我们已经了解了autograd,nn基于autograd来定义模型并对他们有所区分. 一个 nn.Module模块由如下部分构成:若干层 ...
- PyTorch教程之Autograd
在PyTorch中,autograd是所有神经网络的核心内容,为Tensor所有操作提供自动求导方法. 它是一个按运行方式定义的框架,这意味着backprop是由代码的运行方式定义的. 一.Varia ...
- Linux安装pytorch的具体过程以及其中出现问题的解决办法
1.安装Anaconda 安装步骤参考了官网的说明:https://docs.anaconda.com/anaconda/install/linux.html 具体步骤如下: 首先,在官网下载地址 h ...
随机推荐
- poj1066--Treasure Hunt(规范相交)
题目链接:点击打开链接 题目大意:一个正方形的墓葬内有n堵墙,每堵墙的两个顶点都在正方形的边界上,如今这些墙将墓葬切割成了非常多小空间,已知正方形内的一个点上存在宝藏,如今我们要在正方形的外面去得到宝 ...
- Dubbo服务框架解析(二)
本节介绍dubbo-common,dubbo-common是公共逻辑模块,包含Util类.通用模型,是其他模块的基础. 扩展机制 SPI SPI是扩展点的注解.标注在类型上.全部的扩展点须要通过SPI ...
- 设置statusBar状态栏颜色
设置statusBar的[前景部分] 简单来说,就是设置显示电池电量.时间.网络部分标示的颜色, 这里只能设置两种颜色: 默认的黑色(UIStatusBarStyleDefault) 白色(UISta ...
- ssh远程无法连接VM中的Ubuntu问题
Ubuntu ssh远程无法连接问题 1. 检查sudo ps -e|grep ssh 查看是否有ssh进程服务,如果没有的话,需要下载安装 sudo apt-get install openss ...
- 如何在CSDN博客自定义栏目中添加“给我写信”
在"自定义栏目"中添加"连接"(将自己的微博,QQ空间和CSDN博客关联起来)很多人都做过.但是添加"给我写信"这个功能,用的好像不太多.此 ...
- Node.js学习疑惑整理
1.Node.js 在调用某个包时,会首先检查包中 package.json 文件的 main 字段,将其作为 包的接口模块,如果 package.json 或 main 字段不存在,会尝试寻找 in ...
- Linux 下的任务管理 —— ps、top
ps:report a snapshot of the current processes. ps 命令支持三种使用的语法格式 UNIX 风格,选项可以组合在一起,并且选项前必须有"-&qu ...
- PAT 1021-1030 题解
早期部分代码用 Java 实现.由于 PAT 虽然支持各种语言,但只有 C/C++标程来限定时间,许多题目用 Java 读入数据就已经超时,后来转投 C/C++.浏览全部代码:请戳 本文谨代表个人思路 ...
- Asp.NetCore程序发布到CentOs(含安装部署netcore)--最佳实践
原文:Asp.NetCore程序发布到CentOs(含安装部署netcore)--最佳实践 环境 本地 win7 服务器:Virtual Box 上的Centos ssh工具: Xshell 文件传输 ...
- node 调用Python exec child_process 模块
参考:http://javascript.ruanyifeng.com/nodejs/child-process.html https://nodejs.org/api/child_process.h ...