nanoporetech/nanonet
| nanonet | Remove extraneous currennt call. | 4 months ago | |
| sample_data | Update example data to contain raw | a year ago | |
| .gitignore | Get 2d caller to running on windows | a year ago | |
| COPYRIGHT | Add open source licensing | 10 months ago | |
| LICENSE.md | Add open source licensing | 10 months ago | |
| MANIFEST.in | deb packaging, datafiles still missing | a year ago | |
| Makefile | Dispense with repack_deb.sh | a year ago | |
| README.md | Revert "Resolves #19 Add simulation" | 7 months ago | |
| TRAIN_INSTALL.md | Add prerequisites to training. | 4 months ago | |
| discover_tests.py | Unittests for nn.py | a year ago | |
| nanoccompiler.py | Support Mingw-w64 compiler on Windows (#11) | a year ago | |
| nanonetcall.spec | Add pyinstaller spec | a year ago | |
| setup.py | Fix #51 (#52) | 3 months ago | |
| stdeb.cfg | Dispense with repack_deb.sh | a year ago |
README.md
(c)2016: Oxford Nanopore Technologies, Limited
Nanonet
Nanonet provides recurrent neural network basecalling for Oxford Nanopore MinION data. It represents the first generation of such a basecaller from Oxford Nanopore Technologies, and is provided as a technology demonstrator. Nanonet is provided unsupported by Oxford Nanopore Technologies, see LICENSE.md for more information.
For training networks, Nanonet leverages currennt to run recurrent neural networks. Currennt is generally run with GPUs to aid performance but can be run in a CPU only environment. The basecaller does not require currennt, and is written in pure python with minimal requirements.
Installation
Nanonet contains implementations of both 1D and 2D basecalling, with OpenCL versions of each of these. By default, using the instructions in this section only the canonical 1D basecalling library will be compiled; OpenCL acceleration of 1D basecalling and any 2D basecalling support will not be configured. See later sections of this documented for setting up these components.
Requirements
Nanonet is a python-based commandline suite consisting of several programs for performing basecalling with recurrent neural networks. It has been developed using 64-bit Python 2.7 available from python.org.
In addition to Python the only required dependencies are h5py and numpy. These can be downloaded and installed/compiled automatically, though installing them from your system's package repository is generally preferable in the first instance. For Windows Christophe Golke maintains a repository of compiled python wheels at:
http://www.lfd.uci.edu/~gohlke/pythonlibs/
For OSX homebrew is easiest:
brew tap homebrew/science
brew install hdf5
Easy Windows Installation
Python wheels for nanonet are available from the github page under the releases tab:
https://github.com/nanoporetech/nanonet-dev/releases
to install these run the following from a command prompt:
pip install nanonet-2.0.0-cp27-cp27m-win_amd64.whl
The wheel currently enables CPU implementataions of both 1D and 2D calling, and additionally the OpenCL implementation of 1D calling. To enable the latter please refer to the OpenCL section below. Wheels for other platforms may be made available in the future.
Setup
The basecalling component of nanonet should install quite trivially using the standard python mechanism on most platforms:
python setup.py install
The basecaller contains small amounts of C code for performing event detection. A C compiler is therefore required. Under Linux and OSX, one is likely installed already, for Windows one can download and install the Microsoft Visual C++ Compiler for Python 2.7 from:
https://www.microsoft.com/en-gb/download/details.aspx?id=44266
Most of these documentation assume you are using this compiler on Windows. The section Compiling with MinGW-w64explains how to use GCC based compilation on Windows.
Optional Watcher Component
Nanonet contains an optional component to watch a filesystem for read files as they are produced by MinKnow. This feature is not installed by default. To install it run
pip install -e .[watcher]
from the source directory, or simply
pip install watchdog
for any location. This will allow use of the --watch option of the basecaller.
Peforming basecalling
Nanonet provides a single program for basecalling Oxford Nanopore Technolgies' reads from .fast5 files. The output is to stdout as fasta formatted sequences.
If you followed the instructions above the program nanonetcall should be on your path. It requires a single argument:
nanonetcall {input_folder} > {output.fasta}
To test your installation several .fast5 files are provided in the sample_data folder of the source distribution, as a concrete example the command:
nanonetcall --jobs 5 --chemistry r9 sample_data > basecalls.fa
will produced output along the following lines:
Basecalled 5 reads (25747 bases, 49340 events) in 36.6668331623s (wall time)
Profiling
---------
Feature generation: 1.08410215378
Load network: 0.0128238201141
Run network: 15.6343309879 (1.5990450771 kb/s, 3.13412835112 kev/s)
Decoding: 19.8838129044 (1.25730412574 kb/s, 2.46431608644 kev/s)
Filesystem watching
Nanonet has the ability to watch a filesystem as reads are produced. This behaviour is enabled with the --watch option:
nanonetcall input_folder --watch 600 > basecalls.fa
where the option value is a timeout in seconds, when no new reads are seen for this time nanonetcall will exit.
Input files
nanonetcall operates from single-read .fast5 files as output by MinKnow. These should contain raw data; the event detection step and segmentation into template and complement sections will be performed by nanonet.
Using multiple CPUs
By default nanonetcall will use a maximum of one CPU. This can be altered through use of the --jobs option. In using this option be aware that higher numbers will lead to increased memory usage.
2D Basecalling
The 2D basecalling library has a dependency on the boost C++ library. For this reason it is not compiled by default. If you know that you have a working boost installation you can simply use the following to enable 2D basecalling:
python setup.py install with2d
Boost can be installed on most Linux systems via the system package manager. For example on Ubuntu it should be sufficient to install the boost-python package:
sudo apt-get install libboost-python1.54-dev libboost-python1.54.0
You may elect to use a different version of boost if you wish. On OSX boost can again use homebrew:
brew install boost --with-python
brew install boost-python
On Windows the simplest method to install boost is to obtain a precompiled version from sourceforge:
https://sourceforge.net/projects/boost/files/boost-binaries/
nanonet has been tested with version 1.55.0 on windows. It is important to use the version compiled with the same version of the Microsoft Visual C compiler you are using. If you followed the instructions above to install the Microsoft Visual C++ Compiler for Python 2.7 you should download the package labelled msvc-9.0-64.exe, i.e. the file available here:
https://sourceforge.net/projects/boost/files/boost-binaries/1.55.0/boost_1_55_0-msvc-9.0-64.exe/download
The above installer will by default install boost to c:\local\boost_1_55_0. If you change this path you will also need to edit the setup.py file in nanonet. To run the 2D basecaller you will need also to add the library location to your Pathenvironment variable. In Windows Powershell this can be done with:
$env:Path += ";c:\local\boost_1_55_0\lib64-msvc-9.0"
Once you have installed boost on your OS, the 2D basecalling components can be compiled and set up with:
python setup.py install with2d
Performing 2D basecalling currently requires use of a distinct program from the pure 1D basecaller. The interface of this program is much the same as nanonetcall, for example a basic use would simply require:
nanonet2d --chemistry r9 sample_data calls
The second option here specifies a prefix for output fasta files; three files will be created: one each for template, complement and 2D basecalls.
OpenCL Support
Nanonet contains OpenCL accelerated versions of both 1D and 2D basecalling. Currently the implementations of these use different mechanisms for creating the OpenCL kernels: 1D basecalling acceleration is driven through pyopencl whilst 2D acceleration directly interfaces with OpenCL libraries.
Configuring a working OpenCL environment depends heavily on your OS and device you wish to target. If you are unfamiliar with how to do this you are recommended to start with the pyopencl documentation:
https://wiki.tiker.net/PyOpenCL
1D Acceleration
Once you have the pyopencl examples working you should be able to run the accelerated version of 1D basecalling:
nanonetcall --chemistry r9 --platforms <VENDOR:DEVICE:1> --exc_opencl sample_data
where VENDOR and DEVICE will be machine dependent. To see available devices you can examine the output of:
nanonetcall --list_platforms
The --exc_opencl option above instructs nanonet to use only the OpenCL device(s) listed on the command line. Without this option, CPU resources will also be used. You may wish to experiment with this option and the --jobs option to achieve optimal throughput.
2D Acceleration
To enable OpenCL acceration on 2D basecalling it is necessary to compile additional components during the setup of nanonet by using the opencl2d argument:
python setup.py install opencl2d
With a working OpenCL runtime and development environment 2D basecalling can be accelerated by simply adding and option on the commandline:
nanonet2d --opencl_2d --chemistry r9 sample_data calls
The program will automatically choose an OpenCL device to use, giving preference to GPU devices over CPU ones. It is not currently possible to use OpenCL acceleration for the 1D basecalling necessary for performing a 2D basecall.
Compiling with MinGW-w64
The Microsoft Visual C++ Compiler for Python 2.7 compiler is fairly old and can produce less than efficient code compared to more recent compilers. The following describes how one may instead compile nanonet using MinGW. This process is not recommended for those unfamiliar with compilation toolchains. We use specific, but not the latest, versions of MinGW and Boost in what follows. Be aware that these instructions may not work with later versions of these tools.
First download the following build of MinGW-w64:
https://sourceforge.net/projects/mingwbuilds/files/host-windows/releases/4.8.1/64-bit/threads-posix/seh/x64-4.8.1-release-posix-seh-rev5.7z/download
together with Boost 1.55.0:
https://sourceforge.net/projects/boost/files/boost/1.55.0/boost_1_55_0.7z/download
The files are packaged with 7zip, which you should install if you haven't previously. Extract both of the packages to:
C:\local\
Open Windows Powershell and run the following to build the boost-python library:
$env:Path += ';C:\local\mingw64\bin'
cd c:\local\boost_1_55_0\
.\bootstrap.bat gcc
.\b2.exe toolset=gcc address-model=64 link=shared define=MS_WIN64 --with-python stage
The distutils module of Python 2.7 supports using MinGW as a compiler, although unfortunately it is likely to cause .dlls to be incorrectly linked when using MinGW-w64. For this reason and to streamline the build process, nanonet patches the class distutils.cygwinccompiler.Mingw32CCompiler with its own version and manipulates some distutils internals in order to force use of this class when a users elects to use MinGW-w64. All this means is that to compile and setup an in-place (development) install of nanonet run:
python setup.py develop --user with2d mingw
If you have and existing MinGW-w64 setup you can use the environment variables BOOST_ROOT, BOOST_LIB, and BOOST_PYTHON to specify respectively the location of you boost install, relative location of the boost libraries, and the name of your boost-python library. For example the defaults are:
BOOST_ROOT = "c:\local\boost_1_55_0\"
BOOST_LIB = "stage\lib"
BOOST_PYTON = "boost_python-mgw48-mt-1_55"
Benchmarks
The results below are indicative and provided for reference only. The performance of nanonet should not be assumed to be equal to that of a production implementation.
AWS g2.2xlarge Ubuntu 16.04 Xeon E5-2670 @ 2.60GHz
- numpy is linked to a multithreaded blas, it appears best to set the number of threads to 1.
- 1D performance scales nicely with number of CPU processes
- 1D GPU performance is rough 4.5X for single concurrent kernel or approaching 6.5X for ten concurrent kernels.
- 2D GPU performance is 7X.
1D basecalling
$ nanonetcall more_sample_data/ > /dev/null
Basecalled 20 reads (30160 bases, 62600 events) in 39.9055719376s (wall time)
Run network: 23.59 ( 1.278 kb/s, 2.653 kev/s)
Decoding: 14.53 ( 2.076 kb/s, 4.308 kev/s)
$
$ OMP_NUM_THREADS=1 nanonetcall more_sample_data/ > /dev/null
Basecalled 20 reads (30160 bases, 62600 events) in 30.2339270115s (wall time)
Run network: 13.78 ( 2.189 kb/s, 4.544 kev/s)
Decoding: 14.74 ( 2.046 kb/s, 4.247 kev/s)
$
$ OMP_NUM_THREADS=1 nanonetcall more_sample_data/ --jobs 4 > /dev/null
Basecalled 20 reads (30160 bases, 62600 events) in 7.72924900055s (wall time)
Run network: 13.84 ( 2.180 kb/s, 4.525 kev/s)
Decoding: 15.00 ( 2.011 kb/s, 4.174 kev/s)
$
$ OMP_NUM_THREADS=1 nanonetcall more_sample_data/ --platforms nvidia:0:1 --exc_opencl > /dev/null
Basecalled 20 reads (30160 bases, 62600 events) in 8.73325586319s (wall time)
Run network: 3.83 ( 7.868 kb/s, 16.331 kev/s)
Decoding: 1.54 (19.623 kb/s, 40.730 kev/s)
2D basecalling
$ OMP_NUM_THREADS=1 nanonet2d more_sample_data calls
Processed 20 reads in 214.572195053s (wall time)
Template Run network: 14.01 ( 2.153 kb/s, 4.469 kev/s)
Template Decoding: 15.76 ( 1.914 kb/s, 3.973 kev/s)
2D calling: 156.85 ( 0.188 kb/s)
$
$ OMP_NUM_THREADS=1 nanonet2d more_sample_data calls --opencl_2d
Processed 20 reads in 78.750191927s (wall time)
Template Run network: 13.92 ( 2.167 kb/s, 4.498 kev/s)
Template Decoding: 15.50 ( 1.946 kb/s, 4.038 kev/s)
2D calling: 21.72 ( 1.355 kb/s)
Intel NUC5i7RYH Ubuntu 14.04 i7-5557U CPU @ 3.10GHz
- Again CPU performance better with single threaded blas
- 1D CPU marginally better than AWS machine, though relative speed of network and coding altered.
- Two physical cores: performance with 4 processes not significantly faster than 2.
- 1D Iris 6100 roughly 3.5X over CPU core. Concurrent kernel execution is not handled well.
- 2D GPU 5X over CPU.
1D basecalling
$ nanonetcall more_sample_data/ > /dev/null
Basecalled 20 reads (30160 bases, 62600 events) in 34.7250430584s (wall time)
Run network: 16.35 ( 1.845 kb/s, 3.829 kev/s)
Decoding: 16.26 ( 1.855 kb/s, 3.851 kev/s)
$
$ OMP_NUM_THREADS=1 nanonetcall more_sample_data/ > /dev/null
Basecalled 20 reads (30160 bases, 62600 events) in 28.1761329174s (wall time)
Run network: 10.19 ( 2.960 kb/s, 6.143 kev/s)
Decoding: 16.13 ( 1.870 kb/s, 3.881 kev/s)
$
$ OMP_NUM_THREADS=1 nanonetcall more_sample_data/ --jobs 2 >/dev/null
Basecalled 20 reads (30160 bases, 62600 events) in 14.3247690201s (wall time)
Run network: 10.38 ( 2.907 kb/s, 6.033 kev/s)
Decoding: 16.11 ( 1.872 kb/s, 3.885 kev/s)
$
$ OMP_NUM_THREADS=1 nanonetcall more_sample_data/ --jobs 4 >/dev/null
Basecalled 20 reads (30160 bases, 62600 events) in 12.9280588627s (wall time)
Run network: 18.68 ( 1.615 kb/s, 3.352 kev/s)
Decoding: 29.16 ( 1.034 kb/s, 2.147 kev/s)
$
$ OMP_NUM_THREADS=1 nanonetcall more_sample_data/ --platforms intel:0:1 --exc_opencl > /dev/null
Basecalled 20 reads (30160 bases, 62600 events) in 10.3367981911s (wall time)
Run network: 5.54 ( 5.444 kb/s, 11.299 kev/s)
Decoding: 2.88 (10.474 kb/s, 21.739 kev/s)
2D basecalling
$ OMP_NUM_THREADS=1 nanonet2d more_sample_data calls
Processed 20 reads in 180.179665089s (wall time)
Template Run network: 10.15 ( 2.970 kb/s, 6.165 kev/s)
Template Decoding: 16.59 ( 1.818 kb/s, 3.773 kev/s)
2D calling: 127.52 ( 0.231 kb/s)
$
$ OMP_NUM_THREADS=1 nanonet2d more_sample_data calls --opencl_2d > /dev/null
Processed 20 reads in 80.0777461529s (wall time)
Template Run network: 10.15 ( 2.973 kb/s, 6.170 kev/s)
Template Decoding: 16.73 ( 1.803 kb/s, 3.742 kev/s)
2D calling: 26.19 ( 1.124 kb/s)
E5-2650 v4 @ 2.20GHz (x2, 24 cores total) GTX 1080 Ubuntu 14.04
- Timings likely a little off as GPU used by other processes
- Utilise very little (<200Mb) of GPU memory
1D basecalling
$ nanonetcall more_sample_data/ --platforms nvidia:0:20 --exc_opencl > /dev/null
Basecalled 20 reads (30160 bases, 62600 events) in 3.88349485397s (wall time)
Run network: 1.32 (22.829 kb/s, 47.384 kev/s)
Decoding: 0.91 (33.268 kb/s, 69.051 kev/s)
2D basecalling
$ OMP_NUM_THREADS=1 nanonet2d more_sample_data calls --opencl_2d
Processed 20 reads in 56.9682049751s (wall time)
Template Run network: 10.10 ( 2.986 kb/s, 6.199 kev/s)
Template Decoding: 11.60 ( 2.601 kb/s, 5.398 kev/s)
2D calling: 13.30 ( 2.214 kb/s)
$
$ OMP_NUM_THREADS=1 nanonet2d more_sample_data calls --opencl_2d --jobs 4
Processed 20 reads in 15.9661970139s (wall time)
Template Run network: 10.19 ( 2.959 kb/s, 6.141 kev/s)
Template Decoding: 11.74 ( 2.570 kb/s, 5.334 kev/s)
2D calling: 15.96 ( 1.844 kb/s)
Training a network
The package provides also an interface to currennt for training networks from .fast5 files. The type of neural networks implemented by currennt require labelled input data. Nanonet does not provide a method for labelling data. It does however provide a hook to create labelled data. To install currennt please refer to the full installation instructions in TRAIN_INSTALL.md.
To run the trainer we specify training data, and validation data. The former will be used to train the network whilst the latter is used to check that the network does not become overtrained to the training data.
nanonettrain --train <training_data> --val <validation_data> \
--output <output_model_prefix> --model <input_model_spec>
An example input model can be found in nanonet/data/default_model.tmpl. The only other consideration is that the size of the first ("input") layer of the network must correspond to the feature vectors created bynanonet.features.events_to_features. The nanonettrain program will try to enforce these considerations. In contructing models one should assign the input layer a size of <n_features> and the final two layers as <n_states>, as in the example.
Training is an intensive process, even on a GPU expect it to take hours not minutes. It is not recommended to attempt training models without GPU support.
Trouble Shooting
If you performed a user install (setup.py install --user) the nanonetcall program may not be on your path. This is because the location into which setuptools installs programs for users is not often a default item in the user's path. On OSX the location is typically:
/Users/<username>/Library/Python/2.7/bin/
whilst on Ubuntu it is:
~/.local/bin
If nanonet complains that it cannot locate the currennt executable you will need to set the CURRENNT environment variable to the location of the executable.
nanoporetech/nanonet的更多相关文章
- AI佳作解读系列(四)——数据增强篇
前言 在深度学习的应用过程中,数据的重要性不言而喻.继上篇介绍了数据合成(个人认为其在某种程度上可被看成一种数据增强方法)这个主题后,本篇聚焦于数据增强来介绍几篇杰作! (1)NanoNets : H ...
随机推荐
- 【Luogu】P3786萃香抱西瓜(状压DP)
题目链接 水题,数据范围提示得太明显了吧,不用动脑子都能知道是状压. 不过还是有坑(当然更可能是我脑子有坑) f[i][j][k][l]表示当前是第i秒,萃香在(j,k),已经抱到的西瓜状态是l的最少 ...
- 论蛋疼的调戏matrix67的首页
唔,初学js,然后拿matrix67的首页玩玩23333 console 里输入以下这句话 for(var i = 1; i <= 400; i++){var td = document.get ...
- NOIP2017赛前模拟(5):总结
题目: 1.刮刮卡 已知n(n<=1000000)张刮刮卡按顺序排列,刮开可以获得B元现金和B个积分,购买刮刮卡需要A元,某人若按照顺序刮开的话··当B的总和小于A时便会停止刮卡(即花出去的钱多 ...
- 【bzoj5110】Yazid的新生舞会
这里是 $THUWC$ 选拔时间 模拟赛的时候犯 $SB$ 了,写了所有的部分分,然后直接跑过了 $4$ 个大样例(一个大样例是一个特殊情况)…… 我还以为我非常叼,部分分都写对了,于是就不管了…… ...
- NIO系列1:框架拆解
最近一年用NIO写了不少网络程序,也研究了一些开源NIO网络框架netty.mina等,总结了一下NIO的架构特点. 无论是netty还是mina它们都在java原生NIO的基础上进行了完善的封装,虽 ...
- 表单编码 appliation/x-www-form-urlencoded 与 multipart/form-data 的区别
当表单使用POST方法时,表单数据提交到服务器端之前有两种编码类型可供选择.默认编码类型为 application/x-www-form-urlencoded,此时所有非字母数字类型的字符都需要转换为 ...
- Codeforces989E. A Trance of Nightfall
$n \leq 200$个平面上的点,$q \leq 200$次询问:重复操作$m \leq 10000$次,到达点$x$的概率最大是多少.操作:一开始选点$P$,不一定要是给定点,可以是平面上任一点 ...
- c#.net分类上升达人~~~呵呵。。。
原文发布时间为:2008-11-11 -- 来源于本人的百度文章 [由搬家工具导入] 觉得自己蛮无聊的~~~~~~~~~(>_<)~~~~
- canvas之webgl的打开方式
OK,一个完整的dome是从无到有的.这篇文章将记录我的webgl之旅.不定时更新. 首先,新建一个canvas元素. 然后获取到它. var e = document.ElementById('ca ...
- 【kindeditor】kindeditor的使用
官网:http://kindeditor.net/demo.php 效果: