In [1]:

import pandas as pd
import numpy as np
import seaborn as sb
import matplotlib.pyplot as plt
%matplotlib inline

Introduction to Porekit-Python¶

Disclaimer¶

Porekit is the result of my personal interest in nanopore sequencing. I’m not affiliated with Oxford Nanopore Technologies, or any MAP participant. This means a lot of the factual information presented in this notebook may be wrong.

General philosophy and goal of Porekit-Python¶

This library is meant to provide tools for interactively exploring nanopore data inside the Jupyter notebook, and for writing simple scripts or more complex software dealing with nanopore data. Therefore a lot of attention has been given to make interactive use easy and painless, and to keep the code in the background flexible and exposed to library users.

What Oxford Nanopore Data looks like¶

The MinION sequencer is attached to a laptop running MinKnow. This program connects directly to the MinION device and tells it what to do. Optionally, third party software can connect to an API inside the primary software to remote control the sequencing process. That is not covered here, though.

In a nutshell, nanopore sequencing works by dragging a DNA molecule through a tiny pore in a membrane. As the DNA passes, the voltage difference between the two sides of the membrane change, depending on the electrochemical properties of the passing nucleotides. This means that, at the core of the nanopore data, there is a timeseries of voltage measurements, which is called the “squiggle”.

The process to convert the squiggle into a sequence of DNA letters is called base calling. The current MinKnow software uploads the squiggle to Metrichor servers, which perform the base calling, and send the result back to the user’s computer.

The result of a sequencing run is a collection of FAST5 files, each containing data on one molecule of DNA which passed through one of currently 512 channels in the flowcell. These files are stored on disk, usually in one directory per run. A convention seems to be to name each file with a unique and descriptive string:

In [4]:

!ls /home/andi/nanopore/GenomeRU2/downloads/pass/ | tail -n 10

PLSP57501_20151028_Mk1_lambda_RU9_2752_1_ch9_file64_strand.fast5
PLSP57501_20151028_Mk1_lambda_RU9_2752_1_ch9_file67_strand.fast5
PLSP57501_20151028_Mk1_lambda_RU9_2752_1_ch9_file6_strand.fast5
PLSP57501_20151028_Mk1_lambda_RU9_2752_1_ch9_file74_strand.fast5
PLSP57501_20151028_Mk1_lambda_RU9_2752_1_ch9_file86_strand.fast5
PLSP57501_20151028_Mk1_lambda_RU9_2752_1_ch9_file90_strand.fast5
PLSP57501_20151028_Mk1_lambda_RU9_2752_1_ch9_file92_strand.fast5
PLSP57501_20151028_Mk1_lambda_RU9_2752_1_ch9_file95_strand.fast5
PLSP57501_20151028_Mk1_lambda_RU9_2752_1_ch9_file97_strand.fast5
PLSP57501_20151028_Mk1_lambda_RU9_2752_1_ch9_file9_strand.fast5

These files belong to data publishd by Quick et al. http://europepmc.org/abstract/MED/25386338;jsessionid=ijHIHUVXlcpxeTzVUihz.0

Gathering Metadata¶

The following snippet will extract meta data from all of my downloaded nanopore data, searching directories recursively.

In [6]:

import porekit
everything = porekit.gather_metadata("/home/andi/nanopore/")

The result is a Pandas DataFrame object, which is too big to comfortably view in its entirety, but still comparatively “small data”. Here is a subset of it:

In [7]:

everything[['asic_id', 'channel_number', 'template_length', 'complement_length']].head()

Out[7]:

	asic_id	channel_number	template_length	complement_length
filename
PLSP57501_default_sample_id_2124_1_ch186_file3_strand.fast5	3442015863	186	NaN	NaN
PLSP57501_default_sample_id_2124_1_ch324_file15_strand.fast5	3442015863	324	NaN	NaN
PLSP57501_default_sample_id_2124_1_ch141_file3_strand.fast5	3442015863	141	NaN	NaN
PLSP57501_default_sample_id_2124_1_ch103_file2_strand.fast5	3442015863	103	NaN	NaN
PLSP57501_default_sample_id_2124_1_ch386_file13_strand.fast5	3442015863	386	NaN	NaN

All of the columns available:

In [8]:

everything.columns

Out[8]:

Index(['absolute_filename', 'format', 'run_id', 'asic_id', 'version_name',
       'device_id', 'flow_cell_id', 'asic_temp', 'heatsink_temp',
       'channel_number', 'channel_range', 'channel_sampling_rate',
       'channel_digitisation', 'channel_offset', 'has_basecalling',
       'basecall_timestamp', 'basecall_version', 'basecall_name',
       'has_template', 'template_length', 'has_complement',
       'complement_length', 'has_2D', '2D_length', 'read_start_time',
       'read_duration', 'read_end_time'],
      dtype='object')

The file names are used as an index, because they are assumed to be unique and descriptive. Even when the absolute/physical location of the dataset changes, data or analytics based on these filenames are still useful.

The philosopy of porekit is to gather the metadata once and then store it in a different format. This makes it easier to analyse the metadata or use it in another context, for example with alignment data.

The following will store the metadata in an HDF5 File:

In [10]:

everything.to_hdf("everything.h5", "meta")

/home/andi/anaconda3/lib/python3.5/site-packages/pandas/core/generic.py:1096: PerformanceWarning:
your performance may suffer as PyTables will pickle object types that it cannot
map directly to c-types [inferred_type->mixed,key->block3_values] [items->['absolute_filename', 'format', 'run_id', 'asic_id', 'version_name', 'device_id', 'flow_cell_id', 'basecall_version', 'basecall_name', 'has_template', 'has_complement', 'has_2D']]

  return pytables.to_hdf(path_or_buf, key, self, **kwargs)

Grouping by Device, ASIC and Run Ids¶

In [13]:

g = everything.groupby(['device_id', 'asic_id', 'run_id'])

In [14]:

df = g.template_length.agg([lambda v: len(v), np.mean, np.max])
df.columns = ['Count', 'Mean template length', 'Max template_length']
df

Out[14]:

			Count	Mean template length	Max template_length
device_id	asic_id	run_id
MN02178	32748	a96198c51610f2df5381b3da4378dba90cb4635b	8107.0	1036.574442	3028.0
MN02297	217752539	aeec32a5b6567efde499475723d41e2370444ad1	160992.0	4999.441396	116519.0
MN15179	152762368	8da1985a781386de16342edd83dc2fb8c7a0236a	57830.0	2174.129721	383927.0
	152762368	d636d617db78c8db8c911f58b869d32395cff3e8	65648.0	NaN	NaN
	3442015863	4918830b4e8663a6ecbaefd9abffa68fe92a0736	4137.0	NaN	NaN
	4244960115	05fb4df1edb44cce039fbc7f609ee0eb4614229e	20032.0	856.956434	60364.0
	4244960115	c91c9d31084bb5e2f0bf8f14064c8be2445eeb0e	26047.0	1330.689296	91786.0
MN16528	3571011476	69ed1ea007613207aa5a162012197a9ffe806e9f	68.0	NaN	NaN

As you can see, I have downloaded several nanopore sets from ENA. These are mostly incomplete sets, since I was interested more in the variety of data rather than the completeness. You can easily use wget to download a tarball from ENA, then extract the partial download. The last file will be truncated, but the rest is usable.