spacer

Goldman Group - Software - AYB

AYB - Advanced Base Calling for Next Generation Sequencing Machines

About


AYB is a base caller for the Illumina Genome Analyzer, using an explicit statistical model of how errors occur during sequencing to produce more accurate reads from the raw intensity data.

In particular, AYB deals with three sources of error:

  • Cross-talk: There is overlap in the excitation spectra of the fluorophores used to label the nucleotides, leading to light emission being detected under several combinations of lasers and filters ("channels"). This effect is especally noticable for the fluorophores used to mark adenine and guanine, each of which is bright in two channels.
  • Phasing: As the number of cycles increases, the signal starts to blur as the cluster loses synchronicity: random failure of nucleotides to incorporate, or failure of the blocking element to prevent incorporation of more than one nucleotide mean that individual strands lag or lead and the signal detected at each cycle is a mixture of several positions along the read.
  • Contamination: Non-sequence contamination in the flow cell, microscopic particles of dust for example, get illuminated by the lasers and might be detected instead of sequence. Such contamination is generally abnormally bright compared to the surrounding sequence and so does not conform to what AYB expects, the quality scores for the called base being automatically down-weighted as a result.

In contrast to other base-calling approaching, AYB uses a general model of phasing estimated directly from the data rather than assuming that it occurs at a constant rate for all cycles. Dealing with phasing in this manner means that the base calls made by AYB at the end of each read tend to be more accurate than other methods, making greater read lengths feasible and increasing the number of the highest quality reads: AYB returning 2.8 times as many perfect reads than other base callers for 100 cycle data (with smaller gains for shorter reads).

By default AYB performs per-tile analysis, estimating phasing and cross-talk separately for every tile. This level of analysis is more processor intensive than the Illumina analysis pipeline but can be efficiently split between machines: an entire 8 lane run of 45 cycle data (95 million clusters) can be analysed within an hour on a modern eight-core server, as could 2 million clusters of much longer 101 cycle data. In addition AYB offers two options to reduce the total computational burden: fixing the cross-talk matrix across tiles, either at a value previously estimated by AYB or the Illumina pipeline, allows phasing to be solved analytically in each iteration and so speeding up estimation considerably; alteratively a Bustard-like approach can be used, estimating the cross-talk and phasing from a few tiles and then holding them fixed while calling bases for the remaining tiles.

Download

AYB is freely available under the GNU General Public Licence version 3 (see www.gnu.org for further information). A copy of the licence is provided with the software.

AYB Version II Source code

Latest version of AYB.

Build instructions for Version II are in the README file.

The Version II AYB Manual contains user information including program options.

AYB with generalised phasing model

This version of AYB is the one on which Massingham and Goldman (2012) is based.

Original AYBc Source code

Original pre-release version of AYB with older phasing model. Historic interest only.

Recalibration tool (suitable for all versions of AYB)

CIFTools

The ciftools package for manipulating CIF format intensities may also be useful.

News

20 December 2012
Support for compressed output
Add samplename to output
26 August 2012
Support for reading coordinates from run folder
Format of output read names now more in keeping with those from Illumina pipeline
31 May 2012
AYB Version 2.11
Thin missing data and general tidy up.
25 April 2012
AYB Version 2.10
Performance improvements including thin option. Memory leak fixed.
04 April 2012
AYB Version 2.09
Bug fixes to improve handling of certain patterns of missing data.
29 Feburary 2012
AYB paper published.
Massingham and Goldman (2012) All Your Base: a fast and accurate probabilistic approach to base calling Genome Biology 13:R13
21 February 2012
AYB Version 2.08
Option to use spike-in data to improve base calling and calibrate qualities.
16 December 2011
AYB Version 2.07
Option to run with multiple threads (with OpenMP).
01 December 2011
AYB Version 2.06
Implement improved quality scoring and robustness as in AYBg
26 October 2011
AYBg update
Improved quality scoring and robustness fixes
Basis for revised manuscript
18 October 2011
AYB Version 2.05
Implement improved modelling algorithm as in AYBg
14 Sept 2011
AYBg compilation fixes on Linux, reported by Yves Wetzels
Turn on openmp support, compile optimised by default
Remove dependency on Fortran compiler (Mac + Linux)
1 Sept 2011
AYBg update
Addition of quality calculation and calibration. No changes to base-call accuracy
29 July 2011
AYBg released
Much improved accuracy over previous AYB versions due to generalised phasing model. Basis for manuscript.
22 July 2011
AYB Version 2.04
Memory use reduction and changes to sim runfile format (version 5)
10 May 2011
AYB Version 2.03
Automated module and system testing and modelling refactored (no function change)
17 Feb 2011
AYB Version 2.02
Quality calibration table now contains values to use
08 Feb 2011
New release of recalibration tool produces values to use
21 Jan 2011
AYB Version 2.01
Cif from run-folder and quality calibration table
07 Dec 2010
First release of AYB Version II
28 Nov 2010
Performance improvements
Improved estimation of phasing
07 Oct 2010
Translation into C
21 May 2009
Initial release

Examples

AYB intensities

will process cif file intensities in one block using 5 iterations and output a fastq file, both in the current directory with log messages to stderr.

AYB -b R76R76 -i cifdir -o outputdir s_3_1301

will process a 76 base paired-end from the file s_3_1301.cif stored in the directory cifdir. Output will be stored in outputdir

AYB -i runfolder -b R8R108R108 -r L1T1301-2301

will process a 108 base paired-end run, with an additional 8 base index between the pairs, from a run folder. All the tiles between 1301 and 2301 will be processed from lane 1.

Data sets

PhiX
76 cycle control lane (27 tiles). Sanger Institute.
B. pertussis
76 cycle paired-end data from a problematic run (100 tiles). Sanger Institute.
HiSeq
101 cycle paired-end data from a HiSeq machine with PhiX spike-in. Illumina corp.
Ibis Test
51 cycle test set of data distributed with the Ibis base-caller
NA19240/BGI (archive)
45 cycle paired-end data from BGI (part of 1KGP, pilot 2, individual NA19240)
NA19240/Illumina (archive)
51 cycle paired-end data from Illumina (part of 1KGP, pilot 2, individual NA19240)

Paper

AYB paper (Genome Biology open access)
All Your Base: a fast and accurate probabilistic approach to base calling. T. Massingham and N. Goldman (2012) Genome Biology 13:R13
Figures
Fig 1. Comparison of error rates.
Fig 2. Frequency of errors for B. pertussis data.
Fig 3. Quality calibration comparison between AYB and Ibis.
Supplementary
Fitting a block tridiagonal information matrix by ML
Supplementary (old)
Rapid estimation of M, P and N
Basecalls
Basecalls for data sets in manuscript

Contact

Please direct any queries to ayb@ebi.ac.uk

spacer
spacer