Skip to content
This repository has been archived by the owner on Mar 16, 2022. It is now read-only.

Compatible Software

tkerelska edited this page Mar 21, 2019 · 84 revisions

The following software packages are known to be compatible with PacBio® data, in addition to PacBio's own SMRT® Analysis suite. All packages are believed to be open source or freely available for non-commercial use. See the individual project sites for up-to-date license information. A separate page lists commercial software.

Know of any other open source software for PacBio data? Email us.

Software categories:

De novo assembly

Detailed information on Large Genome Assembly with PacBio Long Reads is published here

  • Falcon: An experimental diploid assembler, tested on ~100 Mb genomes
  • Canu: Canu is a fork of the Celera Assembler designed for high-noise single-molecule sequencing
  • wtdbg2: A fuzzy Bruijn graph approach to long noisy reads assembly
  • MHAP: This is a reference implementation of a probabilistic sequence overlapping algorithm. Designed to efficiently detect all overlaps between noisy long-read sequence data. It efficiently estimates Jaccard similarity by compressing sequences to their representative fingerprints composed on min-mers (minimum k-mer).
  • HGAP: hierarchical genome assembler for PacBio long reads only. Bundled in SMRT Analysis since v1.4
  • HBAR-DTK: Hierarchical-Based AssembleR Development ToolKit, recommended for advanced users only
  • ALLORA: a long read assembler for PacBio long reads alone. Available only in SMRT Analysis. Since v1.0.
  • Celera® Assembler: Celera® Assembler 8.1 now offers a way to directly assemble subreads
  • Sprai: A preassembly-based assembler that aims to generate longer contigs
  • PBcR self-correction: A mode within PBcR (aka pacBioToCA) to do self-correction in the same style as HGAP. Celera® Assembler 8.2 uses the MHAP algorithm for faster overlap calculation during the self-correction phase.
  • pacBioToCA + Celera® Assembler: A scalable hybrid assembly to combine PacBio long reads with Illumina®, 454, Sanger, Ion Torrent or CCS. Bundled in SMRT Analysis from v1.3.3
  • ECTools: A set of tools for hybrid assembly. It that contigs instead of short reads for correction.
  • SPAdes: True hybrid assembler, PacBio with Illumina or Ion Torrent; small(er) genomes only
  • Cerulean: Ceruleanis a hybrid assembly. It starts with an assembly graph from Abyss and extends contigs by resolving bubbles in the graph using PacBio long reads. Was successfully run on genomes <100 Mb.
  • dbg2olc: dbg2olc is a hybrid assembly which uses Illumina contigs as anchors to build an overlap graph with PacBio reads, allowing very fast performance
  • ALLPATHS-LG: hybrid assembler for PacBio long reads plus Illumina mate pairs plus Illumina jumping libraries
  • AHA: A hybrid assembler to scaffold existing contigs and fill gaps. Available only in SMRT Analysis. Since v1.0
  • PBJelly 2: Gap filling and scaffolding for large genomes
  • MIRA: de novo assembler

Structural Variations Calling

  • Sniffles: Calls all types of structural variants using evidence from split-read alignments, high-mismatch regions, and coverage analysis.

  • SMRT-SV: Calls insertions, deletions, and inversions using a local assembly approach.

RNA Analysis

  • Iso-Con: for targeted Iso-Seq only. IsoCon is a tool for deriving finished transcripts from Iso-Seq reads. Input is a set of full-length-non-chimeric reads in fasta format and the CCS base call values as a bam file. The output is a set of predicted transcripts.

  • Cupcake: accompanying scripts for official Iso-Seq1, 2, and 3 output analysis.

  • TAMA: suite of downstream analysis scripts, including collapsing and merging transcript data. See TAMA wiki for more details.

  • SQANTI, a Iso-Seq QC and analysis software that can take long read output from either Iso-Seq, IDP, TAPIS, etc, and combine with short read, reference genome, annotations, to give a comprehensive description of the dataset. preprint

  • TAPPAS for isoform analysis and visualization, to be used after data has been cleaned up with SQANTI.

  • lncRNA Discovery Pipeline: Python scripts for using two ncRNA classifiers (CPAT and PLEK) for discovering long ncRNAs in Iso-Seq data.

  • ANGEL: Python library for doing both error-free and error-tolerant Open Reading Frame prediction

  • Cogent: Genome Reconstruction using Iso-Seq data only, without a reference genome.

  • SpliceMap-LSC-IDP pipeline: developed by Kin Fai Au's lab, a hybrid (long + short read) error correction and quantification software for transcriptome data.

  • IDP-fusion: a fusion detection finder using both long & short reads (hybrid).

Reference-based alignment

  • bwa-sw: Burrows-Wheeler aligner with Smith-Waterman

Consensus and variant calling

  • GATK: Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping
  • DeepVariant: DeepVariant is an analysis pipeline developed by Google that uses a deep neural network to call genetic variants from next-generation DNA sequencing data
  • LoFreq: Low frequency variant caller. Recommended to switch off BAQ computation with -B. Calls all known mutations in the HBV amplicons data-set without false positives starting from 1.15% AF.

Epigenetic base modifications and methylation

Genome Browsers

  • IGV: Integrative Genome Viewer from the Broad Institute
  • SMRT View: PacBio's Genome Browser for SMRT Sequencing data. Explore and interact with Resequencing, De novo, Base Modification and Identification, Motif Analysis, cDNA, Single Molecule and Barcoding experiment results
  • Tablet: Next Generation Sequence Assembly Visualization
Clone this wiki locally