Amino acid based de Bruijn graph algorithm for identifying complete coding genes from metagenomic and metatranscriptomic short reads.

RSS de esta página

PubMed ID: 30657979

Imagen Publicación

Liu J, Lian Q, Chen Y, Qi J

Nucleic Acids Res. Jan 2019. doi: 10.1093/nar/gkz017

COMMENT: This article presents a new assembler called MetaPA for metagenomic and metatranscriptomic sequences that is focused on knowing the set of proteins that there are in each sample. Unlike other assembly methods based in nucleotide sequences, this method assembles amino acid sequences. This method is not able to differentiate sequences from different strains that share the protein sequence but have different nucleotide sequence and, hence, we could say that MetaPA is oriented to the assembly of the pan-proteome of a metagenomic or metatranscriptomic sample. It solves many problems that the assembly of similar nucleotide sequences posses and allows having a representative global pan-proteome of each sample providing a useful functional profile with the drawback that highly similar ortholog sequences from different strains/species can be assembled in only one protein.


To have an efficient assembler of amino acid sequences for metagenomics and metatranscriptomics projects using short reads NGS tecchnologies.


Here we present a protein assembler (MetaPA), based on de Bruijn graph searching on oligopeptide spaces and can be applied on both metagenomic and metatranscriptomic sequencing data.

MetaPA adopts a de Bruijn  graph based strategy and depends on multiple graphs constructed by k-mers with different lengths (up to 24 amino acids), whose iteration benefits the correctionof sequencing error and the reduction of false ORF prediction, and leads to simpler de Bruijn  graphs, which yield more completely assembled proteins. In addition, published microbial protein sequences, if available, can be introduced to guide the assembly of metagenomic sequences.

We compared the performance between MetaPA and six other assemblers, among which SFA-SPA is also an amino acid based assembler, and all the other five are nucleotide based approaches, namely IDBA-UD, MEGAHIT, metaSPAdes, MetaVelvet and SOAPdenovo2.


The source code of MetaPA are available at


Raquel Tobes