ECuADOR—Easy Curation of Angiosperm Duplicated Organellar Regions, a tool for cleaning and curating plastomes assembled from next generation sequencing pipelines
Introduction
With the rapid increase in availability of genomic resources offered by Next-Generation Sequencing (NGS) and the availability of free online genomic databases, efficient and standardized metadata curation approaches have become increasingly critical for the post-processing stages of biological data. Especially in organelle-based studies using circular chloroplast genome datasets, the assembly of the main structural regions in random order and orientation represents a major limitation in our ability to easily generate “ready-to-align” datasets for phylogenetic reconstruction, at both small and large taxonomic scales. In addition, current practices discard the most variable regions of the genomes to facilitate the alignment of the remaining coding regions. Nevertheless, no software is currently available to perform curation to such a degree, through simple detection, organization and positioning of the main plastome regions, making it a time-consuming and error-prone process. Here we introduce a fast and user friendly software ECuADOR, a Perl script specifically designed to automate the detection and reorganization of newly assembled plastomes obtained from any source available (NGS, sanger sequencing or assembler output).
Methodology
ECuADOR uses a sliding-window approach to detect long repeated sequences in draft sequences, which then identifies the inverted repeat regions (IRs), even in case of artifactual breaks or sequencing errors and automates the rearrangement of the sequence to the widely used LSC–Irb–SSC–IRa order. This facilitates rapid post-editing steps such as creation of genome alignments, detection of variable regions, SNP detection and phylogenomic analyses.
Validation and testing
ECuADOR was successfully tested on plant families throughout the angiosperm phylogeny by curating 161 chloroplast datasets. ECuADOR first identified and reordered the central regions (LSC–Irb–SSC–IRa) for each dataset and then produced a new annotation for the chloroplast sequences. The process took less than 20 min with a maximum memory requirement of 150 MB and an accuracy of over 99%. ECuADOR is the sole de novo one-step recognition and re-ordination tool that provides facilitation in the post-processing analysis of the extra nuclear genomes from NGS data.
The manuscript describing the software is available as Open Access resource in PeerJ here:
– Carrion, A.D., Hinsinger, D.D., Strijk, J.S.* (2020). ECuADOR – Easy Curation of Angiosperm Duplicated Organellar Regions, a tool for cleaning and curating plastomes obtained from next generation sequencing pipelines. PeerJ: 8:e8699. HERE
The software is available on GitHub and updates as well as expansions to its applications will be released HERE