De novo assembly of elite inbred line PH207 provides insights into genomic and transcriptomic diversity in maize (Zea mays L.).
Candice N. Hirscha, Cory D. Hirsch, Alex B. Brohammer, Megan J. Bowman, Ilya Soifer, Omer Barad, Doron Shem-Tov, Kobi Baruch, Fei Lu, Alvaro G. Hernandez, Christopher J. Fields, Chris L. Wright, Klaus Koehler, Nathan M. Springer, Edward Buckler, C. Robin Buell, Natalia de Leon, Shawn M. Kaeppler, Kevin L. Childs, Mark A. Mikel
Sequencing technologies: Whole-genome shotgun sequencing using paired-end, mate-pair, and TruSeq synthetic long-reads Sequencing method: Illumina HiSeq Chemistry
Assembly methods: Reads pre-processing and error correction. For all PH207 genomic libraries, with the exception of TruSeq synthetic long-reads, PCR duplicates were removed using FastUniq software (Xu et al., 2012). The Illumina HiSeq2000 adaptor AGATCGGAAGAGC was removed, and reads were error corrected using the Corrector_HA module of SOAPdenovo (using kmer size 23 and cutoff of 6) (Luo et al., 2012). For the PCR-free library (MiSeq stitched reads), following adaptor truncation, overlapping reads were merged using FLASH (Magoc and Salzberg, 2011) with a minimal required overlap of 10 bp to create the stitched reads. Processing of MP reads consisted of filtering out putative false mate-pairs by searching for the Nextera linker (10 nucleotides of CTGTCTCTTATACACATCTAGATGTGTATAAGAGACAG) sequence on either end of the MP. Mate pairs for which the linker was not found were sorted into a separate file for restricted scaffolding application. Mate-pairs that did not hit the linker were used only in support of links found with the filtered MPs, but were not used to create links independently. The TruSeq synthetic long-read assembly pipeline application was processed by Illumina (Illumina IGN FastTrack Long Reads version 41; Illumina, San Diego, CA) to create synthetic long-read builds that represent long contiguous template fragments.
PH207 De Novo Assembly. In the first step of assembly, SOAPdenovo v1.05 (Luo et al., 2012) was used to construct a De Bruijn graph of contigs from the single end reads of the PE library using very conservative settings (no bubble merge and no repeat masking, but removing low coverage kmers and edges). A kmer of size 63 bp was optimal for De Bruijn graph construction. To scaffold the contigs of the De Bruijn graph, non-repetitive contigs within the graph were identified and assembled into scaffolds based on mapping information of the single end reads, followed by application of synthetic long-reads to resolve long repeats in a similar way. The mapping of the single read ends (mapping without gaps) and long-read builds facilitated scaffolding by linking contigs mapping to the same read.
Scaffolding was facilitated using a directed graph containing scaffolds longer than 200 bp as nodes, and edges were based on the PE and MP links as vertices. Erroneous connections were filtered out to generate unconnected sub-graphs that were ordered into scaffolds. PE reads were used to find reliable paths in the graph for additional repeat resolving. This was accomplished through searching the De Bruijn graph for a unique path connecting pairs of reads mapping to two different scaffolds. At this phase of scaffolding, scaffolds had an average size of thousands of bases and almost no gaps. The scaffolds were then furthered ordered and linked using the MP libraries, estimating gaps between the contigs according to the distance of MP links. Linking scaffolds with MP reads required confirmation of at least three filtered MPs or at least one filtered MP with supporting confirmation from two or more filter failed MPs where the Nextera adaptor was not found. Scaffolds shorter than 200 bp were masked and links between non-repetitive contigs mapping to the same scaffolds were united, generating a directed scaffold graph. In agreement with previous reports, a significant amount of erroneous MPs was observed. Since these pairs link between long non-branched components in the scaffold graph, the scaffolding procedure identified the non-branched components and filtered out the rare connections between them. Construction of pseudomolecules: Further ordering of scaffolds was achieved through scaffold alignment to the B73 reference genome version 2 using BWA-SW version 0.6.1 (Li and Durbin, 2010) requiring alignment quality of at least one. The most probable genomic location was assigned to the ordered scaffold based on the alignments of the unordered scaffolds that generated it. The ordered scaffolds were then placed into pseudomolecules to maximize synteny between the PH207 and B73 genomes. Finally, small scaffolds that mapped inside the unfilled gap of ordered scaffolds replaced these gaps if MP links existed between them. TruSeq synthetic long reads were aligned to scaffolds generated without usage of the TruSeq synthetic long-reads. Scaffolds were identified that had a significant alignment of greater than 30 kb to two different locations greater than 10 Mb apart on the B73 reference genome version 2. For those scaffolds where the block aligning to the alternative locations was between two blocks that aligned to the same location, it was assumed to be a translocation between the two genomes, and it was considered unlikely there were two misassemblies at the same region. If a block aligned to one location followed by a block aligning to another distant genomic location, it was considered a misassembly. Suspicious scaffolds were further resolved using a previously defined genotyping-by-sequencing anchor tag pipeline (Lu et al., 2015), and those flagged by both criteria were manually reviewed for misassembly.
To assess the completeness of the genome assembly, PH207 genomic paired-end reads were cleaned using Cutadapt version 1.8.1 (Martin, 2011) requiring a minimum length of 70 and a minimum quality of 10, and aligned to the PH207 genome assembly using Bowtie version 2.2.4 (Langmead and Salzberg, 2012) with default parameters as single end reads to determine the percentage of reads that could map to the assembly. Variants were called using Samtools mpileup (Li et al., 2009) with a minimum depth of three and a maximum depth of 200. Only variants exceeding a quality score of 20 were retained. To assess the completeness of the gene space in the assemblies, PH207 RNAseq reads were aligned to the PH207 genome assembly and B73 RNAseq reads were aligned to the B73 version 2 genome assembly (Schnable et al., 2009). Reads were cleaned using Cutadapt version 1.8.1 (Martin, 2011) requiring a minimum length of 75 nt and a minimum quality of 20. All reads were trimmed to 75 nt using fastx_trimmer within the FASTX toolkit version 0.0.14 (http://hannonlab.cshl.edu/fastx_toolkit/index.html) for consistency between the B73 and PH207 reads. Reads were aligned using Bowtie2 version 2.2.4 (Langmead and Salzberg, 2012) and TopHat2 version 2.0.12 (Kim et al., 2013) with the maximum number of multihits set to 20, minimum intron size of 10 bp, and maximum intron size of 60,000 bp. Additionally, the Core Eukaryotic Genes Mapping Approach pipeline version 2.4.010312 (Parra et al., 2007) was run using default parameters. Comparison of the B73 and PH207 genome assemblies was completed using the NUCmer program within MUMmer version 3.23 (Kurtz et al., 2004). The B73 v2 assembly was used for the comparison and the minimum length of a cluster match was set to 5,000 for the genome-wide plot and 250 for the regional view of chromosome 1.
Whole-genome shotgun sequencing using paired-end, mate-pair, and TruSeq synthetic long-reads
Complete Genome; 230x coverage
Total scaff length
N50 scaff length
N50 contig length
Total sequence length represented by scaffolds.
The length of scaffold which takes the sum length (summing from longest to shortest scaffold) past 50% of the total assembly size.
The length of contig which takes the sum length (summing from longest to shortest contig) past 50% of the total assembly size.
A contig is a contiguous consensus sequence that is
derived from a collection of overlapping reads.
A scaffold is set of a ordered and orientated contigs
that are linked to one another by mate pairs of sequencing reads.
Kevin Childs lab, Michigan State University
The MAKER genome annotation pipeline was used to generate gene annotations (Campbell et al., 2014a; Campbell et al., 2014b; Law et al., 2015). The PH207 genome assembly was first masked using REPEATMASKER and a maize custom repeat library (Smit et al., 2013-2015; Law et al., 2015). Six PH207 transcript assemblies (leaf blade, root cortical parenchyma, root stele, germinating kernel, root tip, and whole seedling) generated using Trinity (Grabherr et al., 2011), the predicted O. sativa proteome and UniProtKB/Swiss-Prot plant proteins (minus maize sp. proteins) were used as evidence during each stage of the MAKER pipeline (Kawahara et al., 2013; UniProt Consortium, 2014). Initially, a Hidden Markov Model (HMM) was trained for the SNAP ab initio gene prediction program using high-quality transcript assembly alignments as gene proxies (Korf, 2004). MAKER was run using the initial SNAP HMM, and these gene predictions were used to train SNAP a second time. MAKER was run again using the second SNAP HMM to make gene predictions, and these gene models were used to train an HMM for the Augustus gene prediction program (Stanke and Waack, 2003). MAKER was run a final time using both the second SNAP HMM and the Augustus HMM. The genes identified by MAKER included models with and without transcript or protein alignment evidence. The predicted proteins from the MAKER gene set were analyzed with hmmscan to identify Pfam domains (Eddy, 2011; Finn et al., 2014). A high-quality gene set was created using all gene predictions that were supported by transcript or protein evidence or that coded for a protein with a Pfam domain.