pubmed.ncbi.nlm.nih.gov

Segmental duplications: organization and impact within the current human genome project assembly - PubMed

Segmental duplications: organization and impact within the current human genome project assembly

J A Bailey et al. Genome Res. 2001 Jun.

Abstract

Segmental duplications play fundamental roles in both genomic disease and gene evolution. To understand their organization within the human genome, we have developed the computational tools and methods necessary to detect identity between long stretches of genomic sequence despite the presence of high copy repeats and large insertion-deletions. Here we present our analysis of the most recent genome assembly (January 2001) in which we focus on the global organization of these segments and the role they play in the whole-genome assembly process. Initially, we considered only large recent duplication events that fell well-below levels of draft sequencing error (alignments 90%-98% similar and > or =1 kb in length). Duplications (90%-98%; > or =1 kb) comprise 3.6% of all human sequence. These duplications show clustering and up to 10-fold enrichment within pericentromeric and subtelomeric regions. In terms of assembly, duplicated sequences were found to be over-represented in unordered and unassigned contigs indicating that duplicated sequences are difficult to assign to their proper position. To assess coverage of these regions within the genome, we selected BACs containing interchromosomal duplications and characterized their duplication pattern by FISH. Only 47% (106/224) of chromosomes positive by FISH had a corresponding chromosomal position by comparison. We present data that indicate that this is attributable to misassembly, misassignment, and/or decreased sequencing coverage within duplicated regions. Surprisingly, if we consider putative duplications >98% identity, we identify 10.6% (286 Mb) of the current assembly as paralogous. The majority of these alignments, we believe, represent unmerged overlaps within unique regions. Taken together the above data indicate that segmental duplications represent a significant impediment to accurate human genome assembly, requiring the development of specialized techniques to finish these exceptional regions of the genome. The identification and characterization of these highly duplicated regions represents an important step in the complete sequencing of a human reference genome.

PubMed Disclaimer

Figures

Figure 1
Figure 1

Detection Method. The method combines DNA sequence analysis software and a suite of Perl scripts that are optimized for the detection of large highly similar duplications. Briefly, the genome assembly (2.6 Gb) is broken into tractable 400-kb segments. For each segment, common repeats (blue) are identified with

RepeatMasker

. Repetitive sequence is then removed (“fuguized”) leaving putatively unique DNA. All fuguized pieces are then compared by

BLAST

. Repeats internal to an individual 400-kb segments are detected with

BLASTZ

. Relaxed affine gap parameters are used allowing gaps up to 1 kb in size to be traversed. Fuguized pairwise alignments (>0.87 similarity and >500 aligned bp) have their common repeats reinserted and then the alignment ends undergo heuristic trimming allowing for refinement of alignment end points which may lie within common repetitive sequence. The program

ALIGN

generates optimal global alignments from which final alignment statistics are calculated. Global alignments >1000 bases aligned and >90% identity were selected in this analysis.

Figure 2
Figure 2

Example of pericentromeric duplication using fuguization method. (A) A graphical view of the output for our method as displayed in the program

PARASIGHT

(J.A. Bailey, unpubl.). Compared to miropeats (B; Parsons 1995), all of the positions of similarity have been captured as continuous large alignments (C). An example of a large insertion-deletion in an alignment (D) demonstrates the ability of fuguization to traverse such regions returning larger more meaningful alignments. Lower thresholds (>500 aligned bases; >90% identity) were used for this test case compared to our genome analysis.

Figure 2
Figure 2

Example of pericentromeric duplication using fuguization method. (A) A graphical view of the output for our method as displayed in the program

PARASIGHT

(J.A. Bailey, unpubl.). Compared to miropeats (B; Parsons 1995), all of the positions of similarity have been captured as continuous large alignments (C). An example of a large insertion-deletion in an alignment (D) demonstrates the ability of fuguization to traverse such regions returning larger more meaningful alignments. Lower thresholds (>500 aligned bases; >90% identity) were used for this test case compared to our genome analysis.

Figure 2
Figure 2

Example of pericentromeric duplication using fuguization method. (A) A graphical view of the output for our method as displayed in the program

PARASIGHT

(J.A. Bailey, unpubl.). Compared to miropeats (B; Parsons 1995), all of the positions of similarity have been captured as continuous large alignments (C). An example of a large insertion-deletion in an alignment (D) demonstrates the ability of fuguization to traverse such regions returning larger more meaningful alignments. Lower thresholds (>500 aligned bases; >90% identity) were used for this test case compared to our genome analysis.

Figure 2
Figure 2

Example of pericentromeric duplication using fuguization method. (A) A graphical view of the output for our method as displayed in the program

PARASIGHT

(J.A. Bailey, unpubl.). Compared to miropeats (B; Parsons 1995), all of the positions of similarity have been captured as continuous large alignments (C). An example of a large insertion-deletion in an alignment (D) demonstrates the ability of fuguization to traverse such regions returning larger more meaningful alignments. Lower thresholds (>500 aligned bases; >90% identity) were used for this test case compared to our genome analysis.

Figure 3
Figure 3

Genome-wide view of segmental duplications. The positions of alignments are depicted in red for each of the 24 chromosomes. Panels separate alignments on the basis of similarity: (A) 90%–98% identity and (B) 98%–100% identity. Purple bars depict centromeric gaps as well as the p-arms of acrocentric chromosomes (13, 14, 15, 21, and 22).Because of scale constraints, only alignments >5 kb are visible. Views were generated with the program

PARASIGHT

(J.A. Bailey, unpubl.), a graphical pairwise alignment viewer.

Figure 3
Figure 3

Genome-wide view of segmental duplications. The positions of alignments are depicted in red for each of the 24 chromosomes. Panels separate alignments on the basis of similarity: (A) 90%–98% identity and (B) 98%–100% identity. Purple bars depict centromeric gaps as well as the p-arms of acrocentric chromosomes (13, 14, 15, 21, and 22).Because of scale constraints, only alignments >5 kb are visible. Views were generated with the program

PARASIGHT

(J.A. Bailey, unpubl.), a graphical pairwise alignment viewer.

Figure 4
Figure 4

Distribution of highly homologous duplications (>98% identity). A histogram showing the sum of aligned bases for different bins of percent identity. Colors denote interchromosomal alignments (red) and intrachromosomal alignments, which may further subdivided into intercontig (light blue) or intracontig (dark blue) “duplications.”

Figure 5
Figure 5

Integration of segmental duplications into assembly. The two pie charts divide the assembly contigs into ordered contigs and unordered (random and unlocated) contigs. Random contigs have chromosomal assignment but no specific position in the chromosome. Unlocated contigs have no chromosome position. Duplicated sequence represents 3% and 25% of the sequence in the ordered and unordered bins, respectively.

Similar articles

Cited by

References

    1. Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ. Gapped BLAST and PSI-BLAST: A new generation of protein database search programs. Nucleic Acids Res. 1997;25:3389–3402. - PMC - PubMed
    1. Amann J, Valentine M, Kidd VJ, Lahti JM. Localization of chi1-related helicase genes to human chromosome regions 12p11 and 12p13: Similarity between parts of these genes and conserved human telomeric-associated DNA. Genomics. 1996;32:260–265. - PubMed
    1. Amos-Landgraf JM, Ji Y, Gottlieb W, Depinet T, Wandstrat AE, Cassidy SB, Driscoll DJ, Rogan PK, Schwartz S, Nicholls RD. Chromosome breakage in the Prader-Willi and Angelman syndromes involves recombination between large, transcribed repeats at proximal and distal breakpoints. Am J Hum Genet. 1999;65:370–386. - PMC - PubMed
    1. Bentley DR, Deloukas P, Dunham A, French L, Gregory SG, Humphray SJ, Mungall AJ, Ross MT, Carter NP, Dunham I, et al. The physical maps for sequencing human chromosomes 1, 6, 9, 10, 13, 20, and X. Nature. 2001;409:942–943. - PubMed
    1. Brenner S, Elgar G, Sandford R, Macrae A, Venkatesh B, Aparicio S. Characterization of the pufferfish (Fugu) genome as a compact model vertebrate genome. Nature. 1993;366:265–268. - PubMed

Publication types

MeSH terms

LinkOut - more resources