3D DNA structure helps assemble genomes

Although DNA sequencing technologies have advanced immensely in recent years, a typical published genome sequence may be fragmented in as many as 100,000 DNA fragments, due to the fact that full assembly of chromosomes is extremely difficult. We show how a simple genome-wide measurement of DNA 3D structural properties can be used to easily assemble and complete genome sequences in high-throughput for the first time.

HFSP Long-Term Fellow Noam Kaplan and colleagues
authored on Mon, 27 January 2014

In the last decade, it seems that sequencing a new complex genome has changed from a grandiose multi-billion dollar project into an almost routine task, due to the meteoric advancement of high-throughput DNA sequencing technology. However, in marked contrast with the increase in the number of available genome sequences, the quality of published genome assemblies of complex genomes is in fact decreasing. This surprising trend is due to the fact that most of the current sequencing technologies read only short DNA sequences. Unfortunately, complex genomes tend to be highly repetitive, so that assembly of these short DNA reads into a complete genome sequence is extremely difficult. Thus, current publications of genome sequences typically produce fragmented genomes of up to 100,000 separate, unordered DNA sequences instead of producing full-length chromosomes. Unfortunately, further assembly requires low-throughput laborious methods. In fact, even in the human genome, in spite of the massive effort invested in its completion, approximately 30 Mb of euchromatic DNA remains unassembled. Thus, high throughput sequencing and genome assembly technology have reached a point in which an increase in the number of short reads does not significantly improve the quality of genome assemblies.

Hi-C is a simple experimental technique that uses high-throughput DNA sequencing to measure pairwise spatial interaction frequencies between chromatin segments across the whole genome. Interestingly, all Hi-C experiments show, in addition to species-specific and cell-type specific chromatin interactions, a canonical interaction pattern. In fact, the pattern is so prominent that it is usually normalized out of the data in order to reveal more subtle interactions. This pattern, which is thought to reflect the random path of the DNA, shows a general interaction trend such that the closer two loci are in the genomic sequence, the higher their probability of interacting in 3D.

Figure: 3D interactions are used to bridge gaps in the genomic sequence and assemble a genome.

How can Hi-C data be used to aid genome assembly? The solution is simple. We know that the genomic distance between loci affects its 3D interaction frequency, so we can also use this principle in reverse: the 3D interaction frequency implies the genomic distance. Thus, we can use the Hi-C experiment to measure 3D interaction frequencies between all genomic DNA fragments, convert this interaction matrix into an estimated genomic distance matrix, and consolidate these distances into estimated genomic positions using mathematical tools without requiring any overlap between the DNA fragments.

To test our approach, we used the human genome to simulate highly-fragmented genome scenarios where genomic DNA fragments are separated by huge gaps in the order of 1Mbp. We then used Hi-C data to estimate the respective genomic positions of the DNA fragments, and compared their predicted positions with their actual positions. Our results indicate that our approach is able to robustly assemble entire chromosomes with a relatively small error rate. Finally, we applied our method to a set of yet unplaced DNA fragments from the human genome. We predicted the position of 65 unplaced fragments and we show our predictions are consistent with other methods.

Our approach is the first high-throughput method based on short-read sequencing that can achieve full, chromosome-scale genome assemblies. Importantly, the method can theoretically bridge any gap size, is simple, robust, scalable and applicable to any species.


High-throughput genome scaffolding from in vivo DNA interaction frequency. Kaplan N, Dekker J. (2013). Nature Biotechnology, 31:1143-1147.

Nature link

Nature Methods link

Nature Reviews Genetics link

Pubmed link