The Human Genome Project was the project of 'determining the sequence of nucleotide base pairs that make up human DNA, and of identifying and mapping all of the genes of the human genome'. It was declared complete in 2003, i.e. 99% of the euchromatic human genome completed with 99.99% accuracy.

Are the datasets provided by HGP still accurate, or as accurate as was claimed in 2003?

Given the technology in the past (such as using old techniques), or any other reason (newer research studies), is it possible that the datasets are not as accurate as originally expected?

  • 14,012
  • 5
  • 23
  • 79
  • 1,293
  • 1
  • 12
  • 15
  • 2
    It should be noted that the reason why a technology is replaced by a newer one is not always because the old one was not accurate. Sometimes the newer technology can be a bit less accurate on some aspects, but so much cheaper that it is deemed worth switching. The sequencing and assembly techniques used to generate the data at the basis of the first release of the human genome may have been quite accurate. That said, using newer techniques to improve on this basis will of course make newer versions better. – bli May 18 '17 at 08:43

2 Answers2


The HGP developed the first "reference" human genome - a genome that other genomes could be compared to, and was actually a composite of multiple human genome sequences.

The standard human reference genome is actually continually updated with major and minor revisions, a bit like software. The latest major version is called GRCh38, was released in 2013, and has since had a number of minor updates.

Are the datasets provided by HGP still accurate?

Yes, in a sense, but we certainly have better information now. One way to measure the quality of the assembly is that the initial release from the HGP had hundreds of thousands of gaps - sequences that could not be resolved (this often occurs because of repetitive sequences). The newest reference genome has less than 500 gaps.

  • 298
  • 2
  • 8

While the quality of the reference human assembly keeps improving, there are still misassemblies in it. A common problem is recent segmental duplications are occasionally collapsed into one sequence in the reference. Another issue is that the centromeric sequences in the reference are computationally generated, which are probably different from real sequences. Issues like these often complicate data analyses.

You should also beware that each human has a different genome. A large region having one copy in the reference genome may have two copies in a specific sample. While this is not really the problem with the reference, such copy-number changes will have the same effect as reference errors and mess up your pipelines.

There is still room for improvement to the human reference genome. In some regions, the CHM1 and CHM13 PacBio assemblies are better than the current reference genome at the larger scale. Illumina population data can produce better consensus at the single base level. GRC is continuously releasing new patches to the latest assembly.

  • 6,515
  • 2
  • 13
  • 29