I have a set of BAM files that are aligned using the NCBI GRCh37 human genome reference (with the chromosome names as NC_000001.10) but I want to analyze it using a BED file that has the UCSC hg19 chromosome names (e.g. chr1). I want to use bedtools to pull out all the on-target and off-target reads.

  1. Are NCBI and UCSC directly comparable? Or do I need to re-align the BAM/lift-over the BED to the UCSC reference?
  2. Should I convert the BED file or the BAM file? Everyone here uses the UCSC chromosome names/positions so I'll need to convert the eventual files to UCSC anyway.
  • 6,515
  • 2
  • 13
  • 29
  • 530
  • 4
  • 9

3 Answers3


You're the second person I have ever seen using NCBI "chromosome names" (they're more like supercontig IDs). Normally I would point you to a resource providing mappings between chromosome names, but since no one has added NCBI names (yet, maybe I'll add them now) you're currently out of luck there.

Anyway, the quickest way to do what you want is to samtools view -H foo.bam > header to get the BAM header and then change each NCBI "chromosome name" to its corresponding UCSC chromosome name. DO NOT REORDER THE LINES! You can then use samtools reheader and be done.

Why, you might ask, would this work? The answer is that chromosome/contig names in BAM files aren't stored in each alignment. Rather, the names are stored in a list in the header and each alignment just contains the integer index into that list (read group IDs are similar, for what it's worth). This also leads to the warning above against reordering entries, since that's a VERY convenient way to start swapping alignments between chromosomes.

As an aside, you'd be well served switching to Gencode or Ensembl chromosome names, they're rather more coherent than the something_random mess that's present in hg19 from UCSC.

Update: Because I'm nice, here is the conversion between NCBI and UCSC. Note that if you have any alignments to patches that there is simply no UCSC equivalent. One of the many reasons not to use UCSC (avoid their annotations too).

Devon Ryan
  • 19,602
  • 2
  • 29
  • 60
  • 1
    How well does this work? Have you run any benchmarks? I ask because I tried to convert various bed files between hg and GRC genomes and the three tools I used all gave very different results. This sort of mapping really should be a simple thing but it appears not to be that straightforward at all. – terdon May 18 '17 at 10:52
  • 1
    For cases where it's just a change of name (most cases) there's nothing to benchmark. For cases where you additionally have position changes then you'd need a different resource (namely, liftOver or crossmap). – Devon Ryan May 18 '17 at 11:10
  • 1
    Yeah, it's tolls like liftOver and crossmap I've used and found issues with. I was expecting this to be a solved problem but each of the three tools I used gave different results, unfortunately. Which makes me leery of using the results. – terdon May 18 '17 at 11:19
  • 1
    The results are deterministic, they should be the same regardless of tool provided you use the same settings. – Devon Ryan May 18 '17 at 11:20
  • 1
    You'd think so, yes. But they're not. I'd be happy to chat ([chat]) about this (this is one of things I need to sort out at work) if you like, and I'd be very pleased to learn I was just doing it wrong, but my preliminary tests and various threads I've read online (see here for example) suggest it isn't as simple as you'd think. – terdon May 18 '17 at 11:26
  • I suspect this will come down to deficiencies in the chain files, but I'd like to hear about that. – Devon Ryan May 18 '17 at 11:40

The "right" solution would be realignment, but that's expensive and most of us would not go that route. My preferred solution would be to convert the bed file, as opposed to the bam. Here's why:

1) Reheadering the bam means that you may have reads aligned to contigs without a corresponding entry in UCSC (see Devon's list for the mappings). This is a problem because:

  • Some of those reads would likely have been mapped elsewhere if a reference without those contigs was used.
  • I'm not even sure what happens to those reads after reheadering - I guess they would need to be marked as unmapped? Lots of potential for screwiness there.

2) It seems cleaner to convert the bed file from UCSC->NCBI, where you are guaranteed that every entry has a "home". Then, after you pull your info from the bam, you can always convert chromosome names back if you need to.

  • 530
  • 4
  • 6
  • Alignments that end up losing their chromosome end up unmapped, though they don't get the SAM unmapped flag (rather, they end up represented as being on the * chromosome). But yeah, samtools reheader should be used with extreme caution. – Devon Ryan May 17 '17 at 19:02
  • Several tools will throw a wobbly if you pass reads on * that are not marked as unmapped. pysam is is one. I know this from bitter (bug fixing ) experience. – Ian Sudbery Aug 08 '17 at 11:03

Just to point out that if you want to follow @Devon Ryan's answer for a different organism/assembly, that is not in his very useful linked resource, you can download NCBI to UCSC contig to chromosome number mappings from https://www.ncbi.nlm.nih.gov/assembly.

To the site and search for your assembly name. At the bottom of the page is a box called "Global assembly definition" containing a link titled "Download full sequence report".

The downloaded file contains a table with:

  • Chromosome numbers in "Sequence-Name"/"Assigned-molecule"
  • NCBI names in Refseq-Accn
  • UCSC contig name in "UCSC-style-name"
Ian Sudbery
  • 3,311
  • 1
  • 11
  • 21