Most Popular

1500 questions
191
votes
4 answers

Why does the SARS-Cov2 coronavirus genome end in aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa (33 a's)?

The SARS-Cov2 coronavirus's genome was released, and is now available on Genbank. Looking at it... 1 attaaaggtt tataccttcc caggtaacaa accaaccaac tttcgatctc ttgtagatct 61 gttctctaaa cgaactttaa aatctgtgtg gctgtcactc ggctgcatgc ttagtgcact …
Rebecca J. Stones
  • 1,725
  • 2
  • 9
  • 11
51
votes
9 answers

What's the most efficient file format for the storage of DNA sequences?

I'd like to learn which format is most commonly used for storing the full human genome sequence (4 letters without a quality score) and why. I assume that storing it in plain-text format would be very inefficient. I expect a binary format would be…
kenorb
  • 1,293
  • 1
  • 12
  • 15
48
votes
6 answers

Feature annotation: RefSeq vs Ensembl vs Gencode, what's the difference?

What are the actual differences between different annotation databases? My lab, for reasons still unknown to me, prefers Ensembl annotations (we're working with transcript/exon expression estimation), while some software ship with RefSeq…
Plasma
  • 583
  • 1
  • 5
  • 6
41
votes
4 answers

What is the difference between FASTA, FASTQ, and SAM file formats?

I'd like to learn the differences between 3 common formats such as FASTA, FASTQ and SAM. How they are different? Are there any benefits of using one over another? Based on Wikipedia pages, I can't tell the differences between them.
kenorb
  • 1,293
  • 1
  • 12
  • 15
35
votes
2 answers

Why do some assemblers require an odd-length kmer for the construction of de Bruijn graphs?

Why do some assemblers like SOAPdenovo2 or Velvet require an odd-length k-mer size for the construction of de Bruijn graph, while some other assemblers like ABySS are fine with even-length k-mers?
Kamil S Jaron
  • 5,542
  • 2
  • 25
  • 59
34
votes
4 answers

Why does the FASTA sequence for coronavirus look like DNA, not RNA?

I'm looking at a genome sequence for 2019-nCoV on NCBI. The FASTA sequence looks like this: >MN988713.1 Wuhan seafood market pneumonia virus isolate 2019-nCoV/USA-IL1/2020, complete…
jameshfisher
  • 443
  • 4
  • 7
33
votes
3 answers

Uppercase vs lowercase letters in reference genome

I am using a reference genome for mm10 mouse downloaded from NCBI, and would like to understand in greater detail the difference between lowercase and uppercase letters, which make up roughly equal parts of the genome. I understand that N is used…
Scott Gigante
  • 2,133
  • 1
  • 13
  • 32
27
votes
7 answers

Read length distribution from FASTA file

I have a single ~10GB FASTA file generated from an Oxford Nanopore Technologies' MinION run, with >1M reads of mean length ~8Kb. How can I quickly and efficiently calculate the distribution of read lengths? A naive approach would be to read the…
Scott Gigante
  • 2,133
  • 1
  • 13
  • 32
27
votes
4 answers

Why sequence the human genome at 30x coverage?

A bit of a historical question on a number, 30 times coverage, that's become so familiar in the field: why do we sequence the human genome at 30x coverage? My question has two parts: Who came up with the 30x value and why? Does the value need to be…
719016
  • 2,324
  • 13
  • 19
27
votes
8 answers

How to version the code and the data during the analysis?

I am currently looking for a system which will allow me to version both the code and the data in my research. I think my way of analyzing data is not uncommon, and this will be useful for many people doing bioinformatics and aiming for the…
Iakov Davydov
  • 2,695
  • 1
  • 13
  • 34
25
votes
5 answers

What happens if a major bug is discovered in a bioinformatic package that has been used in published literature?

Yesterday I was debugging some things in R trying to get a popular Flow Cytometry tool to work on our data. After a few hours of digging into the package I discovered that our data was hitting an edge case, and it seems like the algorithm wouldn't…
Nic Barker
  • 351
  • 3
  • 6
24
votes
4 answers

Tools for simulating Oxford Nanopore reads

Are there any free open source software tools available for simulating Oxford Nanopore reads?
Daniel Standage
  • 5,080
  • 15
  • 50
23
votes
4 answers

What Ensembl genome version should I use for alignments? (e.g. toplevel.fa vs. primary_assembly.fa)

When you look at all the genome files available from Ensembl. You are presented with a bunch of options. Which one is the best to use/download? You have a combination of choices. First part options: dna_sm - Repeats soft-masked (converts repeat…
story
  • 1,573
  • 1
  • 8
  • 15
23
votes
4 answers

Are there any rolling hash functions that can hash a DNA sequence and its reverse complement to the same value?

A common bioinformatics task is to decompose a DNA sequence into its constituent k-mers and compute a hash value for each k-mer. Rolling hash functions are an appealing solution for this task, since they can be computed very quickly. A rolling hash…
Daniel Standage
  • 5,080
  • 15
  • 50
22
votes
3 answers

What is the actual cause of excessive zeroes in single cell RNA-seq data? Is it PCR?

First, sorry if I am missing something basic - I am a programmer recently turned bioinformatician so I still don't know a lot of stuff. This is a cross post with a Biostars question hope that's not bad form. While it is obvious that scRNA-seq data…
Martin Modrák
  • 403
  • 2
  • 10
1
2 3
99 100