Highest Voted Questions - Bioinformatics Stack Exchange

191

votes

4 answers

Why does the SARS-Cov2 coronavirus genome end in aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa (33 a's)?

The SARS-Cov2 coronavirus's genome was released, and is now available on Genbank. Looking at it... 1 attaaaggtt tataccttcc caggtaacaa accaaccaac tttcgatctc ttgtagatct 61 gttctctaaa cgaactttaa aatctgtgtg gctgtcactc ggctgcatgc ttagtgcact …

asked Jan 25 '20 at 00:55

Rebecca J. Stones

1,725
2
9
11

51

votes

9 answers

What's the most efficient file format for the storage of DNA sequences?

I'd like to learn which format is most commonly used for storing the full human genome sequence (4 letters without a quality score) and why. I assume that storing it in plain-text format would be very inefficient. I expect a binary format would be…

asked May 16 '17 at 18:01

kenorb

1,293
1
12
15

48

votes

6 answers

Feature annotation: RefSeq vs Ensembl vs Gencode, what's the difference?

What are the actual differences between different annotation databases? My lab, for reasons still unknown to me, prefers Ensembl annotations (we're working with transcript/exon expression estimation), while some software ship with RefSeq…

asked May 16 '17 at 19:24

Plasma

583
1
5
6

41

votes

4 answers

What is the difference between FASTA, FASTQ, and SAM file formats?

I'd like to learn the differences between 3 common formats such as FASTA, FASTQ and SAM. How they are different? Are there any benefits of using one over another? Based on Wikipedia pages, I can't tell the differences between them.

asked May 16 '17 at 18:37

kenorb

1,293
1
12
15

35

votes

2 answers

Why do some assemblers require an odd-length kmer for the construction of de Bruijn graphs?

Why do some assemblers like SOAPdenovo2 or Velvet require an odd-length k-mer size for the construction of de Bruijn graph, while some other assemblers like ABySS are fine with even-length k-mers?

asked May 19 '17 at 18:34

Kamil S Jaron

5,542
2
25
59

34

votes

4 answers

Why does the FASTA sequence for coronavirus look like DNA, not RNA?

I'm looking at a genome sequence for 2019-nCoV on NCBI. The FASTA sequence looks like this: >MN988713.1 Wuhan seafood market pneumonia virus isolate 2019-nCoV/USA-IL1/2020, complete…

asked Feb 09 '20 at 17:13

jameshfisher

443
4
7

33

votes

3 answers

Uppercase vs lowercase letters in reference genome

I am using a reference genome for mm10 mouse downloaded from NCBI, and would like to understand in greater detail the difference between lowercase and uppercase letters, which make up roughly equal parts of the genome. I understand that N is used…

asked May 24 '17 at 03:26

Scott Gigante

2,133
1
13
32

27

votes

7 answers

Read length distribution from FASTA file

I have a single ~10GB FASTA file generated from an Oxford Nanopore Technologies' MinION run, with >1M reads of mean length ~8Kb. How can I quickly and efficiently calculate the distribution of read lengths? A naive approach would be to read the…

asked May 17 '17 at 04:38

Scott Gigante

2,133
1
13
32

27

votes

4 answers

Why sequence the human genome at 30x coverage?

A bit of a historical question on a number, 30 times coverage, that's become so familiar in the field: why do we sequence the human genome at 30x coverage? My question has two parts: Who came up with the 30x value and why? Does the value need to be…

asked Aug 04 '17 at 15:10

719016

2,324
13
19

27

votes

8 answers

How to version the code and the data during the analysis?

I am currently looking for a system which will allow me to version both the code and the data in my research. I think my way of analyzing data is not uncommon, and this will be useful for many people doing bioinformatics and aiming for the…

asked May 18 '17 at 09:27

Iakov Davydov

2,695
1
13
34

25

votes

5 answers

What happens if a major bug is discovered in a bioinformatic package that has been used in published literature?

Yesterday I was debugging some things in R trying to get a popular Flow Cytometry tool to work on our data. After a few hours of digging into the package I discovered that our data was hitting an edge case, and it seems like the algorithm wouldn't…

asked Nov 14 '17 at 00:26

Nic Barker

351
3
6

24

votes

4 answers

Tools for simulating Oxford Nanopore reads

Are there any free open source software tools available for simulating Oxford Nanopore reads?

asked May 22 '17 at 18:19

Daniel Standage

5,080
15
50

23

votes

4 answers

What Ensembl genome version should I use for alignments? (e.g. toplevel.fa vs. primary_assembly.fa)

When you look at all the genome files available from Ensembl. You are presented with a bunch of options. Which one is the best to use/download? You have a combination of choices. First part options: dna_sm - Repeats soft-masked (converts repeat…

asked Jun 07 '17 at 13:23

story

1,573
1
8
15

23

votes

4 answers

Are there any rolling hash functions that can hash a DNA sequence and its reverse complement to the same value?

A common bioinformatics task is to decompose a DNA sequence into its constituent k-mers and compute a hash value for each k-mer. Rolling hash functions are an appealing solution for this task, since they can be computed very quickly. A rolling hash…

asked May 16 '17 at 19:15

Daniel Standage

5,080
15
50

22

votes

3 answers

What is the actual cause of excessive zeroes in single cell RNA-seq data? Is it PCR?

First, sorry if I am missing something basic - I am a programmer recently turned bioinformatician so I still don't know a lot of stuff. This is a cross post with a Biostars question hope that's not bad form. While it is obvious that scRNA-seq data…

asked Oct 20 '17 at 08:50

Martin Modrák

403
2
10

Most Popular