Most Popular

1500 questions
21
votes
10 answers

What is the fastest way to calculate the number of unknown nucleotides in FASTA / FASTQ files?

I used to work with publicly available genomic references, where basic statistics are usually available and if they are not, you have to compute them only once so there is no reason to worry about performance. Recently I started sequencing project…
Kamil S Jaron
  • 5,542
  • 2
  • 25
  • 59
21
votes
8 answers

What is the fastest way to get the reverse complement of a DNA sequence in python?

I am writing a python script that requires a reverse complement function to be called on DNA strings of length 1 through around length 30. Line profiling programs indicate that my functions spend a lot of time getting the reverse complements, so I…
conchoecia
  • 3,141
  • 2
  • 16
  • 40
21
votes
8 answers

Since every human has a different DNA (different combinations of C, G, A, T) what does it mean to have the genome done?

I'm confused about the difference between genome and DNA. Is it correct to say that the same type of bacteria has the same DNA? But my understanding is that it is not correct to say that the same type of human has the same DNA, since every human has…
CCCCoder2
  • 319
  • 2
  • 3
20
votes
12 answers

Random access on a FASTQ file

I would like to select a random record from a large set of n unaligned sequencing reads in log(n) time complexity (big O notation) or less. A record is defined as the equivalent of four lines in FASTQ format. The records do not fit in RAM and would…
winni2k
  • 2,266
  • 11
  • 28
20
votes
2 answers

Single-sample vs. joint genotyping

I am trying to understand the benefits of joint genotyping and would be grateful if someone could provide an argument (ideally mathematically) that would clearly demonstrate the benefit of joint vs. single-sample genotyping. This is what I've…
llevar
  • 303
  • 3
  • 5
20
votes
4 answers

How can I downsample a BAM file while keeping both reads in pairs?

I know how to downsample a BAM file to lower coverage. I know I can randomly select lines in SAM, but this procedure can't guarantee two reads in a pair are always sampled the same time. Is there a way to downsample BAM while keeping pairing…
medbe
  • 847
  • 1
  • 7
  • 9
20
votes
3 answers

How exactly is "effective length" used in FPKM calculated?

According to this famous blog post, the effective transcript length is: $\tilde{l}_i = l_i - \mu$ where $l_i$ is the length of transcript and $\mu$ is the average fragment length. However, typically fragment length is about 300bp. What if when the…
user172818
  • 6,515
  • 2
  • 13
  • 29
20
votes
1 answer

The state, limitations and comparisons of large variant stores

Background: We're increasingly needing some way of storing lots of variant data associated with lots of subjects: think clinical trials and hospital patients, looking for disease-causing or relevant genes. A thousand subjects is where we'd start,…
agapow
  • 788
  • 3
  • 11
19
votes
2 answers

Confirm success or failure of RNA-Seq normalization

I am working with a set of (bulk) RNA-Seq data collected across multiple runs, run at different times of the year. I have normalized my data using library size / quantile / RUV normalization, and would like to check (quantitatively and/or…
Scott Gigante
  • 2,133
  • 1
  • 13
  • 32
19
votes
2 answers

Obtaining uniquely mapped reads from BWA mem alignment

This is based on a question from betsy.s.collins on BioStars. The original post can be found here. Does anyone have any suggestions for other tags or filtering steps on BWA-generated BAM files that can be used so reads only map to one location? One…
gringer
  • 14,012
  • 5
  • 23
  • 79
19
votes
2 answers

Accuracy of the original human DNA datasets sequenced by Human Genome Project?

The Human Genome Project was the project of 'determining the sequence of nucleotide base pairs that make up human DNA, and of identifying and mapping all of the genes of the human genome'. It was declared complete in 2003, i.e. 99% of the…
kenorb
  • 1,293
  • 1
  • 12
  • 15
19
votes
6 answers

Are there any RepBase alternatives for genome-wide repeat element annotations?

I’m using the RepBase libraries in conjunction with RepeatMasker to get genome-wide repeat element annotations, in particular for transposable elements. This works well enough, and seems to be the de facto standard in the field. However, there are…
Konrad Rudolph
  • 4,845
  • 14
  • 45
19
votes
2 answers

How can I extract normalized read count values from DESeq2 results?

The results obtained by running the results command from DESeq2 contain a "baseMean" column, which I assume is the mean across samples of the normalized counts for a given gene. How can I access the normalized counts proper? I tried the following…
bli
  • 3,130
  • 2
  • 15
  • 36
19
votes
3 answers

How to deal with heterozygosity during polishing of genome assembly based on long reads?

All the long-read sequencing platforms are based on single-molecule sequencing which causes higher per-base error rates. For this reason a polishing step was added to genome assembly pipelines - mapping raw reads back to assembly and correcting…
Kamil S Jaron
  • 5,542
  • 2
  • 25
  • 59
19
votes
2 answers

Understanding DESeq2 design, contrast and results

I have a set of high-troughput experiments with 2 genotypes ("WT" and "prg1") and 3 treatments ("RT", "HS30" and "HS30RT120"), and there are 2 replicates for each of the genotype x treatment combinations. The read counts for the genes are summarized…
bli
  • 3,130
  • 2
  • 15
  • 36