Introduce errors in reference transcripts according to external dataset error model

Question

I would like to modify some reference transcripts from Ensembl (D. melanogaster) to introduce a controlled rate of random errors in the sequences. The idea would be to introduce random base substitutions in these sequences, no indels for now, because I would like to keep the transcript sequence length as it is in the reference.

The rate of error per transcript will be determined according to an error profile computed from an external set of RNA-seq reads (e.g., generated with ONT MinION)

The aim of this modification would be to establish a rough benchmark of the performances of aligners to use over transcripts from spliced reads (rna-to-genome), aka with more than one exon.

Any idea of which software would be best for this purpose?

gringer · Accepted Answer · 2017-08-22T19:52:49.440

Do any of the answers for this question help? Karel Brinda has mentioned a few read simulators in the answer to that question, and has a thesis with more information.

Excluding INDEL errors doesn't sound like a good idea; length can still be preserved even if doing that, it just needs an adjustment at the end of the sequence. Note that if you're trying to model nanopore reads, what you're really modelling is the base-caller, rather than the sequencer. I mention this in more detail in my answer.

In most cases where errors are modelled, I find it better to use publicly-available data instead. Especially for nanopore data, there are unmodelled systematic errors in the base-callers and sequencer that can't be simulated using any programs (because they are unmodelled). The following paper would be a good place to start for cDNA sequences, which looks at single-cell data from mouse (C57Bl/6) B1a cells:

http://www.biorxiv.org/content/early/2017/04/13/126847

Illumina and ONT reads for that study can be found in SRA under accession number SRP082530.

I don't know of any recent D. melanogaster studies that have been done using nanopore. There's always the option of spending $1000 on a purchase of a MinION with an RNA starter kit to do the study yourself. Here's an older targeted gene study, but bear in mind that it was using an R7.3 flow cell, so errors rates will be much higher than what is currently available:

https://genomebiology.biomedcentral.com/articles/10.1186/s13059-015-0777-z

user172818 · Answer 2 · 2017-08-25T12:23:33.937

This preprint uses pbsim to simulate ONT RNA-seq reads for fruit fly. It is probably worth reading if you want to do the same thing.

You should include INDEL errors. Those are what make RNA-seq alignment challenging. For the benchmark purpose, adding INDELs does not increase the complexity at all. You can parse splice junctions on the reference from CIGAR and compare them to the annotation. You don't need to worry about the base-level alignment.

In addition, there are public real ONT data (AC:SRP082530) for SIRV spike-in control and mouse B cells. You don't actually need simulation.

PS: just noticed that you are an author of the first preprint I cited. I would use real data for evaluation.

score 1 · Answer 3 · answered Aug 22 '17 at 17:39

1

It sounds like what you're really looking for is a read simulator. A cursory search turns up NanoSim, which is designed to simulate reads from a MinION. This has the benefit of at least having been used in some of the published literature, which is always a nice sign.

You may also find this review article on read simulators useful. It doesn't specifically mention NanoSim, but it should prove to be a useful review of the general concepts anyway if you need to read up on them.

answered Aug 22 '17 at 17:39

Devon Ryan

19,602
2
29
60

Hum, not really what I am looking for because the simulation tool uses the model built in the previous step to produce in silico reads for a given reference genome - and the thing I need is to first, compute the error rate from the experimental reads (can be done parsing an alignment file in sam), and second, substitute in the reference as many nucleotides as needed to reach the average error rate of real ONT reads. for NanoSim I have the impression they are generating totally denovo 'reads' from the genome directly. – aechchiki Aug 22 '17 at 17:47
@AminaEchchiki If you don't want to use one of their pre-trained models, then you can have it train on real data and then it'll output reads with the appropriate error profile. Of course, since it uses LAST then whatever is most similar will have the best performance in your benchmarks. – Devon Ryan Aug 22 '17 at 17:55

score 1 · Answer 4 · answered Aug 24 '17 at 13:36

1

The executable fastq-sim in DNemulator package is able to modify a set of input sequences in fasta format according to an external set of quality scores reported in a fastq file.

answered Aug 24 '17 at 13:36

aechchiki

2,676
11
34

Introduce errors in reference transcripts according to external dataset error model

4 Answers4