Remove/delete sequences by ID from multifasta

Question

I have a fasta file like this:

>Id1
ATCCTT
>Id2
ATTTTCCC
>Id3
TTTCCCCAAAA
>Id4
CCCTTTAAA

I want to delete sequences that have the following IDs.

Id2
Id3

The IDs are in a .txt file, and the text file will be used to match and delete those sequences.

My output should be a fasta file like this

>Id1
ATCCTT
>Id4
CCCTTTAAA

But I want it with awk and/or sed and/or bash (no python or perl).
How can I do it?

Exactly: use the right tool for the job. Unless you can make some very strong assumptions about your input files, use a proper FASTA parser, not some hacked-together cruft. — Konrad Rudolph, Mar 23 '18 at 09:11
Use whatever tool you want to use if it gets your work done before a deadline. A lot of parsers are slow as mud as compared with Unix tools, but I admit they take a lot of drudgery out of understanding how things work. — Alex Reynolds, Mar 26 '18 at 05:00

score 7 · Answer 1 · answered Mar 21 '18 at 02:08

7

Suppose you keep sequence names in ids.txt and sequences in seq.fa:

awk 'BEGIN{while((getline<"ids.txt")>0)l[">"$1]=1}/^>/{f=!l[$1]}f' seq.fa

answered Mar 21 '18 at 02:08

user172818

6,515
2
13
29

score 7 · Answer 2 · edited Jan 04 '24 at 21:22

Just today I wrote a script to do exactly this using Biopython. I also add a warning If any the headers I wanted to filter was not present in the original fasta. So the python script filter_fasta_by_list_of_headers.py is :

#!/usr/bin/env python3
from Bio import SeqIO
import sys
ffile = SeqIO.parse(sys.argv[1], "fasta")
header_set = set(line.strip() for line in open(sys.argv[2]))
for seq_record in ffile:
    try:
        header_set.remove(seq_record.name)
    except KeyError:
        print(seq_record.format("fasta"))
        continue
if len(header_set) != 0:
    print(len(header_set),'of the headers from list were not identified in the input fasta file.', file=sys.stderr)

and usage :

filter_fasta_by_list_of_headers.py input.fasta list_of_scf_to_filter > filtered.fasta

P.S. it's quite easy to turn over the script to extract the sequences from the list (just the print line would have to move after line header_set.remove(seq_record.name)

I made a modification to use in and benchmarked it on normal insect genome assembly - 1Gbp, 300,000 scaffolds and I gave it lists of 5,000 and 100,000 scaffolds to filter. Regardless of the size of the list or algorithm used it took 29-30s. In other words, if LBYL is faster than EAFP, it's on a scale that totally does not matter.

Ah +1 for showing my intuitions to be wrong, one other thought, it should be faster to write all the records at once rather than one by one, is something like this faster? https://pastebin.com/n7RUgW5x Sorry if it seems like micro-optimization, I'm just curious — Chris_Rands, Mar 26 '18 at 16:08

Alex Reynolds · Answer 3 · 2018-03-20T21:49:46.863

6

If you want to learn how to do things with command-line tools, you can linearize the FASTA with awk, pipe to grep to filter for items of interest named in patterns.txt, then pipe to tr to delinearize:

$ awk '{ if ((NR>1)&&($0~/^>/)) { printf("\n%s", $0); } else if (NR==1) { printf("%s", $0); } else { printf("\t%s", $0); } }' in.fa | grep -Ff patterns.txt - | tr "\t" "\n" > out.fa

This will work on multiline FASTA.

edited Mar 20 '18 at 21:49

answered Mar 20 '18 at 20:03

Alex Reynolds

3,135
11
27

Excellent solution, except it will keep the sequence specified in patterns.txt, not remove them – Federico Giorgi Apr 16 '20 at 04:52
Good point. I must have missed that part of the question. One can use grep -v to remove elements that match specified patterns. – Alex Reynolds Apr 16 '20 at 17:42

terdon · Answer 4 · 2018-03-23T08:58:29.123

3

Using the FastaToTbl and TblToFasta scripts I have posted before, you can do:

FastaToTbl file.fa | grep -vwf ids.txt | TblToFasta > file.2.fa

edited Mar 23 '18 at 08:58

answered Mar 20 '18 at 19:02

terdon

10,071
5
22
48

score 2 · Answer 5 · 2018-03-20T19:17:13.720

2

Presume grep is OK if sed, awk are. The -A(n) and -B(n) flags give (n) lines post match. Presuming all your fasta sequence will be one line is a bit dangerous, but it works for your example. First get those IDs to be removed (in rmid.txt), then inverse grep against the initial fasta.

grep -A1 -f rmid.txt fasta.fa > rmfile.fa
grep -v -f rmfile.fa fasta.fa

The real answer is to use a script that defines a delimiter other than newline "\n", then parse for IDs, so better to use one of the languages you don't want to use...

edited Mar 20 '18 at 19:17

answered Mar 20 '18 at 18:41

This assumes only one line of sequence per ID which is not likely true for fasta files (but is often true of fastq). – terdon Mar 20 '18 at 18:59
The problem with greping Id1 is substrings, this will also capture Id10 etc. you could add --word-regexp flag, but this can often really hurt performance; awk probably a better tool – Chris_Rands Mar 20 '18 at 19:02
Yes, both good points, example is still completed as per request. @terdon I specified line number as an issue, @Chris_Rands unlikely that fasta IDs would be formed to allow that kind of issue but point appreciated. FWIW, for manipulating fasta I would not use grep, awk, sed, there are plenty of good packages based around scripting languages, I would favour Bioperl. I am a lowly newcomer so couldn't comment this to OP. – Mar 20 '18 at 19:15
Ah, so you did. Sorry, I missed the disclaimer and only saw the command. However, note that having having IDs be substrings of other IDs is absolutely common. For instance, the standard human genome fasta sequence has both chr1 and chr11. That said, your answer has a much more serious issue: each line of rmfile.fa will be passed to grep as a pattern to search for, so any line containing that will be removed. This means that if sequence id3 is TAT, you will remove all lines matching TAT irrespective of their ID. – terdon Mar 20 '18 at 19:21

score 1 · Answer 6 · answered Sep 24 '20 at 15:27

This is what samtools faidx is intended for.

It needs to be called twice.

First, to create a FASTA index (*.fai file):
```
samtools faidx input.fasta
```
Secondly, to filter the FASTA file with the help of the index:
```
samtools faidx -o output.fasta input.fasta ids…
```

ids… are the IDs to retain, as individual arguments separated by space. If you’ve got a file blocklist.txt with IDs you want to discard (one per line), you first need to invert this, after having created the index (using Bash syntax):¹

remove_ids=($(awk '{print $1}' input.fasta.fai | grep -v -f blocklist.txt))

… and then we can invoke the command as

samtools faidx -o output.fasta input.fasta "${remove_ids[@]}"

What’s nice about samtools faidx is that it also works with BGZF-compressed FASTA data — i.e. fa.gz files (but only if they were compressed using bgzip, which ships with Samtools, and the index must be generated on the compressed file). However, it will always write FASTA. If you want gzipped FASTA output, you need to manually stream the result into bgzip:


samtools faidx input.fasta.gz "${remove_ids[@]}" | bgzip -c >output.fasta.gz

¹ _{Careful, this syntax is generally unsafe for reading a command output into a variable; we can use it here only because we assume that we want to generate a space-separated list of identifiers; the generally safe way is a lot more convoluted.}

score 1 · Answer 7 · answered Dec 08 '21 at 06:46

1

Shortest way...

awk '!/Id2|Id3/' RS=">" ORS=">" fasta.fa

!= Not match the pattern Note: Sequence doesn't need to be in a single line!

answered Dec 08 '21 at 06:46

Aswin S Soman

11
2

Rajan Raju · Answer 8 · 2021-09-18T08:14:24.903

Firstly, you have to install biopython using the cmd pip install biopython.

Suppose you have two files id.txt and sequence.fasta. The solution file name is seq_rm.py which contains the following codes.

import argparse
from Bio import SeqIO
from Bio.SeqRecord import SeqRecord
if name == 'main':
    parser = argparse.ArgumentParser()
    parser.add_argument('--id_file', required=True,
                        help='file path of removable ids')
    parser.add_argument('--input', required=True,
                        help='path of fasta sequence')
    parser.add_argument('--output', required=True,
                        help='path of selected fasta sequence')
args = parser.parse_args()

with open(args.id_file, mode='r', encoding='utf-8') as file:
    ids = set(file.read().splitlines())

records = [SeqRecord(record.seq, id=record.id, description='', name='') for record in SeqIO.parse(args.input, 'fasta')
           if record.id not in ids]
SeqIO.write(records, args.output, &quot;fasta&quot;)

Run cmd:

python3 seq_rm.py --id_file id.txt --input sequence.fasta --output selected.fasta

It will generate an output file named selected.fasta.

score 0 · Answer 9 · answered Sep 14 '21 at 12:33

0

One way using awk without getline:

awk 'FNR==NR { a[$0]; next } /^>/ { f = !(substr($0, 2) in a) }f' ids.txt seqs.fa

answered Sep 14 '21 at 12:33

Steve

3,099
1
4
12

score 0 · Answer 10 · answered Jan 05 '24 at 01:22

I know you do not want a Python solution but nevertheless I would recommend it like others. Here is one way to do it using Biopython:

from Bio import SeqIO

ids_file = 'ids.txt'
input_f  = 'input_file.fasta'
output_f = 'result.fasta'

my_dictionary = {}
with open(ids_file, 'r') as f:
    for line in f:
        my_dictionary[line[:-1]]='value'

with open(input_f, 'r') as input_handle, open(output_f, 'w') as output_handle:
    for seq_record in SeqIO.parse(input_handle, "fasta"):
        if not seq_record.id in my_dictionary:
            SeqIO.write(seq_record, output_handle, "fasta")

Remove/delete sequences by ID from multifasta

10 Answers10

Linked