9

I have a fasta file like this:

>Id1
ATCCTT
>Id2
ATTTTCCC
>Id3
TTTCCCCAAAA
>Id4
CCCTTTAAA

I want to delete sequences that have the following IDs.

Id2
Id3

The IDs are in a .txt file, and the text file will be used to match and delete those sequences.

My output should be a fasta file like this

>Id1
ATCCTT
>Id4
CCCTTTAAA

But I want it with awk and/or sed and/or bash (no python or perl).
How can I do it?

gringer
  • 14,012
  • 5
  • 23
  • 79
andresito
  • 385
  • 1
  • 3
  • 9
  • 3
    why the language restrictions? – Chris_Rands Mar 20 '18 at 19:03
  • 1
    Exactly: use the right tool for the job. Unless you can make some very strong assumptions about your input files, use a proper FASTA parser, not some hacked-together cruft. – Konrad Rudolph Mar 23 '18 at 09:11
  • Use whatever tool you want to use if it gets your work done before a deadline. A lot of parsers are slow as mud as compared with Unix tools, but I admit they take a lot of drudgery out of understanding how things work. – Alex Reynolds Mar 26 '18 at 05:00

10 Answers10

7

Suppose you keep sequence names in ids.txt and sequences in seq.fa:

awk 'BEGIN{while((getline<"ids.txt")>0)l[">"$1]=1}/^>/{f=!l[$1]}f' seq.fa
user172818
  • 6,515
  • 2
  • 13
  • 29
7

Just today I wrote a script to do exactly this using Biopython. I also add a warning If any the headers I wanted to filter was not present in the original fasta. So the python script filter_fasta_by_list_of_headers.py is :

#!/usr/bin/env python3

from Bio import SeqIO import sys

ffile = SeqIO.parse(sys.argv[1], "fasta") header_set = set(line.strip() for line in open(sys.argv[2]))

for seq_record in ffile: try: header_set.remove(seq_record.name) except KeyError: print(seq_record.format("fasta")) continue if len(header_set) != 0: print(len(header_set),'of the headers from list were not identified in the input fasta file.', file=sys.stderr)

and usage :

filter_fasta_by_list_of_headers.py input.fasta list_of_scf_to_filter > filtered.fasta

P.S. it's quite easy to turn over the script to extract the sequences from the list (just the print line would have to move after line header_set.remove(seq_record.name)

I made a modification to use in and benchmarked it on normal insect genome assembly - 1Gbp, 300,000 scaffolds and I gave it lists of 5,000 and 100,000 scaffolds to filter. Regardless of the size of the list or algorithm used it took 29-30s. In other words, if LBYL is faster than EAFP, it's on a scale that totally does not matter.

gringer
  • 14,012
  • 5
  • 23
  • 79
Kamil S Jaron
  • 5,542
  • 2
  • 25
  • 59
  • @KamilSJaron Nice! – Konrad Rudolph Mar 23 '18 at 13:30
  • Ah +1 for showing my intuitions to be wrong, one other thought, it should be faster to write all the records at once rather than one by one, is something like this faster? https://pastebin.com/n7RUgW5x Sorry if it seems like micro-optimization, I'm just curious – Chris_Rands Mar 26 '18 at 16:08
6

If you want to learn how to do things with command-line tools, you can linearize the FASTA with awk, pipe to grep to filter for items of interest named in patterns.txt, then pipe to tr to delinearize:

$ awk '{ if ((NR>1)&&($0~/^>/)) { printf("\n%s", $0); } else if (NR==1) { printf("%s", $0); } else { printf("\t%s", $0); } }' in.fa | grep -Ff patterns.txt - | tr "\t" "\n" > out.fa

This will work on multiline FASTA.

Alex Reynolds
  • 3,135
  • 11
  • 27
3

Using the FastaToTbl and TblToFasta scripts I have posted before, you can do:

FastaToTbl file.fa | grep -vwf ids.txt | TblToFasta > file.2.fa
terdon
  • 10,071
  • 5
  • 22
  • 48
2

Presume grep is OK if sed, awk are. The -A(n) and -B(n) flags give (n) lines post match. Presuming all your fasta sequence will be one line is a bit dangerous, but it works for your example. First get those IDs to be removed (in rmid.txt), then inverse grep against the initial fasta.

grep -A1 -f rmid.txt fasta.fa > rmfile.fa
grep -v -f rmfile.fa fasta.fa

The real answer is to use a script that defines a delimiter other than newline "\n", then parse for IDs, so better to use one of the languages you don't want to use...

  • This assumes only one line of sequence per ID which is not likely true for fasta files (but is often true of fastq). – terdon Mar 20 '18 at 18:59
  • The problem with greping Id1 is substrings, this will also capture Id10 etc. you could add --word-regexp flag, but this can often really hurt performance; awk probably a better tool – Chris_Rands Mar 20 '18 at 19:02
  • Yes, both good points, example is still completed as per request. @terdon I specified line number as an issue, @Chris_Rands unlikely that fasta IDs would be formed to allow that kind of issue but point appreciated. FWIW, for manipulating fasta I would not use grep, awk, sed, there are plenty of good packages based around scripting languages, I would favour Bioperl. I am a lowly newcomer so couldn't comment this to OP. –  Mar 20 '18 at 19:15
  • Ah, so you did. Sorry, I missed the disclaimer and only saw the command. However, note that having having IDs be substrings of other IDs is absolutely common. For instance, the standard human genome fasta sequence has both chr1 and chr11. That said, your answer has a much more serious issue: each line of rmfile.fa will be passed to grep as a pattern to search for, so any line containing that will be removed. This means that if sequence id3 is TAT, you will remove all lines matching TAT irrespective of their ID. – terdon Mar 20 '18 at 19:21
1

This is what samtools faidx is intended for.

It needs to be called twice.

  1. First, to create a FASTA index (*.fai file):

    samtools faidx input.fasta
    
  2. Secondly, to filter the FASTA file with the help of the index:

    samtools faidx -o output.fasta input.fasta ids…
    

ids… are the IDs to retain, as individual arguments separated by space. If you’ve got a file blocklist.txt with IDs you want to discard (one per line), you first need to invert this, after having created the index (using Bash syntax):1

remove_ids=($(awk '{print $1}' input.fasta.fai | grep -v -f blocklist.txt))

… and then we can invoke the command as

samtools faidx -o output.fasta input.fasta "${remove_ids[@]}"

What’s nice about samtools faidx is that it also works with BGZF-compressed FASTA data — i.e. fa.gz files (but only if they were compressed using bgzip, which ships with Samtools, and the index must be generated on the compressed file). However, it will always write FASTA. If you want gzipped FASTA output, you need to manually stream the result into bgzip:


samtools faidx input.fasta.gz "${remove_ids[@]}" | bgzip -c >output.fasta.gz

1 Careful, this syntax is generally unsafe for reading a command output into a variable; we can use it here only because we assume that we want to generate a space-separated list of identifiers; the generally safe way is a lot more convoluted.

Konrad Rudolph
  • 4,845
  • 14
  • 45
1

Shortest way...

awk '!/Id2|Id3/' RS=">" ORS=">" fasta.fa

!= Not match the pattern Note: Sequence doesn't need to be in a single line!

0

Firstly, you have to install biopython using the cmd pip install biopython.

Suppose you have two files id.txt and sequence.fasta. The solution file name is seq_rm.py which contains the following codes.

import argparse
from Bio import SeqIO
from Bio.SeqRecord import SeqRecord

if name == 'main': parser = argparse.ArgumentParser() parser.add_argument('--id_file', required=True, help='file path of removable ids') parser.add_argument('--input', required=True, help='path of fasta sequence') parser.add_argument('--output', required=True, help='path of selected fasta sequence')

args = parser.parse_args()

with open(args.id_file, mode='r', encoding='utf-8') as file:
    ids = set(file.read().splitlines())

records = [SeqRecord(record.seq, id=record.id, description='', name='') for record in SeqIO.parse(args.input, 'fasta')
           if record.id not in ids]
SeqIO.write(records, args.output, &quot;fasta&quot;)

Run cmd:

python3 seq_rm.py --id_file id.txt --input sequence.fasta --output selected.fasta

It will generate an output file named selected.fasta.

Rajan Raju
  • 11
  • 1
0

One way using without getline:

awk 'FNR==NR { a[$0]; next } /^>/ { f = !(substr($0, 2) in a) }f' ids.txt seqs.fa
Steve
  • 3,099
  • 1
  • 4
  • 12
0

I know you do not want a Python solution but nevertheless I would recommend it like others. Here is one way to do it using Biopython:

from Bio import SeqIO

ids_file = 'ids.txt'
input_f  = 'input_file.fasta'
output_f = 'result.fasta'

my_dictionary = {}
with open(ids_file, 'r') as f:
    for line in f:
        my_dictionary[line[:-1]]='value'

with open(input_f, 'r') as input_handle, open(output_f, 'w') as output_handle:
    for seq_record in SeqIO.parse(input_handle, "fasta"):
        if not seq_record.id in my_dictionary:
            SeqIO.write(seq_record, output_handle, "fasta")
Supertech
  • 606
  • 2
  • 10