I have a fasta file that contains sequence reads and sequence id file that needed to be removed from the fasta file. I have done this earlier, but since id contains a space my piece of code is not working and I don't know how to change it since I'm still learning. fasta file looks like this.
>m64041_200717_231916/74/12723_24868 id=30
CCGCACCTCCTCAATCTGCAGCAGTTGAGGCCACTACCCTTCTGCTCAATGGTTCCTGCAGACTTTATCATAGTCACTCACACTTGTCCA
>m64041_200717_231916/77/1941_50622 id=3115
TGAGCCGCACCTCCTCAATCTGCAGCAGTTGAGGCCACTACCCTTCTGCTCAATGGTTCCTGCAGACTTTATCATAGTCACTCACACTTGTCCATGAG
>m64041_200717_231916/105/20691_65844 id=488
AGGCCACTACCCTTCTGCTCAATGGTTCCTGCAGACTTTATCATAGTCACTCACACTTGTCCAAGACTTTAT
>m64041_200717_231916/108/17414_67048 id=4956
TGAGGCCACTACCCTTCTGCTCAATGGTTCCTGCAGACTTTATCATAGTCACTCACACTTGTCCATGAGGCAGACTTTATCATAGTCACTCACACTTG
>m64041_200717_231916/162/0_6615 id=857
CAGACTTTATCATAGTCACTCACACTTGTCCATGAGGCAGACTTTATCATAGTCACTCACACTTG
IDList file is like this.
m64041_200717_231916/74/12723_24868 id=30
m64041_200717_231916/108/17414_67048 id=4956
m64041_200717_231916/105/20691_65844 id=488
I need to remove these sequences from that fasta file. I have a code I used earlier. But it is not working since id contains a space (I think). Previous code is like this.
# use as follows
# ./filter_seq.bash <fastafile> <list_of_ids_you_want_to_remove>
# ./filter_seq.bash all.fasta bacterial_IDlist
grep ">" $1 | sed 's/>//g' | sort | uniq > all_del
sort $2 | uniq > query_del
comm -23 all_del query_del | sort | uniq > filtered_del
perl -ne 'if(/^>(\S+)/){$c=$i{$1}}$c?print:chomp;$i{$_}=1 if @ARGV' filtered_del $1 > filtered.fasta
rm *_del
I would be grateful if someone can help me.
sed 's, ,_,g' -I in.fa
. Spaces in any kind of genetic data file are bad news and will cause downstream errors. It will make further analysis much easier. – user438383 Feb 04 '21 at 18:49