I have a strange protein fasta file. There are several entries for the same gene and I need to extract the entry with the longest width.
>lcl|NW_017095468.1_prot_XP_017786561.1_8 [gene=LOC108569215] [db_xref=GeneID:108569215] [protein=NADH dehydrogenase [ubiquinone] 1 a
lpha subcomplex subunit 9, mitochondrial] [protein_id=XP_017786561.1] [location=complement(join(88030..88986,89044..89243,89412..8944
8))] [gbkey=CDS]
MAAVIFTGAQLLKQQSGLVGIAYIRVNNYSSDAKVYNLASLKRGTGGRSSFNGIVATVFGSTGFIGRYVANKLGKIGTQL
ILPYRGNNYETMRLKLCGDLGQVLFQPYYLKDLESIDKALKYSNVVINLVGRDWETKNFSFQDVHVKGARDLARAAKKAG
VEKFIHLSALNCDGPNEAIFSRQGSKFLSSKWEGEQAVLEEFPEATIFRPSDVYGQEDRFLRYYAHNWRRQGQYMPMWKN
GEATIKQPVHVSDLAAGIVAAIKDPEAAGKVYQAVGPRRYQLNELVDWFYRVMRKDSEWGYKRYDMKYDMFFKLKVSLTQ
KFSPAFPVGNLHWERLEREFVTDRVNQAIPTLEDLGVNLRRMEDQVPWELKPFTYGLYFGTDSEEPVVEPKPPKYVS
>lcl|NW_017095468.1_prot_XP_017770525.1_9 [gene=LOC108557919] [db_xref=GeneID:108557919] [protein=spondin-1 isoform X1] [protein_id=XP_017770525.1] [location=join(95471..95648,106560..106774,106834..107025,107093..107299,107562..107816,107872..108474,108526..108799,108885..109058,109127..109237,109573..109812,109882..110114)] [gbkey=CDS]
MRLKVAFLWLVSSISWIGEALRCDRTPEGTFSPRTRADGRFVIEVSGNPDTYVPGEQYNIFLRSNGEYQAKNKFKDFLLL
VEHEPSEKILGEVHNPSVGTLQLLGDMLMKFSEKCRNAVMQTNSLPKSEVQVLWVAPPSGSGCVAIRATVVESKEFWYTD
DGPLSKILCEEVQENEDTQPNMLRQCCACDEAKYEVTFEGLWSRNTHPKDFPSNGWLTRFSDIIGASHTFDYTFWNYGEI
ASNGLRQLAENGNTRMLESELKAKSEHIRTIIKARGISYPNITGKTFAVFRVDKRHHLMSLVSMIDPSPDWIVGVSGLEL
CLRNCSWVESRVLNLYPWDAGTDDGPTYISANQPSMPPHPIRRIKSNSPNDPRSPFYDPTGTEMKPLARLYLSRQRLYEK
NCVAQVDVSEEDGGVEGDKCEMEEWSEWSKCTVTCGRGFKYKQRAYKNPASNFVCNKPLTKRASCVAILEHCSNQQRPQE
ADPSCSLTGWGNWSSCTAPCGPGWKTRSRRYKNRKAAKRCAAGNENPEPLEQNLECMERECGPSDRRPLQESKECEGRAW
SNWSPCSSTCGKGIKVRRRMAYRSLWGRSPARYNRGLFDTEDTSRDDDDDGSDEDPCMNLDEKVECINDDVPVCEDTVDN
SAVVCGFPRDEGGCMSNVDRWYFDVIKGNCDIFSYSGCQGNKNNFKTLERCENVCDSYKKELLANRTAYKRQLGVTVSGV
LSYNLHHMQNDDADNCVPGSQTRQDQDKKIIQEPIGEVVDCQMSEWTNWSGCNATCGRGFSTKHRFIRVHPSNGGKRCPQ
KILKKRKCKIPCPGDYTKRDPMLPTWGTANSLEHVQIDCVMTGWSAWSPCSRSCGPNAVQQRTRGILLPPSGRGEPCLHR
TEERPCSLLACPE
>lcl|NW_017095468.1_prot_XP_017771299.1_10 [gene=LOC108557919] [db_xref=GeneID:108557919] [protein=spondin-1 isoform X1] [protein_id=XP_017771299.1] [location=join(95471..95648,106560..106774,106834..107025,107093..107299,107562..107816,107872..108474,108526..108799,108885..109058,109127..109237,109573..109812,109882..110114)] [gbkey=CDS]
MRLKVAFLWLVSSISWIGEALRCDRTPEGTFSPRTRADGRFVIEVSGNPDTYVPGEQYNIFLRSNGEYQAKNKFKDFLLL
VEHEPSEKILGEVHNPSVGTLQLLGDMLMKFSEKCRNAVMQTNSLPKSEVQVLWVAPPSGSGCVAIRATVVESKEFWYTD
DGPLSKILCEEVQENEDTQPNMLRQCCACDEAKYEVTFEGLWSRNTHPKDFPSNGWLTRFSDIIGASHTFDYTFWNYGEI
ASNGLRQLAENGNTRMLESELKAKSEHIRTIIKARGISYPNITGKTFAVFRVDKRHHLMSLVSMIDPSPDWIVGVSGLEL
CLRNCSWVESRVLNLYPWDAGTDDGPTYISANQPSMPPHPIRRIKSNSPNDPRSPFYDPTGTEMKPLARLYLSRQRLYEK
NCVAQVDVSEEDGGVEGDKCEMEEWSEWSKCTVTCGRGFKYKQRAYKNPASNFVCNKPLTKRASCVAILEHCSNQQRPQE
ADPSCSLTGWGNWSSCTAPCGPGWKTRSRRYKNRKAAKRCAAGNENPEPLEQNLECMERECGPSDRRPLQESKECEGRAW
SNWSPCSSTCGKGIKVRRRMAYRSLWGRSPARYNRGLFDTEDTSRDDDDDGSDEDPCMNLDEKVECINDDVPVCEDTVDN
SAVVCGFPRDEGGCMSNVDRWYFDVIKGNCDIFSYSGCQGNKNNFKTLERCENVCDSYKKELLANRTAYKRQLGVTVSGV
LSYNLHHMQNDDADNCVPGSQTRQDQDKKIIQEPIGEVVDCQMSEWTNWSGCNATCGRGFSTKHRFIRVHPSNGGKRCPQ
KILKKRKCKIPCPGDYTKRDPMLPTWGTANSLEHVQIDCVMTGWSAWSPCSRSCGPNAVQQRTRGILLPPSGRGEPCLHR
TEERPCSLLACPE
>lcl|NW_017095468.1_prot_XP_017772069.1_11 [gene=LOC108557919] [db_xref=GeneID:108557919] [protein=spondin-1 isoform X2] [protein_id=XP_017772069.1] [location=join(95471..95648,106560..106774,106834..107025,107093..107299,107562..107816,107872..108474,108526..108799,109546..109812,109882..110114)] [gbkey=CDS]
MRLKVAFLWLVSSISWIGEALRCDRTPEGTFSPRTRADGRFVIEVSGNPDTYVPGEQYNIFLRSNGEYQAKNKFKDFLLL
VEHEPSEKILGEVHNPSVGTLQLLGDMLMKFSEKCRNAVMQTNSLPKSEVQVLWVAPPSGSGCVAIRATVVESKEFWYTD
DGPLSKILCEEVQENEDTQPNMLRQCCACDEAKYEVTFEGLWSRNTHPKDFPSNGWLTRFSDIIGASHTFDYTFWNYGEI
ASNGLRQLAENGNTRMLESELKAKSEHIRTIIKARGISYPNITGKTFAVFRVDKRHHLMSLVSMIDPSPDWIVGVSGLEL
CLRNCSWVESRVLNLYPWDAGTDDGPTYISANQPSMPPHPIRRIKSNSPNDPRSPFYDPTGTEMKPLARLYLSRQRLYEK
NCVAQVDVSEEDGGVEGDKCEMEEWSEWSKCTVTCGRGFKYKQRAYKNPASNFVCNKPLTKRASCVAILEHCSNQQRPQE
ADPSCSLTGWGNWSSCTAPCGPGWKTRSRRYKNRKAAKRCAAGNENPEPLEQNLECMERECGPSDRRPLQESKECEGRAW
SNWSPCSSTCGKGIKVRRRMAYRSLWGRSPARYNRGLFDTEDTSRDDDDDGSDEDPCMNLDEKVECINDDVPVCEDTVDN
SVTENYFGRIVPGSQTRQDQDKKIIQEPIGEVVDCQMSEWTNWSGCNATCGRGFSTKHRFIRVHPSNGGKRCPQKILKKR
KCKIPCPGDYTKRDPMLPTWGTANSLEHVQIDCVMTGWSAWSPCSRSCGPNAVQQRTRGILLPPSGRGEPCLHRTEERPC
SLLACPE
>lcl|NW_017095468.1_prot_XP_017772815.1_12 [gene=LOC108557919] [db_xref=GeneID:108557919] [protein=spondin-1 isoform X3] [protein_id=XP_017772815.1] [location=join(95471..95648,106560..106774,106834..107025,107093..107299,107562..107816,107872..108474,108526..108799,109573..109812,109882..110114)] [gbkey=CDS]
MRLKVAFLWLVSSISWIGEALRCDRTPEGTFSPRTRADGRFVIEVSGNPDTYVPGEQYNIFLRSNGEYQAKNKFKDFLLL
VEHEPSEKILGEVHNPSVGTLQLLGDMLMKFSEKCRNAVMQTNSLPKSEVQVLWVAPPSGSGCVAIRATVVESKEFWYTD
I've tried using BioStrings but I'm unfamiliar with how to index this type of file. Any help appreciated.
XP_123
), for thegene=LOC
string? All of the above? Will the entire header line be identical or only parts of it? – terdon Jul 09 '19 at 17:08XP
bits are RefSeq protein accessions. These can be different for different protein products of the same gene. It looks like the solutions will need to look at either thegene=
or theGeneID:
values. – terdon Jul 09 '19 at 19:30exonerate
on using all proteins as queries. But yeah, picking the longest might make sense if you don't have a decent machine. Also, if my answer isn't working for you, you probably have ID lines that are different from what you've shown here. My solution just takes the 2nd field of the line ([gene=LOCNNNN]
in your example), if that changes, it will not work. – terdon Jul 11 '19 at 08:24