Convert ncbi-blast+ output format (blast.out) to Fasta format

Ambu Vijayan
4 min readMar 5, 2022

Today i had the privilege to help my mentor with converting a blast.out file to a simple fasta file. All he needed was the subject sequence. Right away i went to run Blast_Formatter and its no longer working. Hence i started my investigation to find a solution for this simple problem and i found one, but with more of a simple text manipulation.

Notepad++

Notepad++ is a free source code editor and Notepad replacement that supports several languages. Running in the MS Windows environment.

Meaning its a staring away installation for windows but not available in linux.

Wine

Wine is a compatibility layer capable of running Windows applications on several POSIX-compliant operating systems, such as Linux, macOS, & BSD. Instead of simulating internal Windows logic like a virtual machine or emulator, Wine translates Windows API calls into POSIX calls on-the-fly, eliminating the performance and memory penalties of other methods and allowing you to cleanly integrate Windows applications into your desktop.

Basically you can run windows applications in linux with wine.

>1. Installing wine

sudo apt install wine

>2. Installing Notepad++

From the above link download latest version of Notepad++ and using wine install the exe file.

Open terminal where the exe file was downloaded.

wine npp.x.x.x.installer.x64.exe

For you, the version and bit may be different, change the file name accordingly. Now install it just as you would in windows and the icon to launch will be available among linux programmes.

Now we are ready to edit the blast.out file.

Let me show you an example

This was the file i received and i want only the Sbjct line sequence as fasta format.

Open the file in Notepad++ via File > Open

Check the file distribution to open the downloads or other folder in linux systems. Its a bit confusing if you go for My Computer.

#1. Select and delete all the unwanted parts except the alignment section, both on top and bottom.

#2. (*Optional) Now remove “|” symbol in between lines using Find and Replace option. Use Control+F to activate Find and Replace dialogue box and select second tab replace and paste that symbol in Find What box, leave the other box blank and select “Replace All”.

You will end up with a text file like this.

#3. Now lets remove the blank lines and unwanted Query sequences.

Use Control+F to activate Find and Replace dialogue box and select mark tab.

Type “Sbjct” in Find What box, Check the Bookmark Line box and Select Mark All. Close it.

Now, to to “Search’ option on top > Bookmarks > Remove Unmarked Lines

You will find that all the lines without the word “Sbjct” is removed.

#4. Now lets remove all the numbers.

Use Control+F to activate Find and Replace dialogue box and select second tab replace and paste [0-9]+ in Find What box, leave the other box blank, Change search mode to regular Expression and select “Replace All”.

#5. Delete Sbjct text using Find and Replace dialogue box

#6. Delete unwanted spaces at the beginning and end of all lines by using Edit > Blank operations > trim Leading and Trailing Spaces.

#7. Merge all lines to one line by selecting all text or Ctrl +A and going to Edit > Line Operations > Join Lines, from the menu or press Ctrl + J.

#8. Make the fasta by adding a > in front of the file.

WARNING!!! Make sure to look for nucleotide other than A,T,G and C that is introduced as part of sequence alignment in blasting process.

Kindly refer this article Effective Automated Feature Construction and Selection for Classification of Biological Sequences by Uday Kamath, Kenneth De Jong and Amarda Shehu to learn more about this.

Thank you, Hope this helped you.

*#2. is optional because, #3 will clear the special characters too, but this was left just to let you understand how to handle if the special characters are not in a seperate line.

--

--

Ambu Vijayan

Bioinformatics Analyst, Statistical Consultant, R-Programmer