NCBI Blast

Taking a set of unknown sequence files and annotating them

Screen Recording

For the first task you will take an unknown multi-fasta file and annotate it using blast. You are welcome to do this in terminal, Rstudio, or jupyter. My recommendation, and how I will demonstrate is using Rmarkdown. Once you have have your project structured, we will download software, databases, a fasta file and run the code.

Assignment

Annotate a provided fasta file with GO slim terms. Your code should be fully reproducible.

In your code directory create a file.

01-blast.Rmd

Tip

Rmarkdown is a good option as you can use markdown, add pictures and more!

Downloading software

NCBI Blast Software is at

https://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/

Tip

It is best to decide on central location on computer where software will be downloaded.

https://github.com/RobertsLab/code/blob/master/09-blast.ipynb

cd /Applications/bioinfo/
curl -O https://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/ncbi-blast-2.13.0+-x64-macosx.tar.gz
tar -xf ncbi-blast-2.13.0+-x64-macosx.tar.gz

ls /Applications/bioinfo/

```{bash}
/Applications/bioinfo/ncbi-blast-2.13.0+/bin/blastx -h
```

Make Blast Database

see https://www.uniprot.org/downloads

https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz

cd ../data
curl -O https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz
mv uniprot_sprot.fasta.gz uniprot_sprot_r2023_01.fasta.gz
gunzip -k uniprot_sprot_r2023_01.fasta.gz
ls ../data

/Applications/bioinfo/ncbi-blast-2.13.0+/bin/makeblastdb \
-in ../data/uniprot_sprot_r2023_01.fasta \
-dbtype prot \
-out ../blastdb/uniprot_sprot_r2023_01

Get the query sequence

curl https://eagle.fish.washington.edu/cnidarian/Ab_4denovo_CLC6_a.fa \
-k \
> ../data/Ab_4denovo_CLC6_a.fa

head ../data/Ab_4denovo_CLC6_a.fa
echo "How many sequences are there?"
grep -c ">" ../data/Ab_4denovo_CLC6_a.fa

>solid0078_20110412_FRAG_BC_WHITE_WHITE_F3_QV_SE_trimmed_contig_1
ACACCCCACCCCAACGCACCCTCACCCCCACCCCAACAATCCATGATTGAATACTTCATC
TATCCAAGACAAACTCCTCCTACAATCCATGATAGAATTCCTCCAAAAATAATTTCACAC
TGAAACTCCGGTATCCGAGTTATTTTGTTCCCAGTAAAATGGCATCAACAAAAGTAGGTC
TGGATTAACGAACCAATGTTGCTGCGTAATATCCCATTGACATATCTTGTCGATTCCTAC
CAGGATCCGGACTGACGAGATTTCACTGTACGTTTATGCAAGTCATTTCCATATATAAAA
TTGGATCTTATTTGCACAGTTAAATGTCTCTATGCTTATTTATAAATCAATGCCCGTAAG
CTCCTAATATTTCTCTTTTCGTCCGACGAGCAAACAGTGAGTTTACTGTGGCCTTCAGCA
AAAGTATTGATGTTGTAAATCTCAGTTGTGATTGAACAATTTGCCTCACTAGAAGTAGCC
TTC
How many sequences are there?
5490

Run Blast

/Applications/bioinfo/ncbi-blast-2.13.0+/bin/blastx \
-query ../data/Ab_4denovo_CLC6_a.fa \
-db ../blastdb/uniprot_sprot_r2023_01 \
-out ../output/Ab_4-uniprot_blastx.tab \
-evalue 1E-20 \
-num_threads 20 \
-max_target_seqs 1 \
-outfmt 6

head -2 ../output/Ab_4-uniprot_blastx.tab
wc -l ../output/Ab_4-uniprot_blastx.tab

solid0078_20110412_FRAG_BC_WHITE_WHITE_F3_QV_SE_trimmed_contig_3    sp|O42248|GBLP_DANRE    82.456  171 30  0   1   513 35  205 2.81e-103   301
solid0078_20110412_FRAG_BC_WHITE_WHITE_F3_QV_SE_trimmed_contig_5    sp|Q08013|SSRG_RAT  75.385  65  16  0   3   197 121 185 1.40e-28    104
     765 ../output/Ab_4-uniprot_blastx.tab

Need to convert sp|Q08013|SSRG_RAT to get accession number out.

Getting more information

https://www.uniprot.org/uniprotkb

```{bash}
curl -O -H "Accept: text/plain; format=tsv" "https://rest.uniprot.org/uniprotkb/stream?compressed=true&fields=accession%2Creviewed%2Cid%2Cprotein_name%2Cgene_names%2Corganism_name%2Clength%2Cgo_f%2Cgo%2Cgo_p%2Cgo_c%2Cgo_id%2Ccc_interaction%2Cec%2Cxref_reactome%2Cxref_unipathway%2Cxref_interpro&format=tsv&query=%28%2A%29%20AND%20%28reviewed%3Atrue%29"
```

Joining blast table with annotation table

At this point we have a blast output table and annotation table both with a Uniprot accession number. Thus we can join the two tables and be able to get more functional information about thet genes.

```{bash}
head -2 ../output/Ab_4-uniprot_blastx.tab
wc -l ../output/Ab_4-uniprot_blastx.tab
```

```{bash}
tr '|' '\t' < ../output/Ab_4-uniprot_blastx.tab | head -2
```

```{bash}
tr '|' '\t' < ../output/Ab_4-uniprot_blastx.tab \
> ../output/Ab_4-uniprot_blastx_sep.tab
```

```{bash}
head -2 ../data/uniprot_table_r2023_01.tab
wc -l ../data/uniprot_table_r2023_01.tab
```

```{r}
library(tidyverse)
library("kableExtra")
```

bltabl <- read.csv("../output/Ab_4-uniprot_blastx_sep.tab", sep = '\t', header = FALSE)

#spgo <- read.csv("../data/uniprot_table_r2023_01.tab", sep = '\t', header = TRUE)

kbl(
head(
  left_join(bltabl, spgo,  by = c("V3" = "Entry")) %>%
  select(V1, V3, V13, Protein.names, Organism, Gene.Ontology..biological.process., Gene.Ontology.IDs) %>% mutate(V1 = str_replace_all(V1, 
            pattern = "solid0078_20110412_FRAG_BC_WHITE_WHITE_F3_QV_SE_trimmed", replacement = "Ab"))
)
) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))

left_join(bltabl, spgo,  by = c("V3" = "Entry")) %>%
  select(V1, V3, V13, Protein.names, Organism, Gene.Ontology..biological.process., Gene.Ontology.IDs) %>% mutate(V1 = str_replace_all(V1, 
            pattern = "solid0078_20110412_FRAG_BC_WHITE_WHITE_F3_QV_SE_trimmed", replacement = "Ab")) %>%
  write_delim("../output/blast_annot_go.tab", delim = '\t')

--- title: "NCBI Blast" subtitle: "Taking a set of unknown sequence files and annotating them" format: html: code-fold: false code-tools: true code-copy: true highlight-style: github code-overflow: wrap --- ## Screen Recording [![](http://gannet.fish.washington.edu/seashell/snaps/Monosnap_Recording_Information_-_Zoom_2025-04-01_19-12-07.png){fig-align="left"}](https://washington.zoom.us/rec/share/r2_bDwXOIWDMG7_1WDr5PErr2RrwAnV6U-WG3giCxGRT-kL6fq7574awH38F4nw.2ayImTAVVNW1Er7G?startTime=1743696993000) For the first task you will take an unknown multi-fasta file and annotate it using blast. You are welcome to do this in terminal, Rstudio, or jupyter. My recommendation, and how I will demonstrate is using Rmarkdown. Once you have have your project structured, we will download software, databases, a fasta file and run the code. ::: callout-important ## Assignment Annotate a provided fasta file with GO slim terms. Your code should be fully reproducible. ::: In your code directory create a file. `01-blast.Rmd` ::: callout-tip Rmarkdown is a good option as you can use markdown, add pictures and more! ::: # Downloading software NCBI Blast Software is at <https://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/> ::: callout-tip It is best to decide on central location on computer where software will be downloaded. ::: https://github.com/RobertsLab/code/blob/master/09-blast.ipynb ``` {.bash .code-overflow-wrap} cd /Applications/bioinfo/ curl -O https://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/ncbi-blast-2.13.0+-x64-macosx.tar.gz tar -xf ncbi-blast-2.13.0+-x64-macosx.tar.gz ``` ``` bash ls /Applications/bioinfo/ ``` ```{bash} /Applications/bioinfo/ncbi-blast-2.13.0+/bin/blastx -h ``` # Make Blast Database see <https://www.uniprot.org/downloads> https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz ``` bash cd ../data curl -O https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz mv uniprot_sprot.fasta.gz uniprot_sprot_r2023_01.fasta.gz gunzip -k uniprot_sprot_r2023_01.fasta.gz ls ../data ``` ``` bash /Applications/bioinfo/ncbi-blast-2.13.0+/bin/makeblastdb \ -in ../data/uniprot_sprot_r2023_01.fasta \ -dbtype prot \ -out ../blastdb/uniprot_sprot_r2023_01 ``` # Get the query sequence ``` bash curl https://eagle.fish.washington.edu/cnidarian/Ab_4denovo_CLC6_a.fa \ -k \ > ../data/Ab_4denovo_CLC6_a.fa ``` ```{bash} head ../data/Ab_4denovo_CLC6_a.fa echo "How many sequences are there?" grep -c ">" ../data/Ab_4denovo_CLC6_a.fa ``` # Run Blast ``` bash /Applications/bioinfo/ncbi-blast-2.13.0+/bin/blastx \ -query ../data/Ab_4denovo_CLC6_a.fa \ -db ../blastdb/uniprot_sprot_r2023_01 \ -out ../output/Ab_4-uniprot_blastx.tab \ -evalue 1E-20 \ -num_threads 20 \ -max_target_seqs 1 \ -outfmt 6 ``` ```{bash} head -2 ../output/Ab_4-uniprot_blastx.tab wc -l ../output/Ab_4-uniprot_blastx.tab ``` Need to convert `sp|Q08013|SSRG_RAT` to get accession number out. # Getting more information ```{r schemat, echo = FALSE, out.width = "70%", fig.align = "center"} knitr::include_graphics("../img/uniprot.png") ``` https://www.uniprot.org/uniprotkb ```{r api, echo = FALSE, out.width = "70%", fig.align = "center"} knitr::include_graphics("../img/uniprot-api.png") ``` ```` ```{bash} curl -O -H "Accept: text/plain; format=tsv" "https://rest.uniprot.org/uniprotkb/stream?compressed=true&fields=accession%2Creviewed%2Cid%2Cprotein_name%2Cgene_names%2Corganism_name%2Clength%2Cgo_f%2Cgo%2Cgo_p%2Cgo_c%2Cgo_id%2Ccc_interaction%2Cec%2Cxref_reactome%2Cxref_unipathway%2Cxref_interpro&format=tsv&query=%28%2A%29%20AND%20%28reviewed%3Atrue%29" ``` ```` ## Joining blast table with annotation table At this point we have a blast output table and annotation table both with a Uniprot accession number. Thus we can join the two tables and be able to get more functional information about thet genes. ```{bash} head -2 ../output/Ab_4-uniprot_blastx.tab wc -l ../output/Ab_4-uniprot_blastx.tab ``` ```{bash} tr '|' '\t' < ../output/Ab_4-uniprot_blastx.tab | head -2 ``` ```{bash} tr '|' '\t' < ../output/Ab_4-uniprot_blastx.tab \ > ../output/Ab_4-uniprot_blastx_sep.tab ``` ```{bash} head -2 ../data/uniprot_table_r2023_01.tab wc -l ../data/uniprot_table_r2023_01.tab ``` ```{r} library(tidyverse) library("kableExtra") ``` ```{r} bltabl <- read.csv("../output/Ab_4-uniprot_blastx_sep.tab", sep = '\t', header = FALSE) ``` ```{r, cache=TRUE} #spgo <- read.csv("../data/uniprot_table_r2023_01.tab", sep = '\t', header = TRUE) ``` ```{r, echo=FALSE} #str(spgo) ``` ```{r, eval = FALSE} kbl( head( left_join(bltabl, spgo, by = c("V3" = "Entry")) %>% select(V1, V3, V13, Protein.names, Organism, Gene.Ontology..biological.process., Gene.Ontology.IDs) %>% mutate(V1 = str_replace_all(V1, pattern = "solid0078_20110412_FRAG_BC_WHITE_WHITE_F3_QV_SE_trimmed", replacement = "Ab")) ) ) %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) ``` ```{r, eval = FALSE} left_join(bltabl, spgo, by = c("V3" = "Entry")) %>% select(V1, V3, V13, Protein.names, Organism, Gene.Ontology..biological.process., Gene.Ontology.IDs) %>% mutate(V1 = str_replace_all(V1, pattern = "solid0078_20110412_FRAG_BC_WHITE_WHITE_F3_QV_SE_trimmed", replacement = "Ab")) %>% write_delim("../output/blast_annot_go.tab", delim = '\t') ```