NCBI Blast

Taking a set of unknown sequence files and annotating them

For the first task you will take an unknown multi-fasta file and annotate it using blast. You are welcome to do this in terminal, Rstudio, or jupyter. My recommendation, and how I will demonstrate is using Rmarkdown. Once you have have your project structured, we will download software, databases, a fasta file and run the code.

Assignment

Annotate a provided fasta file with GO slim terms. Your code should be fully reproducible.

In your code directory create a file.

01-blast.Rmd

Tip

Rmarkdown is a good option as you can use markdown, add pictures and more!

Downloading software

NCBI Blast Software is at

https://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/

Tip

It is best to decide on central location on computer where software will be downloaded.

https://github.com/RobertsLab/code/blob/master/09-blast.ipynb

cd /Applications/bioinfo/
curl -O https://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/ncbi-blast-2.13.0+-x64-macosx.tar.gz
tar -xf ncbi-blast-2.13.0+-x64-macosx.tar.gz
ls /Applications/bioinfo/
```{bash}
/Applications/bioinfo/ncbi-blast-2.13.0+/bin/blastx -h
```

Make Blast Database

see https://www.uniprot.org/downloads

https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz

cd ../data
curl -O https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz
mv uniprot_sprot.fasta.gz uniprot_sprot_r2023_01.fasta.gz
gunzip -k uniprot_sprot_r2023_01.fasta.gz
ls ../data
/Applications/bioinfo/ncbi-blast-2.13.0+/bin/makeblastdb \
-in ../data/uniprot_sprot_r2023_01.fasta \
-dbtype prot \
-out ../blastdb/uniprot_sprot_r2023_01

Get the query sequence

curl https://eagle.fish.washington.edu/cnidarian/Ab_4denovo_CLC6_a.fa \
-k \
> ../data/Ab_4denovo_CLC6_a.fa
head ../data/Ab_4denovo_CLC6_a.fa
echo "How many sequences are there?"
grep -c ">" ../data/Ab_4denovo_CLC6_a.fa
>solid0078_20110412_FRAG_BC_WHITE_WHITE_F3_QV_SE_trimmed_contig_1
ACACCCCACCCCAACGCACCCTCACCCCCACCCCAACAATCCATGATTGAATACTTCATC
TATCCAAGACAAACTCCTCCTACAATCCATGATAGAATTCCTCCAAAAATAATTTCACAC
TGAAACTCCGGTATCCGAGTTATTTTGTTCCCAGTAAAATGGCATCAACAAAAGTAGGTC
TGGATTAACGAACCAATGTTGCTGCGTAATATCCCATTGACATATCTTGTCGATTCCTAC
CAGGATCCGGACTGACGAGATTTCACTGTACGTTTATGCAAGTCATTTCCATATATAAAA
TTGGATCTTATTTGCACAGTTAAATGTCTCTATGCTTATTTATAAATCAATGCCCGTAAG
CTCCTAATATTTCTCTTTTCGTCCGACGAGCAAACAGTGAGTTTACTGTGGCCTTCAGCA
AAAGTATTGATGTTGTAAATCTCAGTTGTGATTGAACAATTTGCCTCACTAGAAGTAGCC
TTC
How many sequences are there?
5490

Run Blast

/Applications/bioinfo/ncbi-blast-2.13.0+/bin/blastx \
-query ../data/Ab_4denovo_CLC6_a.fa \
-db ../blastdb/uniprot_sprot_r2023_01 \
-out ../output/Ab_4-uniprot_blastx.tab \
-evalue 1E-20 \
-num_threads 20 \
-max_target_seqs 1 \
-outfmt 6
head -2 ../output/Ab_4-uniprot_blastx.tab
wc -l ../output/Ab_4-uniprot_blastx.tab
solid0078_20110412_FRAG_BC_WHITE_WHITE_F3_QV_SE_trimmed_contig_3    sp|O42248|GBLP_DANRE    82.456  171 30  0   1   513 35  205 2.81e-103   301
solid0078_20110412_FRAG_BC_WHITE_WHITE_F3_QV_SE_trimmed_contig_5    sp|Q08013|SSRG_RAT  75.385  65  16  0   3   197 121 185 1.40e-28    104
     765 ../output/Ab_4-uniprot_blastx.tab

Need to convert sp|Q08013|SSRG_RAT to get accession number out.

Getting more information

https://www.uniprot.org/uniprotkb

https://rest.uniprot.org/uniprotkb/stream?compressed=true&fields=accession%2Creviewed%2Cid%2Cprotein_name%2Cgene_names%2Corganism_name%2Clength%2Cgo_f%2Cgo%2Cgo_p%2Cgo_c%2Cgo_id%2Ccc_interaction%2Cec%2Cxref_reactome%2Cxref_unipathway%2Cxref_interpro&format=tsv&query=%28%2A%29%20AND%20%28reviewed%3Atrue%29
```{bash}
curl -O "Accept: text/plain; format=tsv" "https://rest.uniprot.org/uniprotkb/search?query=reviewed:true+AND+organism_id:9606"
```
```{bash}
curl -O -H "Accept: text/plain; format=tsv" "https://rest.uniprot.org/uniprotkb/stream?compressed=true&fields=accession%2Creviewed%2Cid%2Cprotein_name%2Cgene_names%2Corganism_name%2Clength%2Cgo_f%2Cgo%2Cgo_p%2Cgo_c%2Cgo_id%2Ccc_interaction%2Cec%2Cxref_reactome%2Cxref_unipathway%2Cxref_interpro&format=tsv&query=%28%2A%29%20AND%20%28reviewed%3Atrue%29"
```

Joining blast table with annotation table

At this point we have a blast output table and annotation table both with a Uniprot accession number. Thus we can join the two tables and be able to get more functional information about thet genes.

```{bash}
head -2 ../output/Ab_4-uniprot_blastx.tab
wc -l ../output/Ab_4-uniprot_blastx.tab
```
```{bash}
tr '|' '\t' < ../output/Ab_4-uniprot_blastx.tab | head -2
```
```{bash}
tr '|' '\t' < ../output/Ab_4-uniprot_blastx.tab \
> ../output/Ab_4-uniprot_blastx_sep.tab
```
```{bash}
head -2 ../data/uniprot_table_r2023_01.tab
wc -l ../data/uniprot_table_r2023_01.tab
```
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.0     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.1     ✔ tibble    3.1.8
✔ lubridate 1.9.2     ✔ tidyr     1.3.0
✔ purrr     1.0.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library("kableExtra")

Attaching package: 'kableExtra'

The following object is masked from 'package:dplyr':

    group_rows
bltabl <- read.csv("../output/Ab_4-uniprot_blastx_sep.tab", sep = '\t', header = FALSE)
spgo <- read.csv("../data/uniprot_table_r2023_01.tab", sep = '\t', header = TRUE)
str(spgo)
'data.frame':   569213 obs. of  17 variables:
 $ Entry                             : chr  "A0A023I7E1" "A0A024B7W1" "A0A024SC78" "A0A024SH76" ...
 $ Reviewed                          : chr  "reviewed" "reviewed" "reviewed" "reviewed" ...
 $ Entry.Name                        : chr  "ENG1_RHIMI" "POLG_ZIKVF" "CUTI1_HYPJR" "GUX2_HYPJR" ...
 $ Protein.names                     : chr  "Glucan endo-1,3-beta-D-glucosidase 1 (Endo-1,3-beta-glucanase 1) (EC 3.2.1.39) (Laminarinase) (RmLam81A)" "Genome polyprotein [Cleaved into: Capsid protein C (Capsid protein) (Core protein); Protein prM (Precursor memb"| __truncated__ "Cutinase (EC 3.1.1.74)" "Exoglucanase 2 (EC 3.2.1.91) (1,4-beta-cellobiohydrolase) (Cellobiohydrolase 6A) (Cel6A) (Exocellobiohydrolase "| __truncated__ ...
 $ Gene.Names                        : chr  "ENG1 LAM81A" "" "M419DRAFT_76732" "cbh2 M419DRAFT_122470" ...
 $ Organism                          : chr  "Rhizomucor miehei" "Zika virus (isolate ZIKV/Human/French Polynesia/10087PF/2013) (ZIKV)" "Hypocrea jecorina (strain ATCC 56765 / BCRC 32924 / NRRL 11460 / Rut C-30) (Trichoderma reesei)" "Hypocrea jecorina (strain ATCC 56765 / BCRC 32924 / NRRL 11460 / Rut C-30) (Trichoderma reesei)" ...
 $ Length                            : int  796 3423 248 471 478 693 333 742 2442 455 ...
 $ Gene.Ontology..molecular.function.: chr  "glucan endo-1,3-beta-D-glucosidase activity [GO:0042973]; glucan endo-1,3-beta-glucanase activity, C-3 substitu"| __truncated__ "4 iron, 4 sulfur cluster binding [GO:0051539]; ATP binding [GO:0005524]; ATP hydrolysis activity [GO:0016887]; "| __truncated__ "cutinase activity [GO:0050525]" "cellulose 1,4-beta-cellobiosidase activity [GO:0016162]; cellulose binding [GO:0030248]" ...
 $ Gene.Ontology..GO.                : chr  "extracellular region [GO:0005576]; glucan endo-1,3-beta-D-glucosidase activity [GO:0042973]; glucan endo-1,3-be"| __truncated__ "extracellular region [GO:0005576]; host cell endoplasmic reticulum membrane [GO:0044167]; host cell nucleus [GO"| __truncated__ "extracellular region [GO:0005576]; cutinase activity [GO:0050525]" "extracellular region [GO:0005576]; cellulose 1,4-beta-cellobiosidase activity [GO:0016162]; cellulose binding ["| __truncated__ ...
 $ Gene.Ontology..biological.process.: chr  "cell wall organization [GO:0071555]; polysaccharide catabolic process [GO:0000272]" "clathrin-dependent endocytosis of virus by host cell [GO:0075512]; fusion of virus membrane with host endosome "| __truncated__ "" "cellulose catabolic process [GO:0030245]" ...
 $ Gene.Ontology..cellular.component.: chr  "extracellular region [GO:0005576]" "extracellular region [GO:0005576]; host cell endoplasmic reticulum membrane [GO:0044167]; host cell nucleus [GO"| __truncated__ "extracellular region [GO:0005576]" "extracellular region [GO:0005576]" ...
 $ Gene.Ontology.IDs                 : chr  "GO:0000272; GO:0005576; GO:0042973; GO:0052861; GO:0052862; GO:0071555" "GO:0003724; GO:0003725; GO:0003968; GO:0004252; GO:0004482; GO:0004483; GO:0005198; GO:0005524; GO:0005525; GO:"| __truncated__ "GO:0005576; GO:0050525" "GO:0005576; GO:0016162; GO:0030245; GO:0030248" ...
 $ Interacts.with                    : chr  "" "" "" "" ...
 $ EC.number                         : chr  "3.2.1.39" "2.1.1.56; 2.1.1.57; 2.7.7.48; 3.4.21.91; 3.6.1.15; 3.6.4.13" "3.1.1.74" "3.2.1.91" ...
 $ Reactome                          : chr  "" "" "" "" ...
 $ UniPathway                        : chr  "" "" "" "" ...
 $ InterPro                          : chr  "IPR005200;IPR040720;IPR040451;" "IPR011492;IPR043502;IPR000069;IPR038302;IPR013755;IPR001122;IPR037172;IPR027287;IPR026470;IPR038345;IPR001157;I"| __truncated__ "IPR029058;IPR000675;IPR043580;IPR043579;IPR011150;" "IPR016288;IPR036434;IPR035971;IPR000254;IPR001524;" ...
kbl(
head(
  left_join(bltabl, spgo,  by = c("V3" = "Entry")) %>%
  select(V1, V3, V13, Protein.names, Organism, Gene.Ontology..biological.process., Gene.Ontology.IDs) %>% mutate(V1 = str_replace_all(V1, 
            pattern = "solid0078_20110412_FRAG_BC_WHITE_WHITE_F3_QV_SE_trimmed", replacement = "Ab"))
)
) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
V1 V3 V13 Protein.names Organism Gene.Ontology..biological.process. Gene.Ontology.IDs
Ab_contig_3 O42248 0 Guanine nucleotide-binding protein subunit beta-2-like 1 (Receptor of activated protein kinase C) (RACK) Danio rerio (Zebrafish) (Brachydanio rerio) angiogenesis [GO:0001525]; convergent extension involved in gastrulation [GO:0060027]; negative regulation of Wnt signaling pathway [GO:0030178]; positive regulation of gastrulation [GO:2000543]; positive regulation of protein phosphorylation [GO:0001934]; regulation of cell division [GO:0051302]; regulation of establishment of cell polarity [GO:2000114]; regulation of protein localization [GO:0032880]; rescue of stalled ribosome [GO:0072344] GO:0001525; GO:0001934; GO:0005080; GO:0005634; GO:0005737; GO:0005829; GO:0005840; GO:0030178; GO:0032880; GO:0043022; GO:0045182; GO:0051302; GO:0060027; GO:0072344; GO:1990904; GO:2000114; GO:2000543
Ab_contig_5 Q08013 0 Translocon-associated protein subunit gamma (TRAP-gamma) (Signal sequence receptor subunit gamma) (SSR-gamma) Rattus norvegicus (Rat) SRP-dependent cotranslational protein targeting to membrane [GO:0006614] GO:0005783; GO:0005784; GO:0006614
Ab_contig_6 P12234 0 Phosphate carrier protein, mitochondrial (Phosphate transport protein) (PTP) (Solute carrier family 25 member 3) Bos taurus (Bovine) mitochondrial phosphate ion transmembrane transport [GO:1990547]; phosphate ion transmembrane transport [GO:0035435] GO:0005315; GO:0005739; GO:0005743; GO:0015293; GO:0035435; GO:0044877; GO:1990547
Ab_contig_9 Q41629 0 ADP,ATP carrier protein 1, mitochondrial (ADP/ATP translocase 1) (Adenine nucleotide translocator 1) (ANT 1) Triticum aestivum (Wheat) mitochondrial ADP transmembrane transport [GO:0140021]; mitochondrial ATP transmembrane transport [GO:1990544] GO:0005471; GO:0005743; GO:0140021; GO:1990544
Ab_contig_13 Q32NG4 0 Glutamine amidotransferase-like class 1 domain-containing protein 1 (Parkinson disease 7 domain-containing protein 1) Xenopus laevis (African clawed frog) GO:0005576
Ab_contig_23 Q9GNE2 0 60S ribosomal protein L23 (AeRpL17A) (L17A) Aedes aegypti (Yellowfever mosquito) (Culex aegypti) translation [GO:0006412] GO:0003735; GO:0005840; GO:0006412; GO:1990904
left_join(bltabl, spgo,  by = c("V3" = "Entry")) %>%
  select(V1, V3, V13, Protein.names, Organism, Gene.Ontology..biological.process., Gene.Ontology.IDs) %>% mutate(V1 = str_replace_all(V1, 
            pattern = "solid0078_20110412_FRAG_BC_WHITE_WHITE_F3_QV_SE_trimmed", replacement = "Ab")) %>%
  write_delim("../output/blast_annot_go.tab", delim = '\t')