Taking a set of unknown sequence files and annotating them
For the first task you will take an unknown multi-fasta file and annotate it using blast. You are welcome to do this in terminal, Rstudio, or jupyter. My recommendation, and how I will demonstrate is using Rmarkdown. Once you have have your project structured, we will download software, databases, a fasta file and run the code.
Assignment
Annotate a provided fasta file with GO slim terms. Your code should be fully reproducible.
In your code directory create a file.
01-blast.Rmd
Tip
Rmarkdown is a good option as you can use markdown, add pictures and more!
cd /Applications/bioinfo/curl-O https://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/ncbi-blast-2.13.0+-x64-macosx.tar.gztar-xf ncbi-blast-2.13.0+-x64-macosx.tar.gz
At this point we have a blast output table and annotation table both with a Uniprot accession number. Thus we can join the two tables and be able to get more functional information about thet genes.
Guanine nucleotide-binding protein subunit beta-2-like 1 (Receptor of activated protein kinase C) (RACK)
Danio rerio (Zebrafish) (Brachydanio rerio)
angiogenesis [GO:0001525]; convergent extension involved in gastrulation [GO:0060027]; negative regulation of Wnt signaling pathway [GO:0030178]; positive regulation of gastrulation [GO:2000543]; positive regulation of protein phosphorylation [GO:0001934]; regulation of cell division [GO:0051302]; regulation of establishment of cell polarity [GO:2000114]; regulation of protein localization [GO:0032880]; rescue of stalled ribosome [GO:0072344]
---title: "NCBI Blast"subtitle: "Taking a set of unknown sequence files and annotating them"format: html: code-fold: false code-tools: true code-copy: true highlight-style: github code-overflow: wrap---For the first task you will take an unknown multi-fasta file and annotate it using blast. You are welcome to do this in terminal, Rstudio, or jupyter. My recommendation, and how I will demonstrate is using Rmarkdown. Once you have have your project structured, we will download software, databases, a fasta file and run the code.::: callout-important## AssignmentAnnotate a provided fasta file with GO slim terms. Your code should be fully reproducible.:::In your code directory create a file.`01-blast.Rmd`::: callout-tipRmarkdown is a good option as you can use markdown, add pictures and more!:::# Downloading softwareNCBI Blast Software is at<https://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/>::: callout-tipIt is best to decide on central location on computer where software will be downloaded.:::https://github.com/RobertsLab/code/blob/master/09-blast.ipynb``` {.bash .code-overflow-wrap}cd /Applications/bioinfo/curl-O https://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/ncbi-blast-2.13.0+-x64-macosx.tar.gztar-xf ncbi-blast-2.13.0+-x64-macosx.tar.gz`````` bashls /Applications/bioinfo/``````{bash}/Applications/bioinfo/ncbi-blast-2.13.0+/bin/blastx-h```# Make Blast Databasesee <https://www.uniprot.org/downloads>https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz``` bashcd ../datacurl-O https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gzmv uniprot_sprot.fasta.gz uniprot_sprot_r2023_01.fasta.gzgunzip-k uniprot_sprot_r2023_01.fasta.gzls ../data`````` bash/Applications/bioinfo/ncbi-blast-2.13.0+/bin/makeblastdb\-in ../data/uniprot_sprot_r2023_01.fasta \-dbtype prot \-out ../blastdb/uniprot_sprot_r2023_01```# Get the query sequence``` bashcurl https://eagle.fish.washington.edu/cnidarian/Ab_4denovo_CLC6_a.fa \-k \> ../data/Ab_4denovo_CLC6_a.fa``````{bash}head ../data/Ab_4denovo_CLC6_a.faecho"How many sequences are there?"grep-c">" ../data/Ab_4denovo_CLC6_a.fa```# Run Blast``` bash/Applications/bioinfo/ncbi-blast-2.13.0+/bin/blastx\-query ../data/Ab_4denovo_CLC6_a.fa \-db ../blastdb/uniprot_sprot_r2023_01 \-out ../output/Ab_4-uniprot_blastx.tab \-evalue 1E-20 \-num_threads 20 \-max_target_seqs 1 \-outfmt 6``````{bash}head-2 ../output/Ab_4-uniprot_blastx.tabwc-l ../output/Ab_4-uniprot_blastx.tab```Need to convert `sp|Q08013|SSRG_RAT` to get accession number out.# Getting more information```{r schemat, echo = FALSE, out.width = "70%", fig.align = "center"}knitr::include_graphics("../img/uniprot.png")```https://www.uniprot.org/uniprotkb```{r api, echo = FALSE, out.width = "70%", fig.align = "center"}knitr::include_graphics("../img/uniprot-api.png")``` https://rest.uniprot.org/uniprotkb/stream?compressed=true&fields=accession%2Creviewed%2Cid%2Cprotein_name%2Cgene_names%2Corganism_name%2Clength%2Cgo_f%2Cgo%2Cgo_p%2Cgo_c%2Cgo_id%2Ccc_interaction%2Cec%2Cxref_reactome%2Cxref_unipathway%2Cxref_interpro&format=tsv&query=%28%2A%29%20AND%20%28reviewed%3Atrue%29``` {{bash}}curl -O "Accept: text/plain; format=tsv" "https://rest.uniprot.org/uniprotkb/search?query=reviewed:true+AND+organism_id:9606"`````` {{bash}}curl -O -H "Accept: text/plain; format=tsv" "https://rest.uniprot.org/uniprotkb/stream?compressed=true&fields=accession%2Creviewed%2Cid%2Cprotein_name%2Cgene_names%2Corganism_name%2Clength%2Cgo_f%2Cgo%2Cgo_p%2Cgo_c%2Cgo_id%2Ccc_interaction%2Cec%2Cxref_reactome%2Cxref_unipathway%2Cxref_interpro&format=tsv&query=%28%2A%29%20AND%20%28reviewed%3Atrue%29"```## Joining blast table with annotation tableAt this point we have a blast output table and annotation table both with a Uniprot accession number. Thus we can join the two tables and be able to get more functional information about thet genes.```{bash}head-2 ../output/Ab_4-uniprot_blastx.tabwc-l ../output/Ab_4-uniprot_blastx.tab``````{bash}tr'|''\t'< ../output/Ab_4-uniprot_blastx.tab |head-2``````{bash}tr'|''\t'< ../output/Ab_4-uniprot_blastx.tab \> ../output/Ab_4-uniprot_blastx_sep.tab``````{bash}head-2 ../data/uniprot_table_r2023_01.tabwc-l ../data/uniprot_table_r2023_01.tab``````{r}library(tidyverse)library("kableExtra")``````{r}bltabl <-read.csv("../output/Ab_4-uniprot_blastx_sep.tab", sep ='\t', header =FALSE)``````{r, cache=TRUE}spgo <-read.csv("../data/uniprot_table_r2023_01.tab", sep ='\t', header =TRUE)``````{r}str(spgo)``````{r}kbl(head(left_join(bltabl, spgo, by =c("V3"="Entry")) %>%select(V1, V3, V13, Protein.names, Organism, Gene.Ontology..biological.process., Gene.Ontology.IDs) %>%mutate(V1 =str_replace_all(V1, pattern ="solid0078_20110412_FRAG_BC_WHITE_WHITE_F3_QV_SE_trimmed", replacement ="Ab")))) %>%kable_styling(bootstrap_options =c("striped", "hover", "condensed", "responsive"))``````{r}left_join(bltabl, spgo, by =c("V3"="Entry")) %>%select(V1, V3, V13, Protein.names, Organism, Gene.Ontology..biological.process., Gene.Ontology.IDs) %>%mutate(V1 =str_replace_all(V1, pattern ="solid0078_20110412_FRAG_BC_WHITE_WHITE_F3_QV_SE_trimmed", replacement ="Ab")) %>%write_delim("../output/blast_annot_go.tab", delim ='\t')```