FROGER Day 02

Annotating Mcap Proteins

The genome version at http://cyanophora.rutgers.edu/montipora/ has proteins without annotations. Will give those a whirl.

pic

First I had to clean up hardwrap

%%bash
awk '!/^>/ { printf "%s", $0; n = "\n" }
/^>/ { print n $0; n = "" }
END { printf "%s", n }
' /Volumes/web/tmp/Mcap.protein.fasta  > /Volumes/web/tmp/Mcap.protein.fa

Then blastp

#!/bin/bash
## Job Name
#SBATCH --job-name=blastp
## Allocation Definition
#SBATCH --account=coenv
#SBATCH --partition=coenv
## Resources
## Nodes
#SBATCH --nodes=1
## Walltime (days-hours:minutes:seconds format)
#SBATCH --time=1-00:00:00
## Memory per node
#SBATCH --mem=120G
##turn on e-mail notification
#SBATCH --mail-type=ALL
#SBATCH --mail-user=sr320@uw.edu
## Specify the working directory for this job
#SBATCH --chdir=/gscratch/scrubbed/sr320/030620-Mcap-annotation


# Load Python Mox module for Python module availability

module load intel-python3_2017




source /gscratch/srlab/programs/scripts/paths.sh



/gscratch/srlab/programs/ncbi-blast-2.6.0+/bin/blastp \
-query /gscratch/srlab/sr320/data/froger/Mcap/Mcap.protein.fa \
-db /gscratch/srlab/blastdbs/UniProtKB_20190109/uniprot_sprot.fasta \
-num_threads 28 \
-outfmt 6 \
-max_target_seqs 1 \
-evalue 1e-05 \
-out /gscratch/scrubbed/sr320/030620-Mcap-annotation/Mcap_blastp_sp.tab

Blast is complete.

blast: https://gannet.fish.washington.edu/seashell/bu-mox/scrubbed/030620-Mcap-annotation/Mcap_blastp_sp.tab

Associated GO: https://raw.githubusercontent.com/sr320/nb-2020/master/M_capitata/analyses/uniprot-yourlist_M20200306E5.tab

(needs joining)

One day could be more programatic - EVEN build into mox job.

Comparing Mcap genomes

Quick and dirty mapping of raw BS reads to both Mcap genomes.

“Other” one is at https://www.ncbi.nlm.nih.gov/assembly/GCA_006542545.1/

Need to do a BS conversion …

Have that queued as job05.sh

${bismark_dir}/bismark_genome_preparation \
--verbose \
--parallel 28 \
--path_to_aligner ${bowtie2_dir} \
/gscratch/srlab/sr320/data/froger/Mcap_Genome_02/

Still working in /gscratch/scrubbed/sr320/030520-fr01 next up will be running raw reads through both this and Mcap01 …

as per

#!/bin/bash
## Job Name
#SBATCH --job-name=Mcomp
## Allocation Definition
#SBATCH --account=srlab
#SBATCH --partition=srlab
## Resources
## Nodes (We only get 1, so this is fixed)
#SBATCH --nodes=1
## Walltime (days-hours:minutes:seconds format)
#SBATCH --time=1-00:00:00
## Memory per node
#SBATCH --mem=100G
#SBATCH --mail-type=ALL
#SBATCH --mail-user=sr320@uw.edu
## Specify the working directory for this job
#SBATCH --chdir=/gscratch/scrubbed/sr320/030520-fr01/

# Looking for mapping efficiency

# Directories and programs
bismark_dir="/gscratch/srlab/programs/Bismark-0.21.0"
bowtie2_dir="/gscratch/srlab/programs/bowtie2-2.3.4.1-linux-x86_64/"
samtools="/gscratch/srlab/programs/samtools-1.9/samtools"

reads_dir="/gscratch/scrubbed/sr320/froger-raw/00_fastq/"
genome_folder="/gscratch/srlab/sr320/data/froger/Mcap_Genome/"

source /gscratch/srlab/programs/scripts/paths.sh


# ${bismark_dir}/bismark_genome_preparation \
# --verbose \
# --parallel 28 \
# --path_to_aligner ${bowtie2_dir} \
# ${genome_folder}



find ${reads_dir}*_R1_001.fastq.gz \
| xargs basename -s _R1_001.fastq.gz | xargs -I{} ${bismark_dir}/bismark \
--path_to_bowtie ${bowtie2_dir} \
-genome ${genome_folder} \
-p 4 \
-u 10000 \
--non_directional \
-1 ${reads_dir}{}_R1_001.fastq.gz \
-2 ${reads_dir}{}_R2_001.fastq.gz \
-o Mcap_test06-rutgers



genome_folder="/gscratch/srlab/sr320/data/froger/Mcap_Genome_02/"




find ${reads_dir}*_R1_001.fastq.gz \
| xargs basename -s _R1_001.fastq.gz | xargs -I{} ${bismark_dir}/bismark \
--path_to_bowtie ${bowtie2_dir} \
-genome ${genome_folder} \
-p 4 \
-u 10000 \
--non_directional \
-1 ${reads_dir}{}_R1_001.fastq.gz \
-2 ${reads_dir}{}_R2_001.fastq.gz \
-o Mcap_test06-NCBI

The “Rutgers” genome has slightly higher mappping but oddly the MBD libraries have about 1/2 of the mapping efficiency. Not sure why.

Looking at trimmed file mapping

#!/bin/bash
## Job Name
#SBATCH --job-name=t-trim
## Allocation Definition
#SBATCH --account=coenv
#SBATCH --partition=coenv
## Resources
## Nodes (We only get 1, so this is fixed)
#SBATCH --nodes=1
## Walltime (days-hours:minutes:seconds format)
#SBATCH --time=1-00:00:00
## Memory per node
#SBATCH --mem=100G
#SBATCH --mail-type=ALL
#SBATCH --mail-user=sr320@uw.edu
## Specify the working directory for this job
#SBATCH --chdir=/gscratch/scrubbed/sr320/030520-fr01/

# Looking for mapping efficiency

# Directories and programs
bismark_dir="/gscratch/srlab/programs/Bismark-0.21.0"
bowtie2_dir="/gscratch/srlab/programs/bowtie2-2.3.4.1-linux-x86_64/"
samtools="/gscratch/srlab/programs/samtools-1.9/samtools"

reads_dir="/gscratch/scrubbed/samwhite/outputs/20200305_methcompare_fastp_trimming/"
genome_folder="/gscratch/srlab/sr320/data/froger/Mcap_Genome/"

source /gscratch/srlab/programs/scripts/paths.sh


# ${bismark_dir}/bismark_genome_preparation \
# --verbose \
# --parallel 28 \
# --path_to_aligner ${bowtie2_dir} \
# ${genome_folder}



find ${reads_dir}*2020*_R1_001.fastq.gz \
| xargs basename -s _R1_001.fastq.gz | xargs -I{} ${bismark_dir}/bismark \
--path_to_bowtie ${bowtie2_dir} \
-genome ${genome_folder} \
-p 4 \
-u 10000 \
--non_directional \
-1 ${reads_dir}{}_R1_001.fastq.gz \
-2 ${reads_dir}{}_R2_001.fastq.gz \
-o Mcap_test10



genome_folder="/gscratch/srlab/sr320/data/froger/Pact_Genome/"



find ${reads_dir}*2020*_R1_001.fastq.gz \
| xargs basename -s _R1_001.fastq.gz | xargs -I{} ${bismark_dir}/bismark \
--path_to_bowtie ${bowtie2_dir} \
-genome ${genome_folder} \
-p 4 \
-u 10000 \
--non_directional \
-1 ${reads_dir}{}_R1_001.fastq.gz \
-2 ${reads_dir}{}_R2_001.fastq.gz \
-o Pact_test10


genome_folder="/gscratch/srlab/sr320/data/froger/Pdam_Genome/"



find ${reads_dir}*2020*_R1_001.fastq.gz \
| xargs basename -s _R1_001.fastq.gz | xargs -I{} ${bismark_dir}/bismark \
--path_to_bowtie ${bowtie2_dir} \
-genome ${genome_folder} \
-p 4 \
-u 10000 \
--non_directional \
-1 ${reads_dir}{}_R1_001.fastq.gz \
-2 ${reads_dir}{}_R2_001.fastq.gz \
-o Pdam_test10


genome_folder="/gscratch/srlab/sr320/data/lambda/"



find ${reads_dir}*2020*_R1_001.fastq.gz \
| xargs basename -s _R1_001.fastq.gz | xargs -I{} ${bismark_dir}/bismark \
--path_to_bowtie ${bowtie2_dir} \
-genome ${genome_folder} \
-p 4 \
-u 10000 \
--non_directional \
-1 ${reads_dir}{}_R1_001.fastq.gz \
-2 ${reads_dir}{}_R2_001.fastq.gz \
-o lambda_test10

What we see is a variation of mapping efficiency based on method with RRBS about 50% and MBD about 15%. Will run with min_score.


Big Run with min_score

Written on March 6, 2020