Epigenetic Machinery Expression

For Apul joined relavant blast matches to expression data.

https://github.com/urol-e5/timeseries_molecular/blob/main/D-Apul/code/29-Apul-epimachine-exp.Rmd

Select Gene for Expression Plot

Shiny applications not supported in static R Markdown documents

CpG Methylation

Also looked at CpG Methylation in time-series data.

Prior had generated 10x bedgraph for APul.

15.5

cd ../output/15.5-Apul-bismark/

for f in *merged_CpG_evidence.cov
do
  STEM=$(basename "${f}" _R1_001.fastp-trim_bismark_bt2.CpG_report.merged_CpG_evidence.cov)
  cat "${f}" | awk -F $'\t' 'BEGIN {OFS = FS} {if ($5+$6 >= 10) {print $1, $2, $3, $4}}' \
  > "${STEM}"_10x.bedgraph
done

ntLink_0    25585   25587   0.000000
ntLink_0    25624   25626   0.000000
ntLink_0    25777   25779   0.000000
ntLink_0    25779   25781   0.000000
ntLink_0    25788   25790   0.000000
ntLink_0    25803   25805   0.000000
ntLink_0    25811   25813   0.000000
ntLink_0    25830   25832   0.000000
ntLink_0    90500   90502   0.000000
ntLink_0    90540   90542   0.000000
ntLink_0    90562   90564   0.000000
ntLink_0    90722   90724   0.000000
ntLink_0    94077   94079   0.000000
ntLink_0    94090   94092   0.000000
ntLink_0    94095   94097   0.000000
ntLink_0    94098   94100   0.000000
ntLink_1    2597    2599    0.000000
ntLink_1    2613    2615    0.000000
ntLink_1    2626    2628    0.000000
ntLink_1    4093    4095    0.000000
ntLink_1    4151    4153    0.000000
ntLink_1    4740    4742    10.000000
ntLink_1    4831    4833    0.000000

for file in ../output/15.5-Apul-bismark/*10x.bedgraph; do
    awk '{print "CpG_"_$1"_"$2, $4}' "$file" > "${file%.bedgraph}_processed.txt"
done

CpG_ntLink_0_25692 0.000000
CpG_ntLink_0_25699 0.000000
CpG_ntLink_0_80003 58.333333
CpG_ntLink_0_90500 0.000000
CpG_ntLink_0_90540 0.000000
CpG_ntLink_0_90562 0.000000
CpG_ntLink_0_90633 0.000000
CpG_ntLink_0_90648 0.000000
CpG_ntLink_0_90653 15.384615
CpG_ntLink_0_90671 0.000000
CpG_ntLink_0_94077 0.000000
CpG_ntLink_0_94090 0.000000
CpG_ntLink_0_94095 0.000000
CpG_ntLink_0_94098 0.000000
CpG_ntLink_1_4093 0.000000

From here Kathleen filtered such that

WGBS data

#pull processed files from Gannet 
# Note: Unfortunately we can't use the `cache` feature to make this process more time efficient, as it doesn't support long vectors

# Define the base URL
base_url <- "https://gannet.fish.washington.edu/seashell/bu-github/timeseries_molecular/D-Apul/output/15.5-Apul-bismark/"

# Read the HTML page
page <- read_html(base_url)

# Extract links to files
file_links <- page %>%
  html_nodes("a") %>%
  html_attr("href")

# Filter for files ending in "processed.txt"
processed_files <- file_links[grepl("processed\\.txt$", file_links)]

# Create full URLs
file_urls <- paste0(base_url, processed_files)

# Function to read a file from URL
read_processed_file <- function(url) {
  read_table(url, col_types = cols(.default = "c"))  # Read as character to avoid parsing issues
}

# Import all processed files into a list
processed_data <- lapply(file_urls, read_processed_file)

# Name the list elements by file name
names(processed_data) <- processed_files

# Print structure of imported data
str(processed_data)

# add a header row that has "CpG" for the first column and "sample" for the second column, which will be populated by the file name 

processed_data <- Map(function(df, filename) {
  colnames(df) <- c("CpG", filename)  # Rename columns
  return(df)
}, processed_data, names(processed_data))  # Use stored file names

#merge files together by "CpG"
merged_data <- purrr::reduce(processed_data, full_join, by = "CpG")

# Print structure of final merged data
str(merged_data)

Replace any NA with 0.

# Convert all columns (except "CpG") to numeric and replace NAs with 0
merged_data <- merged_data %>%
  mutate(across(-CpG, as.numeric)) %>%  # Convert all except CpG to numeric
  mutate(across(-CpG, ~ replace_na(.x, 0)))  # Replace NA with 0 in numeric columns

Filter data sets

Only keep CpGs that have a non-zero value in all samples.

filtered_wgbs <- merged_data %>% filter(if_all(-CpG, ~ .x > 0))

# Ensure it's formatted as a data frame
filtered_wgbs <- as.data.frame(filtered_wgbs)
# Only keep the sample information in the column name. 
colnames(filtered_wgbs) <- gsub("^(.*?)_.*$", "\\1", colnames(filtered_wgbs))
# Set CpG IDs to rownames
rownames(filtered_wgbs) <- filtered_wgbs$CpG
filtered_wgbs <- filtered_wgbs %>% select(-CpG)

nrow(merged_data)
nrow(filtered_wgbs)

We had 12,093,025 CpGs before filtering and have only 507 after filtering. This makes sense because most CpGs were not methylated in all samples.

Save filtered set to make code reruns/knitting quicker

write.csv(filtered_wgbs, "../output/22.3-Apul-multiomic-machine-learning-byTP/filtered-WGBS-CpG-counts.csv")

Need to revisit filtering

Ptua CpG Methylation

Shelly did alignment

# run pipeline
nextflow run nf-core/methylseq \
-c /gscratch/srlab/strigg/bin/uw_hyak_srlab.config \
--input /gscratch/scrubbed/strigg/analyses/20250422_methylseq/samplesheet.csv \
--outdir /gscratch/scrubbed/strigg/analyses/20250422_methylseq \
--fasta /gscratch/srlab/strigg/GENOMES/Pocillopora_meandrina_HIv1.assembly.fasta \
--em_seq \
-resume \
-with-report nf_report.html \
-with-trace \
-with-timeline nf_timeline.html \
--skip_trimming \
--nomeseq 

### Results

- Multiqc report: [https://gannet.fish.washington.edu/metacarcinus/E5/Ptuahiniensis/20250422_methylseq/multiqc/bismark/multiqc_report.html](https://gannet.fish.washington.edu/metacarcinus/E5/Ptuahiniensis/20250422_methylseq/multiqc/bismark/multiqc_report.html)
- Bismark summary report: [https://gannet.fish.washington.edu/metacarcinus/E5/Ptuahiniensis/20250422_methylseq/bismark/summary/bismark_summary_report.html](https://gannet.fish.washington.edu/metacarcinus/E5/Ptuahiniensis/20250422_methylseq/bismark/summary/bismark_summary_report.html)
- Pipeline report: [https://gannet.fish.washington.edu/metacarcinus/E5/Ptuahiniensis/20250422_methylseq/nf_report.html](https://gannet.fish.washington.edu/metacarcinus/E5/Ptuahiniensis/20250422_methylseq/nf_report.html)
- Pipeline timeline: [https://gannet.fish.washington.edu/metacarcinus/E5/Ptuahiniensis/20250422_methylseq/nf_timeline.html](https://gannet.fish.washington.edu/metacarcinus/E5/Ptuahiniensis/20250422_methylseq/nf_timeline.html)
- Counts matrices: [https://gannet.fish.washington.edu/metacarcinus/E5/Ptuahiniensis/20250422_methylseq/bismark/methylation_calls/methylation_coverage/](https://gannet.fish.washington.edu/metacarcinus/E5/Ptuahiniensis/20250422_methylseq/bismark/methylation_calls/methylation_coverage/)
    - <sample_name>.fastp-trim_bismark_bt2_pe.deduplicated.bismark.cov.gz
- Deduplicated sorted bam files: [https://gannet.fish.washington.edu/metacarcinus/E5/Ptuahiniensis/20250422_methylseq/bismark/deduplicated/](https://gannet.fish.washington.edu/metacarcinus/E5/Ptuahiniensis/20250422_methylseq/bismark/deduplicated/) 
    - <sample_name>.deduplicated.sorted.bam 
- Other bismark output: [https://gannet.fish.washington.edu/metacarcinus/E5/Ptuahiniensis/20250422_methylseq/bismark/](https://gannet.fish.washington.edu/metacarcinus/E5/Ptuahiniensis/20250422_methylseq/bismark/)

find ../data/dedup-cov/methylation_calls/methylation_coverage/*deduplicated.bismark.cov.gz | \
xargs -n 1 basename -s R1_001.fastp-trim_bismark_bt2_pe.deduplicated.bismark.cov.gz | \
parallel -j 24 /home/shared/Bismark-0.24.0/coverage2cytosine \
--genome_folder ../data/bs \
-o ../output/05-Ptua-bismark-CG/{} \
--merge_CpG \
--zero_based \
../data/dedup-cov/methylation_calls/methylation_coverage/{}R1_001.fastp-trim_bismark_bt2_pe.deduplicated.bismark.cov.gz

cd ../output/05-Ptua-bismark-CG/

for f in *merged_CpG_evidence.cov
do
  STEM=$(basename "${f}" _.CpG_report.merged_CpG_evidence.cov)
  cat "${f}" | awk -F $'\t' 'BEGIN {OFS = FS} {if ($5+$6 >= 5) {print $1, $2, $3, $4}}' \
  > "${STEM}"_10x.bedgraph
done

for file in ../output/05-Ptua-bismark-CG/*10x.bedgraph; do
    awk '{print "CpG_"_$1"_"$2, $4}' "$file" > "${file%.bedgraph}_processed.txt"
done

#pull processed files from Gannet 
# Note: Unfortunately we can't use the `cache` feature to make this process more time efficient, as it doesn't support long vectors

# Define the base URL
base_url <- "https://gannet.fish.washington.edu/v1_web/owlshell/bu-github/timeseries_molecular/F-Ptua/output/05-Ptua-bismark-CG/"

# Read the HTML page
page <- read_html(base_url)

# Extract links to files
file_links <- page %>%
  html_nodes("a") %>%
  html_attr("href")

# Filter for files ending in "processed.txt"
processed_files <- file_links[grepl("processed\\.txt$", file_links)]

# Create full URLs
file_urls <- paste0(base_url, processed_files)

# Function to read a file from URL
read_processed_file <- function(url) {
  read_table(url, col_types = cols(.default = "c"))  # Read as character to avoid parsing issues
}

# Import all processed files into a list
processed_data <- lapply(file_urls, read_processed_file)

# Name the list elements by file name
names(processed_data) <- processed_files

# Print structure of imported data
str(processed_data)

# add a header row that has "CpG" for the first column and "sample" for the second column, which will be populated by the file name 

processed_data <- Map(function(df, filename) {
  colnames(df) <- c("CpG", filename)  # Rename columns
  return(df)
}, processed_data, names(processed_data))  # Use stored file names

#merge files together by "CpG"
merged_data <- purrr::reduce(processed_data, full_join, by = "CpG")

# Print structure of final merged data
str(merged_data)

Replace any NA with 0.

# Convert all columns (except "CpG") to numeric and replace NAs with 0
merged_data <- merged_data %>%
  mutate(across(-CpG, as.numeric)) %>%  # Convert all except CpG to numeric
  mutate(across(-CpG, ~ replace_na(.x, 0)))  # Replace NA with 0 in numeric columns

Filter data sets

Only keep CpGs that have a non-zero value in all samples.

filtered_wgbs <- merged_data %>% filter(if_all(-CpG, ~ .x > 0))

# Ensure it's formatted as a data frame
filtered_wgbs <- as.data.frame(filtered_wgbs)
# Only keep the sample information in the column name. 
colnames(filtered_wgbs) <- gsub("^(.*?)_.*$", "\\1", colnames(filtered_wgbs))
# Set CpG IDs to rownames
rownames(filtered_wgbs) <- filtered_wgbs$CpG
filtered_wgbs <- filtered_wgbs %>% select(-CpG)

nrow(merged_data)
nrow(filtered_wgbs)

[1] 9111633
[1] 137292

We had 9111633 CpGs before filtering and have only 137292 after filtering.

write.csv(filtered_wgbs, "../output/05-Ptua-bismark-CG/filtered-WGBS-CpG-counts.csv")

Why such a descepancy across species in number of filtered CpG?

--- title: "ROL-PI Day one" description: "Más citas, por favor" categories: [methylation, e5] #citation: date: 05-24-2025 image: http://gannet.fish.washington.edu/seashell/snaps/Monosnap_EpigeneticsSimulationMS_-_Online_LaTeX_Editor_Overleaf_2025-05-25_07-13-14.png # finding a good image author: - name: Steven Roberts url: orcid: 0000-0001-8302-1138 affiliation: Professor, UW - School of Aquatic and Fishery Sciences affiliation-url: https://robertslab.info #url: # self-defined draft: false # setting this to `true` will prevent your post from appearing on your listing page until you're ready! format: html: code-fold: FALSE code-tools: true code-copy: true highlight-style: github code-overflow: wrap #runtime: shiny --- ```{r setup, include=FALSE} library(knitr) library(readr) library(DT) library(shiny) library(plotly) knitr::opts_chunk$set( echo = TRUE, # Display code chunks eval = FALSE, # Evaluate code chunks warning = FALSE, # Hide warnings message = FALSE, # Hide messages fig.width = 6, # Set plot width in inches fig.height = 4, # Set plot height in inches fig.align = "center", # Align plots to the center comment = "" # Prevents appending '##' to beginning of lines in code output ) ``` Day one. Much of time was spent on Epigenetic Simulation paper completion. Some coding completed. # Epigenetic Machinery Expression For Apul joined relavant blast matches to expression data. <https://github.com/urol-e5/timeseries_molecular/blob/main/D-Apul/code/29-Apul-epimachine-exp.Rmd> ```{r, eval=TRUE, echo=FALSE} # Load CSV from GitHub url <- "https://raw.githubusercontent.com/urol-e5/timeseries_molecular/refs/heads/main/D-Apul/output/29-Apul-epimachine-exp/Apul-epimachine-expression.csv" epimachine_data <- read_csv(url) ``` ```{r, eval=TRUE, echo=FALSE} datatable(epimachine_data, options = list(pageLength = 10, searchHighlight = TRUE), filter = "top") ``` _Select Gene for Expression Plot_ ```{r, eval=TRUE, echo=FALSE} # Load CSV from GitHub url <- "https://raw.githubusercontent.com/urol-e5/timeseries_molecular/refs/heads/main/D-Apul/output/29-Apul-epimachine-exp/Apul-epimachine-expression.csv" df <- read_csv(url) # Select only V2, V1, and columns containing "ACR" acr_cols <- grep("ACR", names(df), value = TRUE) base_cols <- c("V2", "V1") all_cols <- c(base_cols, acr_cols) # Subset the data df_filtered <- df[, all_cols] # Reorder ACR columns by numeric TP suffix (e.g., ACR_TP1, ACR_TP2, ...) acr_ordered <- acr_cols[order(as.numeric(gsub("\\D", "", acr_cols)))] column_order <- c(base_cols, acr_ordered) df_filtered <- df_filtered[, column_order] # UI ui <- fluidPage( titlePanel("Gene Expression (TP Ordered)"), fluidRow( column(6, DTOutput("table")), column(6, plotlyOutput("plot")) ) ) # Server server <- function(input, output, session) { output$table <- renderDT({ datatable(df_filtered, selection = "single", options = list(scrollX = TRUE)) }) output$plot <- renderPlotly({ selected <- input$table_rows_selected if (length(selected)) { row_data <- df_filtered[selected, , drop = FALSE] y_vals <- as.numeric(unlist(row_data[1, acr_ordered])) x_vals <- factor(acr_ordered, levels = acr_ordered) plot_ly( x = x_vals, y = y_vals, type = "bar" ) %>% layout( title = paste("ACR Expression for:", row_data$V2), xaxis = list(title = "Timepoint"), yaxis = list(title = "Expression Level") ) } }) } # Run the app shinyApp(ui, server) ``` # CpG Methylation Also looked at CpG Methylation in time-series data. ## Prior had generated 10x bedgraph for APul. 15.5 ```{r, engine='bash'} cd ../output/15.5-Apul-bismark/ for f in *merged_CpG_evidence.cov do STEM=$(basename "${f}" _R1_001.fastp-trim_bismark_bt2.CpG_report.merged_CpG_evidence.cov) cat "${f}" | awk -F $'\t' 'BEGIN {OFS = FS} {if ($5+$6 >= 10) {print $1, $2, $3, $4}}' \ > "${STEM}"_10x.bedgraph done ``` ``` ntLink_0 25585 25587 0.000000 ntLink_0 25624 25626 0.000000 ntLink_0 25777 25779 0.000000 ntLink_0 25779 25781 0.000000 ntLink_0 25788 25790 0.000000 ntLink_0 25803 25805 0.000000 ntLink_0 25811 25813 0.000000 ntLink_0 25830 25832 0.000000 ntLink_0 90500 90502 0.000000 ntLink_0 90540 90542 0.000000 ntLink_0 90562 90564 0.000000 ntLink_0 90722 90724 0.000000 ntLink_0 94077 94079 0.000000 ntLink_0 94090 94092 0.000000 ntLink_0 94095 94097 0.000000 ntLink_0 94098 94100 0.000000 ntLink_1 2597 2599 0.000000 ntLink_1 2613 2615 0.000000 ntLink_1 2626 2628 0.000000 ntLink_1 4093 4095 0.000000 ntLink_1 4151 4153 0.000000 ntLink_1 4740 4742 10.000000 ntLink_1 4831 4833 0.000000 ``` ```{r, engine='bash'} for file in ../output/15.5-Apul-bismark/*10x.bedgraph; do awk '{print "CpG_"_$1"_"$2, $4}' "$file" > "${file%.bedgraph}_processed.txt" done ``` ``` CpG_ntLink_0_25692 0.000000 CpG_ntLink_0_25699 0.000000 CpG_ntLink_0_80003 58.333333 CpG_ntLink_0_90500 0.000000 CpG_ntLink_0_90540 0.000000 CpG_ntLink_0_90562 0.000000 CpG_ntLink_0_90633 0.000000 CpG_ntLink_0_90648 0.000000 CpG_ntLink_0_90653 15.384615 CpG_ntLink_0_90671 0.000000 CpG_ntLink_0_94077 0.000000 CpG_ntLink_0_94090 0.000000 CpG_ntLink_0_94095 0.000000 CpG_ntLink_0_94098 0.000000 CpG_ntLink_1_4093 0.000000 ``` From here Kathleen filtered such that ## WGBS data ```{r} #pull processed files from Gannet # Note: Unfortunately we can't use the `cache` feature to make this process more time efficient, as it doesn't support long vectors # Define the base URL base_url <- "https://gannet.fish.washington.edu/seashell/bu-github/timeseries_molecular/D-Apul/output/15.5-Apul-bismark/" # Read the HTML page page <- read_html(base_url) # Extract links to files file_links <- page %>% html_nodes("a") %>% html_attr("href") # Filter for files ending in "processed.txt" processed_files <- file_links[grepl("processed\\.txt$", file_links)] # Create full URLs file_urls <- paste0(base_url, processed_files) # Function to read a file from URL read_processed_file <- function(url) { read_table(url, col_types = cols(.default = "c")) # Read as character to avoid parsing issues } # Import all processed files into a list processed_data <- lapply(file_urls, read_processed_file) # Name the list elements by file name names(processed_data) <- processed_files # Print structure of imported data str(processed_data) # add a header row that has "CpG" for the first column and "sample" for the second column, which will be populated by the file name processed_data <- Map(function(df, filename) { colnames(df) <- c("CpG", filename) # Rename columns return(df) }, processed_data, names(processed_data)) # Use stored file names #merge files together by "CpG" merged_data <- purrr::reduce(processed_data, full_join, by = "CpG") # Print structure of final merged data str(merged_data) ``` Replace any NA with 0. ```{r} # Convert all columns (except "CpG") to numeric and replace NAs with 0 merged_data <- merged_data %>% mutate(across(-CpG, as.numeric)) %>% # Convert all except CpG to numeric mutate(across(-CpG, ~ replace_na(.x, 0))) # Replace NA with 0 in numeric columns ``` ## Filter data sets Only keep CpGs that have a non-zero value in all samples. ```{r} filtered_wgbs <- merged_data %>% filter(if_all(-CpG, ~ .x > 0)) # Ensure it's formatted as a data frame filtered_wgbs <- as.data.frame(filtered_wgbs) # Only keep the sample information in the column name. colnames(filtered_wgbs) <- gsub("^(.*?)_.*$", "\\1", colnames(filtered_wgbs)) # Set CpG IDs to rownames rownames(filtered_wgbs) <- filtered_wgbs$CpG filtered_wgbs <- filtered_wgbs %>% select(-CpG) nrow(merged_data) nrow(filtered_wgbs) ``` We had 12,093,025 CpGs before filtering and have only 507 after filtering. This makes sense because most CpGs were not methylated in all samples. Save filtered set to make code reruns/knitting quicker ```{r} write.csv(filtered_wgbs, "../output/22.3-Apul-multiomic-machine-learning-byTP/filtered-WGBS-CpG-counts.csv") ``` _Need to revisit filtering_ --- ## Ptua CpG Methylation Shelly did alignment ``` # run pipeline nextflow run nf-core/methylseq \ -c /gscratch/srlab/strigg/bin/uw_hyak_srlab.config \ --input /gscratch/scrubbed/strigg/analyses/20250422_methylseq/samplesheet.csv \ --outdir /gscratch/scrubbed/strigg/analyses/20250422_methylseq \ --fasta /gscratch/srlab/strigg/GENOMES/Pocillopora_meandrina_HIv1.assembly.fasta \ --em_seq \ -resume \ -with-report nf_report.html \ -with-trace \ -with-timeline nf_timeline.html \ --skip_trimming \ --nomeseq ### Results - Multiqc report: [https://gannet.fish.washington.edu/metacarcinus/E5/Ptuahiniensis/20250422_methylseq/multiqc/bismark/multiqc_report.html](https://gannet.fish.washington.edu/metacarcinus/E5/Ptuahiniensis/20250422_methylseq/multiqc/bismark/multiqc_report.html) - Bismark summary report: [https://gannet.fish.washington.edu/metacarcinus/E5/Ptuahiniensis/20250422_methylseq/bismark/summary/bismark_summary_report.html](https://gannet.fish.washington.edu/metacarcinus/E5/Ptuahiniensis/20250422_methylseq/bismark/summary/bismark_summary_report.html) - Pipeline report: [https://gannet.fish.washington.edu/metacarcinus/E5/Ptuahiniensis/20250422_methylseq/nf_report.html](https://gannet.fish.washington.edu/metacarcinus/E5/Ptuahiniensis/20250422_methylseq/nf_report.html) - Pipeline timeline: [https://gannet.fish.washington.edu/metacarcinus/E5/Ptuahiniensis/20250422_methylseq/nf_timeline.html](https://gannet.fish.washington.edu/metacarcinus/E5/Ptuahiniensis/20250422_methylseq/nf_timeline.html) - Counts matrices: [https://gannet.fish.washington.edu/metacarcinus/E5/Ptuahiniensis/20250422_methylseq/bismark/methylation_calls/methylation_coverage/](https://gannet.fish.washington.edu/metacarcinus/E5/Ptuahiniensis/20250422_methylseq/bismark/methylation_calls/methylation_coverage/) - <sample_name>.fastp-trim_bismark_bt2_pe.deduplicated.bismark.cov.gz - Deduplicated sorted bam files: [https://gannet.fish.washington.edu/metacarcinus/E5/Ptuahiniensis/20250422_methylseq/bismark/deduplicated/](https://gannet.fish.washington.edu/metacarcinus/E5/Ptuahiniensis/20250422_methylseq/bismark/deduplicated/) - <sample_name>.deduplicated.sorted.bam - Other bismark output: [https://gannet.fish.washington.edu/metacarcinus/E5/Ptuahiniensis/20250422_methylseq/bismark/](https://gannet.fish.washington.edu/metacarcinus/E5/Ptuahiniensis/20250422_methylseq/bismark/) ``` ```{r, engine='bash'} find ../data/dedup-cov/methylation_calls/methylation_coverage/*deduplicated.bismark.cov.gz | \ xargs -n 1 basename -s R1_001.fastp-trim_bismark_bt2_pe.deduplicated.bismark.cov.gz | \ parallel -j 24 /home/shared/Bismark-0.24.0/coverage2cytosine \ --genome_folder ../data/bs \ -o ../output/05-Ptua-bismark-CG/{} \ --merge_CpG \ --zero_based \ ../data/dedup-cov/methylation_calls/methylation_coverage/{}R1_001.fastp-trim_bismark_bt2_pe.deduplicated.bismark.cov.gz ``` ```{r, engine='bash'} cd ../output/05-Ptua-bismark-CG/ for f in *merged_CpG_evidence.cov do STEM=$(basename "${f}" _.CpG_report.merged_CpG_evidence.cov) cat "${f}" | awk -F $'\t' 'BEGIN {OFS = FS} {if ($5+$6 >= 5) {print $1, $2, $3, $4}}' \ > "${STEM}"_10x.bedgraph done ``` ```{r, engine='bash'} for file in ../output/05-Ptua-bismark-CG/*10x.bedgraph; do awk '{print "CpG_"_$1"_"$2, $4}' "$file" > "${file%.bedgraph}_processed.txt" done ``` ```{r, eval=FALSE} #pull processed files from Gannet # Note: Unfortunately we can't use the `cache` feature to make this process more time efficient, as it doesn't support long vectors # Define the base URL base_url <- "https://gannet.fish.washington.edu/v1_web/owlshell/bu-github/timeseries_molecular/F-Ptua/output/05-Ptua-bismark-CG/" # Read the HTML page page <- read_html(base_url) # Extract links to files file_links <- page %>% html_nodes("a") %>% html_attr("href") # Filter for files ending in "processed.txt" processed_files <- file_links[grepl("processed\\.txt$", file_links)] # Create full URLs file_urls <- paste0(base_url, processed_files) # Function to read a file from URL read_processed_file <- function(url) { read_table(url, col_types = cols(.default = "c")) # Read as character to avoid parsing issues } # Import all processed files into a list processed_data <- lapply(file_urls, read_processed_file) # Name the list elements by file name names(processed_data) <- processed_files # Print structure of imported data str(processed_data) # add a header row that has "CpG" for the first column and "sample" for the second column, which will be populated by the file name processed_data <- Map(function(df, filename) { colnames(df) <- c("CpG", filename) # Rename columns return(df) }, processed_data, names(processed_data)) # Use stored file names #merge files together by "CpG" merged_data <- purrr::reduce(processed_data, full_join, by = "CpG") # Print structure of final merged data str(merged_data) ``` Replace any NA with 0. ```{r, eval=FALSE} # Convert all columns (except "CpG") to numeric and replace NAs with 0 merged_data <- merged_data %>% mutate(across(-CpG, as.numeric)) %>% # Convert all except CpG to numeric mutate(across(-CpG, ~ replace_na(.x, 0))) # Replace NA with 0 in numeric columns ``` ## Filter data sets Only keep CpGs that have a non-zero value in all samples. ```{r, eval=FALSE} filtered_wgbs <- merged_data %>% filter(if_all(-CpG, ~ .x > 0)) # Ensure it's formatted as a data frame filtered_wgbs <- as.data.frame(filtered_wgbs) # Only keep the sample information in the column name. colnames(filtered_wgbs) <- gsub("^(.*?)_.*$", "\\1", colnames(filtered_wgbs)) # Set CpG IDs to rownames rownames(filtered_wgbs) <- filtered_wgbs$CpG filtered_wgbs <- filtered_wgbs %>% select(-CpG) nrow(merged_data) nrow(filtered_wgbs) ``` ``` [1] 9111633 [1] 137292 ``` **We had 9111633 CpGs before filtering and have only 137292 after filtering.** ```{r, eval=FALSE} write.csv(filtered_wgbs, "../output/05-Ptua-bismark-CG/filtered-WGBS-CpG-counts.csv") ``` # Why such a descepancy across species in number of filtered CpG?