Day two. Much of time was spent on Epigenetic Simulation then moving on to POISE model.

Did some slide edits…

And coding to figure out what caused the lack of CpG loci in Apul (TLDR::one bad sample)

Trouble shooting Apul

Went back to processed txt loci files (from 10x bedgraph).

Found that 225-T1 was much smaller in line counts and we tracked down multiQC on bismark and found very bad alignment.

To remedy, did merge as before.

#pull processed files from Gannet 
# Note: Unfortunately we can't use the `cache` feature to make this process more time efficient, as it doesn't support long vectors

# Define the base URL
base_url <- "https://gannet.fish.washington.edu/seashell/bu-github/timeseries_molecular/D-Apul/output/15.5-Apul-bismark/"

# Read the HTML page
page <- read_html(base_url)

# Extract links to files
file_links <- page %>%
  html_nodes("a") %>%
  html_attr("href")

# Filter for files ending in "processed.txt"
processed_files <- file_links[grepl("processed\\.txt$", file_links)]

# Create full URLs
file_urls <- paste0(base_url, processed_files)

# Function to read a file from URL
read_processed_file <- function(url) {
  read_table(url, col_types = cols(.default = "c"))  # Read as character to avoid parsing issues
}

# Import all processed files into a list
processed_data <- lapply(file_urls, read_processed_file)

# Name the list elements by file name
names(processed_data) <- processed_files

# Print structure of imported data
str(processed_data)

# add a header row that has "CpG" for the first column and "sample" for the second column, which will be populated by the file name 

processed_data <- Map(function(df, filename) {
  colnames(df) <- c("CpG", filename)  # Rename columns
  return(df)
}, processed_data, names(processed_data))  # Use stored file names

#merge files together by "CpG"
merged_data <- purrr::reduce(processed_data, full_join, by = "CpG")

# Print structure of final merged data
str(merged_data)

Then changed things up.

#remove 225-T1

merged_data <- merged_data[, !(names(merged_data) %in% "ACR-225-TP1_10x_processed.txt")]

# Convert all columns (except "CpG") to numeric
merged_data <- merged_data %>%
  mutate(across(-CpG, as.numeric)) \

Only keep CpGs that have a non-na value in all samples.

filtered_wgbs <- na.omit(merged_data)

# Ensure it's formatted as a data frame
filtered_wgbs <- as.data.frame(filtered_wgbs)
# Only keep the sample information in the column name. 
colnames(filtered_wgbs) <- gsub("^(.*?)_.*$", "\\1", colnames(filtered_wgbs))
# Set CpG IDs to rownames
rownames(filtered_wgbs) <- filtered_wgbs$CpG
filtered_wgbs <- filtered_wgbs %>% select(-CpG)

nrow(merged_data)
nrow(filtered_wgbs)

We had 12,093,025 CpGs before filtering and have 96785 after filtering.

How about Ptua?

First lets go back to processed txt loci file to see if there are any outliers.

Missing data downloads - https://github.com/urol-e5/timeseries_molecular/issues/34

--- title: "ROL-PI Day two" description: "Más CpGs, por favor" categories: [methylation, e5] #citation: date: 05-24-2025 image: http://gannet.fish.washington.edu/seashell/snaps/Monosnap_EpigeneticsSimulationMS_-_Online_LaTeX_Editor_Overleaf_2025-05-25_07-13-14.png # finding a good image author: - name: Steven Roberts url: orcid: 0000-0001-8302-1138 affiliation: Professor, UW - School of Aquatic and Fishery Sciences affiliation-url: https://robertslab.info #url: # self-defined draft: false # setting this to `true` will prevent your post from appearing on your listing page until you're ready! format: html: code-fold: FALSE code-tools: true code-copy: true highlight-style: github code-overflow: wrap #runtime: shiny --- ```{r setup, include=FALSE} library(knitr) library(readr) library(DT) library(shiny) library(plotly) knitr::opts_chunk$set( echo = TRUE, # Display code chunks eval = FALSE, # Evaluate code chunks warning = FALSE, # Hide warnings message = FALSE, # Hide messages fig.width = 6, # Set plot width in inches fig.height = 4, # Set plot height in inches fig.align = "center", # Align plots to the center comment = "" # Prevents appending '##' to beginning of lines in code output ) ``` Day two. Much of time was spent on Epigenetic Simulation then moving on to POISE model. Did some slide edits... And coding to figure out what caused the lack of CpG loci in Apul (TLDR::one bad sample) # Trouble shooting Apul Went back to processed txt loci files (from 10x bedgraph). Found that 225-T1 was much smaller in line counts and we tracked down multiQC on bismark and found very bad alignment. To remedy, did merge as before. ```{r, eval=FALSE} #pull processed files from Gannet # Note: Unfortunately we can't use the `cache` feature to make this process more time efficient, as it doesn't support long vectors # Define the base URL base_url <- "https://gannet.fish.washington.edu/seashell/bu-github/timeseries_molecular/D-Apul/output/15.5-Apul-bismark/" # Read the HTML page page <- read_html(base_url) # Extract links to files file_links <- page %>% html_nodes("a") %>% html_attr("href") # Filter for files ending in "processed.txt" processed_files <- file_links[grepl("processed\\.txt$", file_links)] # Create full URLs file_urls <- paste0(base_url, processed_files) # Function to read a file from URL read_processed_file <- function(url) { read_table(url, col_types = cols(.default = "c")) # Read as character to avoid parsing issues } # Import all processed files into a list processed_data <- lapply(file_urls, read_processed_file) # Name the list elements by file name names(processed_data) <- processed_files # Print structure of imported data str(processed_data) # add a header row that has "CpG" for the first column and "sample" for the second column, which will be populated by the file name processed_data <- Map(function(df, filename) { colnames(df) <- c("CpG", filename) # Rename columns return(df) }, processed_data, names(processed_data)) # Use stored file names #merge files together by "CpG" merged_data <- purrr::reduce(processed_data, full_join, by = "CpG") # Print structure of final merged data str(merged_data) ``` Then changed things up. ```{r} #remove 225-T1 merged_data <- merged_data[, !(names(merged_data) %in% "ACR-225-TP1_10x_processed.txt")] ``` ```{r, eval=FALSE} # Convert all columns (except "CpG") to numeric merged_data <- merged_data %>% mutate(across(-CpG, as.numeric)) \ ``` Only keep CpGs that have a non-na value in all samples. ```{r, eval=FALSE} filtered_wgbs <- na.omit(merged_data) # Ensure it's formatted as a data frame filtered_wgbs <- as.data.frame(filtered_wgbs) # Only keep the sample information in the column name. colnames(filtered_wgbs) <- gsub("^(.*?)_.*$", "\\1", colnames(filtered_wgbs)) # Set CpG IDs to rownames rownames(filtered_wgbs) <- filtered_wgbs$CpG filtered_wgbs <- filtered_wgbs %>% select(-CpG) nrow(merged_data) nrow(filtered_wgbs) ``` We had 12,093,025 CpGs before filtering and have 96785 after filtering. --- # How about Ptua? First lets go back to processed txt loci file to see if there are any outliers. Missing data downloads - https://github.com/urol-e5/timeseries_molecular/issues/34