ROL-PI Day two

Más CpGs, por favor
methylation
e5
Author
Affiliation

Steven Roberts

Published

May 24, 2025

Day two. Much of time was spent on Epigenetic Simulation then moving on to POISE model.

Did some slide edits…

And coding to figure out what caused the lack of CpG loci in Apul (TLDR::one bad sample)

Trouble shooting Apul

Went back to processed txt loci files (from 10x bedgraph).

Found that 225-T1 was much smaller in line counts and we tracked down multiQC on bismark and found very bad alignment.

To remedy, did merge as before.

#pull processed files from Gannet 
# Note: Unfortunately we can't use the `cache` feature to make this process more time efficient, as it doesn't support long vectors

# Define the base URL
base_url <- "https://gannet.fish.washington.edu/seashell/bu-github/timeseries_molecular/D-Apul/output/15.5-Apul-bismark/"

# Read the HTML page
page <- read_html(base_url)

# Extract links to files
file_links <- page %>%
  html_nodes("a") %>%
  html_attr("href")

# Filter for files ending in "processed.txt"
processed_files <- file_links[grepl("processed\\.txt$", file_links)]

# Create full URLs
file_urls <- paste0(base_url, processed_files)

# Function to read a file from URL
read_processed_file <- function(url) {
  read_table(url, col_types = cols(.default = "c"))  # Read as character to avoid parsing issues
}

# Import all processed files into a list
processed_data <- lapply(file_urls, read_processed_file)

# Name the list elements by file name
names(processed_data) <- processed_files

# Print structure of imported data
str(processed_data)

# add a header row that has "CpG" for the first column and "sample" for the second column, which will be populated by the file name 

processed_data <- Map(function(df, filename) {
  colnames(df) <- c("CpG", filename)  # Rename columns
  return(df)
}, processed_data, names(processed_data))  # Use stored file names

#merge files together by "CpG"
merged_data <- purrr::reduce(processed_data, full_join, by = "CpG")

# Print structure of final merged data
str(merged_data)

Then changed things up.

#remove 225-T1

merged_data <- merged_data[, !(names(merged_data) %in% "ACR-225-TP1_10x_processed.txt")]
# Convert all columns (except "CpG") to numeric
merged_data <- merged_data %>%
  mutate(across(-CpG, as.numeric)) \

Only keep CpGs that have a non-na value in all samples.

filtered_wgbs <- na.omit(merged_data)

# Ensure it's formatted as a data frame
filtered_wgbs <- as.data.frame(filtered_wgbs)
# Only keep the sample information in the column name. 
colnames(filtered_wgbs) <- gsub("^(.*?)_.*$", "\\1", colnames(filtered_wgbs))
# Set CpG IDs to rownames
rownames(filtered_wgbs) <- filtered_wgbs$CpG
filtered_wgbs <- filtered_wgbs %>% select(-CpG)

nrow(merged_data)
nrow(filtered_wgbs)

We had 12,093,025 CpGs before filtering and have 96785 after filtering.


How about Ptua?

First lets go back to processed txt loci file to see if there are any outliers.

Missing data downloads - https://github.com/urol-e5/timeseries_molecular/issues/34