#pull processed files from Gannet
# Note: Unfortunately we can't use the `cache` feature to make this process more time efficient, as it doesn't support long vectors
# Define the base URL
<- "https://gannet.fish.washington.edu/seashell/bu-github/timeseries_molecular/D-Apul/output/15.5-Apul-bismark/"
base_url
# Read the HTML page
<- read_html(base_url)
page
# Extract links to files
<- page %>%
file_links html_nodes("a") %>%
html_attr("href")
# Filter for files ending in "processed.txt"
<- file_links[grepl("processed\\.txt$", file_links)]
processed_files
# Create full URLs
<- paste0(base_url, processed_files)
file_urls
# Function to read a file from URL
<- function(url) {
read_processed_file read_table(url, col_types = cols(.default = "c")) # Read as character to avoid parsing issues
}
# Import all processed files into a list
<- lapply(file_urls, read_processed_file)
processed_data
# Name the list elements by file name
names(processed_data) <- processed_files
# Print structure of imported data
str(processed_data)
# add a header row that has "CpG" for the first column and "sample" for the second column, which will be populated by the file name
<- Map(function(df, filename) {
processed_data colnames(df) <- c("CpG", filename) # Rename columns
return(df)
names(processed_data)) # Use stored file names
}, processed_data,
#merge files together by "CpG"
<- purrr::reduce(processed_data, full_join, by = "CpG")
merged_data
# Print structure of final merged data
str(merged_data)
Day two. Much of time was spent on Epigenetic Simulation then moving on to POISE model.
Did some slide edits…
And coding to figure out what caused the lack of CpG loci in Apul (TLDR::one bad sample)
Trouble shooting Apul
Went back to processed txt loci files (from 10x bedgraph).
Found that 225-T1 was much smaller in line counts and we tracked down multiQC on bismark and found very bad alignment.
To remedy, did merge as before.
Then changed things up.
#remove 225-T1
<- merged_data[, !(names(merged_data) %in% "ACR-225-TP1_10x_processed.txt")] merged_data
# Convert all columns (except "CpG") to numeric
<- merged_data %>%
merged_data mutate(across(-CpG, as.numeric)) \
Only keep CpGs that have a non-na value in all samples.
<- na.omit(merged_data)
filtered_wgbs
# Ensure it's formatted as a data frame
<- as.data.frame(filtered_wgbs)
filtered_wgbs # Only keep the sample information in the column name.
colnames(filtered_wgbs) <- gsub("^(.*?)_.*$", "\\1", colnames(filtered_wgbs))
# Set CpG IDs to rownames
rownames(filtered_wgbs) <- filtered_wgbs$CpG
<- filtered_wgbs %>% select(-CpG)
filtered_wgbs
nrow(merged_data)
nrow(filtered_wgbs)
We had 12,093,025 CpGs before filtering and have 96785 after filtering.
How about Ptua?
First lets go back to processed txt loci file to see if there are any outliers.
Missing data downloads - https://github.com/urol-e5/timeseries_molecular/issues/34