Loops

Multiple Files

This is a follow-up to stringr.

The same colleague wants to scale up their analysis to work on data from the PATRIC bacterial phytogenomic database. They want to know the GC content of all of the bacteria in the database and have started working initially with just a handful of archaea.

Help out by downloading the data and looping over the files to determine the GC content for each file. Once downloaded, either unzip the .zip file manually or use the unzip function.

Read the .fasta files in the unzipped folder into R using the ShortRead package in Bioconductor. Start by installing Bioconductor with the following code (this may take a while, so be patient):

source("https://bioconductor.org/biocLite.R")
biocLite("ShortRead")

Each fasta file contains one long sequence of DNA for an archaea species. The following code loads one sequence file, where seq is the variable name for the data file:

library(ShortRead)
reads <- readFasta("archaea-dna/A-saccharovorans.fasta")
seq <- sread(reads)

Use list.files(), with full.names set to true, to generate a list of the names of all the sequence files. Then create a for loop that uses the above seq code to read in each sequence file. When this works, add to the loop a function that calculates GC content of each file. You can use the GC content function you wrote for stringr, but you will have to capitalize the “g” and “c” strings in the str_count functions because the sequences in these files are upper case. Once this is working, create an empty data frame to store the output and fill it with values from the for loop, one column with file names and the second column with GC contents.

[click here for output]

Data Science for SAFS

Loops

Multiple Files