#### Learning Objectives

Following this assignment students should be able to:

• practice basic syntax and usage of `for` loops
• use `for` loops to automate function operations
• understand how to decompose complex problems

• Topics

• Iteration
• Problem decomposition

Loops

1. #### -- Use and Modify with Apply --

This is a followup to Use and Modify.

1. Write a function that takes length as an argument to get an estimate of mass values for the dinosaur Theropoda. Use the equation `mass <- 0.73 * length ** 3.63`, where 0.73 is a and 3.63 is b. Then use `sapply()` and the following vector of lengths to get a vector of estimated masses.

`theropoda_lengths <- c(17.8013631070471, 20.3764452071665, 14.0743486294308, 25.65782386974, 26.0952008049675, 20.3111541103134, 17.5663244372533, 11.2563431277577, 20.081903202614, 18.6071626441984, 18.0991894513166, 23.0659685685892, 20.5798853467837, 25.6179254233558, 24.3714331573996, 26.2847248252537, 25.4753783544473, 20.4642089867304, 16.0738256364701, 20.3494171706583, 19.854399305869, 17.7889814608919, 14.8016421998303, 19.6840911485379, 19.4685885050906, 24.4807784966691, 13.3359960054899, 21.5065994598917, 18.4640304608411, 19.5861532398676, 27.084751999756, 18.9609366301798, 22.4829168046521, 11.7325716149514, 18.3758846100456, 15.537504851634, 13.4848751773738, 7.68561192214935, 25.5963348603783, 16.588285389794)`

2. Rewrite the function with a and b as arguments, then use `mapply()` to determine the mass estimates for the dinosaurs using the following vectors of a and b values for each dinosaur.

`a_values <- c(0.759, 0.751, 0.74, 0.746, 0.759, 0.751, 0.749, 0.751, 0.738, 0.768, 0.736, 0.749, 0.746, 0.744, 0.749, 0.751, 0.744, 0.754, 0.774, 0.751, 0.763, 0.749, 0.741, 0.754, 0.746, 0.755, 0.764, 0.758, 0.76, 0.748, 0.745, 0.756, 0.739, 0.733, 0.757, 0.747, 0.741, 0.752, 0.752, 0.748)`

`b_values <- c(3.627, 3.633, 3.626, 3.633, 3.627, 3.629, 3.632, 3.628, 3.633, 3.627, 3.621, 3.63, 3.631, 3.632, 3.628, 3.626, 3.639, 3.626, 3.635, 3.629, 3.642, 3.632, 3.633, 3.629, 3.62, 3.619, 3.638, 3.627, 3.621, 3.628, 3.628, 3.635, 3.624, 3.621, 3.621, 3.632, 3.627, 3.624, 3.634, 3.621)`

2. #### -- Crown Volume Calculation --

The UHURU experiment in Kenya has conducted a survey of Acacia and other tree species in ungulate exclosure treatments. Data for the tree data is available here in a tab delimited (`"\t"`) format. Each of the individuals surveyed were measured for tree height (`HEIGHT`) and canopy size in two directions (`AXIS_1` and `AXIS_2`). Read these data in using the following code:

``````tree_data <- read.csv("http://www.esapubs.org/archive/ecol/E095/064/TREE_SURVEYS.txt",
sep = '\t',
"NA", "?", "3.3."),
stringsAsFactors = FALSE)
``````

You want to estimate the crown volumes for the different species and have developed equations for species in the Acacia genus:

``````volume = 0.16 * HEIGHT^0.8 * pi * AXIS_1 * AXIS_2
``````

and the Balanites genus:

``````volume = 1.2 * HEIGHT^0.26 * pi * AXIS_1 * AXIS_2
``````

For all other genera you’ll use a general equation developed for trees:

``````volume = 0.5 * HEIGHT^0.6 * pi * AXIS_1 * AXIS_2
``````
1. Write a function called `tree_volume_calc` that calculates the canopy volume for the Acacia species in the dataset. To do so, use an if statement in combination with the `str_detect()` function from the `stringr` R package. The code `str_detect(SPECIES, "Acacia")` will return `TRUE` if the string stored in this variable contains the word “Acacia” and `FALSE` if it does not. This function will have to take the following arguments as input: SPECIES, HEIGHT, AXIS_1, AXIS_2. Then run the following line:

`tree_volume_calc("Acacia_brevispica", 2.2, 3.5, 1.12)`

2. Expand this function to additionally calculate canopy volumes for other types of trees in this dataset by adding if/else statements and including the volume equations for the Balanites genus and other genera. Then run the following lines:

`tree_volume_calc("Balanites", 2.2, 3.5, 1.12)` `tree_volume_calc("Croton", 2.2, 3.5, 1.12)`

3. Now run the completed tree canopy function on the entire tree_data dataframe using `mapply()`. Save the output of the `mapply()` as an object, then turn this object into a dataframe.

3. #### -- Basic Vector --

The following code gets just the genus from strings that are scientific names and capitalizes the first letter of the genus name. The `str_extract` and `str_to_title` functions are from the `stringr` package, which is great for working with strings.

``````waterbird <- "cygnus olor"
waterbird_genus <- str_extract(waterbird, "\\w+")
waterbird_genus_cap <- str_to_title(waterbird_genus)
print(waterbird_genus_cap)
``````

Modify the code to loop over a vector of multiple scientific names and print the capitalized genus names to the screen one line at a time, using the following vector:

``````waterbirds <- c("cygnus olor", "aix sponsa", "anas acuta")
``````
4. #### -- Basic Index --

The following code prints the first five whole numbers one line at a time:

``````for (i in 1:5){
print(i)
}
``````

Modify this code to accomplish each of the following tasks:

1. Multiply each of the numbers 1 through 5 by 3, then print the result.
2. Use the vector `odds <- seq(1, 9, by = 2)` and a loop to square these numbers and print them one at a time. Choose a descriptive variable to identify the index values, such as `for (odd in odds)`.
3. Instead of printing the values produced by this for loop, store them in a new vector named `output`. The following code provides a template for doing this:

``````output <- c()
for (odd in _____) {
_____ <- _____^2
output <- c(output, odds_squared)
}
``````
5. #### -- stringr --

A colleague has produced a file with one DNA sequence on each line. Download the file. Load it into R with`read.csv()` and name the data frame `sequences`.

Your colleague wants to calculate the GC content of each DNA sequence (i.e., the percentage of bases that are either G or C) and knows just a little R. They sent you the following code, which will calculate the GC content for a single sequence using the `stringr` package:

``````sequence <- "attggc"
num_g <- str_count(sequence, "g")
num_c <- str_count(sequence, "c")
gc_content <- (num_g + num_c) / str_length(sequence) * 100
``````
1. Convert the last three lines of this code into a function to calculate the GC content of a DNA sequence. Name that function `get_gc_content`.

2. Use a `for` loop and your function to calculate the GC content of each sequence and store the results in a new data frame using an index. The following code will help you create this data frame:

``````# create an empty data frame with one row for each sequence
gc_contents <- data.frame(gc_content = numeric(nrow(_______)))

# loop over sequences using an index for the row and
# store the output in the new data frame
for (i in 1:nrow(__________)){
________[i,] <- get_gc_content(sequences\$____[____])
}
``````
6. #### -- Multiple Files --

This is a follow-up to stringr.

The same colleague wants to scale up their analysis to work on data from the PATRIC bacterial phytogenomic database. They want to know the GC content of all of the bacteria in the database and have started working initially with just a handful of archaea.

Help out by downloading the data and looping over the files to determine the GC content for each file. Once downloaded, either unzip the .zip file manually or use the `unzip` function.

Read the .fasta files in the unzipped folder into R using the ShortRead package in Bioconductor. Start by installing Bioconductor with the following code (this may take a while, so be patient):

``````source("https://bioconductor.org/biocLite.R")
``````

Each fasta file contains one long sequence of DNA for an archaea species. The following code loads one sequence file, where `seq` is the variable name for the data file:

``````library(ShortRead)
``````

Use `list.files()`, with `full.names` set to true, to generate a list of the names of all the sequence files. Then create a for loop that uses the above `seq` code to read in each sequence file. When this works, add to the loop a function that calculates GC content of each file. You can use the GC content function you wrote for stringr, but you will have to capitalize the “g” and “c” strings in the `str_count` functions because the sequences in these files are upper case. Once this is working, create an empty data frame to store the output and fill it with values from the for loop, one column with file names and the second column with GC contents.

7. #### -- Species Occurrences Elevation Histogram --

A colleague of yours is working on a project on deer mice (Peromyscus maniculatus) and is interested in what elevations these mice tend to occupy in the continental United States. You offer to help them out by getting some coordinates for specimens of this species and looking up the elevation of these coordinates.

1. Get deer mouse occurrences from GBIF, the Global Biodiversity Information Facility, using the `spocc` R package, which is designed to retrieve species occurrence data from various openly available data resources. Use the following code to do so:

`````` mouse_df = occ(query = "Peromyscus maniculatus",
from = "gbif")
mouse_df = data.frame(mouse_df\$gbif\$data)
``````
2. With `dplyr`, rename the second and third columns of this dataset to `longitude` and `latitude`, and include only those specimens with a `basisOfRecord` that is `PRESERVED_SPECIMEN`.

3. The `raster` package comes with some datasets, including one of global elevations, that can be retrieved with the `getData` function as follows:

`````` elevation = getData("alt", country = "US")
elevation = elevation[[1]]
``````
4. Turn the occurrences dataframe into a spatial dataframe, making sure that its projection matches that of the elevation dataset.

5. Extract the elevation values for all of the deer mouse occurrences and plot a histogram of them.

6. Write a function that creates a vector of elevation values for a given species name. This function should retrieve the data and put it into a dataframe, as in part 1. To limit how long this takes, add in the argument `limit = 50` in the `occ` line. Then rename the latitude and longitude columns and remove NA rows from these columns using the following code:

`````` colnames(mouse_df)[2] <- "longitude"
colnames(mouse_df)[3] <- "latitude"
mouse_df = mouse_df %>% filter(!is.na(longitude) & !is.na(latitude))
``````

Turn the dataframe into a spatial object and extract the elevations for the points in the object. Test that your function works by putting in `"Peromyscus maniculatus"` as the argument.

7. Run this function on the following vector of mouse species names using either a loop to get a vector or an apply function to get a list of elevations for 50 occurrences of each of these 5 species.

`````` mouse_species = c("Peromyscus maniculatus", "Peromyscus leucopus", "Peromyscus eremicus", "Peromyscus merriami", "Peromyscus boylii")
``````