Learning Objectives

Following this assignment students should be able to:

  • write basic and nested functions
  • understand the search patterns described by regular expressions
  • debug and test R script to ensure proper function and output

Reading

Lecture Notes


Exercises

  1. -- Species Area Relationship --

    The species-area relationship characterizes the relationship between the number of species observed at a site and the area being sampled. This relationship is used widely in ecology and conservation biology for tasks such as estimating the location of biodiversity hotspots to prioritize for conservation.

    Unfortunately there is no consensus on the form of the equation that best describes the species-area relationship. This means that any estimate of species richness depends on the choice of model. Most of the models have roughly equivalent statistical support and we are going to be making predictions for regions where there is no data so we can’t determine the best model statistically. Instead we are going to take a consensus approach where we estimate the species richness using all possible models and then use the average prediction as our best estimate.

    We are going to deal with five models today (which is already kind of a lot), but according to some authors there are as many as 20 reasonable models for the species-area relationship, so we’ll want to make our code easily extensible. The five models we will work with are those defined by Dengler and Oldeland (2010).

    • Power: S = b0 * Ab1
    • Power-quadratic: S = 10(b0 + b1 * log(A) + b2 * log(A)2)
    • Logarithmic: S = b0 + b1 * log(A)
    • Michaelis-Menten: S = b0 * A / (b1 + A)
    • Lomolino: S = b0 / (1 + b1log(b2/A))

    All logarithms are base 10. The parameters for each model are available below, along with the areas at which we wish to predict species richness. Each sublist contains the parameters for one model in the order given above. All models contain b0 and b1, but only the Power-quadratic and Lomolino models contain the third parameter b2.

    sar_parameters <- list(c(20.81, 0.1896), c(1.35, 0.1524, 0.0081),
    c(14.36, 21.16), c(85.91, 42.57), c(1082.45, 1.59, 390000000))
    
    areas <- c(1, 5.2, 10.95, 152.3, 597.6, 820, 989.8, 1232.5, 15061)
    

    These can be cut and paste into your code. Alternatively, if you’re looking for a more realistic challenge you can import the related csv files for the parameters and the areas directly from the web. Dealing with extracting the data you need from a standard csv import will be a little challenging, but you’ll learn a lot (and you can always solve the main problem first and then go back and solve the import step later; which might well be what an experienced programmer would do in this situation).

    Write a script that calculates the richness predicted by each model for each area, and exports the results to a csv file with the first column containing the area for the prediction and the second column containing the mean predicted richness for that area. To make this easily extensible you will want to write a function that defines each of the different species-area models (5 functions total) and then use higher order functions to call those functions.

    [click here for output]
  2. -- Find the Rodents --

    There are many diverse sources of biological data in the modern world, but not all of them are nicely cultivated into well structured data. For a project you’re working on you need a list of all of the rodent species in the world. You find a great list of rodents by following the first link on a Google search for List of Rodents.

    Unfortunately it’s in HTML and scraping HTML is generally considered to be a bit of a nightmare. Fortunately it’s a Wikipedia article and Wikipedia has a nice feature to let us see the source code (or wiki markup) that is used to build the HTML. This is same as what we would see if we clicked on the Edit tab of the article, but accessible in a simple text file. This can be done using the general url:

    http://en.wikipedia.org/w/index.php?title=PAGETITLE&action=raw

    where PAGETITLE is replaced with the actual title of the page.

    Download the wiki markup and write a short script using regular expressions that extracts the list of all latin binomials from the page. Export this list as a text file, one species per line, with genus and species separated by a comma.

    [click here for output]
  3. -- Debug the Code --

    This is a follow up to Data Management Review.

    While you were out of town doing field work over the summer Dr. Granger hired another student, Gregory Goyle, to help her modify your code so that it did something a bit different than the original code. The new code was intended to include more size classes and to output the average GC content for each size class to a csv file rather than the individual level data. Unfortunately Greg hasn’t learned an important lesson about programming, that it’s almost always better to work with existing code than to try to rewrite it from scratch, so he figured it would be easier to just start over than to try to understand what you’d already done. Sadly Greg isn’t quite the programmer you are and so didn’t actually finish the project before having to stop to focus on his course work now that school is back in session (and boy does he need to focus). So, he’s committed the current version of his code to your repository. It has all of the parts in place, but isn’t exactly… well… working just yet.

    You don’t want to make the same mistake that Greg did and besides, your computer crashed over the summer and you weren’t using version control (it’s OK, it’s not your fault), so you’ll need to work with Greg’s code, such as it is. Find the bugs in the code and fix them. You’ll need to both read the code and use a debugger to understand what’s going on and fix the problems. Get the code cleaned up at least up to the point where the code is actually executing. You’re welcome to find and fix/improve other issues as well, but you’ll also be writing tests later to help you track the tricky problems down, so the really important thing at this point is to get the code running so that you can actually run the tests.

    Download the code to get started.

    [click here for output]
  4. -- Test the Code --

    This is a follow up to Debug the Code.

    Write tests for your ‘Debug the Code’ code for the following cases and save it in a file called Tests-R.R. Make corrections/improvements to the ‘Debug the Code’ code so that all of your tests pass.

    test_that("get_gc_content() works",

    1. on a sequence represented by upper case string.
    2. on a sequence represented by lower case string.
    3. on a sequence represented by mixed case string.
    4. on a sequence represented by multiline string.

    In an email accompanying your “updated” code, Dr. Granger indicated that the specifications for the earlength size classes were:

    1. extralarge: earlength >= 15
    2. large: 10 <= earlength < 15
    3. medium: 8 <= earlength < 10
    4. small: earlength < 8

    test_that("get_size_class() works",

    1. for each case when the numbers are in the range,
    2. for the edgecases of 8, 10, and 15,
    3. but the function fails if non-numerical values are input as an argument (e.g., a string from a header row that didn’t get removed).
    [click here for output]