Learning Objectives

Following this assignment students should be able to:

  • connect to a remote database and execute simple queries
  • integrate database and R workflow
  • export output data from R to database
  • tidy data table with redundant fields or overfilled cells


Lecture Notes

dplyr with Databases


  1. -- Source and Query --

    This is a follow up to the Basic Queries filtering problem.

    It is clear Dr. Undómiel appreciates your skill working with large databases and she seems to expect you will maintain your benevolence. (Such is a fair expectation of a true wizard). This time though, she’s looking for some extra detail in her queries. She’s curious if desert rodents are dimorphic in size.

    1. Download a new copy of the Portal database.
    2. Connect to portal_mammals.sqlite as object portaldb using dplyr.
    3. Start by reminding yourself about which tables are in the database using src_tbls()
    4. Then remind yourself of the fields in the surveys and plots tables using the list subset operator $ops$vars.
    5. Select and print out the average hind foot length and average weight of:
      • all Dipodomys spectabilis individuals on the control plots
      • male D. spectabilis on the control plots
      • female D. spectabilis on the control plots
    [click here for output]
  2. -- Manipulate Query --

    This is a follow-up to Source and Query.

    Dr. Undómiel agrees with you that the difference in male and female D. spectabilis hind foot length and weight seems pretty small, but wants to make a more detailed comparison. She wants you to find the male and female hind foot length and weight for all species of rodent on all of the plots (not just the controls) and quantitatively define the size differences among species.

    1. Produce a data frame with species_id, sex, avg_hindfoot_length, and avg_weight for each species. Your data frame should have two rows for each species, one row for each sex.

      You can solve this problem with dplyr in a variety of ways including writing a query or using data manipulation verbs to group and select the data. You could also decide to use for loops or apply statements. Take whichever approach you like best.

    2. Write a function that determines if the absolute difference in average male and female size is less than the standard deviation of sizes for all individuals (abs(male - female) <= stdev).

    3. Manipulate the data so that you have a local data frame that has average male and female hindfoot_length and weight and the standard deviations in a single row for each species.

    4. Use transmute() and an ifelse() with your function to take each species hindfoot_length and weight from your local data frame and make a new data frame as that labels the results of your simple calculation as "SAME" or "DIFF".

      You may find that you get an Error: non-numeric argument to binary operator. The missing size data causes mean() to return results as a character. Remove the missing data from your query or re-class the results as.numeric() to make your calculation.

    [click here for output]
  3. -- Copy to Database --

    Dr. Undómiel has decided to focus on the change in size of a few target rodent species over the course of the experiment(1977-2002). She has chosen Dipodymys spectabilis, Onychomys torridus, Perymiscus erimicus, Chaetodipus penicillatus.

    Write a script that uses dplyr to:

    1. Connect to the portal_mammals.sqlite.
    2. Generate a data frame with year, species_id, and the average weight per year (avg_weight) for each target species. You may find the %in% c() construction useful for your filter().
    3. Use copy_to() to include your new data frame in portal_mammals.sqlite. Call it something informative so that Dr. Undómiel can find it easily. Make sure it remains after the connection is terminated using temporary = FALSE.
    [click here for output]
  4. -- NEON Mammals --

    The National Ecological Observatory Network has entered into the construction phase of development and they are already making their data available! NEON collects ecological and environmental data for representative regions of the United States at local to continental scales, including, of course!, small mammal box trapping. We’ve retrieved NEON’s existing small mammal data from Ordway-Swisher Biological Station [NEON Data Use Policy].

    1. Create a SQLite database called ordway_mammals.sqlite.
    2. Download the three data tables (capture, plots, traps) and import them into the SQLite database. Familiarize yourself with the tables’ structure.
    3. Write a query to determine the total number of traps that have been disturbed at each plot. Plot a histogram of the results.
    4. Determine the average hind foot length and weight of each species collected for each National Landcover Database class (nlcdClass).
    5. Plot the average weight of all species with weight measurements from the woody wetlands.
    6. Create a scatter plot of the relationship between the total number of disturbed traps and the average weight of Podimus floridanus for each sampling event (eventID).
    [click here for output] [click here for output] [click here for output] [click here for output]
  5. -- Tree Biomass --

    Estimating the total amount of biomass (the total mass of all individuals) in forests is important for understanding the global carbon budget and how the earth will respond to increases in carbon dioxide emissions. Measuring the mass of whole trees is a major effort and requires destructive harvest of the tree. Fortunately, we can estimate the mass of a tree based on its diameter.

    There are lots of equations for estimating the mass of a tree from its diameter, but one good option is the equation:

    Mass = 0.124 * Diameter2.53

    where Mass is measured in kg of dry above-ground biomass and Diameter is in cm DBH (Brown 1997).

    We’re going to estimate the total tree biomass for trees in a 96 hectare area of the Western Ghats in India. The raw data is available on Ecological Archives. Unfortunately, the data is stored in a poor database structure and using all of the tree stems would be difficult without first tidying up the data. You can have a look at the metadata to get familiar with the data structure.

    1. Use tidyr to gather() the raw data into rows for each measured stem.
    2. Write a function that takes a vector of tree diameters as an argument and
      returns a vector of tree masses.
    3. Stems are measured in girth (or circumference) rather than diameter. Write a function that takes a vector of circumferences as an argument and returns a vector of diameters (circumference = pi * diameter).
    4. Use the two functions you’ve written to estimate the total biomass (i.e., the sum of the masses) of trees in this dataset and print the result to the screen.
    5. separate() the SpCode into GenusCode and SpEpCode and estimate the total biomass per genus in a table. Technically the four letter code doesn’t uniquely identify all of the genera in the dataset, but we’ll assume it does for the purpose of this exercise.
    [click here for output]