Learning Objectives

Following this assignment students should be able to:

  • understand basic built-in and stringr functions
  • manipulate strings for data analysis

Exercises

  1. -- Print Strings --

    1. Print the following: Post hoc ergo propter hoc

    2. Print the following with no quotes: What’s up with scientists using all of this snooty latin?

    3. Print the following with no quotes and an extra blank line (?cat): Darwin’s “On the origin of species” is a seminal work in biology.

    4. Assign x <- 3, then paste in the appropriate location of the statement: Then shalt thou count to x, no more, no less.

    [click here for output]
  2. -- stringr Functions --

    Use the character functions from the package stringr to print the following strings.

    1. "atgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgc". Do this by duplicating “atgc” 15 times.
    2. " Thank goodness it's Friday" without the leading white space (i.e., without the spaces before "Thank").
    3. "gcagtctgaggattccaccttctacctgggagagaggacatactatatcgcagcagtggaggtggaatgg" with all of the occurences of "a" replaced with "A".
    4. Print the length of this dna sequence "gccgatgtacatggaatatacttttcaggaaacacatatctgtggagagg".
    5. The number of "a"s in "gccgatgtacatggaatatacttttcaggaaacacatatctgtggagagg".
    6. Print the first 20 positions of this dna sequence "gccgatgtacatggaatatacttttcaggaaacacatatctgtggagagg".
    7. Print the last 10 positions of this dna sequence "gccgatgtacatggaatatacttttcaggaaacacatatctgtggagagg".
    [click here for output]
  3. -- Strings and Math --

    The length of an organism is typically strongly correlated with its body mass. This is useful because it allows us to estimate the mass of an organism even if we only know its length. This relationship generally takes the form

    Mass (kg) = a * Length(m)b

    where the parameters a and b vary among groups. Write a script that prompts the user for the following pieces of information:

    1. genus name
    2. species name
    3. the length of the species

    and then estimates the mass of the organism using the equation above. The script should paste the result as:

    Genus species is length meters long and weighs approximately mass kg.

    where the words in italics are replaced with the appropriate values. As is standard practice the first letter (and only the first letter) of the Genus name should be capitalized, and the species name should appear in all lower case letters when input.

    An allometric approach is regularly used to estimate the mass of dinosaurs since we cannot typically weigh something that is only preserved as bones. I’ll be testing your script using the length of a Spinosaurus (Spinosaurus aegyptiacus), which is 16 m long based on its reassembled skeleton. So, use the values of a and b for Theropoda (the appropriate dinosaur clade): a has been estimated as 0.73 and b has been estimated as 3.63 (Seebacher 2001). Spinosaurus is a predator that is bigger, and therefore, by definition, cooler, than that stupid Tyrannosaurus that everyone likes so much.

    [click here for output]
  4. -- Long Strings --

    For the DNA sequence below determine the following properties and print them to the screen (you can cut and paste the following into your code, it’s a lot longer than you can see on the screen, but just select the whole thing and when you paste it into R you’ll see what it looks like):

    dna="ttcacctatgaatggactgtccccaaagaagtaggacccactaatgcagatcctgtgtgtctagctaagatgtattattctgctgtggatcccactaaagatatattcactgggcttattgggccaatgaaaatatgcaagaaaggaagtttacatgcaaatgggagacagaaagatgtagacaaggaattctatttgtttcctacagtatttgatgagaatgagagtttactcctggaagataatattagaatgtttacaactgcacctgatcaggtggataaggaagatgaagactttcaggaatctaataaaatgcactccatgaatggattcatgtatgggaatcagccgggtctcactatgtgcaaaggagattcggtcgtgtggtacttattcagcgccggaaatgaggccgatgtacatggaatatacttttcaggaaacacatatctgtggagaggagaacggagagacacagcaaacctcttccctcaaacaagtcttacgctccacatgtggcctgacacagaggggacttttaatgttgaatgccttacaactgatcattacacaggcggcatgaagcaaaaatatactgtgaaccaatgcaggcggcagtctgaggattccaccttctacctgggagagaggacatactatatcgcagcagtggaggtggaatgggattattccccacaaagggagtgggattaggagctgcatcatttacaagagcagaatgtttcaaatgcatttttagataagggagagttttacataggctcaaagtacaagaaagttgtgtatcggcagtatactgatagcacattccgtgttccagtggagagaaaagctgaagaagaacatctgggaattctaggtccacaacttcatgcagatgttggagacaaagtcaaaattatctttaaaaacatggccacaaggccctactcaatacatgcccatggggtacaaacagagagttctacagttactccaacattaccaggtaaactctcacttacgtatggaaaatcccagaaagatctggagctggaacagaggattctgcttgtattccatgggcttattattcaactgtggatcaagttaaggacctctacagtggattaattggccccctgattgtttgtcgaagaccttacttgaaagtattcaatcccagaaggaagctggaatttgcccttctgtttctagtttttgatgagaatgaatcttggtacttagatgacaacatcaaaacatactctgatcaccccgagaaagtaaacaaagatgatgaggaattcatagaaagcaataaaatgcatgctattaatggaagaatgtttggaaacct"

    1. How long is the sequence?
    2. How many occurences of "gagg" occur in the sequence?
    3. What is the starting position of the first occurrence of "atta"?
    4. What is the GC content of the sequence? The GC content is the percentage of bases that are either G or C (as a percentage of total base pairs). Paste the result as “The GC content of this sequence is XX.XX%” where XX.XX is the actual GC content.
    [click here for output]
  5. -- Strings from Data --

    A colleague has produced a file with one DNA sequence on each line. Download the file and load it into R using read.csv(). The file has no header and is separated by white space ("").

    Calculate the GC content of each sequence. The GC content is the percentage of bases that are either G or C (as a percentage of total base pairs). Print each GC content in order to the screen (in %).

    [click here for output]
  6. -- String Data --

    This is a follow up to Strings from Data.

    A colleague has produced a file with one DNA sequence on each line. Download the file and load it into R using read.csv(). The file has no header.

    Write a function to calculate GC content. GC content is the percentage of bases that are either G or C as a percentage of total base pairs. Your function should take a dna sequence as input and return the GC-content of that sequence. Print the result for each sequence.

    Before we knew about functions we had to take each dna sequence one at a time and then rewrite or copy-paste the same code to analyze each one. Isn’t this better?

    You may have noticed that for Loop prints the results differently. read.csv() imports the data as a data.frame(), unlike the numeric vector in the previous exercise.

    [click here for output]
  7. -- Improve Your Code --

    This is a follow up to String Data.

    A colleague has produced a file with one DNA sequence on each line. So far you’ve been manually extracting each DNA sequence and calculating it’s GC content, which as worked OK with five sequences, but isn’t going to work very well when the sequencer really gets going and you have to handle 100s-1000s of sequences.

    Use a for loop and your function from String Data to calculate the GC content of each sequence and print them out. The function should work on a single sequence at a time and the for loop should repeatedly call the function and print out the result.

    [click here for output]
  8. -- Split Strings --

    You have a data file with a single "taxonomy" column in it. This column contains the family, genus, and species for a single taxonomic group. You need to figure out how to split that information into separate values for family, genus, and species. To solve the basic problem take a single example string, "Ornithorhynchidae Ornithorhynchus anatinus", split it into three separate strings using a stringr command, and then print the family, genus, and species, each on a separate line.

    [click here for output]