Remember the five basic rules of database structure

  1. Order doesn’t matter
  2. No duplicate rows
  3. Every cell contains one value
  4. One column per type of information
  5. No redundant information

Restructure tables with messy data

How to restructure to keep no duplicate rows and one value per cell

tidyr

library(tidyr)
genes_wide = data.frame(
  name = c("A", "B", "C"),
  a = c("16-Y", "25-N", "13-Y"),
  t = c("1-N", "12-Y", "31-Y")
)
> genes_wide
  name    a    t
1    A 16-Y  1-N
2    B 25-N 12-Y
3    C 13-Y 31-Y

Ask students,

  • “What makes genes_wide messy?”
  • “What are the variables in genes_wide?”

tidyr helps restructure messy data

genes_long = genes_wide %>%
  gather(base, base_counts, a:t)
> genes_long
  name base base_counts
1    A    a        16-Y
2    B    a        25-N
3    C    a        13-Y
4    A    t         1-N
5    B    t        12-Y
6    C    t        31-Y
genes = genes_long %>%
  separate(base_counts, c("counts", "sequence"), sep = "-")
> genes
  name base counts sequence
1    A    a     16        Y
2    B    a     25        N
3    C    a     13        Y
4    A    t      1        N
5    B    t     12        Y
6    C    t     31        Y