Genomic Ranges
Coordinate Systems
Screen Recording
Reading
Textbook: Working with Range Data A crash course in Genomic Ranges 263-269
Working with Range Data with BEDtools 329-337
Objectives
Understand genomic ranges, what they are, and what they are useful for.
Genomic Ranges:
Genomic ranges are a way to represent regions of a genome, typically the start and end of specific features like genes, exons, or regulatory regions.
A genomic range is typically represented by three pieces of information:
Chromosome: The chromosome on which the feature is located.
Start position: The position on the chromosome where the feature begins.
End position: The position on the chromosome where the feature ends.
It’s important to note that the start and end positions can refer to the genomic coordinates (see below), but it can also be as simple as the number of base pairs from the start of the chromosome.
Coordinate Systems:
In genomics, there are two main types of coordinate systems:
1-based coordinate system: In this system, the first base of a sequence is numbered as 1. This system is commonly used in most biological research, including genome annotations in databases such as GenBank and Ensembl.
0-based coordinate system: In this system, the first base of a sequence is numbered as 0. This system is commonly used in some computational tools and languages, like Python and in the BED file format widely used in bioinformatics.
The distinction between 1-based and 0-based is important because it can lead to off-by-one errors if not accounted for.
GenomicRanges Package:
In the R programming language, the GenomicRanges
package is often used for handling genomic ranges. It provides a convenient and consistent way to handle and manipulate genomic ranges. Some of the functionalities include:
Creating genomic ranges
Finding overlaps between genomic ranges
Finding the nearest genomic range
Shifting or narrowing genomic ranges
Here’s a basic example of how you can create a GenomicRange object in R:
# Loading the library
library(GenomicRanges)
# Creating a GenomicRanges object
<- GRanges(seqnames = "chr1",
gr ranges = IRanges(start = c(100, 200), end = c(150, 250)))
# Printing the object
print(gr)
Understanding genomic ranges and coordinate systems is crucial when dealing with any kind of genomic data, as it allows precise representation and manipulation of genomic features. Be sure to always check the documentation of any tool or database you’re using to understand what kind of coordinate system it uses, and always be consistent in your own work.
BEDTools:
BEDTools is a suite of utilities for comparing and manipulating genomic features in various formats such as BED, GFF/GTF, VCF, and BAM. These tools are incredibly powerful for handling genomic intervals and are often used in bioinformatics pipelines. Some of its features include:
Comparing genomic features
Manipulating genomic features
Counting genomic features
Creating coverage tracks
Here’s a basic rundown of some of the most commonly used BEDTools subcommands:
intersect
bedtools intersect
is used to find overlapping regions between two BED files. For example, you could use it to find genes (from a genes.BED file) that overlap with regulatory regions (from a regulatory_regions.BED file):bedtools intersect -a genes.BED -b regulatory_regions.BED
merge
bedtools merge
combines overlapping intervals into a single interval. For example, if you have a BED file with multiple overlapping regions, you can merge them into single, continuous regions:bedtools merge -i input.BED
sort
bedtools sort
sorts a BED file by chromosome and then by start position. This is often necessary before using other BEDTools commands, as many require sorted input:bedtools sort -i input.BED
closest
bedtools closest
finds the closest non-overlapping interval. For example, for each gene (in genes.BED), you could find the closest regulatory region (in regulatory_regions.BED):bedtools closest -a genes.BED -b regulatory_regions.BED
coverage
bedtools coverage
calculates the coverage of one set of intervals over another. For example, you could calculate the coverage of sequencing reads (in reads.BED) over genes (in genes.BED):bedtools coverage -a genes.BED -b reads.BED
Lastly, BEDTools is flexible and powerful, but it can be complex. Always check the documentation and test your commands on small, controlled datasets to make sure they’re doing what you expect.