countGenomicOverlaps {GenomicRanges}R Documentation

Count Read Hits in Genomic Features

Description

Count read hits per exon or transcript and resolve multi-hit reads.

Usage

  ## S4 method for signature 'GRangesList,GRangesList'
countGenomicOverlaps(
    query, subject, 
    type = c("any", "start", "end", "within", "equal"),
    resolution = c("none", "divide", "uniqueDisjoint"),
    ignore.strand = FALSE, splitreads = TRUE, ...) 

Arguments

query

A GRangesList, or a GRanges of genomic features. These are the annotations that define the genomic regions and will often be the result of calling "exonsBy" or "transcriptsBy" on a TranscriptDb object. If a GRangesList is provided, each top level of the list represents a "super" such as a gene and each row is a "sub" such as an exon or transcript. When query is a GRanges all rows are considered to be of the same level (e.g., all genes, all exons or all transcripts).

subject

A GRangesList, GRanges, or GappedAlignments representing the data (e.g., reads).

List structures as the subject are used to represent reads with multiple parts (i.e., gaps in the CIGAR). When a GappedAlignments is provided it is coerced to a GRangesList object. If any of the reads in the GappedAlignments have gaps, the corresponding GRangesList will have multiple elements for that top level list. When subject is a GRanges, it is assumed that all reads are simple and do not have multiple parts.

type

See findOverlaps in the IRanges package for a description of this argument.

resolution

A character(1) string of "none", "divide", or "uniqueDisjoint". These rule sets are used to distribute read hits when multiple queries are hit by the same subject.

  • "none" : No conflict resolution is performed. All subjects that hit more than 1 query are dropped.

  • "divide" : The hit from a single subject is divided equally among all queries that were hit. If a subject hit 4 queries each query is assigned 1/4 of a hit.

  • "uniqueDisjoint" : Queries hit by a common subject are partitioned into disjoint intervals. Any regions that are shared between the queries are discarded. If the read overlaps one of these remaining unique disjoint regions the hit is assigned to that feature. If the read overlaps both or none of the regions, no hit is assigned. Therefore, unlike the divide option, uniqueDisjoint does not resolve multi-hit conflict in all situations.

ignore.strand

A logical value indicating if strand should be considered when matching.

splitreads

A logical value indicating if split reads should be included.

...

Additional arguments, perhaps used by methods defined on this generic.

Details

The countGenomicOverlaps methods use the findOverlaps function in conjunction with a resolution method to identify overlaps and resolve subjects (reads) that match multiple queries (annotation regions). The usual type argument of findOverlaps is used to specify the type of overlap. The resolution argument is used to select a method to resolve the conflict when a subject hits more than 1 query. Here the term ‘hit’ means an overlap identified by findOverlaps.

The primary difference in the handling of split reads vs simple reads (i.e., no gap in the CIGAR) is the portion of the read hit each split read fragment has to contribute. All reads, whether simple or split, have an overall value of 1 to contribute to a query they hit. In the case of the split reads, this value is further divided by the number of fragments in the read. For example, if a split read has 3 fragments (i.e., two gaps in the CIGAR) each fragment has a value of 1/3 to contribute to the query they hit. As with the simple reads, depending upon the resolution chosen the value may be divided, fully assigned or discarded.

More detailed examples can be found in the countGenomicOverlaps vignette.

Value

A vector of counts

Author(s)

Valerie Obenchain and Martin Morgan

Examples

## Not run: 
rng1 <- function(s, w)
GRanges(seq="chr1", IRanges(s, width=w), strand="+")

rng2 <- function(s, w)
GRanges(seq="chr2", IRanges(s, width=w), strand="+")

query <- GRangesList(A=rng1(1000, 500),
                    B=rng2(2000, 900),
                    C=rng1(c(3000, 3600), c(500, 300)),
                    D=rng2(c(7000, 7500), c(600, 300)),
                    E1=rng1(4000, 500), E2=rng1(c(4300, 4500), c(400, 400)),
                    F=rng2(3000, 500),
                    G=rng1(c(5000, 5600), c(500, 300)),
                    H1=rng1(6000, 500), H2=rng1(6600, 400))

subj <- GRangesList(a=rng1(1400, 500),
                     b=rng2(2700, 100),
                     c=rng1(3400, 300),
                     d=rng2(7100, 600),
                     e=rng1(4200, 500),
                     f=rng2(c(3100, 3300), 50),
                     g=rng1(c(5400, 5600), 50),
                     h=rng1(c(6400, 6600), 50))

## Overlap type = "any"
none <- countGenomicOverlaps(query, subj, 
                             type="any", resolution="none")
divide <- countGenomicOverlaps(query, subj, 
                               type="any", resolution="divide")
uniqueDisjoint <- countGenomicOverlaps(query, subj, type="any", 
                                       resolution="uniqueDisjoint")
data.frame(none = none, 
           divide = divide,
           uniqDisj = uniqueDisjoint)


## Split read with 4 fragments :
splitreads <- GRangesList(c(rng1(c(3000, 3200, 4000), 100), rng1(5400, 300)))
## Unlist both the splitreads and the query to see 
## - read fragments 1 and 2 both hit query 3
## - read fragment 3 hits query 7
## - read fragment 4 hits query 11 and 12 
findOverlaps(unlist(query), unlist(splitreads))

## Use countGenomicOverlaps to avoid double counting.
## Because this read has 4 parts each part contributes a count of 1/4.
## When resolution="none" only reads that hit a single region are counted. 
split_none <- countGenomicOverlaps(query, splitreads, type="any",
                                   resolution="none")
## When resolution="divide" all reads are counted by dividing their count 
## evenly between the regions they hit. Region 3 of the query was hit
## by two reads each contributing a count of 1/4. Region 7 was hit
## by one read contributing a count of 1/4. Regions 11 and 12 were both
## hit by the same read resulting in having to share (i.e., "divide") the 
## single 1/4 hit read 4 had to contribute.
split_divide <- countGenomicOverlaps(query, splitreads, 
                                     type="any", resolution="divide")

data.frame(none = split_none,
           divide = split_divide)

## End(Not run)

[Package GenomicRanges version 1.6.7 Index]