extractTranscriptsFromGenome {GenomicFeatures}R Documentation

Tools for extracting transcript sequences

Description

extractTranscriptsFromGenome extracts the transcript sequences from a BSgenome data package using the transcript information (exon boundaries) stored in a TranscriptDb or GRangesList object.

extractTranscripts extracts a set of transcripts from a single DNA sequence.

Related utilities:

transcriptWidths to get the lengths of the transcripts (called the "widths" in this context) based on the boundaries of their exons.

transcriptLocs2refLocs converts transcript-based locations into reference-based (aka chromosome-based or genomic) locations.

Usage

extractTranscriptsFromGenome(genome, txdb, use.names=TRUE)

extractTranscripts(x,
        exonStarts=list(), exonEnds=list(), strand=character(0),
        reorder.exons.on.minus.strand=FALSE)

## Related utilities:

transcriptWidths(exonStarts=list(), exonEnds=list())

transcriptLocs2refLocs(tlocs,
        exonStarts=list(), exonEnds=list(), strand=character(0),
        reorder.exons.on.minus.strand=FALSE)

Arguments

genome

A BSgenome object. See the available.genomes function in the BSgenome package for how to install a genome.

txdb

A TranscriptDb object or a GRangesList object.

use.names

TRUE or FALSE. Ignored if txdb is not a TranscriptDb object. If TRUE (the default), the returned sequences are named with the transcript names. If FALSE, they are named with the transcript internal ids. Note that, unlike the transcript internal ids, the transcript names are not guaranteed to be unique or even defined (they could be all NAs). A warning is issued when this happens.

x

A DNAString or MaskedDNAString object.

exonStarts, exonEnds

The starts and ends of the exons, respectively.

Each argument can be a list of integer vectors, an IntegerList object, or a character vector where each element is a comma-separated list of integers. In addition, the lists represented by exonStarts and exonEnds must have the same shape i.e. have the same lengths and have elements of the same lengths. The length of exonStarts and exonEnds is the number of transcripts.

strand

A character vector of the same length as exonStarts and exonEnds specifying the strand ("+" or "-") from which the transcript is coming.

reorder.exons.on.minus.strand

TRUE or FALSE. Should the order of exons for transcripts coming from the minus strand be reversed?

tlocs

A list of integer vectors of the same length as exonStarts and exonEnds. Each element in tlocs must contain transcript-based locations.

Value

For extractTranscriptsFromGenome: A named DNAStringSet object with one element per transcript. When txdb is a GRangesList object, elements in the output align with elements in the input (txdb), and they have the same names.

For extractTranscripts: A DNAStringSet object with one element per transcript.

For transcriptWidths: An integer vector with one element per transcript.

For transcriptLocs2refLocs: A list of integer vectors of the same shape as tlocs.

Author(s)

H. Pages

See Also

available.genomes, TranscriptDb-class, GRangesList-class, DNAStringSet-class

Examples

  library(BSgenome.Hsapiens.UCSC.hg18)  # load the genome

  ## ---------------------------------------------------------------------
  ## A. USING extractTranscriptsFromGenome() WITH A TranscriptDb OBJECT
  ## ---------------------------------------------------------------------
  txdb_file <- system.file("extdata", "UCSC_knownGene_sample.sqlite",
                           package="GenomicFeatures")
  txdb <- loadFeatures(txdb_file)
  txseqs <- extractTranscriptsFromGenome(Hsapiens, txdb)
  txseqs

  ## ---------------------------------------------------------------------
  ## B. USING extractTranscriptsFromGenome() WITH A GRangesList OBJECT
  ## ---------------------------------------------------------------------

  ## A GRangesList object containing exons grouped by transcripts gives
  ## the same result as above:
  exbytx <- exonsBy(txdb, by="tx", use.names=TRUE)
  txseqs2 <- extractTranscriptsFromGenome(Hsapiens, exbytx)
  ## A sanity check:
  stopifnot(identical(unname(sapply(width(exbytx), sum)), width(txseqs2)))

  ## CDSs grouped by transcripts (this extracts only the translated parts
  ## of the transcripts):
  cds <- extractTranscriptsFromGenome(Hsapiens, cdsBy(txdb))

  ## ---------------------------------------------------------------------
  ## C. GOING FROM TRANSCRIPT-BASED TO REFERENCE-BASED LOCATIONS
  ## ---------------------------------------------------------------------
  ## Get the reference-based locations of the first 4 (5' end)
  ## and last 4 (3' end) nucleotides in each transcript:
  tlocs <- lapply(width(txseqs2), function(w) c(1:4, (w-3):w))
  tx_strand <- sapply(strand(exbytx), runValue)
  ## Note that, because of how we made them, 'tlocs', 'start(exbytx)',
  ## 'end(exbytx)' and 'tx_strand' have the same length, and, for any
  ## valid positional index, elements at this position are corresponding
  ## to each other. This is how transcriptLocs2refLocs() expects them
  ## to be!
  rlocs <- transcriptLocs2refLocs(tlocs, start(exbytx), end(exbytx),
               tx_strand, reorder.exons.on.minus.strand=TRUE)

  ## ---------------------------------------------------------------------
  ## D. EXTRACTING WORM TRANSCRIPTS ZC101.3 AND F37B1.1
  ## ---------------------------------------------------------------------

  ## Transcript ZC101.3 (is on + strand):
  ##   Exons starts/ends relative to transcript:
  rstarts1 <- c(1, 488, 654, 996, 1365, 1712, 2163, 2453)
  rends1 <- c(137, 578, 889, 1277, 1662, 1870, 2410, 2561)
  ##   Exons starts/ends relative to chromosome:
  starts1 <- 14678410 + rstarts1
  ends1 <- 14678410 + rends1

  ## Transcript F37B1.1 (is on - strand):
  ##   Exons starts/ends relative to transcript:
  rstarts2 <- c(1, 325)
  rends2 <- c(139, 815)
  ##   Exons starts/ends relative to chromosome:
  starts2 <- 13611188 - rends2
  ends2 <- 13611188 - rstarts2

  exon_starts <- list(as.integer(starts1), as.integer(starts2))
  exon_ends <- list(as.integer(ends1), as.integer(ends2))

  library(BSgenome.Celegans.UCSC.ce2)
  ## Both transcripts are on chrII:
  chrII <- Celegans$chrII
  transcripts <- extractTranscripts(chrII,
                                    exonStarts=exon_starts,
                                    exonEnds=exon_ends,
                                    strand=c("+","-"))

  ## Same as 'width(transcripts)':
  transcriptWidths(exonStarts=exon_starts, exonEnds=exon_ends)

  transcriptLocs2refLocs(list(c(1:6, 135:140, 1555:1560),
                              c(1:6, 137:142, 625:630)),
                         exonStarts=exon_starts,
                         exonEnds=exon_ends,
                         strand=c("+","-"))

  ## A sanity check:
  ref_locs <- transcriptLocs2refLocs(list(1:1560, 1:630),
                                     exonStarts=exon_starts,
                                     exonEnds=exon_ends,
                                     strand=c("+","-"))
  stopifnot(chrII[ref_locs[[1]]] == transcripts[[1]])
  stopifnot(complement(chrII)[ref_locs[[2]]] == transcripts[[2]])

[Package GenomicFeatures version 1.6.8 Index]