Sequence conservation, relative isoform frequencies, and nonsense-mediated decay in evolutionarily conserved alternative splicing

Baek and Green. 10.1073/pnas.0506139102.

Supporting Information

Files in this Data Supplement:

Supporting Text
Supporting Figure 7
Supporting Figure 8
Supporting Figure 9
Supporting Figure 10
Supporting Figure 11
Supporting Figure 12
Supporting Table 3
Supporting Table 4
Supporting Table 5
Supporting Table 6
Supporting Table 7
Supporting Table 8




Supporting Figure 7

Fig. 7. Identification of orthologous alternative splicing (AS) cases conserved between human and mouse.





Supporting Figure 8

Fig. 8. Examples of major alternative splicing categories; frame preserving (A), non-frame-preserving non-NMD (nonsense-mediated mRNA decay) (B), exon exclusion isoform NMD (C), exon inclusion isoform NMD (D), single-exon skipping (E), multiple-exon skipping (F), and mutually exclusive single-exon skipping (G) cases. Gene and coding sequence structures are presented by colored boxes (exons) and arrowed lines (introns). Light green, dark green, red, and blue structures represent RefSeq gene, RefSeq coding sequence, aligned cDNA, and aligned EST structures, respectively. Termination codon and final exon junction are marked by red and black rectangles.





Supporting Figure 9

Fig. 9. Sequence conservation in exonic and splice site 20-bp (5 bp in exon and 15 bp intron) window (A) and effective number of conserved nucleotides in exonic synonymous sites and intronic region adjacent to splice site window (B) as a function of splicing pattern.





Supporting Figure 10

Fig. 10. Scatterplots of effective number of conserved nucleotides (Nc) (A), sequence conservation in splice site 20-bp window (B), average exon size (C), and combined (donor plus acceptor) splice site score (D) against exclusion rate, by predicted nonsense-mediated mRNA decay (NMD) status in constitutively spliced (CS) and single-exon skipping alternatively spliced (AS) exons. Blue and red circled areas represent regions likely enriched for misclassified exons. All variables are the mouse-human averages.





Supporting Figure 11

Fig. 11. Histogram of LL(E) scores in constitutively spliced (CS) and alternatively spliced (AS) (single-exon skipping) exons computed by seven-parameter model (filter A in Table 5).





Supporting Figure 12

Fig. 12. Histogram of alternatively spliced (AS) exons by exclusion rate and nonsense-mediated mRNA decay (NMD) status. Frequencies are relative to the number of AS (single-exon skipping) exons in each AS group. The two NMD categories display contrasting patterns, consistent with loss of the NMD-inducing isoform. Frame-preserving AS exons show a relatively flat distribution, except for a small bump at lower exclusion rates which may reflect misclassified constitutively spliced exons.





Table 3. Exon counts by mouse and human splicing status

Human

Mouse

Constitutively spliced

Frame preserving

Non-frame-preserving Non-NMD

Exclusion isoform NMD

Inclusion isoform NMD

Both isoforms NMD

Constitutively spliced

14,368

216

47

179

9

11

Frame preserving

639

774

10

1

24

1

Non-frame-preserving non-NMD

188

12

176

10

10

0

Exclusion isoform NMD

509

1

16

233

18

13

Inclusion isoform NMD

75

31

7

23

89

4

Both isoforms NMD

42

4

4

21

10

2

All counts in alternative splicing cases are unfiltered single-exon skipping events. Agreeing status cases (on diagonal) significantly outnumber disagreeing ones, except for the first row and column, which presumably contain many cases of aberrant events or insufficient data to detect alternative isoforms. Also, numbers below the diagonal are generally larger than those above, consistent with the fact there is more human than mouse EST/cDNA data and that larger datasets should have more aberrant events.





Table 5. Filtering methods used in analyses

Filter ID

Filtering parameters

Figures for which the filters are used

SS Score

SS Cons.

Intron Nc

Intron Cons.

Exon Cons.

Exon Size

Exon Size Divisibility by 3

A

 

B

 

 

4 (splice site score) and

5 (sequence conservation in splice site)

C

 

 

5 (intronic Nc and 20 bp conservation)

D

 

 

 

2, 4 (exon size), and

3 (synonymous exonic Nc)

E

 

 

 

6

F

 

 

 

3 (intronic Nc)

All of the filters eliminate 29-32% of alternatively spliced (AS) exons, and 11-15% of constitutively spliced (CS) exons; and were developed by using, in the case of AS exons, only single-exon skipping cases. CS exons with LL(E) ≥ 0, or AS exons with LL(E) < 0, were eliminated from the analysis supporting the specified figure. SS, splice site





Table 6. Comparison of methods to discriminate constitutively (CS) and alternatively (AS) spliced exons

 

Present study

G. Yeo et al.(1)

G. Dror et al.(2)

R. Sorek et al.(3)

D. Philipps et al.(4)

Compared organisms

Homo sapiens and Mus musculus

Drosophila melanogaster and Drosophila pseudoobscura

Main purpose

To discriminate potentially misclassified CS and AS (single-exon skipping) events

To predict single-exon skipping events

To predict single-exon skipping events

To predict various types of alternatively spliced exons

NMD status observed

Yes

No

No

No

No

Size of

AS Training Data Set

1,274

241

243

243

592 highly conserved exon pairs

Size of

CS training data set

14,368

~5,000

1,753

1,753

Splicing pattern required to be conserved between two compared species in training data set

Yes

Yes

Yes [but only 149 of 243 AS cases have both isoforms in mouse while the remaining 94 cases have only exon exclusion isoforms (5)]

No

Classification method

Nonparametric log likelihood

Regularized least-square

Support vector machine

Simple thresholding

Discriminating features

Exonic PI

Yes

Yes

Yes

Yes

Yes

Exon Size

Yes

Yes

Yes

Yes

No

Exon Size Divisibility by 3

Yes

No

Yes

Yes

No

Intronic PI

Yes

Yes

Yes

Yes

Yes

Size of Highly Conserved Intronic Fragment

Yes

No

Yes

No

No

Splice Site PI

Yes

No (but included as a part of exonic and intronic sequences)

Splice Site Score

Yes

Yes

Donor site only

No

No

Overrepresented motifs

No

4- and 5-mers

3-mers

No

No

Other Features

None

Intron size

Intensity of poly-pyrimidine tract

None

None

Experimental verification

No

Yes

Yes

No

Yes

PI, percent identity; NMD, nonsense-mediated mRNA decay.

1. Yeo, G. W., Van Nostrand, E., Holste, D., Poggio, T. & Burge, C. B. (2005) Proc. Natl. Acad. Sci. USA 102, 2850-2855.

2. Dror, G., Sorek, R. & Shamir, R. (2005) Bioinformatics 21, 897-901.

3. Sorek, R., Shemesh, R., Cohen, Y., Basechess, O., Ast, G. & Shamir, R. (2004) Genome Res. 14, 1617-1623.

4. Philipps, D. L., Park, J. W. & Graveley, B. R. (2004) RNA 10, 1838-1844.

5. Sorek, R. & Ast, G. (2003) Genome Res. 13, 1631-1637.





Table 7. Correlations of various characteristics with exclusion rate in frame-preserving single-exon skipping alternatively spliced exons

Characteristic

Frame-Preserving single-exon skipping cases (n=774)

Filtered frame-preserving single-exon skipping cases

Correlation (P)

Correlation (P)

Filter*

PI in acceptor site

0.266 (<10-10)

0.052 (0.21)

B (n=591)

PI in donor site

0.255 (<10-10)

0.104 (0.011)

Combined PI in splice site

0.303 (<10-10)

0.094 (0.022)

Acceptor site score

-0.237 (<10-10)

-0.191 (2.8×10-6)

Donor site score

-0.147 (4.3×10-5)

-0.106 (9.6×10-3)

Combined splice site score

-0.284 ((<10-10)

-0.229 (1.9×10-8)

Upstream intron Nc

0.334 (<10-10)

0.128 (1.7×10-3)

C (n=595)

Downstream intron Nc

0.328 (<10-10)

0.203 (5.4×10-7)

Combined intron Nc

0.370 (<10-10)

0.194 (1.8×10-6)

Combined intron 20 bp PI

0.382 (<10-10)

0.179 (1.1×10-5)

Exon PI

0.238 (<10-10)

0.055 (0.20)

D (n=560)

Exon size

-0.345 (<10-10)

-0.325 (<10-10)

Note that the direction of the trends is such that the unfiltered data tend to show stronger correlations, because of the inclusion of misclassified constitutively spliced exons. PI, percent identity. *See Table 5 for filter identification.





Table 8. Correlations of various characteristics with exon size in constitutively spliced (CS) exons

Characteristic

Conserved CS exons (n=14,368)

Filtered CS exons

Correlation (P)

Correlation (P)

Filter*

PI in acceptor site

-0.071 (<10-10)

-0.055 (1.4×10-9)

E (n=12,257)

PI in donor site

-0.057 (<10-10)

-0.041 (5.1×10-6)

Combined PI in splice site

-0.083 (<10-10)

-0.064 (<10-10)

Acceptor site score

-0.075 (<10-10)

-0.069 (<10-10)

Donor site score

-0.036 (1.3×10-5)

-0.042 (3.4×10-6)

Combined splice site score

-0.077 (<10-10)

-0.077 (<10-10)

Upstream intron Nc

-0.031 (2.5×10-4)

-0.013 (0.15)

F (n=12,478)

Downstream intron Nc

-0.039 (2.9×10-6)

-0.018 (0.042)

Combined intron Nc

-0.039 (3.0×10-6)

-0.016 (0.074)

Combined intron 20 bp PI

-0.056 (<10-10)

-0.032 (3.1×10-4)

Exon PI

-0.117 (<10-10)

-0.100 (<10-10)

D (n=12,197)

PI, percent identity. *See Table 5 for filter identification.





Supporting Text

cDNA and EST alignments. We used blat version 29 (1) to align 21,382 RefSeq Release 6 (2) and 16,876 H-Invitational Release 1.8 (3) human near-full-length protein-coding cDNAs against NCBI Build 35 of the human genome (4), and 17,076 RefSeq Release 6 (2) and 22,049 FANTOM2 DB 2.00 (5) mouse cDNAs against Build 33 of the mouse genome (6) to find the corresponding exon/intron structures. We eliminated those cDNA alignments that did not span at least 97% of the transcript or its full coding sequence, were less than 97% identical to the genome sequence, were not uniquely placed (had another alignment match for which the percent identity differed by less than 1.0), had only one exon, or lacked correctly annotated translation initiation or termination codons. Transcripts that were in the same genomic orientation and had at least 100 bp of exonic overlap were considered to represent the same gene. We eliminated ESTs that were from the RAGE library, did not overlap a reference cDNA alignment, were <95% identical, had only one exon, or were not uniquely placed. We also eliminated ESTs or cDNAs that included exons upstream or downstream of the matching reference cDNA; or for which more than 20% of the implied introns did not satisfy the U2 spliceosomal consensus GT-AG or GC-AG.

Identification of orthologous alternative splicing (AS) exons. For each EST or cDNA that was not itself full length as judged by comparison to the reference cDNA, we inferred a putative reconstructed full-length structure by combining with the upstream and downstream structure from the corresponding reference cDNA. The reconstructed coding sequence was then conceptually translated to identify reading frame and the location of the stop codon. If the latter was ³50 bp upstream of the final exon-exon junction, the corresponding transcript was designated nonsense-mediated mRNA decay (NMD)-inducing.

Although in some cases the reconstructed structure may be incorrect due to undetected additional alternative splices, or to alternative transcription starts, or polyadenylation sites in the upstream or downstream region, in most cases there is a single EST or cDNA spanning the entire region required to support relevant inferences of frame or NMD status. For example, 95% (342 of 361) of the putative NMD-inducing isoforms in Table 1 have at least one EST or cDNA that simultaneously includes the frame-altering alternative splice, the termination codon, and an exon-exon junction ³50 bp downstream from the termination codon on a single transcript, whereas 75% (272 of 361) have such evidence in both species. Moreover, in most putative NMD cases a premature termination codon (PTC) satisfying the 50bp rule is present in each reading frame, implying that assignment of NMD status does not depend on correct inference of the upstream structure.

Instances of exon-skipping AS were identified by comparing the genomic coordinates of pairs of reconstructed cDNAs from the same gene. To be classified as exon-skipping, we required that there be a pair of transcripts, one including and one excluding the "cassette" (skipped exon or exons), and both using the same donor site in the exon upstream of the cassette, and the same acceptor site in the exon downstream of the cassette. Exons for which all EST and cDNA alignments imply the same genomic coordinates for both splice sites were classified as CS. The blat alignment sometimes spuriously merges short initial or final segments of ESTs to the adjacent exon; such exons are omitted from the present analysis since they are classified neither as CS nor as exon-skipping AS variants. In AS cases, when there were multiple transcript structures satisfying the above criteria but differing in the splice sites used (in either the cassette or flanking exons), we in general selected the structure supported by the largest number of transcripts; for the NMD categories, we chose where possible a structure having direct evidence for a ³50bp PTC within a single transcript (choosing the most highly expressed when there was more than one such).

We used the University of California, Santa Cruz, blastz alignments of the mouse and human genomes to identify orthologous exons having identical predicted status (CS, or frame-preserving, non-frame-preserving non-NMD, or NMD-inducing AS) in both species. The translated amino acid sequences of orthologous exons were aligned by using the Smith-Waterman algorithm (7) with the BLOSUM80 score matrix (8), except when there were frame-shifting indels, in which case the BLASTZ alignment was used instead. Orthologous AS exon pairs not agreeing in NMD or frame-preservation status were disregarded. These cases constituted a small minority of all AS pairs, and are tabulated in Table 3.

Nc (effective number of conserved nucleotides) for synonymous sites in exons. (Notation as in Effective Number of Conserved Nucleotides in Materials and Methods). For 4-fold degenerate sites, ro is the fraction of 4-fold degenerate sites which are identical. For 2-fold sites, it is necessary to count separately the number of (synonymous) transition differences ti2, the number of (nonsynonymous) transversion differences tv2, and the number of positions that are identical i2. Note that i2 overestimates the "true" number of identical positions, because some identical positions represent cases where a transversion occurred but was selected against because the amino acid change is deleterious. We estimate the latter as follows: in neutrally evolving DNA transversions occur at roughly half the rate of transitions, so the ‘true’ number of transversions (had all of them been observed) is estimated to be 0.5ti2. Of these, tv2 were observed, whereas the other 0.5ti2 - tv2 were not. We therefore correct i2 by subtracting from it the unobserved transversions, obtaining i'2 = i2 - (0.5ti2 - tv2,). Then ro for the 2-fold degenerate sites is just i'2 / l2 where l2 is the total number of 2-fold degenerate sites. For both 2-fold and 4-fold degenerate sites, we again set rc to 1.0, and rn is measured in noncoding DNA within the 2 Mbp region surrounding the exon in question. From these values, we can compute nc separately for 2- and 4-fold sites, and then add them to get the synonymous nc for the exon.

Filters to Remove Potentially Misclassified Exons. Our data set of conserved AS and CS exons likely contains some functionally CS exons which are misclassified as AS exons because of aberrant splicing events that have occurred in both mouse and human, or AS exons misclassified as CS exons because there is insufficient data to reveal both isoforms. We attempt to discriminate true CS from AS exons (single-exon skipping cases) using 7 characteristics for which they are known to differ: sum of 5' and 3' splice site scores; combined PI (percent identity between mouse and human sequence) in 5' and 3' splice site 20-bp windows; combined PI in upstream and downstream intronic 20-bp regions immediately adjacent to the splice site windows; exonic PI; combined effective number of conserved nucleotides in upstream and downstream introns; exon size; and a binary variable indicating whether the exon size is a multiple of three. Our method constructs a nonparametric approximate log likelihood LL(E) (defined for each exon E), and is motivated by the Neyman-Pearson lemma (9) which implies that log-likelihood ratios provide optimal discrimination. For a given i = 1, …, 7, and an exon E, let si(E) represent the value of the i-th characteristic for E. We define the approximate loglikelihood LLi(E) as log2(fiAS(I(E))/fiCS(I(E))), where I(E) is an interval on the real line containing si(E), and fi*(I) represents the proportion of exons of type * (AS or CS) in our data set having scores si that fall within I. The interval I(E) is chosen to be as small as possible, subject to the requirement that the standard error in the estimate of LLi(E) must not exceed a specified threshold. We exclude E itself from the calculation of fiAS(I(E)) and fiCS(I(E)) to reduce bias. An overall log likelihood LL(E) is then computed by summing the LLi(E). (Note that for this sum to be a legitimate log likelihood, the indicated characteristics would have to be independent of each other, which is not strictly true. Hence LL(E) is not guaranteed to be an optimal discriminator.)

1. Kent, W. J. (2002) Genome Res. 12, 656-964.

2. Pruitt, K. D., Katz, K. S., Sicotte, H. & Maglott, D. R. (2000) Trends Genet. 16, 44-47.

3. Imanishi, T., Itoh, T., Suzuki, Y., O’Donovan, C., Fukuchi, S., Koyanagi, K. O., Barrero, R. A., Tamura, T., Yamaguchi-Kabata, Y., Tanino, M., et al. (2004) PLoS Biol. 2, E162.

4. Lander, E. S., Linton, L. M., Birren, B., Nusbaum, C., Zody, M. C., Baldwin, J., Devon, K., Dewar, K., Doyle, M., FitzHugh, W., et al. (2001) Nature 409, 860-921.

5. Okazaki, Y., Furuno, M., Kasukawa, T., Adachi, J., Bono, H., Kondo, S., Nikaido, I., Osato, N., Saito, R., Suzuki, H., et al. (2002) Nature 420, 563-573.

6. Waterston, R. H., Lindblad-Toh, K., Birney, E., Rogers, J., Abril, J. F., Agarwal, P., Agarwala, R., Ainscough, R., Alexandersson, M., An, P., et al. (2002) Nature 420, 520-562.

7. Smith, T. F. & Waterman, M. S. (1981) J. Mol. Biol. 147, 195-197.

8. Henikoff, S. & Henikoff, J. G. (1992) Proc. Natl. Acad. Sci. USA 89, 10915-10919.

9. Neyman, J. & Pearson, E. (1933) Philos. Trans. R. Soc. London Ser. A 231, 289-337.

This Article

  1. PNAS September 6, 2005 vol. 102 no. 36 12813-12818
  1. AbstractFree
  2. Figures Only
  3. Full Text
  4. Full Text (PDF)
  5. » Supporting Information