Smoothed PhyloCSF Tracks

PhyloCSF Track Hub

Description

These tracks show evolutionary protein-coding potential as determined by PhyloCSF [1] to help identify conserved, functional, protein-coding regions of genomes. PhyloCSF (Phylogenetic Codon Substitution Frequencies) examines evolutionary signatures characteristic of alignments of conserved coding regions, such as the high frequencies of synonymous codon substitutions and conservative amino acid substitutions, and the low frequencies of other missense and nonsense substitutions. PhyloCSF provides more information than conservation of the amino acid sequence, because it distinguishes the different codons that code for the same amino acid. One of PhyloCSF's main current applications is to help distinguish protein-coding and non-coding RNAs represented among novel transcript models obtained from high-throughput transcriptome sequencing. More information on PhyloCSF can be found on the PhyloCSF wiki.

The Raw PhyloCSF tracks show the PhyloCSF score for each codon in each of 6 frames. Regions in which most codons have score greater than 0 are likely to be protein-coding in that frame. No score is shown when the relative branch length is less than 0.1 (see PhyloCSF Power).

The Smoothed PhyloCSF tracks show the scores smoothed using a hidden Markov model (HMM). No score is shown when the relative branch length is less than 0.1 (see PhyloCSF Power).

The PhyloCSF Regions tracks show the regions that are protein-coding in the most-likely-path through the HMM. They are something like predicted exons, except that splice sites, start codons, and stop codons are not considered so the boundaries are approximate. The gray scale is an indication of the maximum log-odds that any codon in the region is coding according to the HMM.

The PhyloCSF Power track shows the branch length score at each codon, i.e., the ratio of the phylogenetic branch length of the species present in the local alignment to the total branch length of all species in the full genome alignment. It is an indication of the statistical power available to PhyloCSF. Codons with branch length score less than 0.1 have been excluded altogether (from all tracks) because PhyloCSF does not have sufficient power to get a meaningful score at these codons. Codons with branch length score greater than 0.1 but much less than 1 should be considered as having low confidence.

The Splice Prediction tracks show canonical splice predictions for each strand. This can be useful for predicting where novel exons start and end. The bars are on the first or last base of the hypothetical intron, i.e., the G of GT or AG. Green bars indicate splice donors and red bars indicate acceptors, so an intron would extend from a green bar to a red bar. Taller bars are more likely to be true splice sites.

The PhyloCSF Novel tracks (a.k.a., PhyloCSF Candidate Coding Region or PCCR tracks) show regions that could be protein-coding but are not currently annotated as protein-coding exons or pseudogenes in a specified gene set. Green regions are on the plus strand and red regions are on the minus strand. These regions have been ranked using a support vector machine (SVM) with ones most likely to be real novel coding or pseudogene regions having low rank (darker shading), and ones that are more likely to be non-coding false positives having high rank (lighter shading). The rank is listed next to the region.

Select on a PCCR to bring up its details page. This includes useful information for deciding if it is protein-coding including a link to CodAlignView [2], which shows the alignment of that region, color-coded to indicate protein-coding signatures.

Caveats

Some annotated protein-coding regions get score less than 0 (for example, around 10% in human). This can happen for various reasons. For example, the region could be coding in the reference species but not in other species, or the alignment does not represent a true orthology relationship between the species.
Protein-coding regions will often have positive score on the reverse strand in the frame in which the third codon positions match up (the "antisense" frame), though the score will usually be higher on the correct strand.
Pseudogenes will often get positive scores even though they are not protein-coding. One reason for this is that since PhyloCSF is measuring coding potential on the whole species tree, if a unitary pseudogene was coding in the common ancestor but is no longer protein-coding in the reference species it will often still have positive score. Another reason is that the aligner might align a duplicated pseudogene to the orthologs of its parent. The CACOFONy score in the details page can help identify this situation; CACOFONy, short for Coding Alignment COllision FractiON, is the fraction of the alignment of the region that is identical to the alignment of some annotated coding region.

Methods

PhyloCSF was run with the "fixed" strategy on every codon in every frame on each strand in the assembly.

The alignments and PhyloCSF parameters for the various assemblies are as follows:

Assembly	PhyloCSF Parameters	Alignments
hg19	29mammals	29-mammals subset of the 46-vertebrates hg19 alignment
hg38	58mammals	58-mammals subset of the 100-vertebrates hg38 alignment
mm10	29mammals	29-mammals subset of the 60-vertebrates mm10 alignment
mm39	58mammals	30-placental-mammals subset of the 35-vertebrates mm39 alignment
dm6	23flies	23-Drosophila subset of the 27-insect dm6 alignment
ce11	8worms	8-Caenorhabditis subset of the 26-worm ce11 alignment
galGal4	49birds	49 sauropsids alignment (unpublished)
galGal6	53birds	53-bird subset of the 77-vertebrates galGal6 alignment
wuhCor1 (SARS-CoV-2)	29mammals	44 Sarbecovirus genomes

The scores were smoothed using a hidden Markov model (HMM) with 4 states, one representing coding regions and three representing non-coding regions. The emission of each codon is its PhyloCSF score. The ratio of the emissions probabilities for the coding and non-coding models are computed from the PhyloCSF score, since it represents the log-likelihood ratio of the alignment under the coding and non-coding models. The three non-coding states have the same emissions probabilities but different transition probabilities (they can only transition to coding) to better capture the multimodal distribution of gaps between same-frame coding exons. These transition probabilities represent the best approximation of this gap distribution as a mixture model of three exponential distributions, computed using Expectation Maximization.

The HMM defines a probability that each codon is coding, based on the PhyloCSF scores of that codon and nearby codons on the same strand in the same frame, without taking into account start codons, stop codons, or potential splice sites. PhyloCSF+1 shows the log-odds that codons in frame 1 on the '+' strand are in the coding state according to the HMM, and similarly for strand '-' and frames 2 and 3. The PhyloCSF Regions tracks show the genomic intervals that are in the protein-coding state in the most likely path through the HMM.

Splice sites were predicted using the maximum entropy method [3].

The PhyloCSF Novel (PCCR) tracks were created as follows. All regions from the PhyloCSF Regions tracks were compared to protein-coding and pseudogene annotations from the specified gene set. Regions contained in annotated pseudogene regions, or in protein-coding regions in the same frame or the antisense frame, were eliminated. If part of a region was contained in the annotated region, the region was trimmed to the unannotated portion. Regions less than nine codons long were eliminated. PhyloCSF scores for each region were recomputed using the "mle" strategy except for hg19, galGal4, mm10 version 5, and GENCODE hg38 versions 25 and earlier. Regions that were more likely to be antisense to a novel region than to be novel themselves were distinguished using a three-feature SVM and excluded, the features being the PhyloCSF score, the difference between the PhyloCSF scores on the two strands, and the length of the region. Ranks were assigned using a four-feature SVM, trained to distinguish regions that are contained in coding annotations from ones that are not, the features being the three features mentioned previously and the relative branch length of the species in the local alignment of the region.

Reference [4] includes a more detailed description of these algorithms, and Python code implementing them.

Credits

Questions should be directed to Irwin Jungreis.

Citing the PhyloCSF Tracks

If you use the PhyloCSF browser tracks, please cite Mudge et al. 2019 [4].

If you use the tracks for wuhCor1/SARS-CoV-2, please also cite Jungreis et al. 2021 [5].

References

[1] Lin MF, Jungreis I, and Kellis M (2011). PhyloCSF: a comparative genomics method to distinguish protein-coding and non-coding regions. Bioinformatics 27(13), i275-i282. doi.org/10.1093/bioinformatics/btr209

[2] Jungreis I, Lin MF, Chan CS, Kellis M (2016). CodAlignView: The Codon Alignment Viewer [Internet]. Available from: http://data.broadinstitute.org/compbio1/cav.php

[3] Yeo G, Burge CB (2004). Maximum entropy modeling of short sequence motifs with applications to RNA splicing signals. Journal of Computational Biology : a Journal of Computational Molecular Cell Biology 11(2-3), 377–394. doi.org/10.1089/1066527041410418

[4] Mudge JM, Jungreis I, Hunt T, Gonzalez JM, Wright J, Kay M, Davidson C, Fitzgerald S, Seal R, Tweedie S, He L, Waterhouse RM, Li Y, Bruford E, Choudhary J, Frankish A, Kellis M (2019). Discovery of high-confidence human protein-coding genes and exons by whole-genome PhyloCSF helps elucidate 118 GWAS loci. Genome Research gr-246462. doi: 10.1101/gr.246462.118

[5] Jungreis I, Sealfon R, Kellis M (2021). SARS-CoV-2 gene content and COVID-19 mutation impact by comparing 44 Sarbecovirus genomes. Nature Communications 12(1), 1-20. doi:10.1038/s41467-021-22905-7