UbPred: predictor of protein ubiquitination sites

Run predictor | Usage instructions | Datasets

Usage

Input

The input to UbPred is a protein sequence in FastA format, which can be pasted into the text box, or uploaded as a file. Only the 20 conventional amino acid symbols are supported; entering one of the ambiguous symbols (BJOUXZ) will produce an error. The sequence should be 25 or more residues long and should contain at least one lysine (K).

Note (short): Due to the limited computational resources, we allow prediction of only one sequence per user at a time.

Note (longer): UbPred uses evolutionary sequence features, which are extracted from a PSSM profile, created by PSI-BLAST. If the PSSM for the sequence you are interested in is stored in our database, prediction results will be displayed without delay. However, if the PSSM has not been previously generated, we have to run PSI-BLAST, which may take up to 45 minutes for some sequences, depending on the sequence length. In this situation the wait time between requesting and receiving the prediction via e-mail also depends on the number of requests that were already in the prediction queue. To prevent the queue from growing too large - which would make the wait intolerable for everyone - we are limiting the number of prediction requests that any individual can have in the queue at the same time. After the prediction has completed, you can submit an additional request and it will be added to the end of the queue.

We also provide stand-alone versions of the predictor (Linux and Windows) that you can download and install on your workstation.

Output

The output consists of 3 columns:

  1. position
  2. ubiquitination score
  3. ubiquitination annotation (yes for positive result, no for a negative result).
Predictions are made on all lysine residues of a query sequence. Only lysines with the score ≥0.62 are considered to be ubiquitinated.

Depending on the actual predictor score, we label ubiquitination predictions as low, medium or high confidence. Sensitivity and specificity estimates for different predictor scores are given in the following table:

Label Score range Sensitivity Specificity
Low confidence 0.62 ≤ s ≤ 0.69 0.464 0.903
Medium confidence 0.69 ≤ s ≤ 0.84 0.346 0.950
High confidence 0.84 ≤ s ≤ 1.00 0.197 0.989

Datasets

Positive examples of ubiquitination sites were extracted from two large-scale proteomics studies (Peng, et al., 2003; Hitchcock, et al. 2003), our own experiments and an ad-hoc literature search. These lysine ubiquitination sites were present in 201 proteins from S. cerevisiae. From these proteins, we extracted 272 ubiquitinated (positive) fragments, each containing up to 12 upstream and downstream residues around the central lysine residue. The set of 4651 nonubiquitinated (negative) fragments were extracted from 124 mitochondrial matrix proteins.

To obtain a nonredundant dataset, no two fragments within the positive or negative datasets, as well as across the two datasets, were allowed to share 40% sequence identity. When a similar pair between a positive and negative example occurred, the negative site was always removed as less reliably labeled. The sequence identity cutoff of 40% lies well below those that provide accurate functional inference by homology transfer (Rost, et al., 2003) thus allowing us to consider our dataset to be nonredundant. The resulting datasets contained 265 positive and 4431 negative fragments.

Datasets can be downloaded from here.

Predictor Evaluation

To evaluate UbPred, 100-fold cross-validation strategy was chosen. We measured accuracy on a per-residue level by estimating sensitivity (sn) and specificity (sp). Sensitivity represents the percentage of true positives predicted to be positive (ubiquitinated), while specificity represents the percentage of true negatives predicted to be negative (non-ubiquitinated) (Hastie, et al., 2001). In addition to sn and sp, we also report accuracy on a balanced sample (acc), defined as an average of sn and sp, and area under the ROC curve (AUC), which are essentially unaffected by the disparity in class sizes. The ROC curve represents a mapping of 1 – sp to sn and in our case was estimated by varying costs of the positive and negative examples during training. The AUCs were estimated using the trapezoid rule.

References

  1. Hastie, T., Tibshirani, R., and Friedman, J. H. (2001). The elements of statistical learning: data mining, inference, and prediction. New York, NY, Springer Verlag.
  2. Hitchcock, A. L., Auld, K., Gygi, S.P., and Silver, P. A. (2003). A subset of membrane-associated proteins is ubiquitinated in response to mutations in the endoplasmic reticulum degradation machinery. Proc Natl Acad Sci USA, 100(22), 12735-40.
  3. Peng, J., Schwartz, D., Elias, J. E., Thoreen, C. C., Cheng, D., Marsischky, G., Roelofs, J., Finley, D., and Gygi, S. P. (2003). A proteomics approach to understanding protein ubiquitination. Nat Biotechnol, 21(8), 921-6.
  4. Rost, B., Liu, J., Nair, R., Wrzeszczynski, K. O., and Ofran, Y. (2003). Automatic prediction of protein function. Cell Mol Life Sci, 60: 2637-2650.