Loading…

Balanced ROC analysis (BAROC) protocol for the evaluation of protein similarities

Identification of problematic protein classes (domain types, protein families) that are difficult to predict from sequence is a key issue in genome annotation. ROC (Receiver Operating Characteristic) analysis is routinely used for the evaluation of protein similarities, however its results – the are...

Full description

Saved in:
Bibliographic Details
Published in:Journal of biochemical and biophysical methods 2008-04, Vol.70 (6), p.1210-1214
Main Authors: Busa-Fekete, Róbert, Kertész-Farkas, Attila, Kocsor, András, Pongor, Sándor
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Items that cite this one
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Identification of problematic protein classes (domain types, protein families) that are difficult to predict from sequence is a key issue in genome annotation. ROC (Receiver Operating Characteristic) analysis is routinely used for the evaluation of protein similarities, however its results – the area under curve (AUC) values – are differentially biased for the various protein classes that are highly different in size. We show the bias can be compensated for by adjusting the length of the top list in a class-dependent fashion, so that the number of negatives within the top list will be equal to (or proportional with) the size of the positive class. Using this balanced protocol the problematic classes can be identified by their AUC values, or by a scatter diagram in which the AUC values are plotted against positive/negative ratio of the top list. The use of likelihood-ratio scoring (Kaján et al, Bioinformatics, 22, 2865–2869, 2007) the bias caused by class imbalance can be further decreased.
ISSN:0165-022X
1872-857X
DOI:10.1016/j.jbbm.2007.06.003