Centroid estimation in discrete high-dimensional spaces with applications in biology

  1. Luis E. Carvalho and
  2. Charles E. Lawrence*
  1. Division of Applied Mathematics, Brown University, 182 George Street, Providence, RI 02912
  1. Communicated by David Mumford, Brown University, Providence, RI, December 28, 2007 (received for review May 24, 2007)

Abstract

Maximum likelihood estimators and other direct optimization-based estimators dominated statistical estimation and prediction for decades. Yet, the principled foundations supporting their dominance do not apply to the discrete high-dimensional inference problems of the 21st century. As it is well known, statistical decision theory shows that maximum likelihood and related estimators use data only to identify the single most probable solution. Accordingly, unless this one solution so dominates the immense ensemble of all solutions that its probability is near one, there is no principled reason to expect such an estimator to be representative of the posterior-weighted ensemble of solutions, and thus represent inferences drawn from the data. We employ statistical decision theory to find more representative estimators, centroid estimators, in a general high-dimensional discrete setting by using a family of loss functions with penalties that increase with the number of differences in components. We show that centroid estimates are obtained by maximizing the marginal probabilities of the solution components for unconstrained ensembles and for an important class of problems, including sequence alignment and the prediction of RNA secondary structure, whose ensembles contain exclusivity constraints. Three genomics examples are described that show that these estimators substantially improve predictions of ground-truth reference sets.

Footnotes

  • *To whom correspondence should be addressed. E-mail: Charles_Lawrence{at}brown.edu
  • Author contributions: L.E.C. and C.E.L. designed research; L.E.C. and C.E.L. performed research; L.E.C. analyzed data; and L.E.C. and C.E.L. wrote the paper.

  • The authors declare no conflict of interest.

  • This article contains supporting information online at www.pnas.org/cgi/content/full/0712329105/DC1.

  • Freely available online through the PNAS open access option.

« Previous | Next Article »Table of Contents
OPEN ACCESS ARTICLE