CavSimBase: A Database for Large Scale Comparison of Protein Binding Sites
Abstract: CavBase is a database containing information about the three-dimensional geometry and the physicochemical properties of putative protein binding sites. Analyzing CavBase data typically involves computing the similarity of pairs of binding sites. In contrast to sequence alignment, however, a structural comparison of protein binding sites is a computationally challenging problem, making large scale studies difficult or even infeasible. One possibility to overcome this obstacle is to precompute pairwise similarities in an all-against-all comparison, and to make these similarities subsequently accessible to data analysis methods. Pairwise similarities, once being computed, can also be used to equip CavBase with a neighborhood structure. Taking advantage of this structure, methods for problems such as similarity retrieval can be implemented efficiently. In this paper, we tackle the problem of performing an all-against-all comparison using CavBase, consisting of more than 200,000 protein cavities, by means of parallel computation and cloud computing techniques. We present the conceptual design and technical realization of a large-scale study to create a similarity database called CavSimBase. We illustrate how CavSimBase is constructed, is accessed, and is used to answer biological questions by data analysis and similarity retrieval.