Abstract:
A heterogeneous data ensemble approach for the classification of Saccharomyces
cerevisiae proteins under ‘mitochondrion organization’
Proteins are the real role players in keeping a cell healthy and well functioning. An
important group of proteins is the subset of mitochondrial proteins that engage in the
assembly, arrangement and disassembly of the mitochondrion. Several of them have
been identified to cause human diseases. Hence, annotating proteins under the ‘mitochondrion
organization’ Biology process is vital for identifying disease causative factors
and for designing therapeutics. As manual annotation requires costly and laborious in
vitro methods, in silico function prediction is preferred nowadays. Recent studies identify
the importance of incorporating data from various biological aspects, to formulate
a strong functional context for classification. In addition, many approaches from literature
employ ensemble classifiers to attain a higher prediction accuracy. However, an
insightful approach for accurate classification; biological data utilization; and biological
data type significance determination; is still in need. This study presents an assessment
of a heterogeneous data ensemble to classify Saccharomyces cerevisiae proteins under
‘mitochondrion organization’. The ensemble consists of nine euclidean-distance based
nearest neighbour models and three affinity-based neighbourhood models; it utilizes
sequences, protein domains, peptide chain properties, gene expression, secondary structure
and interactions. The base models were trained upon annotations from the Gene
Ontology, as well as from a publicly available benchmark gold dataset. They show
a substantial level of disagreement, implying their effectiveness in collective decision
making. Six combination schemes were evaluated for fusing the base model outputs. A
Genetic Algorithmically weighted ensemble gives the highest improvement to the best
performing base classifier, by displaying an average area under the Receiver Operating
Characteristic curve of 92.52%. Moreover, it is capable of determining the biological
importance of each data type. Overall, the proposed heterogeneous data ensemble is
capable of identifying eight disease related proteins and one disease related protein in
a strong and moderate sense, respectively.