A Heterogeneous data ensemble approach for protein function prediction under mitochondrion organization

dc.contributor.advisorPerera, AS
dc.contributor.authorSumanaweera, DN
dc.date.accept2016
dc.date.accessioned2017-02-15T06:10:32Z
dc.date.available2017-02-15T06:10:32Z
dc.description.abstractA heterogeneous data ensemble approach for the classification of Saccharomyces cerevisiae proteins under ‘mitochondrion organization’ Proteins are the real role players in keeping a cell healthy and well functioning. An important group of proteins is the subset of mitochondrial proteins that engage in the assembly, arrangement and disassembly of the mitochondrion. Several of them have been identified to cause human diseases. Hence, annotating proteins under the ‘mitochondrion organization’ Biology process is vital for identifying disease causative factors and for designing therapeutics. As manual annotation requires costly and laborious in vitro methods, in silico function prediction is preferred nowadays. Recent studies identify the importance of incorporating data from various biological aspects, to formulate a strong functional context for classification. In addition, many approaches from literature employ ensemble classifiers to attain a higher prediction accuracy. However, an insightful approach for accurate classification; biological data utilization; and biological data type significance determination; is still in need. This study presents an assessment of a heterogeneous data ensemble to classify Saccharomyces cerevisiae proteins under ‘mitochondrion organization’. The ensemble consists of nine euclidean-distance based nearest neighbour models and three affinity-based neighbourhood models; it utilizes sequences, protein domains, peptide chain properties, gene expression, secondary structure and interactions. The base models were trained upon annotations from the Gene Ontology, as well as from a publicly available benchmark gold dataset. They show a substantial level of disagreement, implying their effectiveness in collective decision making. Six combination schemes were evaluated for fusing the base model outputs. A Genetic Algorithmically weighted ensemble gives the highest improvement to the best performing base classifier, by displaying an average area under the Receiver Operating Characteristic curve of 92.52%. Moreover, it is capable of determining the biological importance of each data type. Overall, the proposed heterogeneous data ensemble is capable of identifying eight disease related proteins and one disease related protein in a strong and moderate sense, respectively.en_US
dc.identifier.accnoTH3257en_US
dc.identifier.degreeMSc (Major Component Research)en_US
dc.identifier.departmentDepartment of Computer Science & Engineeringen_US
dc.identifier.facultyEngineeringen_US
dc.identifier.urihttp://dl.lib.mrt.ac.lk/handle/123/12395
dc.language.isoenen_US
dc.subjectyeasten_US
dc.subjectproteins
dc.subjectmitochondrion
dc.subjectweighted ensemble
dc.subjectdata heterogeneity
dc.titleA Heterogeneous data ensemble approach for protein function prediction under mitochondrion organizationen_US
dc.typeThesis-Full-texten_US

Files

Original bundle

Now showing 1 - 3 of 3
Loading...
Thumbnail Image
Name:
TH3257-1.pdf
Size:
2.66 MB
Format:
Adobe Portable Document Format
Description:
Pre-text
Loading...
Thumbnail Image
Name:
TH3257-2.pdf
Size:
10.18 MB
Format:
Adobe Portable Document Format
Description:
Post-text
Loading...
Thumbnail Image
Name:
TH3257.pdf
Size:
38.72 MB
Format:
Adobe Portable Document Format
Description:
Full-thesis