Events

Department of Computer Science: MSc Thesis Presentations

Lectures and seminars

Mika Sorvoja will present their MSc thesis on Thursday 19 September at 9:00 in B337, CS building

When

19.9.2024 9:00 – 9:30 (UTC +3)

Where

Computer Science building

meeting room B337

Event language(s)

English

Projective feature selection on molecular fingerprint data

Author: Mika Sorvoja
Supervisor: Juho Rousu
Advisors: Riikka Huusari, Sandor Szedmak

Abstract: This thesis deals with molecular fingerprints and anti-cancer drug response data. It is studied whether features that contribute the most to the response the anti-cancer drugs elicit can be found with the help of feature selection. For this, mainly two different algorithms are used and compared: a basic linear Pearson correlation coefficient-based feature ranking algorithm and a novel kernel-based projective feature selection algorithm. The success of feature selection is evaluated using kernel alignment and regression. The former is used for examining how well feature selection maintains similarity between data points, while the latter is for predicting drug response values from the fingerprint data. In the regression problem, a few different learners are compared: the two mainly used are linear and random forest regression, but also support vector regression and two slightly different multilayer perceptrons are applied to the task. The results show that the feature selection algorithm relying on projection operators is superior to the simple linear one, finding features contributing both to higher kernel alignment and more accurate predictions of drug response values. Regressor-wise, the algorithm producing the most accurate predictions, on average, is the random forest regression while the linear regression yields the most inaccurate ones. However, none of the algorithms used achieves satisfactory prediction accuracy. When it comes to the different molecular fingerprints, it appears that the substructure keys-based produce slightly more accurate predictions than topological or circular fingerprints, on average. In conclusion, the results of this thesis show that predicting drug response values from molecular fingerprint data is a challenging task. However, with a suitable feature selection algorithm, it is possible to improve learning speed while maintaining almost as high prediction accuracy as with full features. Further studies might find more suitable learning algorithms producing more satisfactory results.

Department of Computer Science

We are an internationally-oriented community and home to world-class research in modern computer science.

Updated: 10.9.2024
Published: 10.9.2024