Lorentz JÄNTSCHI (lori) works ?id=255
- [id] => 255
- [recorddate] => 2016:08:25:11:10:48
- [lastupdate] => 2016:08:25:11:10:48
- [type] => conference
- [place] => Ribno, Slovenia
- [subject] => biology - biostatistics; chemistry - computational; informatics - models implementation; mathematics - modeling; mathematics - statistics
- [relatedworks] =>
- 3 (low):
- Linear regression modeling and validation strategies for structure-activity relationships, ?id=275
- [file] => ?f=255
- [mime] => application/pdf
- [size] => 588446
- [pubname] => Applied Statistics 2011, September 25-28, 2011
- [pubinfo] => Statistical Society of Slovenia
- [pubkey] => ISBN 978-961-92487-7-5
- [workinfo] => Oral presentation, 28st September (1210-1230), Abstracts book, p. 74
- [year] => 2011
- [title] => Is simple randomization of compounds in training and test set as good as other methods used in quantitative structure-activity experiments?
- [authors] => Sorana D. BOLBOACĂ, Lorentz JÄNTSCHI
- [abstract] =>
The present research aimed to assess if the simple random sampling is a proper method for splitting the set of compounds in training and test sets.
Four sets of compounds were included in the analysis: 1) a set of 83 of drug-like compounds with blood-brain barrier permeation; 2) a set of 18 sulfanilamide derivatives with carbonic anhydrase II
isozyme inhibitory activity; 3) a set of 34 taxoids with inhibitory activity on cell growth; and 4) a set of 25 triphenylacrylonitriles with affinity on estrogen receptor. A qSAR experiments was carried
using the Molecular Descriptors Family on Vertices for computing structural descriptors and multiple linear regressions were identified. Each set of compounds was split in training and test sets using a
simple randomization approach. The reliability of randomization was tested using the generalized cluster analysis with K-means algorithm (Statistica 8; Euclidian distance and maximization of the
initial distance in regards of cluster center using a cross-validation with 10-folds).
The following number of molecules was included in training:test sets: 55:28 for 1st set; 12:6 for 2nd set; 23:11 for 3rd set; and 19:6 for 4th set. Both the experimental data in training set and test set
proved to be normal distributed (Anderson-Darling and Kolmogorov-Smirnov statistics with p-value ¡ 0.05). The proper number of clusters identified using the observed activity and identified descriptors
varied from 3 (1st set) to 6 (4th test). With some exceptions (clusters with just one compound), the clusters proved to contain compounds from both training and test set. The descriptors and observed activity proved to have significant contribution in clustering (p < 0:001).
Simple randomization proved to be a proper method for splitting the set of compounds in training and test sub-sets.
- [keywords] => randomization; training vs. test; leave-one-out cross-validation
- [acknowledgment] => To the Erasmus office for a Erasmus Staff Mobility for one of the authors (SDB).