11th Electronic Computational Chemistry Conference (ECCC) book of abstracts located here
http://eccc.monmouth.edu/cgi-bin/discus/discus.cgi#29http://apache.org
Data Mining on Structure-Activity/Property Relationships Models
Sorana Daniela BOLBOACĂ, Lorentz JÄNTSCHI
"Iuliu Haţieganu" Medicine and Pharmacy University, Technical University, Cluj-Napoca, Romania



Keywords: Knowledge-Discovery in Database (KDD), Cluster analysis, Structure-Activity/Property Relationships (SAR/SPR), Molecular Descriptors Family (MDF)

Abstract#29IntroMaterialMethodResultsDiscussionConclusionRef

   Molecular descriptors family on structure-activity/property relationships studies were carried out in order to identify the link between compounds structure and their activity/property. A number of fifty-five classes of properties or activities of different compounds sets were investigated. Single and multi-varied linear regression models using molecular descriptors as variables were identified. The models with estimation and prediction abilities and associated characteristics were stored into a database. A data mining analysis using classification and clustering were applied on the obtained database for searching and extracting useful information. The methodology applied in searching and extracting for information and the obtained results are presented.

Intro#29AbstractMaterialMethodResultsDiscussionConclusionRef

   Data mining (DM), also called Knowledge-Discovery in Databases (KDD) or Knowledge-Discovery and Data Mining, is the process of automatically searching large volumes of data for patterns using tools such as classification, association rule mining, and/or clustering. The term has been defined as the nontrivial extraction of implicit, previously unknown, and potentially useful information from data [1], being considered as the science of extracting useful information from large data sets or databases [2].
   Data mining techniques are use in search of consistent patterns and/or systematic relationships between variables in business [3], evaluation of web-based educational programs [4], computer science [5], chemistry [6], engineering [7], medicine [8], and in all domains where a large amount of date must be analyzed.
   A new method of quantitative structure-activity/property relationships called MDF SAR/SPR (molecular descriptors family on the structure-activity/property relationships) has been introduced by Jäntschi in 2004 [9] and reviewed in 2005 [10]. Since then, samples of compounds with different properties or activities have been investigated and analyzed. Some results on different properties (retention chromatography index [9], relative response factor [11], molar refraction [12], octanol/water partition coefficient [13-15]) or activities (insecticidal activity [16], herbicidal activity [17], antioxidant efficacy [18], inhibition activity [19-21], toxicity [22,23], antituberculotic activity [24], and antimalarial activity [25]) have been reported. In addition, the overall results from the use of molecular descriptors family on structure property/activity relationships has also been published [26].
   The best performing models in terms of correlation coefficients and cross-validation scores were collected into a database. On this amount of information, data mining techniques have been applied in order to identify consistent patterns and/or relationships between variables of MDF SAR/SPR models.

Material#29AbstractIntroMethodResultsDiscussionConclusionRef

   A number of fifty-five sets of compounds were included into analysis. The set abbreviation, activity or property of interest and class of compounds are presented in Table 1.

Table 1. Characteristics of the sets included into analysis
NoAbbreviationActivity /PropertyCompounds
1DevMTOp00LC50/EC50 - fertilization of sea urchinordnance
2DevMTOp01LC50/EC50 - embryological development of sea urchin
3DevMTOp02LC50/EC50 - germination of sea urchin
4DevMTOp03LC50/EC50 - zoospore germination of green macroalgae
5DevMTOp04LC50/EC50 - germling length of green macroalgae
6DevMTOp05LC50/EC50 - germling cell number of green macroalgae
7DevMTOp06LC50/EC50 - survival and reproductive success of polychaete
8DevMTOp07LC50/EC50 - redfish larvae survival
9DevMTOp08LC50/EC50 - juveniles survival of opossum shrimp
10DevMTOp09NOEC - fertilization of sea urchin
11DevMTOp10NOEC - embryological development of sea urchin
12DevMTOp11NOEC - germination of sea urchin
13DevMTOp12NOEC - germling length and cell number of green macroalgae
14DevMTOp14NOEC - survival and reproductive success of green macroalgae
15DevMTOp15NOEC - survival and reproductive success of polychaete
16DevMTOp16NOEC - redfish larvae survival
17DevMTOp17NOEC - juveniles survival of opossum shrimp
18DevMTOp18LOEC - fertilization of sea urchin
19DevMTOp19LOEC - embryological development of sea urchin
20DevMTOp20LOEC - germination of sea urchin
21DevMTOp21LOEC - germling length and cell number of green macroalgae
22DevMTOp22LOEC - survival and reproductive success of green macroalgae
23DevMTOp23LOEC - survival and reproductive success of polychaete
24DevMTOp24LOEC - redfish larvae survival
25DevMTOp25LOEC - juveniles survival of opossum shrimp
26DHFRinhibition activity2,4-Diamino-5-(substituted-benzyl)pyrimides
27Dipeptidesdipeptides
28RRC433_lbrtoxicitypara substituted phenols
29RRC433_pkarelative toxicity
30Ta395cytotoxicityquinolines
31Tox395mutagenicity
3219654antiallergic activitysubstituted N 4-methoxyphenyl benzamides
3322583anti-HIV-1 potenciesHEPTA and TIBO derivatives
3426449antituberculotic activitypolyhydroxyxanthones
353300growth inhibition activitytaxoids
3641521insecticidal activityneonicotinoids
3752344antioxidant efficacy3-indolyl derivates
3852730toxicityalkyl metal compounds
3923110benzene derivates
4023158mono-substituted nitrobenzenes
4123167polychlorinated organic compounds
4240846_1inhibition activity on carbonic anhydrase Isubstituted 1,3,4-thiadiazole-
and
1,3,4-thiadiazoline-disulfonamides
4340846_2inhibition activity on carbonic anhydrase II
4440846_4inhibition activity on carbonic anhydrase IV
45Triazinesherbicidal activitysubstituted triazines
4623159eoctanol/water partition coefficientspolychlorinated biphenyls
4733504boiling pointalkanes
4836638water activated carbon adsorptionorganic compounds
49IChr_10retention chromatography indexorganophosphorus herbicides
50MR_10molar refractioncyclic organophosphorus
51PCB_rrfrelative response factorpolychlorinated biphenyls
52PCB_lkowoctanol/water partition coefficient
53PCB_rrtrelative retention time
54RRC433_lkowoctanol/water partition coefficientpara substituted phenols
5531572volatile organic compound
LC50 = lethal concentration to 50% of the test organisms
EC50 = effective concentration to 50% of the test organisms
NOEC = no observed effect concentration
LOEC = lowest observed effect concentration

   Univariate and multivariate models were obtained by applying the MDF SAR/SPR methodology on the samples of compounds; the models were stored into a database. The molecular descriptors are the variables used by the models. The characters used on molecular descriptors name are presented in Table 2. The significance of each character was previous posted [23].

Table 2. Characters in molecular descriptors name
PositionCharacters
FirstI-i-A-a-L-l
Secondm-M-n-N-S-P-s-A-a-B-b-G-g-F-f-H-h-I-i
Thirdm-M-D-P
FourthR-r-M-m-D-d
FifthD-d-O-o-P-p-Q-q-J-j-K-k-L-l-V-E-W-w-F-f-S-s-T-t
SixthC-H-M-E-G-Q
Seventhg-t

Method#29AbstractIntroMaterialResultsDiscussionConclusionRef

   The MDF SAR/SPR database was interrogated and the interest information was obtained by using a series of PHP programs. The SPSS software was used for data summarizing and analyzing. The 95% confidence intervals were computed by using dedicated software based on binomial distribution hypothesis [27].
   Two steps cluster analysis and hierarchical cluster analysis were used as methods in searching the patterns where was appropriate. The two-step cluster analysis was used on searching patterns overall models. This technique was choused because has specific feature: automatic selection of the best number of clusters, and ability to create cluster models simultaneously based on categorical and continuing variables. The hierarchical cluster method has been used for identification of similarities on the best performing MDF SAR/SPR models and was been choused because it is an easy to implement well-documented method and provides as result dendrograms, tree-like structures that illustrate the relationships between the entries.

Results#29AbstractIntroMaterialMethodDiscussionConclusionRef

   Fifty-five sets were included into analysis, cumulating an amount of one-hundred and ninety-five models. One hundred fifty-six models were for activities estimation and prediction (95%CI [144 - 166]) and thirty-eight models for properties estimation and prediction (95%CI [28 - 50]).
   Seventy-three models reported estimation and prediction of activity (95%CI [64 - 80]) and nineteen models (95%CI [12 - 27]) estimation and prediction ability of property. The number of MDF SAR models varied from two to eleven (for the set no. 40, Table 1) and for MDF SPR models varied from two to eight (for the set no. 48, table 1). The statistical characteristics of all models, and of the best performing models (in terms of closest squared correlation coefficient and cross-validation score to one) are presented in Table 3.

Table 3. Statistical characteristics of the MDF SAR/SPR models
  nvMean [95%CI]MedianMinMaxStDev
All models
Activityr21560.9023 [0.8783 - 0.9263]0.94890.01221.00000.1514
v2 [2 - 2]2151.1003
nsample28 [24 - 31]2356921.468
Propertyr2380.8698 [0.8077 - 0.9319]0.97720.12081.00000.1889
v4 [2 - 6]21246.0663
nsample77 [48 - 105]241020986.220
Best performing models
Activityr2450.9807 [0.9714 - 0.9900]0.99920.90371.00000.0310
v3 [2 - 3]2251.0288
nsample19 [13 - 24]856917.945
Propertyr2100.9572 [0.8993 - 1.0000]0.98830.73681.00000.0808
v3 [2 - 4]2.5261.3703
nsample80 [16 - 144]271020990.120
r2 = squared correlation coefficient; v = number of descriptors used in models;
nsample = sample size; nv = number of valid samples; 95% CI = 95% confidence interval;
Min= minimum; Max = maximum, StDev = standard deviation

   The MDF SAR/SPR models stored into database used two hundred and eighty-four molecular descriptors. Almost sixty-nine percent of them were used just by one model (one hundred and ninety-six descriptors, 95%CI [180 - 211]). The distribution of the descriptors used by MDF SAR/SPR models was:
  • Two descriptors were used by six models (imDrkQt, and lPMDVQg)
  • Four by five models (ASPrVQg, IiMMWHt, IMPrkQg, and iSMMWHg)
  • Sixteen descriptors were used by four models (AHMMVQg, aHPMwQt, aIDmjQg, iAMrVQg, iBMmwHg, iHDdFHg, iHMMtHg, IiDrQHg, iIPmWHt, ImmRDCg, imMrFHt, inDmwHg, INPRJQg, inPRlQg, isMdTHg, iSMmEQt)
  • Twenty one descriptors were used by three models (ABDmtQg, ASMmVQt, AsPmVQt, aSPRtQg, IADRSHg, IBPMWQt, iGPrfHt, iIMdLGg, iIMdTMg, iImrKHt, InMdTHg, isDRTCg, isDRtHg, ismRSEg, iSPRtQg, lfDdOQg, LHDmjQg, lIDrFEg, lIMdLGg, liMDWHg, LsDMpQg)
  • Forty-five descriptors were used by two models (ABmrtQg, AHDmEQg, aHMmjQt, AiMrKQt, AIPmVQt, AiPmVQt, aIPMwQt, anDRJQt, aSMMjQg, iAPmEQg, ibDMFHt, IbMmjHg, IBMrkGg, IBMRQCg, IbPdPHg, iFmRFMt, iFPMECg, IHDRKEg, iHMMTQt, IIDDKGg, IiMMSGg, imDdSCg, ImDmEEt, IMDMtQt, ImDrFEt, iMMMjQg, IMmrKQg, imMrtCg, inMRkQt, InPdJQg, inPRjQt, isDDkGg, IsMRKQg, ISPdlMg, IsPdOQg, lFDMwEt, lfDMWHt, lFMMKQg, LHDROQg, LIDmjQg, lImrKHt, lmMrsGg, lNPmfQt, LSPmEQg, LsPrDQt).
       One hundred and forty-seven descriptors have been used in the best performing models. The correspondences between using the descriptors in all models and in best performing models are presented in Table 4.

    Table 4. Descriptors in all models versus best performing models
    DescriptorsTotal
    all modelsbest models
    1189
    1 Total89
    2124
    22
    31
    2 Total27
    3111
    31
    3 Total12
    419
    24
    4 Total13
    513
    21
    5 Total4
    612
    6 Total2
    Total147

       The partial squared correlation coefficient (the squared correlation coefficient between each descriptor from the model and property or activity of interest) varied for the all models from 0.0001 to 0.9995 with an average of 0.3645. For the best performing models, the values of the partial squared correlation coefficients varied from 0.0001 to 0.9794 with an average of 0.2959. The average values of partial squared correlation coefficients for all models and for best performing model according with the activity or property of interest are summarized in Table 5. More, the descriptors that obtained greater value of partial squared correlation coefficients are not found in the best performing model.

    Table 5. The average contribution of the descriptors to the model
    Set abb.Avgr2-bestAvgr2-all   Set abb.Avgr2-bestAvgr2-all
    MDF SARs MDF SPRs
    DevMTOp000.86730.9113 231590.00890.1685
    DevMTOp010.66320.7753 315720.22740.2581
    DevMTOp020.41440.5866 335040.52970.6416
    DevMTOp030.03980.3232 366380.28800.3051
    DevMTOp040.22210.4454 IChr100.59980.4005
    DevMTOp050.13550.3823 MR100.89710.9075
    DevMTOp060.30400.5251 PCB_lkow0.22680.3327
    DevMTOp070.45790.6160 PCB_rrf0.27120.2843
    DevMTOp080.33840.5284 PCB_rrt0.46870.7021
    DevMTOp090.40350.5883 RRC433_lkow0.23080.3011
    DevMTOp100.41690.5941 Min0.00890.1685
    DevMTOp110.16920.4368 Max0.89710.9075
    DevMTOp120.02140.3060 Average0.37480.4302
    DevMTOp140.10920.3786 
    DevMTOp150.11000.3905 
    DevMTOp160.24510.4669 
    DevMTOp170.14470.3694 
    DevMTOp180.50830.6717 
    DevMTOp190.28880.5032 
    DevMTOp200.13910.3846 
    DevMTOp210.07210.3492 
    DevMTOp220.19460.4475 
    DevMTOp230.14300.4033 
    DevMTOp240.49970.6464 
    DevMTOp250.04410.3559 
    DHFR0.14820.1680 
    Dipeptides0.51450.4603 
    RRC433_lbr0.16120.2329 
    RRC433_pka0.26230.2144 
    Ta3950.10270.1002 
    Tox3950.20530.2712 
    196540.13600.3286 
    225830.22880.1908 
    264490.38740.5332 
    33000.24080.2761 
    415210.24070.4365 
    523440.50830.4243 
    527300.58060.7092 
    231100.12980.2106 
    231580.30110.2719 
    231670.35460.3636 
    40846_10.32640.4271 
    40846_20.13190.2170 
    40846_40.25290.2621 
    Triazines0.43230.4613 
    Min0.02140.1002 
    Max0.86730.9113 
    Average0.28000.4210 
    Avgr2-best = the average of the partial squared correlation coefficient on best performing model;
    Avgr2-all = the average of the partial squared correlation coefficient on all models

       Summarizing the characters that were included into the descriptors name it can be observed that, with a single exception, all characters for first, third, fourth, fifth, sixth and seven descriptor name letters appear in the descriptors names if all MDF SAR/SPR models. The same observation is valid for analysis of the best performing ones. There were identified that three characters out of nineteen from the second descriptor letter (the letters a, g and h, see Table 2) did not appear in any model. In order to applied cluster analysis techniques the frequency of the characters into the models according with the set name were transformed as qualitative variables (yes/no). The summaries of the results obtained by performing the two steps cluster analysis on all models as well as on the best performing models are presented in Table 6 (DescL = the letter in the descriptor name, Ch = character, Best model = the model that obtained the greatest squared correlation coefficient and cross-validation leave-one-out score). There were included into the Table 6 the absolute frequency of appearance of the character into the name of descriptors and the attribute importance into the cluster (‡ = significant importance in cluster at a significance level of 5%).

    Table 6. Two steps cluster analysis: results
    DescLChAll modelsBest model
    Cluster 1(41)Cluster 2(14)Total
    1st letterI25 13 3831
    i30 14‡4438
    A7 4 117
    a10 3 135
    L13 4 178
    l28 10 3831
    2nd letterm10 7 179
    M3 4 77
    n12 7 1913
    N7 1 85
    S11 8 1912
    P5 1 65
    s19 7 2618
    A14 5 1913
    B6 7‡139
    b2 6‡86
    G7 2 98
    F3 7‡104
    f2 1 32
    H14 9 2316
    I17 8 2511
    i3 7‡104
    3rd letterm13 8 2110
    M29 14‡4336
    D31 13 4434
    P31 11 4234
    4th letterR22 10 3223
    r26 13 3932
    M11‡14‡2520
    m28 13 4125
    D12 8 2015
    d10 10‡2014
    5th letterD7 2 94
    d4 2 63
    O6 0 65
    o3 2 52
    P3 3 64
    p5 2 74
    Q1 3‡43
    q6 1 76
    J7 6 136
    j9 5 146
    K3 7‡105
    k10 8‡1813
    L7 2 96
    l4 2 65
    V8 6 1410
    E5‡9‡149
    W1 4‡55
    w9 7 168
    F4‡10‡147
    f9 2 115
    S7 5 128
    s6 6 125
    T6 6 129
    t10 7 178
    6th letterC10 7 176
    H9‡14‡2320
    M17 7 2416
    E10 5 1512
    G12 8 2011
    Q40 14 5444
    7th letterg40 14 5441
    t31 13 4451
    DescL = the letter in the descriptor name
    Ch = character
    Best model = the model that obtained
        the greatest squared correlation coefficient
        and cross-validation leave-one-out score
    ‡ = significant importance in cluster
        at a significance level of 5%

       The hierarchical cluster technique was applied in order to analyze the best performing models. The Icile plot is presented in Figure 1 and the associated dendrogram in Figure 2.

    opens in new window
    Figure 1. Best performing MDF SAR/SPR models analysis: icile plot

    opens in new window
    Figure 2. Best performing MDF SAR/SPR models analysis: dendrogram


  • Discussion#29AbstractIntroMaterialMethodResultsConclusionRef

       Searching the information regarding the MDF SAR/SPR models for patterns revealed important information for activity/property characterization of compounds classes by applying the molecular descriptors family methodology.
       As it can be observed from Table 3, the average of the correlation coefficient obtained by MDF SARs is greater comparing with the value obtained by the MDF SPRs, while the number of variables is less for MDF SARs than for MDF SPRs when all models are considered. When the best performing models are analyzed it can be observed that the squared correlation coefficient average obtained by the MDF SAR models is very closed to the squared correlation coefficient average obtained by MDF SPR models, and the average of the descriptors is the same.
       Just forty-five percent of the molecular descriptors that were used in one model on completely sample of models could be found in the best performing models (see Table 4). Sixty percent of the molecular descriptors used by two models on whole samples were found again on the best performing models (see Table 4). Fifty-seven percent of the molecular descriptors used by three models on whole samples were found again on the best performing models; almost eighty-one percent of the molecular descriptors used by four models on whole samples were found again on the best performing models. All molecular descriptors used by five, and respectively six models on whole samples were found as being used on the best performing models too (see Table 4). These observations sustained the stability and consistency of the MDF SAR/SPR method in identification of the molecular descriptors that are able to identify the strongest relationships between compounds structure and associated activity or property.
       Analyzing the data presented in Table 4 it can be observed that the average, minimum and maximum values of average contribution of descriptors are smaller values for the best performing models than the values obtained on all models. This observation leads to the conclusion that the best performing models are obtained by combination of descriptors, and the molecular descriptors that had a value of the partial correlation coefficient closest to one are not always found in the best performing model.
       Two clusters were obtained by applying the two-step cluster analysis technique on the all models, showing that there exist some similarities between MDF models. One cluster used forty-one sets of compounds while the second cluster used fourteen compounds. Four characters had significant importance into the first cluster obtained on all models (see table 6):

  • Character M (the overlapping descriptors interaction on the maximal fragments) from fourth position on descriptors name
  • Characters E (interaction descriptor of the second atom property divided to the distance between the atoms) and F (interaction descriptor of the square first atom property divided to the square distance between atoms) as fifth position on descriptors name
  • Character H (number of directly bonded hydrogen's as atomic property) from sixth position on descriptors name
       In the second cluster, the one that comprise fourteen sets of compounds, fourteen characters revealed to have significant importance in clustering:
  • Character i (the inverse linearization procedure applied in global molecular descriptor generation) from the first position on descriptors name
  • Characters B (as average mean by atom), b (average mean by bond), F (geometric mean by atom), i (harmonic mean by bond) from the second position on descriptors name (the cumulative method of fragmentation properties)
  • Character M (the maximal fragments criteria) from the third position on descriptors name
  • Characters M (the overlapping descriptors interaction on the maximal fragments) and d (the overlapping descriptors interaction on threat descriptors as Cartesian vectors) from the fourth position on descriptors name
  • Characters Q (the squared product between first and second atoms properties), K (the product between the first and second atoms properties and the distance between them), k (the inverse of K), E (interaction descriptor of the second atom property divided to the distance between the atoms), and W (the square of the first atom property divided to the distance between two atoms) from the fifth position on descriptors name
  • Character H (number of directly bonded hydrogen's as atomic property) from the sixth position on descriptors name:
       On the sample of best performing MDF SAR/SPR models, the two-step cluster analysis was able to identify two clusters. This could be explained by the absence of similarities of descriptors characters used by the best performing models. The most frequently met characters on the descriptors name on the best performing models were:
  • i character for the first position on descriptors name (the inverse linearization procedure applied in global molecular descriptor generation)
  • s character for the second position on descriptors name (the product between the first and second atoms properties divided to the distance rice to power three)
  • M character for the third position on descriptors name (the maximal fragments criteria)
  • r character for the fourth position on descriptors name (the overlapping descriptors interaction obtained by treating descriptors as scalars and computing resultant relative to conventional origin)
  • k character for the fifth position on descriptors name (the inverse of the product between the first and second atoms properties and the distance between them)
  • Q character for the sixth position on descriptors name (semi-empirical Extended Hückel model, Single Point approach as atomic property)
  • t character for the first position on descriptors name (molecular topology)
       Taking into account the above information, it can be concluding that there could not be identify similarities or patterns on the MDF SAR/SPR models even if the results of the analysis of all models say something else. Note that in the analysis of the all MDF SAR/SPR models were included for each set of compounds the univariate models that in most of the cases obtained weak performances in terms of estimation and prediction abilities.
       The quantitative variables similarities of the best performing models were analyzed with hierarchical cluster technique. Looking at the icile plot (Figure 1) it can be analyzed what happen at each clusterization step. At the start step (the one that is not represented on icicle plot, Figure 1), each set of compounds was a cluster unto itself (the number of clusters at the start point being equal with fifty-five). Starting with the first step, the sets were ordered in the icicle plot according with their combination into clusters. The 15:DevMTOp15 set is linked first with 12:DevMTOp11 set, being follow by the 24:DevMTOp24 set, and so on until all the clusters are formed. From the dendrogram (see Figure 2) it can be observed that at a small distances three clusters are formed: one that comprised forty-seven sets, and other two that comprised five and respectively three sets. The differences between the obtained three clusters are at the level of sample size, and number of descriptors used by model. On the cluster that comprised forty-seven sets the sample sizes varied from five to forty, and the number of molecular descriptors from two to three. On the cluster that comprised five sets the sample seizes varied from fifty-seven to seventy-three and the number of descriptors from two to five, while on the cluster that comprised three sets the number of compounds were of two hundred and nine and the number of variables from two to six. At a short distance, two clusters are linked together (the one that comprised forty-seven and the other that comprised five sets). All the clusters are linked together at the maximum distance as possible.
       The research reached its goal of searching the patterns on MDF SAR/SPR models. The results shown that on the studied sets of compounds the MDF SAR/SPR method identified models that are unique for each set do to the complex information obtained from compounds structure. Based on the obtained results the MDF SAR/SPR method will be updated by analyzing of the usefulness of the three characters from the second position descriptor name that were not identified in any model. The development of the MDF SAR/SPR database by analyzing and including of more compounds sets will be done in the future. Data mining techniques applied on larger sets of compounds could revealing important information for characterization of activities or properties of compound based on information obtained from the structure.

  • Conclusion#29AbstractIntroMaterialMethodResultsDiscussionRef

       The data mining techniques applied on MDF SAR/SPR models revealed that is not possible any classification of characters used on descriptors name and thus on their construction. This result sustains the ability of MDF SAR/SPR method on identification of those structure characteristics of compounds that are linked with the activity or property of interest.
       The hierarchical cluster analysis is a useful technique in identification of similarities of MDF SAR/SPR models regarding the quantitative variables, in our case the squared correlation coefficient, the number of descriptors used by models and the sample sizes.
       Data mining techniques applied on larger sets of compounds analyzed with MDF SAR/SPR method could reveal important information for characterization of activities or properties of compound based on information obtained from the structure.

    Ref#29AbstractIntroMaterialMethodResultsDiscussionConclusion

    1. W. Frawley and G. Piatetsky-Shapiro and C. Matheus (Fall 1992). "Knowledge Discovery in Databases: An Overview". AI Magazine 1992, pp. 213-228.
    2. D. Hand, H. Mannila, P. Smyth. "Principles of Data Mining". MIT Press, Cambridge, MA, 2001.
    3. Y.-L. Chen, J.-M. Chen, C.-W. Tung. "A data mining approach for retail knowledge discovery with consideration of the effect of shelf-space adjacency on sales". Decision Support Systems 2007, 42(3), pp. 1503-1520.
    4. C. Romero, S. Ventura. "Educational data mining: A survey from 1995 to 2005". Expert Systems with Applications 2007, 33(1), pp. 135-146.
    5. A.J.T. Lee, R.-W. Hong, W.-M. Ko, W.-K. Tsao, H.-H. Lin. "Mining spatial association rules in image databases". Information Sciences 2007, 177(7), pp. 1593-1608.
    6. U. Maran, S. Sild, I. Kahn, K. Takkis. "Mining of the chemical information in GRID environment". Future Generation Computer Systems 2007, 23(1), pp. 76-83.
    7. Q. Yang, J. Yin, C. Ling, R. Pan. "Extracting actionable knowledge from decision trees". IEEE Transactions on Knowledge and Data Engineering 2007, 19(1), pp. 43-55.
    8. T. Imamura, S. Matsumoto, Y. Kanagawa, B. Tajima, S. Matsuya, M. Furue, H. Oyama. "A technique for identifying three diagnostic findings using association analysis". Medical and Biological Engineering and Computing 2007, 45(1), pp. 51-59.
    9. L. Jäntschi. "MDF - A New QSPR/QSAR Molecular Descriptors Family". Leonardo Journal of Sciences 2004, Issue 4, pp. 68-85.
    10. L. Jäntschi. "Molecular Descriptors Family on Structure Activity Relationships 1. Review of the Methodology". Leonardo Electronic Journal of Practices and Technologies 2005, Issue 6, pp. 76-98.
    11. L. Jäntschi. "QSPR on Estimating of Polychlorinated Biphenyls Relative Response Factor using Molecular Descriptors Family". Leonardo Electronic Journal of Practices and Technologies 2004, 5, pp. 67-84
    12. L. Jäntschi, S. Bolboacă. "Molecular Descriptors Family on Structure Activity Relationships 4. Molar Refraction of Cyclic Organophosphorus Compounds". Leonardo Electronic Journal of Practices and Technologies 2005, 7, pp. 55-102.
    13. L. Jäntschi, S. Bolboacă. "Molecular Descriptors Family on Structure Activity Relationships 6. Octanol-Water Partition Coefficient of Polychlorinated Biphenyls". Leonardo Electronic Journal of Practices and Technologies 2006, 8, pp. 71-86.
    14. L. Jäntschi. "Delphi Client - Server Implementation of Multiple Linear Regression Findings: a QSAR/QSPR Application". Applied Medical Informatics 2004, 15, pp. 48-55
    15. L. Jäntschi, S.D. Bolboacă. "Modeling the Octanol-Water Partition Coefficient of Substituted Phenols by the Use of Structure Information". International Journal of Quantum Chemistry. In Press, Published Online: 3 Jan 2007
    16. S. Bolboacă, L. Jäntschi. "Molecular Descriptors Family on Structure Activity Relationships 2. Insecticidal Activity of Neonicotinoid Compounds". Leonardo Journal of Sciences 2005, 6, pp. 78-85.
    17. S. Bolboacă, L. Jäntschi. "Molecular Descriptors Family on Structure-Activity Relationships: Modeling Herbicidal Activity of Substituted Triazines Class". Bulletin of University of Agricultural Sciences and Veterinary Medicine - Agriculture 2006, 62, pp. 35-40.
    18. S. Bolboacă, C. Filip, S. Tigan, L. Jäntschi, "Antioxidant Efficacy of 3-Indolyl Derivates by Complex Information Integration". Clujul Medical 2006, Issue LXXIX(2), p. 204-209.
    19. L. Jäntschi, M.L. Unguresan, S.D. Bolboacă. "Integration of Complex Structural Information in Modeling of Inhibition Activity on Carbonic Anhydrase II of Substituted Disulfonamides". Applied Medical Informatics 2005, 17(3, 4), pp. 12-21.
    20. L. Jäntschi, S. Bolboacă. "Modelling the Inhibitory Activity on Carbonic Anhydrase IV of Substituted Thiadiazole- and Thiadiazoline- Disulfonamides: Integration of Structure Information". Electronic Journal of Biomedicine 2006, 2, p. 22-33.
    21. S. Bolboacă, S. Tigan, L. Jäntschi. "Molecular Descriptors Family on Structure-Activity Relationships on anti-HIV-1 Potencies of HEPTA and TIBO Derivatives". Proceedings of the European Federation for Medical Informatics Special Topic Conference, April 6-8, 2006, pp. 222-226.
    22. S.D. Bolboacă, L. Jäntschi. "Modeling of Structure-Toxicity Relationship of Alkyl Metal Compounds by Integration of Complex Structural Information". Terapeutics, Pharmacology and Clinical Toxicology 2006, X(1), pp. 110-114.
    23. L. Jäntschi, S. Bolboacă. "Molecular Descriptors Family on QSAR Modeling of Quinoline-based Compounds Biological Activities". The 10th Electronic Computational Chemistry Conference. April 2005, http://eccc.monmouth.edu
    24. S. Bolboacă, L. Jäntschi. "Molecular Descriptors Family on Structure Activity Relationships 3. Antituberculotic Activity of some Polyhydroxyxanthones". Leonardo Journal of Sciences 2005, 7, pp. 58-64.
    25. L. Jäntschi, S. Bolboacă. "Molecular Descriptors Family on Structure Activity Relationships 5. Antimalarial Activity of 2,4-Diamino-6-Quinazoline Sulfonamide Derivates". Leonardo Journal of Sciences 2006, 8, pp. 77-88.
    26. L. Jäntschi, S. Bolboacă. "Results from the Use of Molecular Descriptors Family on Structure Property/Activity Relationships". International Journal of Molecular Sciences 2007, 8, pp. 189-203.
    27. Binomial Distribution, © L 2007. Available from: URL: http://l.academicdirect.org/Statistics/binomial_distribution/

    http://eccc.monmouth.edu/cgi-bin/discus/discus.cgiUsed software: PHP [php.het], FreePascal [freepascal.org], MySQL [mysql.com], SPSS [spss.com].http://l.academicdirect.org
    Online resources: ECCC10#4 [presentation #4 at ECCC10]; L.AcademicDirect [Library from AcademicDirect].
    Acknowledgment: MEC/UEFISCSU Romania, Grant ET.46/2006.
    Contact: sorana@j.academicdirect.ro, lori@j.academicdirect.org.