The choice of the similarity coefficient used in clustering could have a great impact on the resulting classification, there is therefore need to study and understand these coefficients better, so as to be able to make the right choice for specific situations. Many studies have been carried out without apparent reason for the choice of the similarity coefficient or clustering method, however, the use of a particular similarity coefficient combined with different clustering methods may give different results. The Dice and Jaccard similarity coefficients have been reported to give very similar results with respect to dendrogram structures, despite the fact that Jaccard is metric while Dice is believed to be non-metric. On the other hand, the Simple matching coefficient, which takes into consideration the negative co-occurrences of the individuals being compared, is known to give a different structure. In this study, these three coefficients were employed in carrying out cluster analysis (CA) using five (Unweighted Pair-Group Mean Arithmetic (UPGMA), Weighted Pair-Group Mean Arithmetic (WPGMA), complete linkage, single linkage and Neighbour-Joining (NJ) clustering methods for simulated and experimental binary data sets. The consensus fork index (CFI) results used to compare the dendrograms showed varying level of similarity for all the CA methods. The NJ and single linkage methods seemed to give the lowest values. Therefore the single linkage method is not suggested as an appropriate method because of its tendency to produce lots of singletons in classifications. In all of the data sets, it was observed that high correlation does not necessarily imply similarity in the topology of a tree, therefore care should be taken in its interpretation. The cophenetic correlation with original distances suggests that the UPGMA method gives consistent results with respect to grouping irrespective of the similarity measure/coefficient. However, the combination of the Jaccard coefficient and the UPGMA method was observed to give a higher cophenetic correlation value for all data. We will therefore recommend the use of UPGMA method and Jaccard coefficient because of its consistency. The MDS and PCA analyses confirmed most of the groupings of the isolates as seen in the dendrograms. The Pair-wise comparison which measures similarity of two individuals and the clustering method, which measures the similarity of groups may both have big impact on the results of classification. Therefore there is need to carefully select these two options depending on the data and purpose of research.
Verknüpfung zu Publikationen oder weiteren Datensätzen