weka.clusterers
Class EM

java.lang.Object
  extended by weka.clusterers.AbstractClusterer
      extended by weka.clusterers.AbstractDensityBasedClusterer
          extended by weka.clusterers.RandomizableDensityBasedClusterer
              extended by weka.clusterers.EM
All Implemented Interfaces:
java.io.Serializable, java.lang.Cloneable, Clusterer, DensityBasedClusterer, NumberOfClustersRequestable, CapabilitiesHandler, OptionHandler, Randomizable, RevisionHandler, WeightedInstancesHandler

public class EM
extends RandomizableDensityBasedClusterer
implements NumberOfClustersRequestable, WeightedInstancesHandler

Simple EM (expectation maximisation) class.

EM assigns a probability distribution to each instance which indicates the probability of it belonging to each of the clusters. EM can decide how many clusters to create by cross validation, or you may specify apriori how many clusters to generate.

The cross validation performed to determine the number of clusters is done in the following steps:
1. the number of clusters is set to 1
2. the training set is split randomly into 10 folds.
3. EM is performed 10 times using the 10 folds the usual CV way.
4. the loglikelihood is averaged over all 10 results.
5. if loglikelihood has increased the number of clusters is increased by 1 and the program continues at step 2.

The number of folds is fixed to 10, as long as the number of instances in the training set is not smaller 10. If this is the case the number of folds is set equal to the number of instances.

Valid options are:

 -N <num>
  number of clusters. If omitted or -1 specified, then 
  cross validation is used to select the number of clusters.
 -I <num>
  max iterations.
 (default 100)
 -V
  verbose.
 -M <num>
  minimum allowable standard deviation for normal density
  computation
  (default 1e-6)
 -O
  Display model in old format (good when there are many clusters)
 
 -S <num>
  Random number seed.
  (default 100)

Version:
$Revision: 1.44 $
Author:
Mark Hall (mhall@cs.waikato.ac.nz), Eibe Frank (eibe@cs.waikato.ac.nz)
See Also:
Serialized Form

Constructor Summary
EM()
          Constructor.
 
Method Summary
 void buildClusterer(Instances data)
          Generates a clusterer.
 double[] clusterPriors()
          Returns the cluster priors.
 java.lang.String debugTipText()
          Returns the tip text for this property
 java.lang.String displayModelInOldFormatTipText()
          Returns the tip text for this property
 Capabilities getCapabilities()
          Returns default capabilities of the clusterer (i.e., the ones of SimpleKMeans).
 double[][][] getClusterModelsNumericAtts()
          Return the normal distributions for the cluster models
 double[] getClusterPriors()
          Return the priors for the clusters
 boolean getDebug()
          Get debug mode
 boolean getDisplayModelInOldFormat()
          Get whether to display model output in the old, original format.
 int getMaxIterations()
          Get the maximum number of iterations
 double getMinStdDev()
          Get the minimum allowable standard deviation.
 int getNumClusters()
          Get the number of clusters
 java.lang.String[] getOptions()
          Gets the current settings of EM.
 java.lang.String getRevision()
          Returns the revision string.
 java.lang.String globalInfo()
          Returns a string describing this clusterer
 java.util.Enumeration listOptions()
          Returns an enumeration describing the available options.
 double[] logDensityPerClusterForInstance(Instance inst)
          Computes the log of the conditional density (per cluster) for a given instance.
static void main(java.lang.String[] argv)
          Main method for testing this class.
 java.lang.String maxIterationsTipText()
          Returns the tip text for this property
 java.lang.String minStdDevTipText()
          Returns the tip text for this property
 int numberOfClusters()
          Returns the number of clusters.
 java.lang.String numClustersTipText()
          Returns the tip text for this property
 void setDebug(boolean v)
          Set debug mode - verbose output
 void setDisplayModelInOldFormat(boolean d)
          Set whether to display model output in the old, original format.
 void setMaxIterations(int i)
          Set the maximum number of iterations to perform
 void setMinStdDev(double m)
          Set the minimum value for standard deviation when calculating normal density.
 void setMinStdDevPerAtt(double[] m)
           
 void setNumClusters(int n)
          Set the number of clusters (-1 to select by CV).
 void setOptions(java.lang.String[] options)
          Parses a given list of options.
 java.lang.String toString()
          Outputs the generated clusters into a string.
 
Methods inherited from class weka.clusterers.RandomizableDensityBasedClusterer
getSeed, seedTipText, setSeed
 
Methods inherited from class weka.clusterers.AbstractDensityBasedClusterer
distributionForInstance, logDensityForInstance, logJointDensitiesForInstance, makeCopies
 
Methods inherited from class weka.clusterers.AbstractClusterer
clusterInstance, forName, makeCopies, makeCopy
 
Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, wait, wait, wait
 
Methods inherited from interface weka.clusterers.Clusterer
clusterInstance
 

Constructor Detail

EM

public EM()
Constructor.

Method Detail

globalInfo

public java.lang.String globalInfo()
Returns a string describing this clusterer

Returns:
a description of the evaluator suitable for displaying in the explorer/experimenter gui

listOptions

public java.util.Enumeration listOptions()
Returns an enumeration describing the available options.

Specified by:
listOptions in interface OptionHandler
Overrides:
listOptions in class RandomizableDensityBasedClusterer
Returns:
an enumeration of all the available options.

setOptions

public void setOptions(java.lang.String[] options)
                throws java.lang.Exception
Parses a given list of options.

Valid options are:

 -N <num>
  number of clusters. If omitted or -1 specified, then 
  cross validation is used to select the number of clusters.
 -I <num>
  max iterations.
 (default 100)
 -V
  verbose.
 -M <num>
  minimum allowable standard deviation for normal density
  computation
  (default 1e-6)
 -O
  Display model in old format (good when there are many clusters)
 
 -S <num>
  Random number seed.
  (default 100)

Specified by:
setOptions in interface OptionHandler
Overrides:
setOptions in class RandomizableDensityBasedClusterer
Parameters:
options - the list of options as an array of strings
Throws:
java.lang.Exception - if an option is not supported

displayModelInOldFormatTipText

public java.lang.String displayModelInOldFormatTipText()
Returns the tip text for this property

Returns:
tip text for this property suitable for displaying in the explorer/experimenter gui

setDisplayModelInOldFormat

public void setDisplayModelInOldFormat(boolean d)
Set whether to display model output in the old, original format.

Parameters:
d - true if model ouput is to be shown in the old format

getDisplayModelInOldFormat

public boolean getDisplayModelInOldFormat()
Get whether to display model output in the old, original format.

Returns:
true if model ouput is to be shown in the old format

minStdDevTipText

public java.lang.String minStdDevTipText()
Returns the tip text for this property

Returns:
tip text for this property suitable for displaying in the explorer/experimenter gui

setMinStdDev

public void setMinStdDev(double m)
Set the minimum value for standard deviation when calculating normal density. Reducing this value can help prevent arithmetic overflow resulting from multiplying large densities (arising from small standard deviations) when there are many singleton or near singleton values.

Parameters:
m - minimum value for standard deviation

setMinStdDevPerAtt

public void setMinStdDevPerAtt(double[] m)

getMinStdDev

public double getMinStdDev()
Get the minimum allowable standard deviation.

Returns:
the minumum allowable standard deviation

numClustersTipText

public java.lang.String numClustersTipText()
Returns the tip text for this property

Returns:
tip text for this property suitable for displaying in the explorer/experimenter gui

setNumClusters

public void setNumClusters(int n)
                    throws java.lang.Exception
Set the number of clusters (-1 to select by CV).

Specified by:
setNumClusters in interface NumberOfClustersRequestable
Parameters:
n - the number of clusters
Throws:
java.lang.Exception - if n is 0

getNumClusters

public int getNumClusters()
Get the number of clusters

Returns:
the number of clusters.

maxIterationsTipText

public java.lang.String maxIterationsTipText()
Returns the tip text for this property

Returns:
tip text for this property suitable for displaying in the explorer/experimenter gui

setMaxIterations

public void setMaxIterations(int i)
                      throws java.lang.Exception
Set the maximum number of iterations to perform

Parameters:
i - the number of iterations
Throws:
java.lang.Exception - if i is less than 1

getMaxIterations

public int getMaxIterations()
Get the maximum number of iterations

Returns:
the number of iterations

debugTipText

public java.lang.String debugTipText()
Returns the tip text for this property

Returns:
tip text for this property suitable for displaying in the explorer/experimenter gui

setDebug

public void setDebug(boolean v)
Set debug mode - verbose output

Parameters:
v - true for verbose output

getDebug

public boolean getDebug()
Get debug mode

Returns:
true if debug mode is set

getOptions

public java.lang.String[] getOptions()
Gets the current settings of EM.

Specified by:
getOptions in interface OptionHandler
Overrides:
getOptions in class RandomizableDensityBasedClusterer
Returns:
an array of strings suitable for passing to setOptions()

getClusterModelsNumericAtts

public double[][][] getClusterModelsNumericAtts()
Return the normal distributions for the cluster models

Returns:
a double[][][] value

getClusterPriors

public double[] getClusterPriors()
Return the priors for the clusters

Returns:
a double[] value

toString

public java.lang.String toString()
Outputs the generated clusters into a string.

Overrides:
toString in class java.lang.Object
Returns:
the clusterer in string representation

numberOfClusters

public int numberOfClusters()
                     throws java.lang.Exception
Returns the number of clusters.

Specified by:
numberOfClusters in interface Clusterer
Specified by:
numberOfClusters in class AbstractClusterer
Returns:
the number of clusters generated for a training dataset.
Throws:
java.lang.Exception - if number of clusters could not be returned successfully

getCapabilities

public Capabilities getCapabilities()
Returns default capabilities of the clusterer (i.e., the ones of SimpleKMeans).

Specified by:
getCapabilities in interface Clusterer
Specified by:
getCapabilities in interface CapabilitiesHandler
Overrides:
getCapabilities in class AbstractClusterer
Returns:
the capabilities of this clusterer
See Also:
Capabilities

buildClusterer

public void buildClusterer(Instances data)
                    throws java.lang.Exception
Generates a clusterer. Has to initialize all fields of the clusterer that are not being set via options.

Specified by:
buildClusterer in interface Clusterer
Specified by:
buildClusterer in class AbstractClusterer
Parameters:
data - set of instances serving as training data
Throws:
java.lang.Exception - if the clusterer has not been generated successfully

clusterPriors

public double[] clusterPriors()
Returns the cluster priors.

Specified by:
clusterPriors in interface DensityBasedClusterer
Specified by:
clusterPriors in class AbstractDensityBasedClusterer
Returns:
the cluster priors

logDensityPerClusterForInstance

public double[] logDensityPerClusterForInstance(Instance inst)
                                         throws java.lang.Exception
Computes the log of the conditional density (per cluster) for a given instance.

Specified by:
logDensityPerClusterForInstance in interface DensityBasedClusterer
Specified by:
logDensityPerClusterForInstance in class AbstractDensityBasedClusterer
Parameters:
inst - the instance to compute the density for
Returns:
an array containing the estimated densities
Throws:
java.lang.Exception - if the density could not be computed successfully

getRevision

public java.lang.String getRevision()
Returns the revision string.

Specified by:
getRevision in interface RevisionHandler
Returns:
the revision

main

public static void main(java.lang.String[] argv)
Main method for testing this class.

Parameters:
argv - should contain the following arguments:

-t training file [-T test file] [-N number of clusters] [-S random seed]