Friday, March 30, 2012

PredictCaseLikelihood

I'm working with the cluster analysis algorithm (EM) in SQL 2005. I have tried to find documentation on the function PredictCaseLikelihood without luck. Is there any reference on how this function is defined?

Here's an excerpt from my book Data Mining with SQL Server 2005

PredictCaseLikelihood

PredictCaseLikelihood returns a measure from 0 to 1 that indicates how likely an input case is to exist considering the model learned by the algorithm.This measure is very good for use in anomaly detection as it quickly and easily tells you if new data is similar to any data seen before.This function operates in two modes, normalized and nonnormalized.

In the nonnormalized mode, the value of the measure is the raw probability of the case, that is, the product of the probabilities of each of the attributes in the case.For instance, if the probability of Home Ownership = ‘Yes’ is 40% and the probability of Occupation = ‘Craftsmen’ is 10% then the probability of the case is 40% x 10% = 4%.

Nonnormalized likelihoods can be useful, but due to the nature of the probabilities, as you increase the number of attributes in a case the probability of the case becomes increasingly small.Additionally, as a user, you can not understand if a 4% probability for a certain combination of attributes is a good thing or a bad thing.The normalized likelihood divides the probability of the case as provided by the model by the probability computed without the model using raw statistics.This provides a “lift” number that is normalized between 0 and 1 using the formula (lift)/(lift + 1).This is interpreted that cases with likelihood values greater than 0.5 have positive lift and are more likely than random to occur and that values less than 0.5 have negative life and are less likely than random to occur.

For continuous attributes, the probability distribution is used for this computation.

This query returns the normalized case likelihood for each case in the input set.

SELECT t.id, PredictCaseLikelihood()

FROM CustomerClusters

NATURAL PREDICTION JOIN <Input Set> AS t

This query returns the nonnormalized case likelihood for each case in the input set.

SELECT t.id, CaseLikelihood(NONNORMALIZED)

FROM CustomerClusters

NATURAL PREDICTION JOIN <Input Set> AS t

No comments:

Post a Comment