Friday, March 30, 2012

Predict Probability in Decision Trees

Hello,

I installed the bike buyer example and i am learning the DMX language. Now i wrote the following query (using MS decision trees):

SELECT
T.[Last Name],
[Bike Buyer],
PredictProbability(Predict([Bike Buyer])) AS [Probability]
From
[v Target Mail]
PREDICTION JOIN
OPENQUERY
(....... And so on..)

Now the result is surprising to me. In the resulttabel all the probabilities are equal.

Bike Buyer Probability
1 0.99994590500919611
0 0.99994590500919611
0 0.99994590500919611
0 0.99994590500919611
0 0.99994590500919611
1 0.99994590500919611

and so on.

Now i am wondering what predictProbability means. I thought that PredictProbability meant the probability that the prediction is correct. Now all the probabilities are the same and the input is different. Can somebody tell me what PredictProbability means or am I using it wrong?

Thanx in advance,

Joris Valkonet

This is an interesting query - I would write it as "PredictProbability([Bike Buyer])" however, the syntax is semantically the same. Another thing to try is PredictNodeId([Bike Buyer]) to see exactly the node used for the prediction and then check that node to see the distribution.

It's possible that PredictProbability(Predict([Bike Buyer])) is exposing a bug and you should use the simpler syntax above.

HTH

-Jamie

|||

Thank you for your reply.
I modified my query to PredictProbability([bike buyer]) and this made no difference. The result is the same.

Then I checked the PredictNodeId([Bike Buyer]) and this is the result:

FirstName

LastName

bike buyer

Expression

Abby

Malhotra

1

000000003

Abby

Prasad

0

000000003

Abby

Rodriguez

0

000000003

Abby

Srini

0

000000003

Abigail

Brown

0

000000003

Abigail

Bryant

1

000000003

Abigail

Davis

0

000000003

Abigail

Flores

0

000000003

Now the nodeId is the rootnode of the tree and the probability of the root node is indeed the value 0.99994590500919611. So if I understand this correct, the rootnode of the decision tree is used for all probability predictions. Is this correct?

(Maybe my query is incorrect and therefore the whole DMX query below)

SELECT
t.[FirstName],
t.[LastName],
[bike buyer],
PredictNodeId([bike buyer]) AS [Predicted NodeId]
From
[v Target Mail]
PREDICTION JOIN
OPENQUERY([Adventure Works DW],
'SELECT
[Gender],
[FirstName],
[MiddleName],
[LastName],
[BirthDate],
[MaritalStatus],
[EmailAddress],
[YearlyIncome],
[TotalChildren],
[NumberChildrenAtHome],
[HouseOwnerFlag],
[NumberCarsOwned],
[AddressLine1],
[AddressLine2],
[Phone]
FROM
[dbo].[ProspectiveBuyer]
') AS t
ON
[v Target Mail].[First Name] = t.[FirstName] AND
[v Target Mail].[Middle Name] = t.[MiddleName] AND
[v Target Mail].[Last Name] = t.[LastName] AND
[v Target Mail].[Birth Date] = t.[BirthDate] AND
[v Target Mail].[Marital Status] = t.[MaritalStatus] AND
[v Target Mail].[Gender] = t.[Gender] AND
[v Target Mail].[Email Address] = t.[EmailAddress] AND
[v Target Mail].[Yearly Income] = t.[YearlyIncome] AND
[v Target Mail].[Total Children] = t.[TotalChildren] AND
[v Target Mail].[Number Children At Home] = t.[NumberChildrenAtHome] AND
[v Target Mail].[House Owner Flag] = t.[HouseOwnerFlag] AND
[v Target Mail].[Number Cars Owned] = t.[NumberCarsOwned] AND
[v Target Mail].[Address Line1] = t.[AddressLine1] AND
[v Target Mail].[Address Line2] = t.[AddressLine2] AND
[v Target Mail].[Phone] = t.[Phone]

|||

Something wierd is happening here - Predict([Bike Buyer]), which is equivalent to saying just [Bike Buyer], returns the highest probability value for the prediction. That means that whatever node the tree goes to, it will return the highest probability state for that node. I can't see how you are getting different prediction results from the same node.

However, something else struck me as odd. The predict probability of 0.999... doesn't seem right. In fact the only time I have ever seen anything like this is when all the data at the node was missing, and the actual non-missing values only recieved the bayesian prior. By default prediction never returns missing (if you are asking for a value, it assumes you want an actual value). If you do a Predict([Bike Buyer], INCLUDE_MISSING) you will get predictions that include the missing value. If there were no observations in the training data of the states 1 and 0 then they would only have their prior probabilities which would be equal and then the choice of a state would be arbitrary.

What confuses me is that PredictProbabililty should behave the same way.

Can you do a SELECT FLATTENED NODE_DISTRIBUTION FROM [v Target Mail].CONTENT and post the results?

Thanks

|||

Thanks for your analysis. I tried to post the whole result of the query "SELECT FLATTENED NODE_DISTRIBUTION FROM [v Target Mail].CONTENT ", but then an unknown error occurs. I think because the resulttable is to big. So... i will give the first top rows of the result.

Thanks....

ATTRIBUTE_NAME

ATTRIBUTE_VALUE

SUPPORT

PROBABILITY

VARIANCE

.VALUETYPE

Bike Buyer

Missing

0

0.000432432

0

1

Bike Buyer

0.494048907

18484

0.999567568

0.249964584

3

Birth Date

5.45E-06

0

0

16892175

7

Birth Date

0.080783333

0

0

0

8

Birth Date

22673.85945

0

0

16892175

9

Date First Purchase

-0.000941648

0

0

76769.40019

7

Date First Purchase

3242.190525

0

0

0

8

Date First Purchase

37852.25763

0

0

76769.40019

9

Number Cars Owned

-0.079859089

0

0

1.295870198

7

Number Cars Owned

301.5988757

0

0

0

8

Number Cars Owned

1.502705042

0

0

1.295870198

9

Number Children At Home

-0.015796323

0

0

2.318367003

7

Number Children At Home

9.530872281

0

0

0

8

Number Children At Home

1.004057563

0

0

2.318367003

9

Total Children

-0.003992138

0

0

2.599718694

7

Total Children

2.60E+01

0

0

0

8

Total Children

1.844351872

0

0

2.599718694

9

Yearly Income

2.02E-06

0

0

1042319181

7

Yearly Income

116.9361238

0

0

0

8

Yearly Income

57305.77797

0

0

1042319181

9

36.04143269

0

0

0.166743652

11

Bike Buyer

Missing

0

0.000205761

0.00E+00

1

Bike Buyer

1

4858

0.999794239

0

3

1

0

0

5.14E-07

11

Bike Buyer

Missing

0

0.000536481

0

1

Bike Buyer

0.507518797

1862

0.999463519

0.249943468

3

Date First Purchase

-0.010374704

0

0

1160.841795

7

Date First Purchase

679.5475673

0

0

0

8

Date First Purchase

37822.50644

0

0

1160.841795

9

Geography Key

0.000170418

0

0

36083.71548

7

Geography Key

1.990727272

0

0

0

8

Geography Key

236.4581096

0

0

36083.71548

9

Number Cars Owned

-0.072103841

0

0

1.254812169

7

Number Cars Owned

2.37E+01

0

0

0

8

Number Cars Owned

1.523630505

0

0

1.254812169

9

Yearly Income

1.29E-06

0

0

1131929217

7

Yearly Income

7.102048444

0

0

0

8

Yearly Income

60633.72718

0

0

1131929217

9

392.8960717

0

0

0.113511443

11

Bike Buyer

Missing

0

0.000123274

0

1

Bike Buyer

0.256103576

8110

0.999876726

0.190514534

3

|||

I see the problem - you have Bike Buyer modeled as continuous where it should be discrete. PredictProbability was giving you the probability that BikeBuyer exists in this case. To get the confidence of the prediction for continuous predictions, you want to use PredictStdev.

Essentially what was happening was that the model was creating a linear regression to predict bike buyer rather than a classification model. Changing the content type to "Discrete" will fix your issue. If you are creating the model from the source table (rather than a cube), you can click on "Detect" on the data types page of the wizard and it will automatically determing that Bike Buyer should be discrete.

No comments:

Post a Comment