PORTS USED: Predict Probability in Decision Trees

Hello,

I installed the bike buyer example and i am learning the DMX language. Now i wrote the following query (using MS decision trees):

SELECT
T.[Last Name],
[Bike Buyer],
PredictProbability(Predict([Bike Buyer])) AS [Probability]
From
[v Target Mail]
PREDICTION JOIN
OPENQUERY
(....... And so on..)

Now the result is surprising to me. In the resulttabel all the probabilities are equal.

Bike Buyer Probability
1 0.99994590500919611
0 0.99994590500919611
0 0.99994590500919611
0 0.99994590500919611
0 0.99994590500919611
1 0.99994590500919611

and so on.

Now i am wondering what predictProbability means. I thought that PredictProbability meant the probability that the prediction is correct. Now all the probabilities are the same and the input is different. Can somebody tell me what PredictProbability means or am I using it wrong?

Thanx in advance,

Joris Valkonet

This is an interesting query - I would write it as "PredictProbability([Bike Buyer])" however, the syntax is semantically the same. Another thing to try is PredictNodeId([Bike Buyer]) to see exactly the node used for the prediction and then check that node to see the distribution.

It's possible that PredictProbability(Predict([Bike Buyer])) is exposing a bug and you should use the simpler syntax above.

HTH

-Jamie

|||

Thank you for your reply.
I modified my query to PredictProbability([bike buyer]) and this made no difference. The result is the same.

Then I checked the PredictNodeId([Bike Buyer]) and this is the result:

FirstName

LastName

bike buyer

Expression

Abby

Malhotra

000000003

Abby

Prasad

000000003

Abby

Rodriguez

000000003

Abby

Srini

000000003

Abigail

Brown

000000003

Abigail

Bryant

000000003

Abigail

Davis

000000003

Abigail

Flores

000000003

Now the nodeId is the rootnode of the tree and the probability of the root node is indeed the value 0.99994590500919611. So if I understand this correct, the rootnode of the decision tree is used for all probability predictions. Is this correct?

(Maybe my query is incorrect and therefore the whole DMX query below)

SELECT
t.[FirstName],
t.[LastName],
[bike buyer],
PredictNodeId([bike buyer]) AS [Predicted NodeId]
From
[v Target Mail]
PREDICTION JOIN
OPENQUERY([Adventure Works DW],
'SELECT
[Gender],
[FirstName],
[MiddleName],
[LastName],
[BirthDate],
[MaritalStatus],
[EmailAddress],
[YearlyIncome],
[TotalChildren],
[NumberChildrenAtHome],
[HouseOwnerFlag],
[NumberCarsOwned],
[AddressLine1],
[AddressLine2],
[Phone]
FROM
[dbo].[ProspectiveBuyer]
') AS t
ON
[v Target Mail].[First Name] = t.[FirstName] AND
[v Target Mail].[Middle Name] = t.[MiddleName] AND
[v Target Mail].[Last Name] = t.[LastName] AND
[v Target Mail].[Birth Date] = t.[BirthDate] AND
[v Target Mail].[Marital Status] = t.[MaritalStatus] AND
[v Target Mail].[Gender] = t.[Gender] AND
[v Target Mail].[Email Address] = t.[EmailAddress] AND
[v Target Mail].[Yearly Income] = t.[YearlyIncome] AND
[v Target Mail].[Total Children] = t.[TotalChildren] AND
[v Target Mail].[Number Children At Home] = t.[NumberChildrenAtHome] AND
[v Target Mail].[House Owner Flag] = t.[HouseOwnerFlag] AND
[v Target Mail].[Number Cars Owned] = t.[NumberCarsOwned] AND
[v Target Mail].[Address Line1] = t.[AddressLine1] AND
[v Target Mail].[Address Line2] = t.[AddressLine2] AND
[v Target Mail].[Phone] = t.[Phone]

|||

Something wierd is happening here - Predict([Bike Buyer]), which is equivalent to saying just [Bike Buyer], returns the highest probability value for the prediction. That means that whatever node the tree goes to, it will return the highest probability state for that node. I can't see how you are getting different prediction results from the same node.

However, something else struck me as odd. The predict probability of 0.999... doesn't seem right. In fact the only time I have ever seen anything like this is when all the data at the node was missing, and the actual non-missing values only recieved the bayesian prior. By default prediction never returns missing (if you are asking for a value, it assumes you want an actual value). If you do a Predict([Bike Buyer], INCLUDE_MISSING) you will get predictions that include the missing value. If there were no observations in the training data of the states 1 and 0 then they would only have their prior probabilities which would be equal and then the choice of a state would be arbitrary.

What confuses me is that PredictProbabililty should behave the same way.

Can you do a SELECT FLATTENED NODE_DISTRIBUTION FROM [v Target Mail].CONTENT and post the results?

Thanks

|||

Thanks for your analysis. I tried to post the whole result of the query "SELECT FLATTENED NODE_DISTRIBUTION FROM [v Target Mail].CONTENT ", but then an unknown error occurs. I think because the resulttable is to big. So... i will give the first top rows of the result.

Thanks....

ATTRIBUTE_NAME

ATTRIBUTE_VALUE

SUPPORT

PROBABILITY

VARIANCE

.VALUETYPE

Bike Buyer

Missing

0.000432432

Bike Buyer

0.494048907

18484

0.999567568

0.249964584

Birth Date

5.45E-06

16892175

Birth Date

0.080783333

Birth Date

22673.85945

16892175

Date First Purchase

-0.000941648

76769.40019

Date First Purchase

3242.190525

Date First Purchase

37852.25763

76769.40019

Number Cars Owned

-0.079859089

1.295870198

Number Cars Owned

301.5988757

Number Cars Owned

1.502705042

1.295870198

Number Children At Home

-0.015796323

2.318367003

Number Children At Home

9.530872281

Number Children At Home

1.004057563

2.318367003

Total Children

-0.003992138

2.599718694

Total Children

2.60E+01

Total Children

1.844351872

2.599718694

Yearly Income

2.02E-06

1042319181

Yearly Income

116.9361238

Yearly Income

57305.77797

1042319181

36.04143269

0.166743652

Bike Buyer

Missing

0.000205761

0.00E+00

Bike Buyer

4858

0.999794239

5.14E-07

Bike Buyer

Missing

0.000536481

Bike Buyer

0.507518797

1862

0.999463519

0.249943468

Date First Purchase

-0.010374704

1160.841795

Date First Purchase

679.5475673

Date First Purchase

37822.50644

1160.841795

Geography Key

0.000170418

36083.71548

Geography Key

1.990727272

Geography Key

236.4581096

36083.71548

Number Cars Owned

-0.072103841

1.254812169

Number Cars Owned

2.37E+01

Number Cars Owned

1.523630505

1.254812169

Yearly Income

1.29E-06

1131929217

Yearly Income

7.102048444

Yearly Income

60633.72718

1131929217

392.8960717

0.113511443

Bike Buyer

Missing

0.000123274

Bike Buyer

0.256103576

8110

0.999876726

0.190514534

|||

I see the problem - you have Bike Buyer modeled as continuous where it should be discrete. PredictProbability was giving you the probability that BikeBuyer exists in this case. To get the confidence of the prediction for continuous predictions, you want to use PredictStdev.

Essentially what was happening was that the model was creating a linear regression to predict bike buyer rather than a classification model. Changing the content type to "Discrete" will fix your issue. If you are creating the model from the source table (rather than a cube), you can click on "Detect" on the data types page of the wizard and it will automatically determing that Bike Buyer should be discrete.

Friday, March 30, 2012

Predict Probability in Decision Trees

No comments:

Post a Comment

PORTS USED

Blog Archive

About Me