Friday, March 30, 2012

Prediction with many attribute states

I have a large dataset of around 3 million records with accounting data for 2 years. Attributes are transaction amount (cont. / predict), account, cost centre, project, month and a few others. I want to predict any future transaction amount for a certain combination. For example; what will the next salary cost transaction amount in cost centre 123 probably be?

I have tried Decision trees and Neural nets. But the predictions are not good enough even if there should be clear patterns in normal accounting data.

I guess the problem is that many of the input attributes have many states. There are around 500 account, and 1000 cost centres, and 2000 projects etc. And the Decision tree doesn’t seem to be able to capture all the business rules in the company. I have tried to group the attribute states into groups based on their average amount, their parent account etc, but it doesn’t seem to solve the problem.

Please post any suggestion you might have how to improve the prediction. I will try them all and post back my findings!

/Erik

You may need to structure your model so that it creates independent models for all scenarios. Also if things like "Project" only have a few rows/state, there's likely not alot to learn from them.

To create independent models you need to move one of your attributes to a nested table. For example, if you thought that "accounts" were the most important you would create a model like this

CREATE MINING MODEL CostByAccount
{
Transaction LONG KEY,
AccountAmount TABLE
{
Account TEXT KEY,
Amount FLOAT CONTINUOUS PREDICT_ONLY
}
CostCenter LONG DISCRETE,
Project LONG DISCRETE,
Month TEXT DISCRETE,
...
} USING Microsoft_Decision_Trees(params)

This will create a different tree for each account based on input only for that account. To create this table in the UI, you will mark the source table as case and nested tables and then add Account as Key of the nested table.

|||

Thanx Jamie,

Seems like a good idea. Creating a forrest instead of a tree. The result looks as expected when browsing the created tree structres in the model viewer.

1. But is this kind of model supported by the accurancy chart? Can't seem to get it working. I add the case table, and the nested table (same table twice). But the drop-down "Predictabel column name" is empty.

2. How to write the predict query? Have used the query builder but it dosn't seem to work.

/Erik

|||

Actually, no, it doesn't work with the accuracy chart, so you would have to create your own accuracy test queries.

For predict, you should be able to do Predict(<Nested Table Name>,3) for example to get the 3 most likely categories. There are also additional tricks you can play, for example to get statistics you can do

Predict(<Nested Table Name>,INCLUDE_STATISTICS)

This will return all possible states with descriptive stats for each state. Since these functions return tables you can select from them, e.g.

SELECT (SELECT * FROM Predict(<Nested Table>, INCLUDE_STATISTICS) WHERE $Probability >0.25) as Result FROM MyModel ...

Will return all states with a 25% probability or higher.

|||

Can't follow you,

This is approx. what I would like to do. But it dosnt work. (A simplified version of the real model).

/Erik

SELECT
t.[TransactionID],
t.[Account],
t.[CostCentre],
t.[Project],
(t.[Amount]) as [ActualAmount],
(SELECT ([Amount]) as [EstimatedAmount] FROM [DesTree].[Transactions])
From
[DesTree]
PREDICTION JOIN
SHAPE {
OPENQUERY([Adb2],
'SELECT DISTINCT
[TransactionID],
[Account],
[CostCentre],
[Project],
[Amount]
FROM
[dbo].[Transactions]
ORDER BY
[TransactionID]')}
APPEND
({OPENQUERY([Adb2],
'SELECT
[Account],
[Amount],
[TransactionID]
FROM
[dbo].[Transactions]
ORDER BY
[TransactionID]')}
RELATE
[TransactionID] TO [TransactionID])
AS
[Transactions] AS t
ON
[DesTree].[Cost Centre] = t.[CostCentre] AND
[DesTree].[Project] = t.[Project] AND
[DesTree].[Transactions].[Account] = t.[Transactions].[Account] AND
[DesTree].[Transactions].[Amount] = t.[Transactions].[Amount]

|||

I think you want to do your nested select like this

SELECT FLATTENED

t.[TransactionID],
t.[Account],
t.[CostCentre],
t.[Project],
(t.[Amount]) as [ActualAmount],

(SELECT Account, Amount FROM Predict(Transactions) WHERE Account='MyAccount') as Prediction

FROM ...

The only problem here is that you can't compare the nested account to your input - only to a static string or parameter. E.g you can do WHERE Account=@.Account, but you can't do WHERE Account=t.Account.

No comments:

Post a Comment